Efficiently Read last 'n' rows of CSV into DataFrame Efficiently Read last 'n' rows of CSV into DataFrame numpy numpy

Efficiently Read last 'n' rows of CSV into DataFrame


I don't think pandas offers a way to do this in read_csv.

Perhaps the neatest (in one pass) is to use collections.deque:

from collections import dequefrom StringIO import StringIOwith open(fname, 'r') as f:    q = deque(f, 2)  # replace 2 with n (lines read at the end)In [12]: qOut[12]: deque(['7,8,9\n', '10,11,12'], maxlen=2)         # these are the last two lines of my csvIn [13]: pd.read_csv(StringIO(''.join(q)), header=None)

Another option worth trying is to get the number of lines in a first pass and then read the file again, skip that number of rows (minus n) using read_csv...


Here's a handy way to do. Works well for what I like to do -

import tailer as tlimport pandas as pdimport iofile = open(fname)lastLines = tl.tail(file,15) #to read last 15 lines, change it  to any value.file.close()df=pd.read_csv(io.StringIO('\n'.join(lastLines)), header=None)


Files are simply streams of bytes. Lines do not exist as separate entities; they are an artifact of treating certain bytes as newline characters. As such, you must read from the beginning of the file to identify lines in order.

If the file doesn't change (often) and this is an operation you need to perform often (say, with different values of n), you can store the byte offsets of the newline characters in a second file. You can use this much-smaller file and the seek command to quickly jump to a given line in the first file and read from there.

(Some operating systems provide record-oriented files that have more complex internal structure than the common flat file. The above does not apply to them.)