Length of an XML file Length of an XML file unix unix

Length of an XML file


31 gigs is a really big text file. I bet it would compress down to about 1.5 gigs. I would create these files in a compressed format to begin with then you can stream a decompressed version of the file through wc. This will greatly reduce the amount of i/o and memory used to process this file. gzip can read and write compressed streams.

But I would also make the following comments:

  • Line numbers are not really that informative for XML as whitespace between elements is ignored (except for mixed content). What do you really want to know about the dataset? I bet counting elements would be more useful.
  • Make sure your xml file is not unnecessarily redunant, for example are you repeating the same namespace declarations all over the document?
  • Perhaps XML is not the best way to represent this document, if it is try looking into something like Fast Infoset


if all you need is the line count, wc -l will be as fast as anything else.

The problem is the 31GB text file.


If accuracy isn't an issue, find the average line length and divide the file size by that. That way you can get a really fast approximation. (make sure to consider the character encoding used)