Is there an elegant way to count tag elements in a xml file using lxml in python? Is there an elegant way to count tag elements in a xml file using lxml in python? python python

Is there an elegant way to count tag elements in a xml file using lxml in python?


If you want to count all author tags:

import lxml.etreedoc = lxml.etree.parse(xml)count = doc.xpath('count(//author)')


One must be careful using module re to treat a SGML/XML/HTML text, because not all treatments of such files can't be performed with regex (regexes aren't able to parse a SGML/HTML/XML text)

But here, in this particular problem, it seems to me it is possible (re.DOTALL is mandatory because an element may extend on more than one line; apart that, I can't imagine any other possible pitfall)

from time import clockn= 10000print 'n ==',n,'\n'import lxml.etreedoc = lxml.etree.parse('xml.txt')te = clock()for i in xrange(n):    countlxml = doc.xpath('count(//author)')tf = clock()print 'lxml\ncount:',countlxml,'\n',tf-te,'seconds'import rewith open('xml.txt') as f:    ch = f.read()regx = re.compile('<author>.*?</author>',re.DOTALL)te = clock()for i in xrange(n):    countre = sum(1 for mat in regx.finditer(ch))tf = clock()print '\nre\ncount:',countre,'\n',tf-te,'seconds'

result

n == 10000 lxmlcount: 3.0 2.84083032899 secondsrecount: 3 0.141663256084 seconds