How can I use the python HTMLParser library to extract data from a specific div tag?

python html parsing html-parsing

class LinksParser(HTMLParser.HTMLParser):  def __init__(self):    HTMLParser.HTMLParser.__init__(self)    self.recording = 0    self.data = []  def handle_starttag(self, tag, attributes):    if tag != 'div':      return    if self.recording:      self.recording += 1      return    for name, value in attributes:      if name == 'id' and value == 'remository':        break    else:      return    self.recording = 1  def handle_endtag(self, tag):    if tag == 'div' and self.recording:      self.recording -= 1  def handle_data(self, data):    if self.recording:      self.data.append(data)

self.recording counts the number of nested div tags starting from a "triggering" one. When we're in the sub-tree rooted in a triggering tag, we accumulate the data in self.data.

The data at the end of the parse are left in self.data (a list of strings, possibly empty if no triggering tag was met). Your code from outside the class can access the list directly from the instance at the end of the parse, or you can add appropriate accessor methods for the purpose, depending on what exactly is your goal.

The class could be easily made a bit more general by using, in lieu of the constant literal strings seen in the code above, 'div', 'id', and 'remository', instance attributes self.tag, self.attname and self.attvalue, set by __init__ from arguments passed to it -- I avoided that cheap generalization step in the code above to avoid obscuring the core points (keep track of a count of nested tags and accumulate data into a list when the recording state is active).

python html parsing html-parsing

Have You tried BeautifulSoup ?

from bs4 import BeautifulSoupsoup = BeautifulSoup('<div id="remository">20</div>')tag=soup.divprint(tag.string)

This gives You 20 on output.

python html parsing html-parsing

Little correction at Line 3

HTMLParser.HTMLParser.__init__(self)

it should be

HTMLParser.__init__(self)

The following worked for me though

import urllib2 from HTMLParser import HTMLParser  class MyHTMLParser(HTMLParser):  def __init__(self):    HTMLParser.__init__(self)    self.recording = 0     self.data = []  def handle_starttag(self, tag, attrs):    if tag == 'required_tag':      for name, value in attrs:        if name == 'somename' and value == 'somevale':          print name, value          print "Encountered the beginning of a %s tag" % tag           self.recording = 1   def handle_endtag(self, tag):    if tag == 'required_tag':      self.recording -=1       print "Encountered the end of a %s tag" % tag   def handle_data(self, data):    if self.recording:      self.data.append(data) p = MyHTMLParser() f = urllib2.urlopen('http://www.someurl.com') html = f.read() p.feed(html) print p.data p.close()

CodeHunter

How can I use the python HTMLParser library to extract data from a specific div tag?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last