error with parse function in lxml
lxml.html.parse
does not fetch URLs.
Here's how to do it with urllib2:
>>> from urllib2 import urlopen>>> from lxml.html import parse>>> page = urlopen('http://www.google.com')>>> p = parse(page)>>> p.getroot()<Element html at 1304050>
Update
Steven is right. lxml.etree.parse
should accept and load URLs. I missed that. I've tried deleting this answer, but I'm not allowed.
I retract my statement about it not fetching URLs.
According to the api docs it should work: http://lxml.de/api/lxml.html-module.html#parse
This seems to be a bug in lxml 2.2.2. I just tested on windows with python 2.6 and 2.7 and it does work with 2.3.0.
So: upgrade your lxml and you'll be fine.
I don't know exactly in which versions of lxml the problem occurs, but I believe the problem was not so much with lxml itself, but with the version of libxml2 used to build the windows binary. (certain versions of libxml2 had a problem with http on windows)
Since line breaks are not allowed in comments, here's my implementation of MattH's answer:
from urllib2 import urlopenfrom lxml.html import parsesite_url = ('http://www.google.com')try: page = parse(site_url).getroot()except IOError: page = parse(urlopen(site_url)).getroot()