error with parse function in lxml error with parse function in lxml windows windows

error with parse function in lxml


lxml.html.parse does not fetch URLs.

Here's how to do it with urllib2:

>>> from urllib2 import urlopen>>> from lxml.html import parse>>> page = urlopen('http://www.google.com')>>> p = parse(page)>>> p.getroot()<Element html at 1304050>

Update
Steven is right. lxml.etree.parse should accept and load URLs. I missed that. I've tried deleting this answer, but I'm not allowed.

I retract my statement about it not fetching URLs.


According to the api docs it should work: http://lxml.de/api/lxml.html-module.html#parse

This seems to be a bug in lxml 2.2.2. I just tested on windows with python 2.6 and 2.7 and it does work with 2.3.0.

So: upgrade your lxml and you'll be fine.

I don't know exactly in which versions of lxml the problem occurs, but I believe the problem was not so much with lxml itself, but with the version of libxml2 used to build the windows binary. (certain versions of libxml2 had a problem with http on windows)


Since line breaks are not allowed in comments, here's my implementation of MattH's answer:

from urllib2 import urlopenfrom lxml.html import parsesite_url = ('http://www.google.com')try:    page = parse(site_url).getroot()except IOError:    page = parse(urlopen(site_url)).getroot()