Python Web Crawlers and "getting" html source code Python Web Crawlers and "getting" html source code python python

Python Web Crawlers and "getting" html source code


Use Python 2.7, is has more 3rd party libs at the moment. (Edit: see below).

I recommend you using the stdlib module urllib2, it will allow you to comfortably get web resources.Example:

import urllib2response = urllib2.urlopen("http://google.de")page_source = response.read()

For parsing the code, have a look at BeautifulSoup.

BTW: what exactly do you want to do:

Just for background, I need to download a page and replace any img with ones I have

Edit: It's 2014 now, most of the important libraries have been ported, and you should definitely use Python 3 if you can. python-requests is a very nice high-level library which is easier to use than urllib2.


An Example with python3 and the requests library as mentioned by @leoluk:

pip install requests

Script req.py:

import requestsurl='http://localhost'# in case you need a sessioncd = { 'sessionid': '123..'}r = requests.get(url, cookies=cd)# or without a session: r = requests.get(url)r.content

Now,execute it and you will get the html source of localhost!

python3 req.py


If you are using Python > 3.x you don't need to install any libraries, this is directly built in the python framework. The old urllib2 package has been renamed to urllib:

from urllib import requestresponse = request.urlopen("https://www.google.com")# set the correct charset belowpage_source = response.read().decode('utf-8')print(page_source)