asyncio web scraping 101: fetching multiple urls with aiohttp
I would use gather
instead of wait
, which can return exceptions as objects, without raising them. Then you can check each result, if it is instance of some exception.
import aiohttpimport asyncioasync def fetch(session, url): with aiohttp.Timeout(10): async with session.get(url) as response: return await response.text()async def fetch_all(session, urls, loop): results = await asyncio.gather( *[fetch(session, url) for url in urls], return_exceptions=True # default is false, that would raise ) # for testing purposes only # gather returns results in the order of coros for idx, url in enumerate(urls): print('{}: {}'.format(url, 'ERR' if isinstance(results[idx], Exception) else 'OK')) return resultsif __name__ == '__main__': loop = asyncio.get_event_loop() # breaks because of the first url urls = [ 'http://SDFKHSKHGKLHSKLJHGSDFKSJH.com', 'http://google.com', 'http://twitter.com'] with aiohttp.ClientSession(loop=loop) as session: the_results = loop.run_until_complete( fetch_all(session, urls, loop))
Tests:
$python test.py http://SDFKHSKHGKLHSKLJHGSDFKSJH.com: ERRhttp://google.com: OKhttp://twitter.com: OK
I am far from an asyncio expert but you want to catch the error you need to catch a socket error:
async def fetch(session, url): with aiohttp.Timeout(10): try: async with session.get(url) as response: print(response.status == 200) return await response.text() except socket.error as e: print(e.strerror)
Running the code and printing the_results:
Cannot connect to host sdfkhskhgklhskljhgsdfksjh.com:80 ssl:False [Can not connect to sdfkhskhgklhskljhgsdfksjh.com:80 [Name or service not known]]TrueTrue({<Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result='<!DOCTYPE ht...y>\n</html>\n'>, <Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result=None>, <Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result='<!doctype ht.../body></html>'>}, set())
You can see we get catch the error and the further calls are still successful returning the html.
We should probably really be catching an OSError as socket.error is A deprecated alias of OSError since python 3.3:
async def fetch(session, url): with aiohttp.Timeout(10): try: async with session.get(url) as response: return await response.text() except OSError as e: print(e)
If you want to also check the response is 200, put your if in the try too and you can use the reason attribute to get more info:
async def fetch(session, url): with aiohttp.Timeout(10): try: async with session.get(url) as response: if response.status != 200: print(response.reason) return await response.text() except OSError as e: print(e.strerror)