asyncio web scraping 101: fetching multiple urls with aiohttp asyncio web scraping 101: fetching multiple urls with aiohttp python-3.x python-3.x

asyncio web scraping 101: fetching multiple urls with aiohttp


I would use gather instead of wait, which can return exceptions as objects, without raising them. Then you can check each result, if it is instance of some exception.

import aiohttpimport asyncioasync def fetch(session, url):    with aiohttp.Timeout(10):        async with session.get(url) as response:            return await response.text()async def fetch_all(session, urls, loop):    results = await asyncio.gather(        *[fetch(session, url) for url in urls],        return_exceptions=True  # default is false, that would raise    )    # for testing purposes only    # gather returns results in the order of coros    for idx, url in enumerate(urls):        print('{}: {}'.format(url, 'ERR' if isinstance(results[idx], Exception) else 'OK'))    return resultsif __name__ == '__main__':    loop = asyncio.get_event_loop()    # breaks because of the first url    urls = [        'http://SDFKHSKHGKLHSKLJHGSDFKSJH.com',        'http://google.com',        'http://twitter.com']    with aiohttp.ClientSession(loop=loop) as session:        the_results = loop.run_until_complete(            fetch_all(session, urls, loop))

Tests:

$python test.py http://SDFKHSKHGKLHSKLJHGSDFKSJH.com: ERRhttp://google.com: OKhttp://twitter.com: OK


I am far from an asyncio expert but you want to catch the error you need to catch a socket error:

async def fetch(session, url):    with aiohttp.Timeout(10):        try:            async with session.get(url) as response:                print(response.status == 200)                return await response.text()        except socket.error as e:            print(e.strerror)

Running the code and printing the_results:

Cannot connect to host sdfkhskhgklhskljhgsdfksjh.com:80 ssl:False [Can not connect to sdfkhskhgklhskljhgsdfksjh.com:80 [Name or service not known]]TrueTrue({<Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result='<!DOCTYPE ht...y>\n</html>\n'>, <Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result=None>, <Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result='<!doctype ht.../body></html>'>}, set())

You can see we get catch the error and the further calls are still successful returning the html.

We should probably really be catching an OSError as socket.error is A deprecated alias of OSError since python 3.3:

async def fetch(session, url):    with aiohttp.Timeout(10):        try:            async with session.get(url) as response:                return await response.text()        except OSError as e:            print(e)

If you want to also check the response is 200, put your if in the try too and you can use the reason attribute to get more info:

async def fetch(session, url):    with aiohttp.Timeout(10):        try:            async with session.get(url) as response:                if response.status != 200:                    print(response.reason)                return await response.text()        except OSError as e:            print(e.strerror)