Unable to fetch all the links from a webpage using requests Unable to fetch all the links from a webpage using requests python-3.x python-3.x

Unable to fetch all the links from a webpage using requests


If you want to do it with requests then please consider to query XHR/Ajax Http requests for imitating Lazy load. See the following picture:

enter image description here

You make queries to the instagram.com server similar to Scrape a JS Lazy load page by Python requests post.

Disclaimer

You might not succeed to complete that task due to some dynamic cookie values or other scraping prevention imposed by Instagram.


The Instagram web page uses lazy loading to load the images. You can overcome this in 2 ways:

  1. Use the Instagram API as mentioned in the comments
  2. Use a tool like selenium to load all the images on the page by scrolling to the bottom and then fetch the links

The 1st method is be the better way to do it.


I suggest you to use Instagram Graph API, if you are building a commercial product since using instagram public data is required the consent because of GDPR. This API will easy your work but under api limitations such as you can query 30 searches for 7 days per a user token.

If you are building non-commercial tool you have two approaches.

  1. Scrape directly the instagram web page. As mentioned in above answers you can use selenium and automate page interactions since web page uses javascript to generate image urls. The disadvantage of this method is instagram and facebook do anti scraping methods to prevent scraping their data such as wrapping html elements with dynamic generated classes, change xpaths frequently. You might have to spend lot of time to code and fix those things later.

  2. Using third party libraries that are built to scrape instagram data. There are many open source third party libraries in github and instaloader is my favorite. you can download all of hashtag search results using single command. This library not only download images but also data json of the post related to the image. Since there are maintainers to the library, you don't have to worry about later instagram web page changes. I recommend this method in your case.