How to scrape a website which requires login using python and beautifulsoup? How to scrape a website which requires login using python and beautifulsoup? python python

How to scrape a website which requires login using python and beautifulsoup?


You can use mechanize:

import mechanizefrom bs4 import BeautifulSoupimport urllib2 import cookielib ## http.cookiejar in python3cj = cookielib.CookieJar()br = mechanize.Browser()br.set_cookiejar(cj)br.open("https://id.arduino.cc/auth/login/")br.select_form(nr=0)br.form['username'] = 'username'br.form['password'] = 'password.'br.submit()print br.response().read()

Or urllib - Login to website using urllib2


There is a simpler way, from my pov, that gets you there without selenium or mechanize, or other 3rd party tools, albeit it is semi-automated.

Basically, when you login into a site in a normal way, you identify yourself in a unique way using your credentials, and the same identity is used thereafter for every other interaction, which is stored in cookies and headers, for a brief period of time.

What you need to do is use the same cookies and headers when you make your http requests, and you'll be in.

To replicate that, follow these steps:

  1. In your browser, open the developer tools
  2. Go to the site, and login
  3. After the login, go to the network tab, and then refresh the page
    At this point, you should see a list of requests, the top one being the actual site - and that will be our focus, because it contains the data with the identity we can use for Python and BeautifulSoup to scrape it
  4. Right click the site request (the top one), hover over copy, and then copy as cURL
    Like this:

enter image description here

  1. Then go to this site which converts cURL into python requests: https://curl.trillworks.com/
  2. Take the python code and use the generated cookies and headers to proceed with the scraping


If you go for selenium, then you can do something like below:

from selenium import webdriverfrom selenium.webdriver.common.keys import Keysfrom selenium.webdriver.support.ui import Selectfrom selenium.webdriver.support.ui import WebDriverWait# If you want to open Chromedriver = webdriver.Chrome()# If you want to open Firefoxdriver = webdriver.Firefox()username = driver.find_element_by_id("username")password = driver.find_element_by_id("password")username.send_keys("YourUsername")password.send_keys("YourPassword")driver.find_element_by_id("submit_btn").click()

However, if you're adamant that you're only going to use BeautifulSoup, you can do that with a library like requests or urllib. Basically all you have to do is POST the data as a payload with the URL.

import requestsfrom bs4 import BeautifulSouplogin_url = 'http://example.com/login'data = {    'username': 'your_username',    'password': 'your_password'}with requests.Session() as s:    response = requests.post(login_url , data)    print(response.text)    index_page= s.get('http://example.com')    soup = BeautifulSoup(index_page.text, 'html.parser')    print(soup.title)