How to scrape a website which requires login using python and beautifulsoup?
You can use mechanize:
import mechanizefrom bs4 import BeautifulSoupimport urllib2 import cookielib ## http.cookiejar in python3cj = cookielib.CookieJar()br = mechanize.Browser()br.set_cookiejar(cj)br.open("https://id.arduino.cc/auth/login/")br.select_form(nr=0)br.form['username'] = 'username'br.form['password'] = 'password.'br.submit()print br.response().read()
Or urllib - Login to website using urllib2
There is a simpler way, from my pov, that gets you there without selenium
or mechanize
, or other 3rd party tools, albeit it is semi-automated.
Basically, when you login into a site in a normal way, you identify yourself in a unique way using your credentials, and the same identity is used thereafter for every other interaction, which is stored in cookies
and headers
, for a brief period of time.
What you need to do is use the same cookies
and headers
when you make your http requests, and you'll be in.
To replicate that, follow these steps:
- In your browser, open the developer tools
- Go to the site, and login
- After the login, go to the network tab, and then refresh the page
At this point, you should see a list of requests, the top one being the actual site - and that will be our focus, because it contains the data with the identity we can use for Python and BeautifulSoup to scrape it - Right click the site request (the top one), hover over
copy
, and thencopy as cURL
Like this:
- Then go to this site which converts cURL into python requests: https://curl.trillworks.com/
- Take the python code and use the generated
cookies
andheaders
to proceed with the scraping
If you go for selenium, then you can do something like below:
from selenium import webdriverfrom selenium.webdriver.common.keys import Keysfrom selenium.webdriver.support.ui import Selectfrom selenium.webdriver.support.ui import WebDriverWait# If you want to open Chromedriver = webdriver.Chrome()# If you want to open Firefoxdriver = webdriver.Firefox()username = driver.find_element_by_id("username")password = driver.find_element_by_id("password")username.send_keys("YourUsername")password.send_keys("YourPassword")driver.find_element_by_id("submit_btn").click()
However, if you're adamant that you're only going to use BeautifulSoup, you can do that with a library like requests
or urllib
. Basically all you have to do is POST
the data as a payload with the URL.
import requestsfrom bs4 import BeautifulSouplogin_url = 'http://example.com/login'data = { 'username': 'your_username', 'password': 'your_password'}with requests.Session() as s: response = requests.post(login_url , data) print(response.text) index_page= s.get('http://example.com') soup = BeautifulSoup(index_page.text, 'html.parser') print(soup.title)