How to scrape data which is in HTML table format?

python pandas selenium web-scraping beautifulsoup

This script will go through all pages and saves them to standard csv and ~|~ delimited text file:

import requestsimport numpy as npimport pandas as pdfrom bs4 import BeautifulSoupurl = 'https://www.msamb.com/ApmcDetail/ArrivalPriceInfo'detail_url = 'https://www.msamb.com/ApmcDetail/DataGridBind?commodityCode={code}&apmcCode=null'headers = {'Referer': 'https://www.msamb.com/ApmcDetail/ArrivalPriceInfo'}soup = BeautifulSoup(requests.get(url).content, 'html.parser')values = [(o['value'], o.text) for o in soup.select('#CommoditiesId option') if o['value']]all_data = []for code, code_name in values:    print('Getting info for code {} {}'.format(code, code_name))    soup = BeautifulSoup(requests.get(detail_url.format(code=code), headers=headers).content, 'html.parser')    current_date = ''    for row in soup.select('tr'):        if row.select_one('td[colspan]'):            current_date = row.get_text(strip=True)        else:            row = [td.get_text(strip=True) for td in row.select('td')]            all_data.append({                'Date': current_date,                'Commodity': code_name,                'APMC': row[0],                'Variety': row[1],                'Unit': row[2],                'Quantity': row[3],                'Lrate': row[4],                'Hrate': row[5],                'Modal': row[6],            })df = pd.DataFrame(all_data)print(df)df.to_csv('data.csv')                                       # <-- saves standard csvnp.savetxt('data.txt', df, delimiter='~|~', fmt='%s')       # <-- saves .txt file with '~|~' delimiter

Prints:

...Getting info for code 08071 TOMATOGetting info for code 10006 TURMERICGetting info for code 08075 WAL BHAJIGetting info for code 08076 WAL PAPDIGetting info for code 08077 WALVADGetting info for code 07011 WATER MELONGetting info for code 02009 WHEAT(HUSKED)Getting info for code 02012 WHEAT(UNHUSKED)            Date        Commodity          APMC Variety     Unit Quantity Lrate Hrate Modal0     18/07/2020      AMBAT CHUKA    PANDHARPUR    ----      NAG       50     5     5     51     16/07/2020      AMBAT CHUKA    PANDHARPUR    ----      NAG       50     5     5     52     15/07/2020      AMBAT CHUKA    PANDHARPUR    ----      NAG      100     9     9     93     13/07/2020      AMBAT CHUKA    PANDHARPUR    ----      NAG       16     7     7     74     13/07/2020      AMBAT CHUKA          PUNE   LOCAL      NAG     2400     4     7     5...          ...              ...           ...     ...      ...      ...   ...   ...   ...4893  12/07/2020    WHEAT(HUSKED)        SHIRUR   No. 2  QUINTAL        2  1400  1400  14004894  17/07/2020  WHEAT(UNHUSKED)  SANGLI-MIRAJ    ----  QUINTAL      863  4000  4600  43004895  16/07/2020  WHEAT(UNHUSKED)  SANGLI-MIRAJ    ----  QUINTAL      475  4000  4500  42504896  15/07/2020  WHEAT(UNHUSKED)  SANGLI-MIRAJ    ----  QUINTAL      680  3900  4400  41504897  13/07/2020  WHEAT(UNHUSKED)  SANGLI-MIRAJ    ----  QUINTAL     1589  3900  4450  4175[4898 rows x 9 columns]

Saves data.txt:

0~|~18/07/2020~|~AMBAT CHUKA~|~PANDHARPUR~|~----~|~NAG~|~50~|~5~|~5~|~51~|~16/07/2020~|~AMBAT CHUKA~|~PANDHARPUR~|~----~|~NAG~|~50~|~5~|~5~|~52~|~15/07/2020~|~AMBAT CHUKA~|~PANDHARPUR~|~----~|~NAG~|~100~|~9~|~9~|~93~|~13/07/2020~|~AMBAT CHUKA~|~PANDHARPUR~|~----~|~NAG~|~16~|~7~|~7~|~74~|~13/07/2020~|~AMBAT CHUKA~|~PUNE~|~LOCAL~|~NAG~|~2400~|~4~|~7~|~55~|~12/07/2020~|~AMBAT CHUKA~|~PUNE~|~LOCAL~|~NAG~|~1700~|~3~|~8~|~56~|~19/07/2020~|~APPLE~|~KOLHAPUR~|~----~|~QUINTAL~|~3~|~9000~|~14000~|~115007~|~18/07/2020~|~APPLE~|~KOLHAPUR~|~----~|~QUINTAL~|~12~|~8500~|~15000~|~117508~|~18/07/2020~|~APPLE~|~NASHIK~|~DILICIOUS- No.1~|~QUINTAL~|~110~|~9000~|~16000~|~130009~|~18/07/2020~|~APPLE~|~SANGLI-PHALE BHAJIPALAM~|~LOCAL~|~QUINTAL~|~8~|~12000~|~16000~|~1400010~|~17/07/2020~|~APPLE~|~MUMBAI-FRUIT MARKET~|~----~|~QUINTAL~|~264~|~9000~|~12000~|~10500...

Screenshot of csv file from LibreOffice:

python pandas selenium web-scraping beautifulsoup

You can save them into txt files and you can do something like this df = pd.read_csv("out.txt",delimiter='~|~') , or

date = df['Date'] commodity = df['Commodity']

you can append the apmc into list, and read_as dataframe at the end.

CodeHunter

How to scrape data which is in HTML table format?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last