How to scrape data which is in HTML table format?
This script will go through all pages and saves them to standard csv and ~|~
delimited text file:
import requestsimport numpy as npimport pandas as pdfrom bs4 import BeautifulSoupurl = 'https://www.msamb.com/ApmcDetail/ArrivalPriceInfo'detail_url = 'https://www.msamb.com/ApmcDetail/DataGridBind?commodityCode={code}&apmcCode=null'headers = {'Referer': 'https://www.msamb.com/ApmcDetail/ArrivalPriceInfo'}soup = BeautifulSoup(requests.get(url).content, 'html.parser')values = [(o['value'], o.text) for o in soup.select('#CommoditiesId option') if o['value']]all_data = []for code, code_name in values: print('Getting info for code {} {}'.format(code, code_name)) soup = BeautifulSoup(requests.get(detail_url.format(code=code), headers=headers).content, 'html.parser') current_date = '' for row in soup.select('tr'): if row.select_one('td[colspan]'): current_date = row.get_text(strip=True) else: row = [td.get_text(strip=True) for td in row.select('td')] all_data.append({ 'Date': current_date, 'Commodity': code_name, 'APMC': row[0], 'Variety': row[1], 'Unit': row[2], 'Quantity': row[3], 'Lrate': row[4], 'Hrate': row[5], 'Modal': row[6], })df = pd.DataFrame(all_data)print(df)df.to_csv('data.csv') # <-- saves standard csvnp.savetxt('data.txt', df, delimiter='~|~', fmt='%s') # <-- saves .txt file with '~|~' delimiter
Prints:
...Getting info for code 08071 TOMATOGetting info for code 10006 TURMERICGetting info for code 08075 WAL BHAJIGetting info for code 08076 WAL PAPDIGetting info for code 08077 WALVADGetting info for code 07011 WATER MELONGetting info for code 02009 WHEAT(HUSKED)Getting info for code 02012 WHEAT(UNHUSKED) Date Commodity APMC Variety Unit Quantity Lrate Hrate Modal0 18/07/2020 AMBAT CHUKA PANDHARPUR ---- NAG 50 5 5 51 16/07/2020 AMBAT CHUKA PANDHARPUR ---- NAG 50 5 5 52 15/07/2020 AMBAT CHUKA PANDHARPUR ---- NAG 100 9 9 93 13/07/2020 AMBAT CHUKA PANDHARPUR ---- NAG 16 7 7 74 13/07/2020 AMBAT CHUKA PUNE LOCAL NAG 2400 4 7 5... ... ... ... ... ... ... ... ... ...4893 12/07/2020 WHEAT(HUSKED) SHIRUR No. 2 QUINTAL 2 1400 1400 14004894 17/07/2020 WHEAT(UNHUSKED) SANGLI-MIRAJ ---- QUINTAL 863 4000 4600 43004895 16/07/2020 WHEAT(UNHUSKED) SANGLI-MIRAJ ---- QUINTAL 475 4000 4500 42504896 15/07/2020 WHEAT(UNHUSKED) SANGLI-MIRAJ ---- QUINTAL 680 3900 4400 41504897 13/07/2020 WHEAT(UNHUSKED) SANGLI-MIRAJ ---- QUINTAL 1589 3900 4450 4175[4898 rows x 9 columns]
Saves data.txt
:
0~|~18/07/2020~|~AMBAT CHUKA~|~PANDHARPUR~|~----~|~NAG~|~50~|~5~|~5~|~51~|~16/07/2020~|~AMBAT CHUKA~|~PANDHARPUR~|~----~|~NAG~|~50~|~5~|~5~|~52~|~15/07/2020~|~AMBAT CHUKA~|~PANDHARPUR~|~----~|~NAG~|~100~|~9~|~9~|~93~|~13/07/2020~|~AMBAT CHUKA~|~PANDHARPUR~|~----~|~NAG~|~16~|~7~|~7~|~74~|~13/07/2020~|~AMBAT CHUKA~|~PUNE~|~LOCAL~|~NAG~|~2400~|~4~|~7~|~55~|~12/07/2020~|~AMBAT CHUKA~|~PUNE~|~LOCAL~|~NAG~|~1700~|~3~|~8~|~56~|~19/07/2020~|~APPLE~|~KOLHAPUR~|~----~|~QUINTAL~|~3~|~9000~|~14000~|~115007~|~18/07/2020~|~APPLE~|~KOLHAPUR~|~----~|~QUINTAL~|~12~|~8500~|~15000~|~117508~|~18/07/2020~|~APPLE~|~NASHIK~|~DILICIOUS- No.1~|~QUINTAL~|~110~|~9000~|~16000~|~130009~|~18/07/2020~|~APPLE~|~SANGLI-PHALE BHAJIPALAM~|~LOCAL~|~QUINTAL~|~8~|~12000~|~16000~|~1400010~|~17/07/2020~|~APPLE~|~MUMBAI-FRUIT MARKET~|~----~|~QUINTAL~|~264~|~9000~|~12000~|~10500...
Screenshot of csv file from LibreOffice:
You can save them into txt files and you can do something like this df = pd.read_csv("out.txt",delimiter='~|~')
, or
date = df['Date'] commodity = df['Commodity']
you can append the apmc into list, and read_as dataframe at the end.