How to scrape data which is in HTML table format? How to scrape data which is in HTML table format? selenium selenium

How to scrape data which is in HTML table format?


This script will go through all pages and saves them to standard csv and ~|~ delimited text file:

import requestsimport numpy as npimport pandas as pdfrom bs4 import BeautifulSoupurl = 'https://www.msamb.com/ApmcDetail/ArrivalPriceInfo'detail_url = 'https://www.msamb.com/ApmcDetail/DataGridBind?commodityCode={code}&apmcCode=null'headers = {'Referer': 'https://www.msamb.com/ApmcDetail/ArrivalPriceInfo'}soup = BeautifulSoup(requests.get(url).content, 'html.parser')values = [(o['value'], o.text) for o in soup.select('#CommoditiesId option') if o['value']]all_data = []for code, code_name in values:    print('Getting info for code {} {}'.format(code, code_name))    soup = BeautifulSoup(requests.get(detail_url.format(code=code), headers=headers).content, 'html.parser')    current_date = ''    for row in soup.select('tr'):        if row.select_one('td[colspan]'):            current_date = row.get_text(strip=True)        else:            row = [td.get_text(strip=True) for td in row.select('td')]            all_data.append({                'Date': current_date,                'Commodity': code_name,                'APMC': row[0],                'Variety': row[1],                'Unit': row[2],                'Quantity': row[3],                'Lrate': row[4],                'Hrate': row[5],                'Modal': row[6],            })df = pd.DataFrame(all_data)print(df)df.to_csv('data.csv')                                       # <-- saves standard csvnp.savetxt('data.txt', df, delimiter='~|~', fmt='%s')       # <-- saves .txt file with '~|~' delimiter

Prints:

...Getting info for code 08071 TOMATOGetting info for code 10006 TURMERICGetting info for code 08075 WAL BHAJIGetting info for code 08076 WAL PAPDIGetting info for code 08077 WALVADGetting info for code 07011 WATER MELONGetting info for code 02009 WHEAT(HUSKED)Getting info for code 02012 WHEAT(UNHUSKED)            Date        Commodity          APMC Variety     Unit Quantity Lrate Hrate Modal0     18/07/2020      AMBAT CHUKA    PANDHARPUR    ----      NAG       50     5     5     51     16/07/2020      AMBAT CHUKA    PANDHARPUR    ----      NAG       50     5     5     52     15/07/2020      AMBAT CHUKA    PANDHARPUR    ----      NAG      100     9     9     93     13/07/2020      AMBAT CHUKA    PANDHARPUR    ----      NAG       16     7     7     74     13/07/2020      AMBAT CHUKA          PUNE   LOCAL      NAG     2400     4     7     5...          ...              ...           ...     ...      ...      ...   ...   ...   ...4893  12/07/2020    WHEAT(HUSKED)        SHIRUR   No. 2  QUINTAL        2  1400  1400  14004894  17/07/2020  WHEAT(UNHUSKED)  SANGLI-MIRAJ    ----  QUINTAL      863  4000  4600  43004895  16/07/2020  WHEAT(UNHUSKED)  SANGLI-MIRAJ    ----  QUINTAL      475  4000  4500  42504896  15/07/2020  WHEAT(UNHUSKED)  SANGLI-MIRAJ    ----  QUINTAL      680  3900  4400  41504897  13/07/2020  WHEAT(UNHUSKED)  SANGLI-MIRAJ    ----  QUINTAL     1589  3900  4450  4175[4898 rows x 9 columns]

Saves data.txt:

0~|~18/07/2020~|~AMBAT CHUKA~|~PANDHARPUR~|~----~|~NAG~|~50~|~5~|~5~|~51~|~16/07/2020~|~AMBAT CHUKA~|~PANDHARPUR~|~----~|~NAG~|~50~|~5~|~5~|~52~|~15/07/2020~|~AMBAT CHUKA~|~PANDHARPUR~|~----~|~NAG~|~100~|~9~|~9~|~93~|~13/07/2020~|~AMBAT CHUKA~|~PANDHARPUR~|~----~|~NAG~|~16~|~7~|~7~|~74~|~13/07/2020~|~AMBAT CHUKA~|~PUNE~|~LOCAL~|~NAG~|~2400~|~4~|~7~|~55~|~12/07/2020~|~AMBAT CHUKA~|~PUNE~|~LOCAL~|~NAG~|~1700~|~3~|~8~|~56~|~19/07/2020~|~APPLE~|~KOLHAPUR~|~----~|~QUINTAL~|~3~|~9000~|~14000~|~115007~|~18/07/2020~|~APPLE~|~KOLHAPUR~|~----~|~QUINTAL~|~12~|~8500~|~15000~|~117508~|~18/07/2020~|~APPLE~|~NASHIK~|~DILICIOUS- No.1~|~QUINTAL~|~110~|~9000~|~16000~|~130009~|~18/07/2020~|~APPLE~|~SANGLI-PHALE BHAJIPALAM~|~LOCAL~|~QUINTAL~|~8~|~12000~|~16000~|~1400010~|~17/07/2020~|~APPLE~|~MUMBAI-FRUIT MARKET~|~----~|~QUINTAL~|~264~|~9000~|~12000~|~10500...

Screenshot of csv file from LibreOffice:

enter image description here


You can save them into txt files and you can do something like this df = pd.read_csv("out.txt",delimiter='~|~') , or

date = df['Date'] commodity = df['Commodity']

you can append the apmc into list, and read_as dataframe at the end.