How to extract json from script tag using beautiful soup python? How to extract json from script tag using beautiful soup python? json json

How to extract json from script tag using beautiful soup python?


This should work, I am absolutely sure there is a more elegant approach:

import jsonfrom bs4 import BeautifulSouphtml = '''<script type="application/json" data-initial-state="review-filter">{"languages":[{"isoCode":"all","displayName":"Toutes les langues","reviewCount":"573"},{"isoCode":"fr","displayName":"français","reviewCount":"567"},{"isoCode":"en","displayName":"English","reviewCount":"6"}],"selectedLanguages":["all"],"selectedStars":null,"selectedLocationId":null}</script>'''soup = BeautifulSoup(html, 'html.parser')res = soup.find('script')json_object = json.loads(res.contents[0])for language in json_object['languages']:    print('{}: {}'.format(language['displayName'], language['reviewCount']))

output:

Toutes les langues: 573français: 567English: 6


Import json and load data into json and then iterarte to get all the reviewCount.

import jsonhtml='''<script type="application/json" data-initial-state="review-filter">{"languages":[{"isoCode":"all","displayName":"Toutes les langues","reviewCount":"573"},{"isoCode":"fr","displayName":"français","reviewCount":"567"},{"isoCode":"en","displayName":"English","reviewCount":"6"}],"selectedLanguages":["all"],"selectedStars":null,"selectedLocationId":null}</script>'''soup=BeautifulSoup(html,"html.parser")item=soup.select_one('script[data-initial-state="review-filter"]').textjsondata=json.loads(item)for item in jsondata['languages']:    print(item['reviewCount'])

Output:

5735676


import rehtml = '''<script type="application/json" data-initial-state="review-filter">{"languages":[{"isoCode":"all","displayName":"Toutes les langues","reviewCount":"573"},{"isoCode":"fr","displayName":"français","reviewCount":"567"},{"isoCode":"en","displayName":"English","reviewCount":"6"}],"selectedLanguages":["all"],"selectedStars":null,"selectedLocationId":null}</script>'''match = [item.group(1) for item in re.finditer('reviewCount":"(.+?)"', html)]print(match)

Output:

['573', '567', '6']