How to parse ld+json using python
You should read the JSON with json.loads
to convert it into a dictionary.
import jsonimport requestsfrom bs4 import BeautifulSoupdef get_ld_json(url: str) -> dict: parser = "html.parser" req = requests.get(url) soup = BeautifulSoup(req.text, parser) return json.loads("".join(soup.find("script", {"type":"application/ld+json"}).contents))
The join
/ contents
combination removes the script tags.
you should read the html to parse
html = urlopen(url).read()soup = BeautifulSoup(html, "html.parser")p = soup.find('script', {'type':'application/ld+json'})print p.contents
The comments above didn't help (thanks though)
In the end I used:
p = str(soup.find('script', {'type':'application/ld+json'}))
I forced it into a string which isn't really pretty, but it did the job. I know there's probably a better way out there, but this worked for me.