Parse HTML Using AWK
With your shown samples/attempts, please try following awk
code.
awk -F"[><]" '{gsub(/\r/,"")} /^[ \t]+<div[ \t]+class="product-price">.*<\/div>/{print $3}' Input_file
Explanation: Adding detailed explanation for above. This is only for explanation purposes for running code please use above one.
awk -F"[><]" ' ##Starting awk program from here and setting field separator as ><{gsub(/\r/,"")} ##Substituting control M chars at last of lines./^[ \t]+<div[ \t]+class="product-price">.*<\/div>/{ ##checking condition if line starts ##from space followed by <div class=product-price"> till div close tag. print $3 ##printing 3rd column here.}' Input_file ##Mentioning Input_file name here.
Changed regex to /^[ \t]+<div[ \t]+class
as per Ed's suggestions in comments. Also its always recommended by experts to use xmlstarlet/xml aware tools in case someone has in their system.
If someone is looking for Python related solution, I would suggest use beautifulsoup library of Python, following is written and tested in Python3.8. To segregate it from my previous answer I am adding another answer here.
#!/bin/python3##import library here. from bs4 import BeautifulSoup##Read Input_file and get its all contents.with open('Input_file', 'r') as f: contents = f.read() f.close()##Get contents in form of xml in soup variable here.soup = BeautifulSoup(contents, 'lxml')##get only those values which specifically needed by OP of div class.vals = (soup.find_all("div", {"class": "product-price"}))##Print actual values out of tags.for val in vals: print (val.text)
NOTE:
- One should have BeautifulSoup installed in Python along with install
lxml
with pip3 or pip depending upon your system. - Where Input_file is one where program is reading your all data.