Parse HTML Using AWK Parse HTML Using AWK shell shell

Parse HTML Using AWK


The result of a quick google for xmlstarlet print div contents and then a few secs of trial and error:

$ xmlstarlet sel -t -m "//*[@class='product-price']" -v "." -n file100,56200,56300,56400,56

For an explanation - ask google :-).


With your shown samples/attempts, please try following awk code.

awk -F"[><]" '{gsub(/\r/,"")} /^[ \t]+<div[ \t]+class="product-price">.*<\/div>/{print $3}' Input_file

Explanation: Adding detailed explanation for above. This is only for explanation purposes for running code please use above one.

awk -F"[><]" '      ##Starting awk program from here and setting field separator as ><{gsub(/\r/,"")}     ##Substituting control M chars at last of lines./^[ \t]+<div[ \t]+class="product-price">.*<\/div>/{ ##checking condition if line starts                    ##from space followed by <div class=product-price"> till div close tag.  print $3          ##printing 3rd column here.}' Input_file        ##Mentioning Input_file name here.

Changed regex to /^[ \t]+<div[ \t]+class as per Ed's suggestions in comments. Also its always recommended by experts to use xmlstarlet/xml aware tools in case someone has in their system.


If someone is looking for Python related solution, I would suggest use beautifulsoup library of Python, following is written and tested in Python3.8. To segregate it from my previous answer I am adding another answer here.

#!/bin/python3##import library here.  from bs4 import BeautifulSoup##Read Input_file and get its all contents.with open('Input_file', 'r') as f:    contents = f.read()    f.close()##Get contents in form of xml in soup variable here.soup = BeautifulSoup(contents, 'lxml')##get only those values which specifically needed by OP of div class.vals = (soup.find_all("div", {"class": "product-price"}))##Print actual values out of tags.for val in vals:    print (val.text)

NOTE:

  • One should have BeautifulSoup installed in Python along with install lxml with pip3 or pip depending upon your system.
  • Where Input_file is one where program is reading your all data.