Parse HTML with CURL in Shell Script

Using xmllint:

a='<div class="tracklistInfo"><p class="artist">Diplo - Justin Bieber - Skrillex</p><p>Where Are U Now</p></div>'xmllint --html --xpath 'concat(//div[@class="tracklistInfo"]/p[1]/text(), "#", //div[@class="tracklistInfo"]/p[2]/text())' <<<"$a"

You obtain:

Diplo - Justin Bieber - Skrillex#Where Are U Now

That can be easily separated.

html shell curl

Don't. Use a HTML parser. For example, BeautifulSoup for Python is easy to use and can do this very easily.

That being said, remember that grep works on lines. The pattern is matched for every line, not for the entire string.

What you can use is -A to also print out lines after the match:

grep -A2 -E -m 1 '<div class="tracklistInfo">'

Should output:

<div class="tracklistInfo"><p class="artist">Diplo - Justin Bieber - Skrillex</p><p>Where Are U Now</p>

You can then get the last or second-last line by piping it to tail:

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' | tail -n1<p>Where Are U Now</p>$ grep -A2 -E -m 1 '<div class="tracklistInfo">' |  tail -n2 | head -n1<p class="artist">Diplo - Justin Bieber - Skrillex</p>

And strip the HTML with sed:

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' | tail -n1Where Are U Now$ grep -A2 -E -m 1 '<div class="tracklistInfo">' |  tail -n2 | head -n1 | sed 's/<[^>]*>//g'Diplo - Justin Bieber - Skrillex

But as said, this is fickle, likely to break, and not very pretty. Here's the same with BeautifulSoup, by the way:

html = '''<body><p>Blah text</p><div class="tracklistInfo"><p class="artist">Diplo - Justin Bieber - Skrillex</p><p>Where Are U Now</p></div></body>'''from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'html.parser')for track in soup.find_all(class_='tracklistInfo'):    print(track.find_all('p')[0].text)    print(track.find_all('p')[1].text)

This also works with multiple rows of tracklistInfo − adding that to the shell command requires more work ;-)

html shell curl

cat - > file.html << EOF<div class="tracklistInfo"><p class="artist">Diplo - Justin Bieber - Skrillex</p><p>Where Are U Now</p></div><div class="tracklistInfo"><p class="artist">toto</p><p>tata</p></div>EOFcat file.html | tr -d '\n'  | sed -e "s/<\/div>/<\/div>\n/g" | sed -n 's/^.*class="artist">\([^<]*\)<\/p> *<p>\([^<]*\)<.*$/artist : \1\ntitle : \2\n/p'

CodeHunter

Parse HTML with CURL in Shell Script

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last