Native shell command set to extract node value from XML
I've managed to solve it for the time being with this rather unwiedly script using xmllint --shell
.
echo "cat //project/parent/version" | xmllint --shell pom.xml | sed '/^\/ >/d' | sed 's/<[^>]*.//g'
If the XML nodes have namespace attributes like my pom.xml had, things get heavier, basically extracting the node by name:
echo "cat //*[local-name()='project']/*[local-name()='parent']/*[local-name()='version']" | xmllint --shell pom.xml | sed '/^\/ >/d' | sed 's/<[^>]*.//g'
Hope it helps. If anyone can simply these expressions, I'd be grateful.
I came here looking for a nice way to scrape a value from a website. The following example may be useful to those (unlike the poster) who have a version of xmllint which supports --xpath.
I needed to pull the most recent stable version of the elasticsearch .debfile and install it. The maintainers have helpfully put the version number in a span with the class "version".
version=`curl -s http://www.elasticsearch.org/download/ |\ xmllint --html --xpath '//span[@class="version"]/text()'\ 2>/dev/null - `;
What goes on:
We use the curl -s (silent) option.
curl -s http://www.elasticsearch.org/download/
We use the xmllint --html and --xpath switches. The xpath arguments (in single quotes)
'//span[@class="version"]/text()'
... looks for a <span> node with the class attribute (@class) "version", and extracts the text value (/text()).
Since xmllint is (surprise!) a linter, it will squawk about the inevitable garbage in your html stream. We direct the stderr to /dev/null in the usual way:
2>/dev/null
Finally, note the " - " at the end of the xmllint command, which tells xmllint the stream is coming from stdin.