How to parse XML in Bash? How to parse XML in Bash? shell shell

How to parse XML in Bash?


This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...

rdom () { local IFS=\> ; read -d \< E C ;}

Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:

read_dom () {    local IFS=\>    read -d \< ENTITY CONTENT}

Okay so it defines a function called read_dom. The first line makes IFS (the input field separator) local to this function and changes it to >. That means that when you read data instead of automatically being split on space, tab or newlines it gets split on '>'. The next line says to read input from stdin, and instead of stopping at a newline, stop when you see a '<' character (the -d for deliminator flag). What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT. So take the following:

<tag>value</tag>

The first call to read_dom get an empty string (since the '<' is the first character). That gets split by IFS into just '', since there isn't a '>' character. Read then assigns an empty string to both variables. The second call gets the string 'tag>value'. That gets split then by the IFS into the two fields 'tag' and 'value'. Read then assigns the variables like: ENTITY=tag and CONTENT=value. The third call gets the string '/tag>'. That gets split by the IFS into the two fields '/tag' and ''. Read then assigns the variables like: ENTITY=/tag and CONTENT=. The fourth call will return a non-zero status because we've reached the end of file.

Now his while loop cleaned up a bit to match the above:

while read_dom; do    if [[ $ENTITY = "title" ]]; then        echo $CONTENT        exit    fidone < xhtmlfile.xhtml > titleOfXHTMLPage.txt

The first line just says, "while the read_dom functionreturns a zero status, do the following." The second line checks if the entity we've just seen is "title". The next line echos the content of the tag. The four line exits. If it wasn't the title entity then the loop repeats on the sixth line. We redirect "xhtmlfile.xhtml" into standard input (for the read_dom function) and redirect standard output to "titleOfXHTMLPage.txt" (the echo from earlier in the loop).

Now given the following (similar to what you get from listing a bucket on S3) for input.xml:

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">  <Name>sth-items</Name>  <IsTruncated>false</IsTruncated>  <Contents>    <Key>item-apple-iso@2x.png</Key>    <LastModified>2011-07-25T22:23:04.000Z</LastModified>    <ETag>"0032a28286680abee71aed5d059c6a09"</ETag>    <Size>1785</Size>    <StorageClass>STANDARD</StorageClass>  </Contents></ListBucketResult>

and the following loop:

while read_dom; do    echo "$ENTITY => $CONTENT"done < input.xml

You should get:

 => ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" => Name => sth-items/Name => IsTruncated => false/IsTruncated => Contents => Key => item-apple-iso@2x.png/Key => LastModified => 2011-07-25T22:23:04.000Z/LastModified => ETag => "0032a28286680abee71aed5d059c6a09"/ETag => Size => 1785/Size => StorageClass => STANDARD/StorageClass => /Contents => 

So if we wrote a while loop like Yuzem's:

while read_dom; do    if [[ $ENTITY = "Key" ]] ; then        echo $CONTENT    fidone < input.xml

We'd get a listing of all the files in the S3 bucket.

EDITIf for some reason local IFS=\> doesn't work for you and you set it globally, you should reset it at the end of the function like:

read_dom () {    ORIGINAL_IFS=$IFS    IFS=\>    read -d \< ENTITY CONTENT    IFS=$ORIGINAL_IFS}

Otherwise, any line splitting you do later in the script will be messed up.

EDIT 2To split out attribute name/value pairs you can augment the read_dom() like so:

read_dom () {    local IFS=\>    read -d \< ENTITY CONTENT    local ret=$?    TAG_NAME=${ENTITY%% *}    ATTRIBUTES=${ENTITY#* }    return $ret}

Then write your function to parse and get the data you want like this:

parse_dom () {    if [[ $TAG_NAME = "foo" ]] ; then        eval local $ATTRIBUTES        echo "foo size is: $size"    elif [[ $TAG_NAME = "bar" ]] ; then        eval local $ATTRIBUTES        echo "bar type is: $type"    fi}

Then while you read_dom call parse_dom:

while read_dom; do    parse_domdone

Then given the following example markup:

<example>  <bar size="bar_size" type="metal">bars content</bar>  <foo size="1789" type="unknown">foos content</foo></example>

You should get this output:

$ cat example.xml | ./bash_xml.sh bar type is: metalfoo size is: 1789

EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:

read_dom () {    local IFS=\>    read -d \< ENTITY CONTENT    local RET=$?    TAG_NAME=${ENTITY%% *}    ATTRIBUTES=${ENTITY#* }    return $RET}

I don't see any reason why that shouldn't work


You can do that very easily using only bash.You only have to add this function:

rdom () { local IFS=\> ; read -d \< E C ;}

Now you can use rdom like read but for html documents.When called rdom will assign the element to variable E and the content to var C.

For example, to do what you wanted to do:

while rdom; do    if [[ $E = title ]]; then        echo $C        exit    fidone < xhtmlfile.xhtml > titleOfXHTMLPage.txt


Command-line tools that can be called from shell scripts include:

  • 4xpath - command-line wrapper around Python's 4Suite package

  • XMLStarlet

  • xpath - command-line wrapper around Perl's XPath library

    sudo apt-get install libxml-xpath-perl
  • Xidel - Works with URLs as well as files. Also works with JSON

I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.