Delete a SPECIFIC duplicate line from XML file in place Delete a SPECIFIC duplicate line from XML file in place shell shell

Delete a SPECIFIC duplicate line from XML file in place


The following script accepts an XML file as a first argument, uses xmlstarlet ( xml in the script ) to parse the XML tree and an Associative Array ( requires Bash 4 ) to store unique <upath> node values.

#!/bin/bashinput_file=$1# XPath to retrieve <upath> node value.xpath_upath_value='//package/userinterface/upath/text()'# XPath to print XML tree excluding  <userinterface> part.xpath_exclude_userinterface_tree='//package/*[not(self::userinterface)]'# Associative array to help us remove duplicated <upath> node values.declare -A arrprint_userinterface_no_dup() {     printf '%s\n' "<userinterface>"    printf '<upath>%s</upath>\n' "${arr[@]}"    printf '%s\n' "</userinterface>"}# Iterate over each <upath> node value, lower-case it and use it as a key in the associative array.while read -r upath; do    key="${upath,,}"    # We can remove this 'if' statement and simply arr[$key]="$upath"    # if it doesn't matter whether we remove <upath>foo</upath> or <upath>FOO</upath>    if [[ ! "${arr[$key]}" ]]; then        arr[$key]="$upath"    fidone < <(xml sel -t -m "$xpath_upath_value" -c \. -n "$input_file")printf '%s\n' "<package>"# Print XML tree excluding <userinterface> part.xml sel -t -m "$xpath_exclude_userinterface_tree" -c \. "$input_file"# Print <userinterface> tree without duplicates.print_userinterface_no_dupprintf '%s\n' "</package>"

Test ( script name is sof ):

$ ./sof xml_file<package>    <id>1523456789</id>    <models>      <model type="A">        <start>2016-04-20</start>        <end>2017-04-20</end>      </model>      <model type="B">                         <start>2016-04-20</start>        <end>2017-04-20</end>      </model>    </models>    <userinterface>        <upath>/Example/Dir/Here2</upath>        <upath>/Example/Dir/Here</upath>    </userinterface></package>

If my comments are not making the code clear enough for you, please ask and I'll answer and edit this solution accordingly.


My xmlstarlet version is 1.6.1, compiled against libxml2 2.9.2 and libxslt 1.1.28.


If you're parsing XML, you really should use a parser. There are multiple options for this - but DON'T use regular expressions, because they're a route to really brittle code - for all the reasons you're finding.

See: parsing XML with regex.

But the long and short of it is - XML is a contextual language. Regular expressions aren't. There are also some perfectly valid variances in XML, which are semantically identical, the regex won't handle.

E.g. Unary tags, variable indentation, paths to tags in different location and line wrapping.

I could format your source XML a bunch of different ways - all of which would be valid XML, saying the same thing. But which would break regex based parsing. That's something to be avoided - one day, mysteriously, your script will break for no particular reason, as the result of an upstream change that's valid within the XML spec.

Which is why you should use a parser:

I like XML::Twig which is a perl module. You can do what you want something like this:

#!/usr/bin/env perluse strict;use warnings;use XML::Twig; my %seen; #a subroutine to process any "upath" tags. sub process_upath {   my ( $twig, $upath ) = @_;    my $text = lc $upath -> trimmed_text;   $upath -> delete if $seen{$text}++; }#instantiate the parser, and configure what to 'handle'. my $twig = XML::Twig -> new ( twig_handlers => { 'upath' => \&process_upath } );   #parse from our data block - but you'd probably use a file handle here.    $twig -> parse ( \*DATA );   #set output formatting   $twig -> set_pretty_print ( 'indented_a' );   #print to STDOUT.   $twig -> print;__DATA__  <package>    <id>1523456789</id>    <models>      <model type="A">        <start>2016-04-20</start>           <end>2017-04-20</end>          </model>      <model type="B">                         <start>2016-04-20</start>             <end>2017-04-20</end>              </model>    </models>    <userinterface>      <upath>/Example/Dir/Here</upath>      <upath>/Example/Dir/Here2</upath>      <upath>/example/dir/here</upath>       </userinterface>  </package>

This is the long form, to illustrate the concept, and it outputs:

<package>  <id>1523456789</id>  <models>    <model type="A">      <start>2016-04-20</start>      <end>2017-04-20</end>    </model>    <model type="B">      <start>2016-04-20</start>      <end>2017-04-20</end>    </model>  </models>  <userinterface>    <upath>/Example/Dir/Here</upath>    <upath>/Example/Dir/Here2</upath>  </userinterface></package>

It can be reduced down considerably though, via the parsefile_inplace method.


If you want to ignore only duplicate lines right after each other, you can store the previous line and compare to that. For ignoring the case you can use tolower() in the comparison on both sides:

awk '{ if (tolower(prev) != $0) print; prev = $0 }'