Replacing HTML ascii codes via a bash script? Replacing HTML ascii codes via a bash script? bash bash

Replacing HTML ascii codes via a bash script?


$ echo '&#33;' | recode html/..!$ echo '<∞>' | recode html/..<∞>


I don't know of an easy way, here is what I suppose I would do...

You might be able to script a browser into reading the file in and then saving it as text. If lynx supports html character entities then it might be worth looking in to. If that doesn't work out...

The general solution to something like this is done with sed. You need a "higher order" edit for this, as you would first start with an entity table and then you would edit that table into an edit script itself with a multiple-step procedure. Something like:

. . .s/&Dagger;/‡/g<br />s/&#8221;/&#8221;/g<br />. . .

Then, encapsulate this as html, read it in to a browser, and save it as text in the character set you are targeting. If you get it to produce lines like:

s/</</g

then you win. A bash script that calls sed or ex can be driven by the substitute commands in the file.


Here is my solution with the standard Linux toolbox.

$ foo="This is a line feed&#010;And e acute:&#233; with a grinning face &#128512;."$ echo "$foo"This is a line feed&#010;And e acute:&#233; with a grinning face &#128512;.$ eval "$(printf '%s' "$foo" | sed 's/^/printf "/;s/&#0*\([0-9]*\);/\$( [ \1 -lt 128 ] \&\& printf "\\\\$( printf \"%.3o\\201\" \1)" || \$(which printf) \\\\U\$( printf \"%.8x\" \1) )/g;s/$/\\n"/')" | sed "s/$(printf '\201')//g"This is a line feedAnd e acute:é with a grinning face 😀.

You see that it works for all kinds of escapes, even Line Feed, e acute (é) which is a 2 byte UTF-8 and even the new emoticons which are in the extended plane (4 bytes unicode).

This command works ALSO with dash which is a trimmed down shell (default shell on Ubuntu) and is also compatible with bash and shells like ash used by the Synology.

If you don't mind sticking with bash and dropping the compatibility, you can make is much simpler.

Bits used should be in any decent Linux box (or OS X?)- which- printf (GNU and builtin)- GNU sed- eval (shell builtin)

The bash only version don't need which nor the GNU printf.