Change from HTML character references to utf-8 in a bash script ie. ā becomes ā Change from HTML character references to utf-8 in a bash script ie. ā becomes ā bash bash

Change from HTML character references to utf-8 in a bash script ie. ā becomes ā


If you have access to Perl then it's relatively simple:

perl -ne 'binmode STDOUT,":utf8";s/&#([0-9]*);/pack("U",$1)/eg;print' \  document.html

Example:

#!/bin/bashhtml2utf8() {  perl -ne 'binmode STDOUT, ":utf8"; s/&#([0-9]*);/pack("U",$1)/eg; print'}echo 'testing 1 ā 2 Ĭ 3 ē' | html2utf8

Produces:

testing 1 ā 2 Ĭ 3 ē


If you're looking for a bash only way of doing this, it looks like there are some solutions in this thread: http://forums.gentoo.org/viewtopic-t-820377-view-previous.html?sid=b35246f20410ba95ee048970d01ac6b3