Why does sed fail with International characters and how to fix? Why does sed fail with International characters and how to fix? linux linux

Why does sed fail with International characters and how to fix?


I think the error occurs if the input encoding of the file is different from the preferred encoding of your environment.

Example: in is UTF-8

$ LANG=de_DE.UTF-8 sed 's/.*| //' < inXY$ LANG=de_DE.iso88591 sed 's/.*| //' < inX Y

UTF-8 can safely be interpreted as ISO-8859-1, you'll get strange characters but apart from that everything is fine.

Example: in is ISO-8859-1

$ LANG=de_DE.UTF-8 sed 's/.*| //' < inXGras Och Stenar Trad - From MöY$ LANG=de_DE.iso88591 sed 's/.*| //' < inX Y

ISO-8859-1 cannot be interpreted as UTF-8, decoding the input file fails. The strange match is probably due to the fact that sed tries to recover rather than fail completely.

The answer is based on Debian Lenny/Sid and sed 4.1.5.


sed is not very well setup for non-ASCII text. However you can use (almost) the same code in perl and get the result you want:

perl -pe 's/.*\| //' x