Why does sed fail with International characters and how to fix?

linux internationalization sed character

I think the error occurs if the input encoding of the file is different from the preferred encoding of your environment.

Example: in is UTF-8

$ LANG=de_DE.UTF-8 sed 's/.*| //' < inXY$ LANG=de_DE.iso88591 sed 's/.*| //' < inX Y

UTF-8 can safely be interpreted as ISO-8859-1, you'll get strange characters but apart from that everything is fine.

Example: in is ISO-8859-1

$ LANG=de_DE.UTF-8 sed 's/.*| //' < inXGras Och Stenar Trad - From MöY$ LANG=de_DE.iso88591 sed 's/.*| //' < inX Y

ISO-8859-1 cannot be interpreted as UTF-8, decoding the input file fails. The strange match is probably due to the fact that sed tries to recover rather than fail completely.

The answer is based on Debian Lenny/Sid and sed 4.1.5.

linux internationalization sed character

sed is not very well setup for non-ASCII text. However you can use (almost) the same code in perl and get the result you want:

perl -pe 's/.*\| //' x

CodeHunter

Why does sed fail with International characters and how to fix?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last