ruby 1.9: invalid byte sequence in UTF-8
In Ruby 1.9.3 it is possible to use String.encode to "ignore" the invalid UTF-8 sequences. Here is a snippet that will work both in 1.8 (iconv) and 1.9 (String#encode) :
require 'iconv' unless String.method_defined?(:encode)if String.method_defined?(:encode) file_contents.encode!('UTF-8', 'UTF-8', :invalid => :replace)else ic = Iconv.new('UTF-8', 'UTF-8//IGNORE') file_contents = ic.iconv(file_contents)end
or if you have really troublesome input you can do a double conversion from UTF-8 to UTF-16 and back to UTF-8:
require 'iconv' unless String.method_defined?(:encode)if String.method_defined?(:encode) file_contents.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '') file_contents.encode!('UTF-8', 'UTF-16')else ic = Iconv.new('UTF-8', 'UTF-8//IGNORE') file_contents = ic.iconv(file_contents)end