How to avoid tripping over UTF-8 BOM when reading files How to avoid tripping over UTF-8 BOM when reading files ruby ruby

How to avoid tripping over UTF-8 BOM when reading files


With ruby 1.9.2 you can use the mode r:bom|utf-8

text_without_bom = nil #define the variable outside the block to keep the dataFile.open('file.txt', "r:bom|utf-8"){|file|  text_without_bom = file.read}

or

text_without_bom = File.read('file.txt', encoding: 'bom|utf-8')

or

text_without_bom = File.read('file.txt', mode: 'r:bom|utf-8')

It doesn't matter, if the BOM is available in the file or not.


You may also use the encoding option with other commands:

text_without_bom = File.readlines(@filename, "r:utf-8")

(You get an array with all lines).

Or with CSV:

require 'csv'CSV.open(@filename, 'r:bom|utf-8'){|csv|  csv.each{ |row| p row }}


I wouldn't blindly skip the first three bytes; what if the producer stops adding the BOM again? What you should do is examine the first few bytes, and if they're 0xEF 0xBB 0xBF, ignore them. That's the form the BOM character (U+FEFF) takes in UTF-8; I prefer to deal with it before trying to decode the stream because BOM handling is so inconsistent from one language/tool/framework to the next.

In fact, that's how you're supposed to deal with a BOM. If a file has been served as UTF-16, you have to examine the first two bytes before you start decoding so you know whether to read it as big-endian or little-endian. Of course, the UTF-8 BOM has nothing to do with byte order, it's just there to let you know that the encoding is UTF-8, in case you didn't already know that.


I'd not "trust" some file to be encoded as UTF-8 when a BOM of 0xEF 0xBB 0xBF is present, you might fail. Usually when detecting the UTF-8 BOM, it should really be a UTF-8 encoded file of course. But, if for example someone has just added the UTF-8 BOM to an ISO file, you'd fail to encode such file so bad if there are bytes in it that are above 0x0F. You can trust the file if you have only bytes up to 0x0F inside, because in this case it's a UTF-8 compatible ASCII file and at the same time it is a valid UTF-8 file.

If there are not just bytes <= 0x0F within the file (after the BOM), to be sure it is properly UTF-8 encoded you'll have to check for valid sequences and - even when all sequences are valid - check also if each codepoint from a sequence uses the shortest sequence possible and check also if there is no codepoint that matches a high- or low-surrogate. Also check if the maximum bytes of a sequence is not more than 4 and the highest codepoint is 0x10FFFF. The highest codepoint limits also the startbyte's payload bits to be not higher than 0x4 and the first following byte's payload not higher than 0xF. If all the mentioned checks pass successfully, your UTF-8 BOM tells the truth.