How to detect if a file has a UTF-8 BOM in Bash? How to detect if a file has a UTF-8 BOM in Bash? bash bash

How to detect if a file has a UTF-8 BOM in Bash?


First, let's demonstrate that head is actually working correctly:

$ printf '\xef\xbb\xbf' >file$ head -c 3 file $ head -c 3 file | hexdump -C00000000  ef bb bf                                          |...|00000003

Now, let's create a working function has_bom. If your grep supports -P, then one option is:

$ has_bom() { head -c3 "$1" | LC_ALL=C grep -qP '\xef\xbb\xbf'; }$ has_bom file && echo yesyes

Currently, only GNU grep supports -P.

Another option is to use bash's $'...':

$ has_bom() { head -c3 "$1" | grep -q $'\xef\xbb\xbf'; }$ has_bom file && echo yesyes

ksh and zsh also support $'...' but this construct is not POSIX and dash does not support it.

Notes:

  1. The use of an explicit return $? is optional. The function will, by default, return with the exit code of the last command run.

  2. I have used the POSIX form for defining functions. This is equivalent to the bash form but gives you one less problem to deal with if you ever have to run the function under another shell.

  3. bash does accept the use of the character - in a function name but this is a controversial feature. I replaced it with _ which is more widely accepted. (For more on this issue, see this answer.)

  4. The -q option to grep makes it quiet, meaning that it still sets a proper exit code but it does not send any characters to stdout.


I applied the followings for the first read line:

read cif (( "$(printf "%d" "'${c:0:1}")" == 65279 ))  ; then c="${c:1}" ; fi

This simply removes the BOM from the variable.


In pure bash, a solution could be:

function has_bom() {    local bom    LANG=C read -r -N 3 bom < "$1"    [[ "$bom" == $'\xef\xbb\xbf' ]]}

Test with a file with BOM:

$ F=test.with-bom$ head -c 5 $F | hd00000000  ef bb bf c3 a9                                    |.....|$ has_bom "$F" && echo "$F has a BOM" || echo "$F has no BOM"test.with-bom has a BOM

Test when no BOM:

$ F=test.utf8$ head -c 5 "$F" | hd00000000  c3 a9 6c c3 a9                                    |..l..|$ has_bom "$F" && echo "$F has a BOM" || echo "$F has no BOM"test.utf8 has no BOM