How to make sure all my source files stay UTF-8 with Unix line endings? How to make sure all my source files stay UTF-8 with Unix line endings? unix unix

How to make sure all my source files stay UTF-8 with Unix line endings?


The solution I ended up with is the two Sublime Text 2 plugins "EncodingHelper" and "LineEndings". I now get both the file encoding and line endings in the status bar:

Sublime Text 2 status bar

If the encoding is wrong, I can File->Save with Encoding. If the line endings are wrong, the latter plugin comes with commands for changing the line endings:

Sublime Text 2 commands


If a file has no BOM, and no 'interesting characters' within the amount of text that file looks at, file concludes that it is ASCII ISO-646 -- a strict subset of UTF-8. You might find that putting BOMs on all your files encourages all these Windows tools to behave; the convention of a BOM on a UTF-8 file originated on Windows. Or it might make things worse. As for x/c++, well, that's just file tryin' to be helpful, and failing. You javascript has something in it that looks like C++.

Apache Tika has an encoding detector; you could even use the command-line driver that comes with it as an alternative to file. It will stick to MIME types and not wander off to C++.


Instead of file, try a custom program to check just the things you want. Here is a quick hack, mainly based on some Google hits, which were incidentally written by @ikegami.

#!/usr/bin/perluse strict;use warnings;use Encode qw( decode );use vars (qw(@ARGV));@ARGV > 0 or die "Usage: $0 files ...\n";for my $filename (@ARGV){    my $terminator = 'CRLF';    my $charset = 'UTF-8';    local $/;    undef $/;    my $file;    if (open (F, "<", $filename))    {        $file = <F>;        close F;            # Don't print bogus data e.g. for directories        unless (defined $file)        {            warn "$0: Skipping $filename: $!\n;            next;        }    }    else    {        warn "$0: Could not open $filename: $!\n";        next;    }    my $have_crlf = ($file =~ /\r\n/);    my $have_cr = ($file =~ /\r(?!\n)/);    my $have_lf = ($file =~ /(?!\r\n).\n/);    my $sum = $have_crlf + $have_cr + $have_lf;    if ($sum == 0)    {        $terminator = "no";    }    elsif ($sum > 2)    {        $terminator = "mixed";    }    elsif ($have_cr)        {        $terminator = "CR";    }    elsif ($have_lf)    {        $terminator = "LF";    }    $charset = 'ASCII' unless ($file =~ /[^\000-\177]/);    $charset = 'unknown'        unless eval { decode('UTF-8', $file, Encode::FB_CROAK); 1 };    print "$filename: charset $charset, $terminator line endings\n";}

Note that this has no concept of legacy 8-bit encodings - it will simply throw unknown if it's neither pure 7-bit ASCII nor proper UTF-8.