How to make sure all my source files stay UTF-8 with Unix line endings?
The solution I ended up with is the two Sublime Text 2 plugins "EncodingHelper" and "LineEndings". I now get both the file encoding and line endings in the status bar:
If the encoding is wrong, I can File->Save with Encoding. If the line endings are wrong, the latter plugin comes with commands for changing the line endings:
If a file has no BOM, and no 'interesting characters' within the amount of text that file
looks at, file
concludes that it is ASCII ISO-646 -- a strict subset of UTF-8. You might find that putting BOMs on all your files encourages all these Windows tools to behave; the convention of a BOM on a UTF-8 file originated on Windows. Or it might make things worse. As for x/c++, well, that's just file
tryin' to be helpful, and failing. You javascript has something in it that looks like C++.
Apache Tika has an encoding detector; you could even use the command-line driver that comes with it as an alternative to file
. It will stick to MIME types and not wander off to C++.
Instead of file
, try a custom program to check just the things you want. Here is a quick hack, mainly based on some Google hits, which were incidentally written by @ikegami.
#!/usr/bin/perluse strict;use warnings;use Encode qw( decode );use vars (qw(@ARGV));@ARGV > 0 or die "Usage: $0 files ...\n";for my $filename (@ARGV){ my $terminator = 'CRLF'; my $charset = 'UTF-8'; local $/; undef $/; my $file; if (open (F, "<", $filename)) { $file = <F>; close F; # Don't print bogus data e.g. for directories unless (defined $file) { warn "$0: Skipping $filename: $!\n; next; } } else { warn "$0: Could not open $filename: $!\n"; next; } my $have_crlf = ($file =~ /\r\n/); my $have_cr = ($file =~ /\r(?!\n)/); my $have_lf = ($file =~ /(?!\r\n).\n/); my $sum = $have_crlf + $have_cr + $have_lf; if ($sum == 0) { $terminator = "no"; } elsif ($sum > 2) { $terminator = "mixed"; } elsif ($have_cr) { $terminator = "CR"; } elsif ($have_lf) { $terminator = "LF"; } $charset = 'ASCII' unless ($file =~ /[^\000-\177]/); $charset = 'unknown' unless eval { decode('UTF-8', $file, Encode::FB_CROAK); 1 }; print "$filename: charset $charset, $terminator line endings\n";}
Note that this has no concept of legacy 8-bit encodings - it will simply throw unknown
if it's neither pure 7-bit ASCII nor proper UTF-8.