Getting readable diff displays in Mercurial on Unicode files (MS Windows) Getting readable diff displays in Mercurial on Unicode files (MS Windows) windows windows

Getting readable diff displays in Mercurial on Unicode files (MS Windows)


This may not be relevant to you; read the last paragraph if it doesn't sound like it is.

I'm not sure whether this is what you're needing, but I've needed diffs with UTF-16LE content more than just the "binary files are different" - when I searched around some months ago for it I found a thread and bug discussing it; here's part of it. I can't find the original source of this mini-extension now (though it's doing just what that patch does), but what I got was an extension, BOM.py:

#!/usr/bin/env pythonfrom mercurial import hg, utilimport codecsboms = [    codecs.BOM_UTF8,    codecs.BOM_UTF16_BE, codecs.BOM_UTF16_LE,    codecs.BOM_UTF32_BE, codecs.BOM_UTF32_LE    ]def binary(s):    if s:        for bom in boms:            if s.startswith(bom):                return False        return '\0' in s    return Falsedef reposetup(ui, repo):    util.binary = binary

This gets loaded in the .hgrc (or your users\username\mercurial.ini) like this:

[extensions]bom = ~/.hgexts/BOM.py

Note the path will vary between Windows and Linux; on my Windows copy I put the path as \...\whatever (it's on a USB disk where the drive letter can change). Unfortunately relative paths are taken relative to the current working directory rather than the repository root or any such thing, but if you are saving it on your C: drive, you can just put the full path.

In Linux (my main development environment), this works well; in Command Prompt (which I still use regularly), it generally works well. I've never tried it in PowerShell, but I would expect it to be better than Command Prompt in its support for arbitrary null bytes in the command line.

I'm not sure if this is what you want at all; by the way you've said "binary diffs" I suspect you may already either have this or be doing hg diff -a which is achieving the same thing. In that case, all I can think of is writing another extension which takes the UTF-16LE and attempts to decode it to UTF-8. I'm not sure of the syntax for such an extension, but I might try that out.

Edit: having now trawled the mercurial source through commands.py, cmdutil.py, patch.py and mdiff.py, I see that binary diffs are done with a base85 encoding (patch.b85diff) rather than the normal diff. I wasn't aware of that, I thought it just forced it to diff it. In that case, perhaps this text is relevant after all. I await a response to see if it is!


I have worked around this by creating a new file with NotePad++ and saving it as a PowerShell file (.ps1 extension). NotePad++ will create the file as a plain text ANSI file. Once created I can open the file in the PowerShell editor and make any changes as necessary without the editor modifying the file encoding.

Disclaimer: I encountered this just moments ago and so I am not sure if there are any repercussions but so far my scripts appear to work as normal and my diffs are showing up nicely.


If my other answer does not do what you want, I think this one may; although I haven't tested it on Windows at all yet, it's working well in Linux. It does what is potentially a nasty thing, in wrapping mercurial.mdiff.unidiff with a new function which converts utf-16le to utf-8. This will not affect hg st, but will affect hg diff. One potential pitfall is that the BOM will also be changed from UTF-16LE BOM to the UTF-8 BOM.

Anyway, I think it may be useful to you, so here it is.

Extension file utf16decodediff.py:

import codecsfrom mercurial import mdiffunidiff = mdiff.unidiffdef new_unidiff(a, ad, b, bd, fn1, fn2, r=None, opts=mdiff.defaultopts):    """    A simple wrapper around mercurial.mdiff.unidiff which first decodes    UTF-16LE text.    """    if a.startswith(codecs.BOM_UTF16_LE):        try:            # Gets reencoded as utf-8 to be a str rather than a unicode; some            # extensions may expect a str and may break if it's wrong.            a = a.decode('utf-16le').encode('utf-8')        except UnicodeDecodeError:            pass    if b.startswith(codecs.BOM_UTF16_LE):        try:            b = b.decode('utf-16le').encode('utf-8')        except UnicodeDecodeError:            pass    return unidiff(a, ad, b, bd, fn1, fn2, r, opts)mdiff.unidiff = new_unidiff

In .hgrc:

[extensions]utf16decodediff = ~/.hgexts/utf16decodediff.py

(Or equivalent paths.)