Can I use a plaintext diff algorithm for tracking XML changes?

I'm the author of the plain-text diff/match/patch library from Google.

The key question is whether your patches are exact. In an ideal world:

  diff(old_text, new_text) -> edits  patch(edits, old_text) -> new_text

Notice that the base text (old_text) is the same in both operations. In this ideal case, then a simple plain-text diff and patch will work perfectly, regardless of the type of the content. If this case applies to you, then you're done.

The issue lies with fuzzy patching. Here's the corresponding example:

  diff(old_text, new_text) -> edits  patch(edits, old_forked_text) -> new_forked_text

Notice that the base text is not the same in both operations. They should be similar, but the patch operation now has to use "judgement" about what it should do. Some patches may fit perfectly as specified in the edit, others may need to be tweaked for position, others may need to be tweaked for altered context, others may not fit at all and should be dropped. If your patching algorithm is not aware of the structure of XML when making its decisions, you may very well end up with malfromed XML. Here's a sample:

  old_text = Jabberwock<SPAN>Hello<SPAN>World</SPAN></SPAN>  new_text = Jabberwock<DIV>Hello<SPAN>World</SPAN></DIV>  diff(old_text, new_text) -> edits  edits = ["SPAN" -> "DIV" @ character 11,           "SPAN" -> "DIV" @ character 41]  old_forked_text = <SPAN>Hello<SPAN>World</SPAN></SPAN>  patch(edits, old_forked_text) -> new_forked_text  new_forked_text = <SPAN>Hello<DIV>World</SPAN></DIV>

Let's look at this one carefully. The original diff returned two edits, change the outermost SPAN to a DIV. Simple change. Unfortunately the text this edit is being applied to has changed from the original. The word "Jabberwock" has been removed. Now the first SPAN->DIV change matches up with the second SPAN tag, not the first one. Since the patch algorithm is not aware of the rules of XML, it results in illegally nested tags.

There are some hacks which allow you to guarantee valid XML when using a plain-text patch, but they result in some loss of flexibility (the original question already has a link to the wiki page I wrote about this). The ultimate solution for patching XML is of course to use an XML-aware diff and patch algorithm. These are significantly more complicated and expensive, but they exist. Google the names Tancred Lindholm and Sebastian Rönnau for the great work that they have done in the XML field (particularly with regards to DocEng).

Let me know if there is anything else I can add.

-- Neil Fraser

xml algorithm diff lcs

I use Beyond Compare all the time to compare XML documents. It understands XML, to a certain degree.

You may need to pre-process the two documents in order for textual comparison to do the best job possible. For instance, in some XML documents, the order of some elements may not matter. It will certainly matter to your diff tool! You may need to pre-process the XML using an XML Transform that sorts these elements into a common order in both files, before comparing the two sorted files.

You're also going to want to use the same indentation for both documents. I find it useful to start each element on a new line, and to use the same amount of indentation, with spaces, for each level. If your document gets very deep, you would want to use only one or two spaces per level, so that the compare fits on the screen. You may even want to use one attribute per line (and to sort the attributes into a common order).

xml algorithm diff lcs

If you are the sole "owner" of the data between your undo/redo points then of course you can use plaintext diff for them. As you point out, it amounts to a set of transformations.

Depending on the operations you provide, however, plaintext diff may not be remotely near optimal for recording undo/redo and you may need to specialise certain cases. Imagine just recording a ReplaceAll command which might be only a few bytes overhead plus the search and replace string. That could generate massive plaintext diffs.

In the broader context, if you allow external editing of these documents, and you're thinking more about how to store deltas on the server, you're mimicking git or other version control systems. You have to use some kind of diff algorithm because just recording your commands is obviously not the only source of transformation. At this point you're starting to mix undo/redo with version control and you may want to think hard about confusing those concepts for your users.

I would keep undo/redo as within an editing session and ban external editing whilst the file is open. That allows you to optimise your command recording for broad cases as I said above.

Beyond that, either use conventional version control (consider wrapping git) or implement your own way of coping with files being changed outside your editor.

CodeHunter

Can I use a plaintext diff algorithm for tracking XML changes?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last