Find what has been changed and upload only changes Find what has been changed and upload only changes ajax ajax

Find what has been changed and upload only changes


There is already an accepted form for transmitting this kind of "differences" information. It's called Unified Diff.

The google-diff-match-patch provides implementations in Java, JavaScript, C++, C#, Lua and Python.

You should be able to just keep the "original text" and the "modified text" in variables on the client, then generate the diff in javascript (via diff-match-patch), send it to the server, along with a hash, and re-construct it (either using diff-match-patch or the unix "patch" program) on the server.

You might also want to consider including a "version" (or a modified date) when you send the original text to the client in the first place. Then include the same version (or date) in the "diff request" that the client sends up to the server. Verify the version on the server prior to applying the diff, so as to be sure that the server's copy of the text has not diverged from the client's copy while the modification was being made. (of course, in order for this to work, you'll need to update the version number on the server every time the master copy is updated).


You have a really interesting approach. But if the text files are really so large that it would need too much time to upload them every time, why do you have the send the whole thing to the client? Does the client really have to receive the whole 5mb text file? Wouldn't it be possible to send him only what he needs?

Anyway, to your question:The first thing that comes to my mind when hearing "large text files" and modification detection is diff. For the algorithm, read here. This could be an approach to commit the changes, and it specifies a format for it. You'd just have to rebuild diff (or a part of it) in javascript. This will be not easy, but possible, as I guess. If the algorithm doesn't help you, possibly at least the definition of the diff file format does.

To the corruption issue: You don't have to fear that your date gets corrupted on the way, because the TCP protocol, on which HTTP is based, looks that everything arrives without being corrupted. What you should fear is the connection reset. Might be you can do something like a handshake? When the client sends an update to the server, the server applies the modifications and keeps one old version of the file. To ensure that the client has received the ratification from the server that the modification went fine (that's where the conneciton reset happens), the client sends back another ajax request to the server. If this one doesn't come to the server within sone definied time, the file gets reset on the server side.

Another thing: I don't know if javascript likes it to handle such gigantic files/data...


This sounds like a problem that versioning systems (CVS, SVN, Git, Bazaar) already solve very well.

They're all reasonably easy to set up on a server, and you can communicate with them through PHP.

After the setup, you'd get for free: versioning, log, rollback, handling of concurrent changes, proper diff syntax, tagging, branches...

You wouldn't get the 'send just the updates' functionality that you asked for. I'm not sure how important that is to you. Pure texts are really very cheap to send as far as bandwidth is concerned.

Personally, I would probably make a compromise similar to what Wikis do. Break down the whole text into smaller semantically coherent chunks (chapters, or even paragraphs), determine on the client side just which chunks have been edited (without going down to the character level), and send those.

The server could then answer with a diff, generated by your versioning system, which is something they do very efficiently. If you want to allow concurrent changes, you might run into cases where editors have to do manual merges, anyway.

Another general hint might be to look at what Google did with Wave. I have to remain general here, because I haven't really studied it in detail myself, but I seem to remember that there have been a few articles about how they've solved the real-time concurrent editing problem, which seems to be exactly what you'd like to do.

In summary, I believe the problem you're planning to tackle is far from trivial, there are tools that address many of the associated problems already, and I personally would compromise and reformulate the approach in favor of much less workload.