XML to TeX or how to get a beautiful PDF from XHTML-like source XML to TeX or how to get a beautiful PDF from XHTML-like source xml xml

XML to TeX or how to get a beautiful PDF from XHTML-like source


I've done something like this in the past (that is, maintaining master versions of documents in XML, and wanting to produce LaTeX output from them).

I've used PassiveTeX in the past, but I found creating stylesheets to be hard work -- the usual result of writing two languages at once. I got it to work, and the result looked very good, but it was probably more effort than it was worth. That said, if you amount of styling you need to add is small, then this might be a good route, because it's a single step.

The most successful route (read, flexible and attractive), was to use XSLT to transform the document into structural LaTeX, which matches the intended structure of the result document, but which doesn't attempt to do more than minimal formatting. Depending on your document, that might be normal-looking LaTeX, or it might have bespoke structures. Then write or adapt a LaTeX stylesheet or class file which formats that output into something attractive. That way, you're using XSLT to its strengths (and not going beyond them, which rapidly becomes very frustrating), using LaTeX to its strengths, and not confusing yourself.

That is, this more-or-less matches the approach of your first two alternatives, and whether you go with them, or write/customise a LaTeX stylesheet with bespoke output, is a function of how comfortable you feel with LaTeX stylesheets, and how much complicated or specialised formatting you need to do.

Since you say you need to handle Unicode characters in the input, then yes, XeLaTeX would be a good choice for the LaTeX part of the pipeline.


You might want to check questions tagged with XML on TeX.sx, especially this one. I suggest you use ConTeXt; the current version has no problems with Unicode and can handle OpenType perfectly - and it's programmable in Lua. The most often used alternative with LaTeX is XMLTeX, but that needs a lot of TeX foo.

If your documents can be handled by pandoc, use that: You'll have multiple output options, more than from any TeX-based system.


In the end, I've decided to go with Pandoc, seems to be very polished and solid code base. One potential drawback is that you have to limit yourself to the number of markup features available in Pandoc's internal representation which maps basically one-to-one to its extended markdown.

Because I didn't think generating markdown from my XHTML-like source was a good idea, I succeeded in initiating a pandoc component that reads DocBook, which is currently in the master branch of Pandoc's development repo. So now I've a simple XSLT stylesheet that converts from my XHTML dialect to DocBook (which is also XML) and then I use Pandoc to export to a hoist of other formats, including PDF via ConTeXt.