Central way to filter invalid unicode chars in lxml? Central way to filter invalid unicode chars in lxml? xml xml

Central way to filter invalid unicode chars in lxml?


Just filter the string before you parse it in LXML: cleaning invalid characters from XML (gist by lawlesst).

I tried it with your code; it seems to work, save the fact that you need to change the gist to import re and sys!

from lxml import etreefrom cleaner import invalid_xml_removeroot = etree.Element("root")root.text = u'\uffff'root.text += u'\ud800' print(etree.tostring(root))root.text += invalid_xml_remove('\x02'.decode("utf-8"))