Central way to filter invalid unicode chars in lxml?
Just filter the string before you parse it in LXML: cleaning invalid characters from XML (gist by lawlesst).
I tried it with your code; it seems to work, save the fact that you need to change the gist to import re and sys!
from lxml import etreefrom cleaner import invalid_xml_removeroot = etree.Element("root")root.text = u'\uffff'root.text += u'\ud800' print(etree.tostring(root))root.text += invalid_xml_remove('\x02'.decode("utf-8"))