Central way to filter invalid unicode chars in lxml?

python xml unicode lxml invalid-characters

Just filter the string before you parse it in LXML: cleaning invalid characters from XML (gist by lawlesst).

I tried it with your code; it seems to work, save the fact that you need to change the gist to import re and sys!

from lxml import etreefrom cleaner import invalid_xml_removeroot = etree.Element("root")root.text = u'\uffff'root.text += u'\ud800' print(etree.tostring(root))root.text += invalid_xml_remove('\x02'.decode("utf-8"))

CodeHunter

Central way to filter invalid unicode chars in lxml?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last