Replace non-ASCII characters with a single space

python unicode encoding ascii

Your ''.join() expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:

return ''.join([i if ord(i) < 128 else ' ' for i in text])

This handles characters one by one and would still use one space per character replaced.

Your regular expression should just replace consecutive non-ASCII characters with a space:

re.sub(r'[^\x00-\x7F]+',' ', text)

Note the + there.

python unicode encoding ascii

For you the get the most alike representation of your original string I recommend the unidecode module:

# python 2.x:from unidecode import unidecodedef remove_non_ascii(text):    return unidecode(unicode(text, encoding = "utf-8"))

Then you can use it in a string:

remove_non_ascii("Ceñía")Cenia

python unicode encoding ascii

For character processing, use Unicode strings:

PythonWin 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32.>>> s='ABC马克def'>>> import re>>> re.sub(r'[^\x00-\x7f]',r' ',s)   # Each char is a Unicode codepoint.'ABC  def'>>> b = s.encode('utf8')>>> re.sub(rb'[^\x00-\x7f]',rb' ',b) # Each char is a 3-byte UTF-8 sequence.b'ABC      def'

But note you will still have a problem if your string contains decomposed Unicode characters (separate character and combining accent marks, for example):

>>> s = 'mañana'>>> len(s)6>>> import unicodedata as ud>>> n=ud.normalize('NFD',s)>>> n'mañana'>>> len(n)7>>> re.sub(r'[^\x00-\x7f]',r' ',s) # single codepoint'ma ana'>>> re.sub(r'[^\x00-\x7f]',r' ',n) # only combining mark replaced'man ana'

CodeHunter

Replace non-ASCII characters with a single space

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last