Replace non-ASCII characters with a single space Replace non-ASCII characters with a single space python python

Replace non-ASCII characters with a single space

Your ''.join() expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:

return ''.join([i if ord(i) < 128 else ' ' for i in text])

This handles characters one by one and would still use one space per character replaced.

Your regular expression should just replace consecutive non-ASCII characters with a space:

re.sub(r'[^\x00-\x7F]+',' ', text)

Note the + there.

For you the get the most alike representation of your original string I recommend the unidecode module:

# python 2.x:from unidecode import unidecodedef remove_non_ascii(text):    return unidecode(unicode(text, encoding = "utf-8"))

Then you can use it in a string:


For character processing, use Unicode strings:

PythonWin 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32.>>> s='ABC马克def'>>> import re>>> re.sub(r'[^\x00-\x7f]',r' ',s)   # Each char is a Unicode codepoint.'ABC  def'>>> b = s.encode('utf8')>>> re.sub(rb'[^\x00-\x7f]',rb' ',b) # Each char is a 3-byte UTF-8 sequence.b'ABC      def'

But note you will still have a problem if your string contains decomposed Unicode characters (separate character and combining accent marks, for example):

>>> s = 'mañana'>>> len(s)6>>> import unicodedata as ud>>> n=ud.normalize('NFD',s)>>> n'mañana'>>> len(n)7>>> re.sub(r'[^\x00-\x7f]',r' ',s) # single codepoint'ma ana'>>> re.sub(r'[^\x00-\x7f]',r' ',n) # only combining mark replaced'man ana'