Replacing punctuation in a data frame based on punctuation list [duplicate]
Use replace
with correct regex would be easier:
In [41]:import pandas as pdpd.set_option('display.notebook_repr_html', False)df = pd.DataFrame({'text':['test','%hgh&12','abc123!!!','porkyfries']})dfOut[41]: text0 test1 %hgh&122 abc123!!!3 porkyfries[4 rows x 1 columns]
use regex with the pattern which means not alphanumeric/whitespace
In [49]:df['text'] = df['text'].str.replace('[^\w\s]','')dfOut[49]: text0 test1 hgh122 abc1233 porkyfries[4 rows x 1 columns]
For removing punctuation from a text column in your dataframme:
In:
import reimport stringrem = string.punctuationpattern = r"[{}]".format(rem)pattern
Out:
'[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]'
In:
df = pd.DataFrame({'text':['book...regh', 'book...', 'boo,', 'book. ', 'ball, ', 'ballnroll"', '"rope"', 'rick % ']})df
Out:
text0 book...regh1 book...2 boo,3 book. 4 ball, 5 ballnroll"6 "rope"7 rick %
In:
df['text'] = df['text'].str.replace(pattern, '')df
You can replace the pattern with your desired character. Ex - replace(pattern, '$')
Out:
text0 bookregh1 book2 boo3 book 4 ball 5 ballnroll6 rope7 rick
Translate is often considered the cleanest and fastest way to remove punctuation (source)
import stringtext = text.translate(None, string.punctuation.translate(None, '"'))
You may find that it works better to remove punctuation in 'a' before loading it into pandas.