Conditional word frequency count in Pandas
You could use the following vectorised approach:
data = {'speaker':['Adam','Ben','Clair'], 'speech': ['Thank you very much and good afternoon.', 'Let me clarify that because I want to make sure we have got everything right', 'By now you should have some good rest']}df = pd.DataFrame(data)wordlist = ['much', 'good','right']df['total'] = df['speech'].str.count(r'\b|\b'.join(wordlist))
Which gives:
>>> df speaker speech total0 Adam Thank you very much and good afternoon. 21 Ben Let me clarify that because I want to make sur... 12 Clair By now you should have some good rest 1
This is a much faster (runtime wise) solution, if you have a very large list and a large data frame to search through.
I guess it is because it takes advantage of a dictionary (which takes O(N) to construct and O(1) to search through). Performance wise, regex search is slower.
import pandas as pdfrom collections import Counterdef occurrence_counter(target_string, search_list): data = dict(Counter(target_string.split())) count = 0 for key in search_list: if key in data: count+=data[key] return countdata = {'speaker':['Adam','Ben','Clair'], 'speech': ['Thank you very much and good afternoon.', 'Let me clarify that because I want to make sure we have got everything right', 'By now you should have some good rest']}df = pd.DataFrame(data)wordlist = ['much', 'good','right']df['speech'].apply(lambda x: occurrence_counter(x, wordlist))
import pandas as pddata = {'speaker': ['Adam', 'Ben', 'Clair'], 'speech': ['Thank you very much and good afternoon.', 'Let me clarify that because I want to make sure we have got everything right', 'By now you should have some good rest']}df = pd.DataFrame(data)wordlist = ['much', 'good', 'right']df["speech"] = df["speech"].str.split()df = df.explode("speech")counts = df[df.speech.isin(wordlist)].groupby("speaker").size()print(counts)