Conditional word frequency count in Pandas Conditional word frequency count in Pandas pandas pandas

Conditional word frequency count in Pandas


You could use the following vectorised approach:

data = {'speaker':['Adam','Ben','Clair'],        'speech': ['Thank you very much and good afternoon.',                   'Let me clarify that because I want to make sure we have got everything right',                   'By now you should have some good rest']}df = pd.DataFrame(data)wordlist = ['much', 'good','right']df['total'] = df['speech'].str.count(r'\b|\b'.join(wordlist))

Which gives:

>>> df  speaker                                             speech  total0    Adam            Thank you very much and good afternoon.      21     Ben  Let me clarify that because I want to make sur...      12   Clair              By now you should have some good rest      1


This is a much faster (runtime wise) solution, if you have a very large list and a large data frame to search through.

I guess it is because it takes advantage of a dictionary (which takes O(N) to construct and O(1) to search through). Performance wise, regex search is slower.

import pandas as pdfrom collections import Counterdef occurrence_counter(target_string, search_list):    data = dict(Counter(target_string.split()))    count = 0    for key in search_list:        if key in data:            count+=data[key]    return countdata = {'speaker':['Adam','Ben','Clair'],        'speech': ['Thank you very much and good afternoon.',                   'Let me clarify that because I want to make sure we have got everything right',                   'By now you should have some good rest']}df = pd.DataFrame(data)wordlist = ['much', 'good','right']df['speech'].apply(lambda x: occurrence_counter(x, wordlist))


import pandas as pddata = {'speaker': ['Adam', 'Ben', 'Clair'],        'speech': ['Thank you very much and good afternoon.',                   'Let me clarify that because I want to make sure we have got everything right',                   'By now you should have some good rest']}df = pd.DataFrame(data)wordlist = ['much', 'good', 'right']df["speech"] = df["speech"].str.split()df = df.explode("speech")counts = df[df.speech.isin(wordlist)].groupby("speaker").size()print(counts)