Regroup column values in a pandas df
current attempt
In the following I have added a few lines before last lines of your code:
d = ({'Time': ['8:03:00', '8:17:00', '8:20:00', '10:15:00', '10:15:00', '11:48:00', '12:00:00', '12:10:00'], 'Place': ['House 1', 'House 2', 'House 1', 'House 3', 'House 4', 'House 5', 'House 1', 'House 1'], 'Area': ['X', 'X', 'Y', 'X', 'X', 'X', 'X', 'X']})df = pd.DataFrame(data=d)def g(gps): s = gps['Place'].unique() d = dict(zip(s, np.arange(len(s)) // 3 + 1)) gps['Person'] = gps['Place'].map(d) return gpsdf = df.groupby('Area', sort=False).apply(g)s = df['Person'].astype(str) + df['Area']# added linest = s.value_counts()df_sub = df.loc[s[s.isin(t[t < 3].index)].index].copy()df_sub["tag"] = df_sub["Place"] + df_sub["Area"]tags = list(df_sub.tag.unique())f = lambda x: f'R{int(tags.index(x) / 3) + 1}'df_sub['reassign'] = df_sub.tag.apply(f)s[s.isin(t[t < 3].index)] = df_sub['reassign']df['Person'] = pd.Series(pd.factorize(s)[0] + 1).map(str).radd('Person ')
To be honest I am not so sure it works in all cases, but it gives your intended output in the test case.
Previous attempts
Let's see if I am able to help with a limited understanding of what you are trying to do.
You have sequential data (I'll call them events) and you want to assign to each event a "person" identifier. The identifier you will assign on each successive event depends on previous assignments and it seems to me it need to be governed by the following rules to be applied sequentially:
I know you: I can reuse a previous identifier if: same values for "Place" and "Area" already appeared for a given identifier (has time something to do with it?).
I do NOT know you: I will create a new identifier if: a new value of Area appears (so Place and Area play different roles?).
do I know you?: I might reuse a previously used identifier if: an identifier has not been assigned to at least three events (what if this happens for multiple identifiers? I will assume I use the oldest...).
nah, I don't: in case none of the preceding rules apply, I will create a new identifier.
Having assumed the above the following is an implementation of a solution:
# dict of list of past events assigned to each person. key is person identifierpeople = dict()# new column for df (as list) it will be appended at the end to dataframepersons = list()# first we define the rulesdef i_know_you(people, now): def conditions(now, past): return [e for e in past if (now.Place == e.Place) and (now.Area == e.Area)] i_do = [person for person, past in people.items() if conditions(now, past)] if i_do: return i_do[0] return Falsedef i_do_not_know_you(people, now): conditions = not bool([e for past in people.values() for e in past if e.Area == now.Area]) if conditions: return f'Person {len(people) + 1}' return Falsedef do_i_know_you(people, now): i_do = [person for person, past in people.items() if len(past) < 3] if i_do: return i_do[0] return False# then we process the sequential datafor event in df.itertuples(): print('event:', event) for rule in [i_know_you, i_do_not_know_you, do_i_know_you]: person = rule(people, event) print('\t', rule.__name__, person) if person: break if not person: person = f'Person {len(people) + 1}' print('\t', "nah, I don't", person) if person in people: people[person].append(event) else: people[person] = [event] persons.append(person)df['Person'] = persons
Output:
event: Pandas(Index=0, Time='8:00:00', Place='House 1', Area='X', Person='Person 1') i_know_you False i_do_not_know_you Person 1event: Pandas(Index=1, Time='8:30:00', Place='House 2', Area='X', Person='Person 1') i_know_you False i_do_not_know_you False do_i_know_you Person 1event: Pandas(Index=2, Time='9:00:00', Place='House 1', Area='Y', Person='Person 2') i_know_you False i_do_not_know_you Person 2event: Pandas(Index=3, Time='9:30:00', Place='House 3', Area='X', Person='Person 1') i_know_you False i_do_not_know_you False do_i_know_you Person 1event: Pandas(Index=4, Time='10:00:00', Place='House 4', Area='X', Person='Person 2') i_know_you False i_do_not_know_you False do_i_know_you Person 2event: Pandas(Index=5, Time='10:30:00', Place='House 5', Area='X', Person='Person 2') i_know_you False i_do_not_know_you False do_i_know_you Person 2event: Pandas(Index=6, Time='11:00:00', Place='House 1', Area='X', Person='Person 1') i_know_you Person 1event: Pandas(Index=7, Time='11:30:00', Place='House 6', Area='X', Person='Person 3') i_know_you False i_do_not_know_you False do_i_know_you False nah, I don't Person 3event: Pandas(Index=8, Time='12:00:00', Place='House 7', Area='X', Person='Person 3') i_know_you False i_do_not_know_you False do_i_know_you Person 3event: Pandas(Index=9, Time='12:30:00', Place='House 8', Area='X', Person='Person 3') i_know_you False i_do_not_know_you False do_i_know_you Person 3
and the final dataframe is, as you want:
Time Place Area Person0 8:00:00 House 1 X Person 11 8:30:00 House 2 X Person 12 9:00:00 House 1 Y Person 23 9:30:00 House 3 X Person 14 10:00:00 House 4 X Person 25 10:30:00 House 5 X Person 26 11:00:00 House 1 X Person 17 11:30:00 House 6 X Person 38 12:00:00 House 7 X Person 39 12:30:00 House 8 X Person 3
Remark: Note that I intentionally avoided using grouped by operations and processed data sequentially. I think this kind of complexity (and not really understanding what you want to do...) calls for that approach. Also, you can adapt the rules to be more complicated (is time really playing a role or not?) using the same structure above.
Updated answer for new data
Looking at new data it is evident I did not understand what you are trying to do (in particular, the assignement does not seem to follow sequential rules).I would have a solution that would work on your second dataset, but it would give a different result for the first dataset.
The solution is much simpler and will add a column (that you can drop later if you want):
df["tag"] = df["Place"] + df["Area"]tags = list(df.tag.unique())f = lambda x: f'Person {int(tags.index(x) / 3) + 1}'df['Person'] = df.tag.apply(f)
On the second dataset, it would give:
Time Place Area tag Person0 8:00:00 House 1 X House 1X Person 11 8:30:00 House 2 X House 2X Person 12 9:00:00 House 3 X House 3X Person 13 9:30:00 House 1 Y House 1Y Person 24 10:00:00 House 1 Z House 1Z Person 25 10:30:00 House 1 V House 1V Person 2
On the first dataset it gives:
Time Place Area tag Person0 8:00:00 House 1 X House 1X Person 11 8:30:00 House 2 X House 2X Person 12 9:00:00 House 1 Y House 1Y Person 13 9:30:00 House 3 X House 3X Person 24 10:00:00 House 4 X House 4X Person 25 10:30:00 House 5 X House 5X Person 26 11:00:00 House 1 X House 1X Person 17 11:30:00 House 6 X House 6X Person 38 12:00:00 House 7 X House 7X Person 39 12:30:00 House 8 X House 8X Person 3
This is different from your intended output on index 2 and 3. Is this output fine with your requirement? Why not?
As far as I understand, you're happy with everything before the Person allocation. So here's a plug and play solution to "merge" Persons with less than 3 unique values so each Person ends up with 3 unique values except for the last one obviously (based on the second to last df you posted ("Output:") without touching the ones that have already 3 unique values and just merges the others.
EDIT: Greatly simplified code. Again, taking your df as input:
n = 3df['complete'] = df.Person.apply(lambda x: 1 if df.Person.tolist().count(x) == n else 0)df['num'] = df.Person.str.replace('Person ','')df.sort_values(by=['num','complete'],ascending=True,inplace=True) #get all persons that are complete to the topc = 0person_numbers = []for x in range(0,999): #Create the numbering [1,1,1,2,2,2,3,3,3,...] with n defining how often a person is 'repeated' if x % n == 0: c += 1 person_numbers.append(c) df['Person_new'] = person_numbers[0:len(df)] #Add the numbering to the dfdf.Person = 'Person ' + df.Person_new.astype(str) #Fill the person column with the new numberingdf.drop(['complete','Person_new','num'],axis=1,inplace=True)
Firstly, this answer does not keep with your requirement to only reassign leftovers (so I don't expect you to accept it). That said, I'm posting it anyway because your time-window constraint was tricky to solve within a pandas world. Perhaps my solution will not be useful for you right now but maybe later ;) At the very least it was a learning experience for me - so perhaps others can gain from it.
import pandas as pdfrom datetime import datetime, time, timedeltaimport random# --- helper functions for demorandom.seed( 0 )def makeRandomTimes( nHours = None, mMinutes = None ): nHours = 10 if nHours is None else nHours mMinutes = 3 if mMinutes is None else mMinutes times = [] for _ in range(nHours): hour = random.randint(8,18) for _ in range(mMinutes): minute = random.randint(0,59) times.append( datetime.combine( datetime.today(), time( hour, minute ) ) ) return timesdef makeDf(): times = makeRandomTimes() houses = [ str(random.randint(1,10)) for _ in range(30) ] areas = [ ['X','Y'][random.randint(0,1)] for _ in range(30) ] df = pd.DataFrame( {'Time' : times, 'House' : houses, 'Area' : areas } ) return df.set_index( 'Time' ).sort_index()# --- real code beginsdef evaluateLookback( df, idx, dfg ): mask = df.index >= dfg.Lookback.iat[-1] personTotals = df[ mask ].set_index('Loc')['Person'].value_counts() currentPeople = set(df.Person[ df.Person > -1 ]) noAllocations = currentPeople - set(personTotals.index) available = personTotals < 3 if noAllocations or available.sum(): # allocate to first available person person = min( noAllocations.union(personTotals[ available ].index) ) else: # allocate new person person = len( currentPeople ) df.Person.at[ idx ] = person # debug df.Verbose.at[ idx ] = ( noAllocations, available.sum() )def lambdaProxy( df, colName ): [ dff[1][colName].apply( lambda f: f(df,*dff) ) for dff in df.groupby(df.index) ]lookback = timedelta( minutes = 120 )df1 = makeDf()df1[ 'Loc' ] = df1[ 'House' ] + df1[ 'Area' ]df1[ 'Person' ] = Nonedf1[ 'Lambda' ] = evaluateLookbackdf1[ 'Lookback' ] = df1.index - lookbackdf1[ 'Verbose' ] = NonelambdaProxy( df1, 'Lambda' )print( df1[ [ col for col in df1.columns if col != 'Lambda' ] ] )
And the sample output on my machine looks like this:
House Area Loc Person Lookback VerboseTime2018-09-30 08:16:00 6 Y 6Y 0 2018-09-30 06:16:00 ({}, 0)2018-09-30 08:31:00 4 Y 4Y 0 2018-09-30 06:31:00 ({}, 1)2018-09-30 08:32:00 10 X 10X 0 2018-09-30 06:32:00 ({}, 1)2018-09-30 09:04:00 4 X 4X 1 2018-09-30 07:04:00 ({}, 0)2018-09-30 09:46:00 10 X 10X 1 2018-09-30 07:46:00 ({}, 1)2018-09-30 09:57:00 4 X 4X 1 2018-09-30 07:57:00 ({}, 1)2018-09-30 10:06:00 1 Y 1Y 2 2018-09-30 08:06:00 ({}, 0)2018-09-30 10:39:00 10 X 10X 0 2018-09-30 08:39:00 ({0}, 1)2018-09-30 10:48:00 7 X 7X 0 2018-09-30 08:48:00 ({}, 2)2018-09-30 11:08:00 1 Y 1Y 0 2018-09-30 09:08:00 ({}, 3)2018-09-30 11:18:00 2 Y 2Y 1 2018-09-30 09:18:00 ({}, 2)2018-09-30 11:32:00 9 X 9X 2 2018-09-30 09:32:00 ({}, 1)2018-09-30 12:22:00 5 Y 5Y 1 2018-09-30 10:22:00 ({}, 2)2018-09-30 12:30:00 9 X 9X 1 2018-09-30 10:30:00 ({}, 2)2018-09-30 12:34:00 6 X 6X 2 2018-09-30 10:34:00 ({}, 1)2018-09-30 12:37:00 1 Y 1Y 2 2018-09-30 10:37:00 ({}, 1)2018-09-30 12:45:00 4 X 4X 0 2018-09-30 10:45:00 ({}, 1)2018-09-30 12:58:00 8 X 8X 0 2018-09-30 10:58:00 ({}, 1)2018-09-30 14:26:00 7 Y 7Y 0 2018-09-30 12:26:00 ({}, 3)2018-09-30 14:48:00 2 X 2X 0 2018-09-30 12:48:00 ({1, 2}, 1)2018-09-30 14:50:00 8 X 8X 1 2018-09-30 12:50:00 ({1, 2}, 0)2018-09-30 14:53:00 8 Y 8Y 1 2018-09-30 12:53:00 ({2}, 1)2018-09-30 14:56:00 6 X 6X 1 2018-09-30 12:56:00 ({2}, 1)2018-09-30 14:58:00 9 Y 9Y 2 2018-09-30 12:58:00 ({2}, 0)2018-09-30 17:09:00 2 Y 2Y 0 2018-09-30 15:09:00 ({0, 1, 2}, 0)2018-09-30 17:19:00 4 X 4X 0 2018-09-30 15:19:00 ({1, 2}, 1)2018-09-30 17:57:00 6 Y 6Y 0 2018-09-30 15:57:00 ({1, 2}, 1)2018-09-30 18:21:00 3 X 3X 1 2018-09-30 16:21:00 ({1, 2}, 0)2018-09-30 18:30:00 9 X 9X 1 2018-09-30 16:30:00 ({2}, 1)2018-09-30 18:35:00 8 Y 8Y 1 2018-09-30 16:35:00 ({2}, 1)>>>
Notes:
- the
lookback
variable controls the length of time looking backwards to consider locations as allocated to a person - the
Lookback
column shows the cutoff time evaluateLookback
is called repeatedly for each row in the table, withdf
being the whole DataFrame,idx
the current index/label, anddfg
the current row.lambdaProxy
controls the calling ofevaluateLookback
.- the number of locations per person is set to
3
but that could be adjusted as needed - arbitrarily complex requirements for the lookback period can be managed by having another func column which is first evaluated by
lambdaProxy
and then that result stored and used withinevaluateLookback
There's some interesting edge cases in the demo output: 10:39:00
, 14:48:00
, 17:09:00
Aside: It would be interesting to see "functions column" in pandas, perhaps with memorize-like capability? Ideally, the 'Person' column should take a function and calc on request, either with its own row or with some variable window view. Anyone seen something like that?