Regroup column values in a pandas df Regroup column values in a pandas df pandas pandas

Regroup column values in a pandas df


current attempt

In the following I have added a few lines before last lines of your code:

d = ({'Time': ['8:03:00', '8:17:00', '8:20:00', '10:15:00', '10:15:00', '11:48:00', '12:00:00', '12:10:00'],      'Place': ['House 1', 'House 2', 'House 1', 'House 3', 'House 4', 'House 5', 'House 1', 'House 1'],      'Area': ['X', 'X', 'Y', 'X', 'X', 'X', 'X', 'X']})df = pd.DataFrame(data=d)def g(gps):        s = gps['Place'].unique()        d = dict(zip(s, np.arange(len(s)) // 3 + 1))        gps['Person'] = gps['Place'].map(d)        return gpsdf = df.groupby('Area', sort=False).apply(g)s = df['Person'].astype(str) + df['Area']# added linest = s.value_counts()df_sub = df.loc[s[s.isin(t[t < 3].index)].index].copy()df_sub["tag"] = df_sub["Place"] + df_sub["Area"]tags = list(df_sub.tag.unique())f = lambda x: f'R{int(tags.index(x) / 3) + 1}'df_sub['reassign'] = df_sub.tag.apply(f)s[s.isin(t[t < 3].index)] = df_sub['reassign']df['Person'] = pd.Series(pd.factorize(s)[0] + 1).map(str).radd('Person ')

To be honest I am not so sure it works in all cases, but it gives your intended output in the test case.

Previous attempts

Let's see if I am able to help with a limited understanding of what you are trying to do.

You have sequential data (I'll call them events) and you want to assign to each event a "person" identifier. The identifier you will assign on each successive event depends on previous assignments and it seems to me it need to be governed by the following rules to be applied sequentially:

  1. I know you: I can reuse a previous identifier if: same values for "Place" and "Area" already appeared for a given identifier (has time something to do with it?).

  2. I do NOT know you: I will create a new identifier if: a new value of Area appears (so Place and Area play different roles?).

  3. do I know you?: I might reuse a previously used identifier if: an identifier has not been assigned to at least three events (what if this happens for multiple identifiers? I will assume I use the oldest...).

  4. nah, I don't: in case none of the preceding rules apply, I will create a new identifier.

Having assumed the above the following is an implementation of a solution:

# dict of list of past events assigned to each person. key is person identifierpeople = dict()# new column for df (as list) it will be appended at the end to dataframepersons = list()# first we define the rulesdef i_know_you(people, now):    def conditions(now, past):        return [e for e in past if (now.Place == e.Place) and (now.Area == e.Area)]    i_do = [person for person, past in people.items() if conditions(now, past)]    if i_do:        return i_do[0]    return Falsedef i_do_not_know_you(people, now):    conditions = not bool([e for past in people.values() for e in past if e.Area == now.Area])    if conditions:        return f'Person {len(people) + 1}'    return Falsedef do_i_know_you(people, now):    i_do = [person for person, past in people.items() if len(past) < 3]    if i_do:        return i_do[0]    return False# then we process the sequential datafor event in df.itertuples():    print('event:', event)    for rule in [i_know_you, i_do_not_know_you, do_i_know_you]:        person = rule(people, event)        print('\t', rule.__name__, person)        if person:            break    if not person:        person = f'Person {len(people) + 1}'        print('\t', "nah, I don't", person)    if person in people:        people[person].append(event)    else:        people[person] = [event]    persons.append(person)df['Person'] = persons

Output:

event: Pandas(Index=0, Time='8:00:00', Place='House 1', Area='X', Person='Person 1')     i_know_you False     i_do_not_know_you Person 1event: Pandas(Index=1, Time='8:30:00', Place='House 2', Area='X', Person='Person 1')     i_know_you False     i_do_not_know_you False     do_i_know_you Person 1event: Pandas(Index=2, Time='9:00:00', Place='House 1', Area='Y', Person='Person 2')     i_know_you False     i_do_not_know_you Person 2event: Pandas(Index=3, Time='9:30:00', Place='House 3', Area='X', Person='Person 1')     i_know_you False     i_do_not_know_you False     do_i_know_you Person 1event: Pandas(Index=4, Time='10:00:00', Place='House 4', Area='X', Person='Person 2')     i_know_you False     i_do_not_know_you False     do_i_know_you Person 2event: Pandas(Index=5, Time='10:30:00', Place='House 5', Area='X', Person='Person 2')     i_know_you False     i_do_not_know_you False     do_i_know_you Person 2event: Pandas(Index=6, Time='11:00:00', Place='House 1', Area='X', Person='Person 1')     i_know_you Person 1event: Pandas(Index=7, Time='11:30:00', Place='House 6', Area='X', Person='Person 3')     i_know_you False     i_do_not_know_you False     do_i_know_you False     nah, I don't Person 3event: Pandas(Index=8, Time='12:00:00', Place='House 7', Area='X', Person='Person 3')     i_know_you False     i_do_not_know_you False     do_i_know_you Person 3event: Pandas(Index=9, Time='12:30:00', Place='House 8', Area='X', Person='Person 3')     i_know_you False     i_do_not_know_you False     do_i_know_you Person 3

and the final dataframe is, as you want:

       Time    Place Area    Person0   8:00:00  House 1    X  Person 11   8:30:00  House 2    X  Person 12   9:00:00  House 1    Y  Person 23   9:30:00  House 3    X  Person 14  10:00:00  House 4    X  Person 25  10:30:00  House 5    X  Person 26  11:00:00  House 1    X  Person 17  11:30:00  House 6    X  Person 38  12:00:00  House 7    X  Person 39  12:30:00  House 8    X  Person 3

Remark: Note that I intentionally avoided using grouped by operations and processed data sequentially. I think this kind of complexity (and not really understanding what you want to do...) calls for that approach. Also, you can adapt the rules to be more complicated (is time really playing a role or not?) using the same structure above.

Updated answer for new data

Looking at new data it is evident I did not understand what you are trying to do (in particular, the assignement does not seem to follow sequential rules).I would have a solution that would work on your second dataset, but it would give a different result for the first dataset.

The solution is much simpler and will add a column (that you can drop later if you want):

df["tag"] = df["Place"] + df["Area"]tags = list(df.tag.unique())f = lambda x: f'Person {int(tags.index(x) / 3) + 1}'df['Person'] = df.tag.apply(f)

On the second dataset, it would give:

       Time    Place Area       tag    Person0   8:00:00  House 1    X  House 1X  Person 11   8:30:00  House 2    X  House 2X  Person 12   9:00:00  House 3    X  House 3X  Person 13   9:30:00  House 1    Y  House 1Y  Person 24  10:00:00  House 1    Z  House 1Z  Person 25  10:30:00  House 1    V  House 1V  Person 2

On the first dataset it gives:

       Time    Place Area       tag    Person0   8:00:00  House 1    X  House 1X  Person 11   8:30:00  House 2    X  House 2X  Person 12   9:00:00  House 1    Y  House 1Y  Person 13   9:30:00  House 3    X  House 3X  Person 24  10:00:00  House 4    X  House 4X  Person 25  10:30:00  House 5    X  House 5X  Person 26  11:00:00  House 1    X  House 1X  Person 17  11:30:00  House 6    X  House 6X  Person 38  12:00:00  House 7    X  House 7X  Person 39  12:30:00  House 8    X  House 8X  Person 3

This is different from your intended output on index 2 and 3. Is this output fine with your requirement? Why not?


As far as I understand, you're happy with everything before the Person allocation. So here's a plug and play solution to "merge" Persons with less than 3 unique values so each Person ends up with 3 unique values except for the last one obviously (based on the second to last df you posted ("Output:") without touching the ones that have already 3 unique values and just merges the others.

EDIT: Greatly simplified code. Again, taking your df as input:

n = 3df['complete'] = df.Person.apply(lambda x: 1 if df.Person.tolist().count(x) == n else 0)df['num'] = df.Person.str.replace('Person ','')df.sort_values(by=['num','complete'],ascending=True,inplace=True) #get all persons that are complete to the topc = 0person_numbers = []for x in range(0,999): #Create the numbering [1,1,1,2,2,2,3,3,3,...] with n defining how often a person is 'repeated'    if x % n == 0:        c += 1            person_numbers.append(c) df['Person_new'] = person_numbers[0:len(df)] #Add the numbering to the dfdf.Person = 'Person ' + df.Person_new.astype(str) #Fill the person column with the new numberingdf.drop(['complete','Person_new','num'],axis=1,inplace=True)


Firstly, this answer does not keep with your requirement to only reassign leftovers (so I don't expect you to accept it). That said, I'm posting it anyway because your time-window constraint was tricky to solve within a pandas world. Perhaps my solution will not be useful for you right now but maybe later ;) At the very least it was a learning experience for me - so perhaps others can gain from it.

import pandas as pdfrom datetime import datetime, time, timedeltaimport random# --- helper functions for demorandom.seed( 0 )def makeRandomTimes( nHours = None, mMinutes = None ):    nHours = 10 if nHours is None else nHours    mMinutes = 3 if mMinutes is None else mMinutes    times = []    for _ in range(nHours):        hour = random.randint(8,18)        for _ in range(mMinutes):            minute = random.randint(0,59)            times.append( datetime.combine( datetime.today(), time( hour, minute ) ) )    return timesdef makeDf():    times   = makeRandomTimes()    houses  = [ str(random.randint(1,10)) for _ in range(30) ]    areas   = [ ['X','Y'][random.randint(0,1)] for _ in range(30) ]    df      = pd.DataFrame( {'Time' : times, 'House' : houses, 'Area' : areas } )    return df.set_index( 'Time' ).sort_index()# --- real code beginsdef evaluateLookback( df, idx, dfg ):    mask = df.index >= dfg.Lookback.iat[-1]    personTotals = df[ mask ].set_index('Loc')['Person'].value_counts()    currentPeople = set(df.Person[ df.Person > -1 ])     noAllocations = currentPeople - set(personTotals.index)    available = personTotals < 3    if noAllocations or available.sum():        # allocate to first available person        person = min( noAllocations.union(personTotals[ available ].index) )    else:        # allocate new person        person = len( currentPeople )    df.Person.at[ idx ] = person    # debug    df.Verbose.at[ idx ] = ( noAllocations, available.sum() )def lambdaProxy( df, colName ):    [ dff[1][colName].apply( lambda f: f(df,*dff) ) for dff in df.groupby(df.index) ]lookback = timedelta( minutes = 120 )df1 = makeDf()df1[ 'Loc' ] = df1[ 'House' ] + df1[ 'Area' ]df1[ 'Person' ] = Nonedf1[ 'Lambda' ] = evaluateLookbackdf1[ 'Lookback' ] = df1.index - lookbackdf1[ 'Verbose' ] = NonelambdaProxy( df1, 'Lambda' )print( df1[ [ col for col in df1.columns if col != 'Lambda' ] ] )

And the sample output on my machine looks like this:

                    House Area  Loc Person            Lookback         VerboseTime2018-09-30 08:16:00     6    Y   6Y      0 2018-09-30 06:16:00         ({}, 0)2018-09-30 08:31:00     4    Y   4Y      0 2018-09-30 06:31:00         ({}, 1)2018-09-30 08:32:00    10    X  10X      0 2018-09-30 06:32:00         ({}, 1)2018-09-30 09:04:00     4    X   4X      1 2018-09-30 07:04:00         ({}, 0)2018-09-30 09:46:00    10    X  10X      1 2018-09-30 07:46:00         ({}, 1)2018-09-30 09:57:00     4    X   4X      1 2018-09-30 07:57:00         ({}, 1)2018-09-30 10:06:00     1    Y   1Y      2 2018-09-30 08:06:00         ({}, 0)2018-09-30 10:39:00    10    X  10X      0 2018-09-30 08:39:00        ({0}, 1)2018-09-30 10:48:00     7    X   7X      0 2018-09-30 08:48:00         ({}, 2)2018-09-30 11:08:00     1    Y   1Y      0 2018-09-30 09:08:00         ({}, 3)2018-09-30 11:18:00     2    Y   2Y      1 2018-09-30 09:18:00         ({}, 2)2018-09-30 11:32:00     9    X   9X      2 2018-09-30 09:32:00         ({}, 1)2018-09-30 12:22:00     5    Y   5Y      1 2018-09-30 10:22:00         ({}, 2)2018-09-30 12:30:00     9    X   9X      1 2018-09-30 10:30:00         ({}, 2)2018-09-30 12:34:00     6    X   6X      2 2018-09-30 10:34:00         ({}, 1)2018-09-30 12:37:00     1    Y   1Y      2 2018-09-30 10:37:00         ({}, 1)2018-09-30 12:45:00     4    X   4X      0 2018-09-30 10:45:00         ({}, 1)2018-09-30 12:58:00     8    X   8X      0 2018-09-30 10:58:00         ({}, 1)2018-09-30 14:26:00     7    Y   7Y      0 2018-09-30 12:26:00         ({}, 3)2018-09-30 14:48:00     2    X   2X      0 2018-09-30 12:48:00     ({1, 2}, 1)2018-09-30 14:50:00     8    X   8X      1 2018-09-30 12:50:00     ({1, 2}, 0)2018-09-30 14:53:00     8    Y   8Y      1 2018-09-30 12:53:00        ({2}, 1)2018-09-30 14:56:00     6    X   6X      1 2018-09-30 12:56:00        ({2}, 1)2018-09-30 14:58:00     9    Y   9Y      2 2018-09-30 12:58:00        ({2}, 0)2018-09-30 17:09:00     2    Y   2Y      0 2018-09-30 15:09:00  ({0, 1, 2}, 0)2018-09-30 17:19:00     4    X   4X      0 2018-09-30 15:19:00     ({1, 2}, 1)2018-09-30 17:57:00     6    Y   6Y      0 2018-09-30 15:57:00     ({1, 2}, 1)2018-09-30 18:21:00     3    X   3X      1 2018-09-30 16:21:00     ({1, 2}, 0)2018-09-30 18:30:00     9    X   9X      1 2018-09-30 16:30:00        ({2}, 1)2018-09-30 18:35:00     8    Y   8Y      1 2018-09-30 16:35:00        ({2}, 1)>>>

Notes:

  • the lookback variable controls the length of time looking backwards to consider locations as allocated to a person
  • the Lookback column shows the cutoff time
  • evaluateLookback is called repeatedly for each row in the table, with df being the whole DataFrame, idx the current index/label, and dfg the current row.
  • lambdaProxy controls the calling of evaluateLookback.
  • the number of locations per person is set to 3 but that could be adjusted as needed
  • arbitrarily complex requirements for the lookback period can be managed by having another func column which is first evaluated by lambdaProxy and then that result stored and used within evaluateLookback

There's some interesting edge cases in the demo output: 10:39:00, 14:48:00, 17:09:00


Aside: It would be interesting to see "functions column" in pandas, perhaps with memorize-like capability? Ideally, the 'Person' column should take a function and calc on request, either with its own row or with some variable window view. Anyone seen something like that?