Pig: Get top n values per group
One approach is
records = LOAD '/user/nubes/ncdc/micro-tab/top.txt' AS (user:chararray,value:chararray,counter:int);grpd = GROUP records BY user;top3 = foreach grpd { sorted = order records by counter desc; top = limit sorted 2; generate group, flatten(top);};
Input is:
Alice third 5 Alice first 11 Alice second 10Alice fourth 2Bob second 20Bob third 18Bob first 21Bob fourth 8
Output is:
(Alice,Alice,first,11)(Alice,Alice,second,10(Bob,Bob,first,21)(Bob,Bob,second,20)
I have just made an observation that
top = limit sorted 2;
top is an inbuilt function and may throw an error so the only thing which I did was changed the name of the relation in this case and instead of
generate group, flatten(top);
which was giving the output
(Alice,Alice,first,11)(Alice,Alice,second,10(Bob,Bob,first,21)(Bob,Bob,second,20)
Amended that as shown below -
records = load 'test1.txt' using PigStorage(',') as (user:chararray, value:chararray, count:int);grpd = GROUP records BY user;top2 = foreach grpd { sorted = order records by count desc; top1 = limit sorted 2; generate flatten(top1);};
which gave me the desired output as required by you -
(Alice,first,11)(Alice,second,10)(Bob,first,21)(Bob,second,20)
Hope this helps.