Grouping query into group and subgroup Grouping query into group and subgroup sqlite sqlite

Grouping query into group and subgroup


Using sqldf package:

df <- read.table(text="Area_code  Subarea_code  Company   Result10         101           A         1510         101           P         1010         101           C         410         102           A         1010         102           P         810         102           C         511         111           A         1511         111           P         2011         111           C         511         112           A         1011         112           P         511         112           C         10", header=TRUE)library(sqldf)mymax <- sqldf("select Company,                  Area_code,                  max(Subarea_code) Largest_subarea_code                  from df                  group by Company,Area_code")mymaxres <- sqldf("select d.Company,                          d.Area_code,                          m.Largest_subarea_code,                          d.Result                  from df d, mymax m                  where d.Company=m.Company and                        d.Subarea_code=m.Largest_subarea_code")mymin <- sqldf("select Company,                  Area_code,                  min(Subarea_code) Smallest_subarea_code                  from df                  group by Company,Area_code")myminres <- sqldf("select d.Company,                          d.Area_code,                          m.Smallest_subarea_code,                          d.Result                  from df d, mymin m                  where d.Company=m.Company and                        d.Subarea_code=m.Smallest_subarea_code")result <- sqldf("select a.*, b.Smallest_subarea_code,b.Result                from mymaxres a, myminres b                where a.Company=b.Company and                       a.Area_code=b.Area_code")


If you already doing it in R, why not use the much more efficient data.table instead of sqldf using SQL syntax? Assuming data is your data set, simply:

library(data.table)setDT(data)[, list(Largest_subarea_code = Subarea_code[which.max(Result)],            Resultmax = max(Result),            Smallest_subarea_code = Subarea_code[which.min(Result)],            Resultmin = min(Result)), by = list(Company, Area_code)]#    Company Area_code Largest_subarea_code Resultmax Smallest_subarea_code Resultmin# 1:       A        10                  101        15                   102        10# 2:       P        10                  101        10                   102         8# 3:       C        10                  102         5                   101         4# 4:       A        11                  111        15                   112        10# 5:       P        11                  111        20                   112         5# 6:       C        11                  112        10                   111         5


There seems to be a discrepancy between the output shown and the description. The description asks for the top 10 and bottom 10 results for each Area code/Company but the sample output shows only the top 1 and the bottom 1. For example, for area code 10 and company A subarea 101 is top with a result of 15 and and subarea 102 is 2nd largest with a result of 10 so according to the description there should be two rows for that company/area code combination. (If there were more data there would be up to 10 rows for that company/area code combination.)

We give two answers. The first assumes the top 10 and bottom 10 are wanted for each company and area code as in the question's description and the second assumes only the top and bottom for each company and area code as in the question's sample output.

1) Top/Bottom 10

Here we assume that the top 10 and bottom 10 results for each Company/Area code are wanted. If its just the top and bottom one then see (2) later on (or replace 10 with 1 in the code here). Bottom10 is all rows for which there are 10 or fewer subareas for the same area code and company with equal or smaller results. Top10 is similar.

library(sqldf)Bottom10 <- sqldf("select a.Company,                           a.Area_code,                           a.Subarea_code Bottom_Subarea,                           a.Result Bottom_Result,                          count(*) Bottom_Rank        from df a join df b          on a.Company = b.Company and            a.Area_code = B.Area_code and           b.Result <= a.Result        group by a.Company, a.Area_code, a.Subarea_code        having count(*) <= 10")Top10 <- sqldf("select a.Company,                        a.Area_code,                        a.Subarea_code Top_Subarea,                        a.Result Top_Result,                       count(*) Top_Rank        from df a join df b          on a.Company = b.Company and            a.Area_code = B.Area_code and            b.Result >= a.Result        group by a.Company, a.Area_code, a.Subarea_code        having count(*) <= 10")

The description indicated you wanted the top 10 OR the bottom 10 for each company/area code in which case just use one of the results above. If you want to combine them we show a merge below. We have added a Rank column to indicate the smallest/largest (Rank is 1), second smallest/largest (Rank is 2), etc.

sqldf("select t.Area_code,               t.Company,               t.Top_Rank Rank,              t.Top_Subarea,               t.Top_Result,              b.Bottom_Subarea,              b.Bottom_Result       from Bottom10 b join Top10 t       on t.Area_code = b.Area_code and           t.Company = b.Company and          t.Top_Rank = b.Bottom_Rank       order by t.Area_code, t.Company, t.Top_Rank")

giving:

   Area_code Company Rank Top_Subarea Top_Result Bottom_Subarea Bottom_Result1         10       A    1         101         15            102            102         10       A    2         102         10            101            153         10       C    1         102          5            101             44         10       C    2         101          4            102             55         10       P    1         101         10            102             86         10       P    2         102          8            101            107         11       A    1         111         15            112            108         11       A    2         112         10            111            159         11       C    1         112         10            111             510        11       C    2         111          5            112            1011        11       P    1         111         20            112             512        11       P    2         112          5            111            20

Note that this format makes less sense if there are ties and, in fact, could generate more than 10 rows for a Company/Area code so you might just want to use the individual Top10 and Bottom10 in that case. You could also consider jittering df$Result if this a problem:

df$Result <- jitter(df$Result)# now perform SQL statements

2) Top/Bottom Only

Here we give only the top and bottom results and the corresponding subareas for each company/area code. Note that this uses an extension to SQL supported by sqlite and the SQL code is substantially simpler:

Bottom1 <- sqldf("select Company,                           Area_code,                           Subarea_code Bottom_Subarea,                           min(Result) Bottom_Result        from df        group by Company, Area_code")Top1 <- sqldf("select Company,                       Area_code,                       Subarea_code Top_Subarea,                       max(Result) Top_Result        from df        group by Company, Area_code")sqldf("select a.Company,               a.Area_code,               Top_Subarea,               Top_Result,              Bottom_Subarea              Bottom_Result        from Top1 a join Bottom1 b          on a.Company = b.Company and            a.Area_code = b.Area_code        order by a.Area_code, a.Company")

This gives:

  Company Area_code Top_Subarea Top_Result Bottom_Result1       A        10         101         15           1022       C        10         102          5           1013       P        10         101         10           1024       A        11         111         15           1125       C        11         112         10           1116       P        11         111         20           112

Update Correction and added (2).