Grouping query into group and subgroup
Using sqldf
package:
df <- read.table(text="Area_code Subarea_code Company Result10 101 A 1510 101 P 1010 101 C 410 102 A 1010 102 P 810 102 C 511 111 A 1511 111 P 2011 111 C 511 112 A 1011 112 P 511 112 C 10", header=TRUE)library(sqldf)mymax <- sqldf("select Company, Area_code, max(Subarea_code) Largest_subarea_code from df group by Company,Area_code")mymaxres <- sqldf("select d.Company, d.Area_code, m.Largest_subarea_code, d.Result from df d, mymax m where d.Company=m.Company and d.Subarea_code=m.Largest_subarea_code")mymin <- sqldf("select Company, Area_code, min(Subarea_code) Smallest_subarea_code from df group by Company,Area_code")myminres <- sqldf("select d.Company, d.Area_code, m.Smallest_subarea_code, d.Result from df d, mymin m where d.Company=m.Company and d.Subarea_code=m.Smallest_subarea_code")result <- sqldf("select a.*, b.Smallest_subarea_code,b.Result from mymaxres a, myminres b where a.Company=b.Company and a.Area_code=b.Area_code")
If you already doing it in R, why not use the much more efficient data.table
instead of sqldf
using SQL syntax? Assuming data
is your data set, simply:
library(data.table)setDT(data)[, list(Largest_subarea_code = Subarea_code[which.max(Result)], Resultmax = max(Result), Smallest_subarea_code = Subarea_code[which.min(Result)], Resultmin = min(Result)), by = list(Company, Area_code)]# Company Area_code Largest_subarea_code Resultmax Smallest_subarea_code Resultmin# 1: A 10 101 15 102 10# 2: P 10 101 10 102 8# 3: C 10 102 5 101 4# 4: A 11 111 15 112 10# 5: P 11 111 20 112 5# 6: C 11 112 10 111 5
There seems to be a discrepancy between the output shown and the description. The description asks for the top 10 and bottom 10 results for each Area code/Company but the sample output shows only the top 1 and the bottom 1. For example, for area code 10 and company A subarea 101 is top with a result of 15 and and subarea 102 is 2nd largest with a result of 10 so according to the description there should be two rows for that company/area code combination. (If there were more data there would be up to 10 rows for that company/area code combination.)
We give two answers. The first assumes the top 10 and bottom 10 are wanted for each company and area code as in the question's description and the second assumes only the top and bottom for each company and area code as in the question's sample output.
1) Top/Bottom 10
Here we assume that the top 10 and bottom 10 results for each Company/Area code are wanted. If its just the top and bottom one then see (2) later on (or replace 10 with 1 in the code here). Bottom10
is all rows for which there are 10 or fewer subareas for the same area code and company with equal or smaller results. Top10
is similar.
library(sqldf)Bottom10 <- sqldf("select a.Company, a.Area_code, a.Subarea_code Bottom_Subarea, a.Result Bottom_Result, count(*) Bottom_Rank from df a join df b on a.Company = b.Company and a.Area_code = B.Area_code and b.Result <= a.Result group by a.Company, a.Area_code, a.Subarea_code having count(*) <= 10")Top10 <- sqldf("select a.Company, a.Area_code, a.Subarea_code Top_Subarea, a.Result Top_Result, count(*) Top_Rank from df a join df b on a.Company = b.Company and a.Area_code = B.Area_code and b.Result >= a.Result group by a.Company, a.Area_code, a.Subarea_code having count(*) <= 10")
The description indicated you wanted the top 10 OR the bottom 10 for each company/area code in which case just use one of the results above. If you want to combine them we show a merge below. We have added a Rank column to indicate the smallest/largest (Rank is 1), second smallest/largest (Rank is 2), etc.
sqldf("select t.Area_code, t.Company, t.Top_Rank Rank, t.Top_Subarea, t.Top_Result, b.Bottom_Subarea, b.Bottom_Result from Bottom10 b join Top10 t on t.Area_code = b.Area_code and t.Company = b.Company and t.Top_Rank = b.Bottom_Rank order by t.Area_code, t.Company, t.Top_Rank")
giving:
Area_code Company Rank Top_Subarea Top_Result Bottom_Subarea Bottom_Result1 10 A 1 101 15 102 102 10 A 2 102 10 101 153 10 C 1 102 5 101 44 10 C 2 101 4 102 55 10 P 1 101 10 102 86 10 P 2 102 8 101 107 11 A 1 111 15 112 108 11 A 2 112 10 111 159 11 C 1 112 10 111 510 11 C 2 111 5 112 1011 11 P 1 111 20 112 512 11 P 2 112 5 111 20
Note that this format makes less sense if there are ties and, in fact, could generate more than 10 rows for a Company/Area code so you might just want to use the individual Top10
and Bottom10
in that case. You could also consider jittering df$Result
if this a problem:
df$Result <- jitter(df$Result)# now perform SQL statements
2) Top/Bottom Only
Here we give only the top and bottom results and the corresponding subareas for each company/area code. Note that this uses an extension to SQL supported by sqlite and the SQL code is substantially simpler:
Bottom1 <- sqldf("select Company, Area_code, Subarea_code Bottom_Subarea, min(Result) Bottom_Result from df group by Company, Area_code")Top1 <- sqldf("select Company, Area_code, Subarea_code Top_Subarea, max(Result) Top_Result from df group by Company, Area_code")sqldf("select a.Company, a.Area_code, Top_Subarea, Top_Result, Bottom_Subarea Bottom_Result from Top1 a join Bottom1 b on a.Company = b.Company and a.Area_code = b.Area_code order by a.Area_code, a.Company")
This gives:
Company Area_code Top_Subarea Top_Result Bottom_Result1 A 10 101 15 1022 C 10 102 5 1013 P 10 101 10 1024 A 11 111 15 1125 C 11 112 10 1116 P 11 111 20 112
Update Correction and added (2).