Median calculation over pre-aggregated / already-grouped by metrics
If I understand you correctly, I can see a fairly straightforward way, and I think I can describe it clearly. I'm pretty sure I can't express it in SQL today, but I'll leave this tab open in my browser, and try to get to it tomorrow if nobody else has contributed.
╔═══════╦═════════════╦═══════════════╗║ COLOR ║ MEASUREMENT ║ MEASURE_COUNT ║╠═══════╬═════════════╬═══════════════╣║ RED ║ 4 ║ 5 ║║ RED ║ 5 ║ 3 ║║ RED ║ 6 ║ 1 ║║ BLUE ║ 5 ║ 4 ║║ BLUE ║ 6 ║ 5 ║╚═══════╩═════════════╩═══════════════╝
First, calculate which measurement represents the median. You can do that based solely on the counts. For example, for red, there are nine measurements altogether. The median measurement will be the 5th measurement. SQL for this should be straightforward.
Second, I think you can determine which row the median measurement is in with analytic functions. For red, you determine which row the 5th measurement is in; it's in the first row. This is a little like a "running balance" problem. The value of the "measurement" column in that row is the value you're trying to determine.
The wall of code (in standard SQL, I think)
"Unwinding" the aggregates is expensive. So this might not be useful to you. I relied on common table expressions to reduce the load on my brain.
with measurements as ( select 'red' color, 4 measurement, 5 measure_count union all select 'red' color, 5 measurement, 3 measure_count union all select 'red' color, 6 measurement, 1 measure_count union all select 'blue' color, 5 measurement, 4 measure_count union all select 'blue' color, 6 measurement, 5 measure_count union all -- Added green, even number of measurements, median should be 5.5. select 'green' color, 5 measurement, 4 measure_count union all select 'green' color, 6 measurement, 4 measure_count union all -- Added yellow, extreme differences in measurements, median should be 6. select 'yellow' color, 6 measurement, 2 measure_count union all select 'yellow' color, 100 measurement, 1 measure_count ), measurement_starts as ( select *, sum(measure_count) over (partition by color order by measurement) total_rows_so_far from measurements), extended_measurements as ( select color, measurement, measure_count, coalesce(lag(total_rows_so_far) over (partition by color order by measurement), 0) + 1 measure_start_row, coalesce(lag(total_rows_so_far) over (partition by color order by measurement), 0) + measure_count measure_end_row from measurement_starts), median_row_range as ( select color, sum(measure_count) num_measurements, ceiling(sum(measure_count)/2.0) start_measurement, case sum(measure_count) % 2 = 0 when true then ceiling(sum(measure_count)/2.0)+1 else ceiling(sum(measure_count)/2.0) end end_measurement from measurements group by color), median_row_values as ( select m.color, c.measurement from median_row_range m inner join extended_measurements c on c.color = m.color and m.start_measurement between c.measure_start_row and c.measure_end_row union all select m.color, c.measurement from median_row_range m inner join extended_measurements c on c.color = m.color and m.end_measurement between c.measure_start_row and c.measure_end_row)select color, avg(measurement)from median_row_valuesgroup by colororder by color;blue 6.00green 5.50red 4.00yellow 6.00
The CTE "extended_measurements" expands the table of measurements to include the starting "row" number and the ending "row" number that you'd find with unaggregated data.
color measurement measure_count measure_start_row measure_end_row--blue 5 4 1 4blue 6 5 5 9green 5 4 1 4green 6 4 5 8red 4 5 1 5red 5 3 6 8red 6 1 4 4yellow 6 2 1 2yellow 100 1 3 3
The CTE "median_row_range" determines the starting "row" and ending "row" for the median.
color num_measurements start_measurement end_measurement--blue 9 5 5green 8 4 5red 9 5 5yellow 3 2 2
This means that the median for 'blue' can be calculated as the average of 5th "row" and the 5th "row". That is, the median for 'blue' is simply the 5th value. The median for green is the average of the 4th "row" and 5th "row".
The idea behind this answer is same as that of Mike's, but varies in execution.
- First CTE extended_measurements, finds the cumulative sum of counts and the midpoint for each color. If the sum of counts is even, then you should take average of two values. So floor and ceil will give you those points.
- Second CTE extended_measurements2, tries to find the measurement to which the midpoint corresponds to, by comparing with cumulative sum. This is done for both floor_midpoint and ceil_midpodint. Rank is assigned, since we are interested only in the first record that matches.
- Final query, selects only the measurement with least ranks and finds the average, which is the MEDIAN value.
Query:
--get the midpoint and cumulative sum of measure_countwith extended_measurements as( select color, measurement, floor((sum(measure_count) over (partition by color) + 1) * 0.5) floor_midpoint, ceil((sum(measure_count) over (partition by color) + 1) * 0.5) ceil_midpoint, sum(measure_count) over (partition by color order by measurement) cumltv_sum from measurements),--assign rank to the measure_count where median liesextended_measurements2 as( select color, measurement, case when floor_midpoint <= cumltv_sum then row_number() over (partition by color order by measurement) else null end r1, case when ceil_midpoint <= cumltv_sum then row_number() over (partition by color order by measurement) else null end r2 from extended_measurements)--get the average of measurements that have least rankselect color, 0.5 * ( max(measurement) keep (dense_rank first order by r1) + max(measurement) keep (dense_rank first order by r2) ) medianfrom extended_measurements2group by colororder by color
| COLOR | MEDIAN ||--------|--------|| blue | 6 || green | 5.5 || red | 4 || white | 8 || yellow | 6 |
Another fiddle for verifying the result of non-aggregated and aggregated data.