Median calculation over pre-aggregated / already-grouped by metrics Median calculation over pre-aggregated / already-grouped by metrics oracle oracle

Median calculation over pre-aggregated / already-grouped by metrics


If I understand you correctly, I can see a fairly straightforward way, and I think I can describe it clearly. I'm pretty sure I can't express it in SQL today, but I'll leave this tab open in my browser, and try to get to it tomorrow if nobody else has contributed.

╔═══════╦═════════════╦═══════════════╗║ COLOR ║ MEASUREMENT ║ MEASURE_COUNT ║╠═══════╬═════════════╬═══════════════╣║ RED   ║           4 ║             5 ║║ RED   ║           5 ║             3 ║║ RED   ║           6 ║             1 ║║ BLUE  ║           5 ║             4 ║║ BLUE  ║           6 ║             5 ║╚═══════╩═════════════╩═══════════════╝

First, calculate which measurement represents the median. You can do that based solely on the counts. For example, for red, there are nine measurements altogether. The median measurement will be the 5th measurement. SQL for this should be straightforward.

Second, I think you can determine which row the median measurement is in with analytic functions. For red, you determine which row the 5th measurement is in; it's in the first row. This is a little like a "running balance" problem. The value of the "measurement" column in that row is the value you're trying to determine.

The wall of code (in standard SQL, I think)

"Unwinding" the aggregates is expensive. So this might not be useful to you. I relied on common table expressions to reduce the load on my brain.

with measurements as (  select 'red'   color, 4 measurement, 5 measure_count union all  select 'red'   color, 5 measurement, 3 measure_count union all  select 'red'   color, 6 measurement, 1 measure_count union all  select 'blue'  color, 5 measurement, 4 measure_count union all  select 'blue'  color, 6 measurement, 5 measure_count union all  -- Added green, even number of measurements, median should be 5.5.  select 'green' color, 5 measurement, 4 measure_count union all  select 'green' color, 6 measurement, 4 measure_count union all  -- Added yellow, extreme differences in measurements, median should be 6.  select 'yellow' color, 6 measurement, 2 measure_count union all  select 'yellow' color, 100 measurement, 1 measure_count ), measurement_starts as (  select     *,    sum(measure_count) over (partition by color order by measurement) total_rows_so_far  from measurements), extended_measurements as (  select     color, measurement, measure_count,    coalesce(lag(total_rows_so_far) over (partition by color order by measurement), 0) + 1 measure_start_row,    coalesce(lag(total_rows_so_far) over (partition by color order by measurement), 0) + measure_count measure_end_row   from measurement_starts), median_row_range as (  select color,     sum(measure_count) num_measurements,     ceiling(sum(measure_count)/2.0) start_measurement,     case       sum(measure_count) % 2 = 0      when true then ceiling(sum(measure_count)/2.0)+1      else ceiling(sum(measure_count)/2.0)    end    end_measurement  from measurements  group by color), median_row_values as (  select m.color, c.measurement  from median_row_range m  inner join extended_measurements c           on c.color = m.color          and m.start_measurement between c.measure_start_row and c.measure_end_row  union all  select m.color, c.measurement  from median_row_range m  inner join extended_measurements c           on c.color = m.color          and m.end_measurement between c.measure_start_row and c.measure_end_row)select color, avg(measurement)from median_row_valuesgroup by colororder by color;blue    6.00green   5.50red     4.00yellow  6.00

The CTE "extended_measurements" expands the table of measurements to include the starting "row" number and the ending "row" number that you'd find with unaggregated data.

color  measurement  measure_count  measure_start_row  measure_end_row--blue   5            4              1                  4blue   6            5              5                  9green  5            4              1                  4green  6            4              5                  8red    4            5              1                  5red    5            3              6                  8red    6            1              4                  4yellow 6            2              1                  2yellow 100          1              3                  3

The CTE "median_row_range" determines the starting "row" and ending "row" for the median.

color  num_measurements  start_measurement  end_measurement--blue   9                 5                  5green  8                 4                  5red    9                 5                  5yellow 3                 2                  2

This means that the median for 'blue' can be calculated as the average of 5th "row" and the 5th "row". That is, the median for 'blue' is simply the 5th value. The median for green is the average of the 4th "row" and 5th "row".


The idea behind this answer is same as that of Mike's, but varies in execution.

  • First CTE extended_measurements, finds the cumulative sum of counts and the midpoint for each color. If the sum of counts is even, then you should take average of two values. So floor and ceil will give you those points.
  • Second CTE extended_measurements2, tries to find the measurement to which the midpoint corresponds to, by comparing with cumulative sum. This is done for both floor_midpoint and ceil_midpodint. Rank is assigned, since we are interested only in the first record that matches.
  • Final query, selects only the measurement with least ranks and finds the average, which is the MEDIAN value.

SQL Fiddle

Query:

--get the midpoint and cumulative sum of measure_countwith extended_measurements as(    select color, measurement,           floor((sum(measure_count) over                     (partition by color) + 1) * 0.5)            floor_midpoint,           ceil((sum(measure_count) over                     (partition by color) + 1) * 0.5)            ceil_midpoint,           sum(measure_count) over                     (partition by color order by measurement)   cumltv_sum    from measurements),--assign rank to the measure_count where median liesextended_measurements2 as(    select color, measurement,           case when floor_midpoint <= cumltv_sum                    then row_number() over (partition by color order by measurement)                else null           end r1,           case when ceil_midpoint <= cumltv_sum                    then row_number() over (partition by color order by measurement)                else null           end r2    from extended_measurements)--get the average of measurements that have least rankselect color, 0.5 * (                        max(measurement) keep (dense_rank first order by r1) +                         max(measurement) keep (dense_rank first order by r2)                     )  medianfrom extended_measurements2group by colororder by color

Result:

|  COLOR | MEDIAN ||--------|--------||   blue |      6 ||  green |    5.5 ||    red |      4 ||  white |      8 || yellow |      6 |

Another fiddle for verifying the result of non-aggregated and aggregated data.