SQL ranking query to compute ranks and median in sub groups
I suggest doing the computing in your programming language:
for each group: for each record_in_group: append y to array median of array
But if you are stuck with SQLite, you can order each group by y
and select the records in the middle like this http://sqlfiddle.com/#!5/d4c68/55/0:
UPDATE: only bigger "median" value is importand for even nr. of rows, so no avg()
is needed:
select groups.gid, ids.y medianfrom ( -- get middle row number in each group (bigger number if even nr. of rows) -- note the integer divisions and modulo operator select round(x) gid, count(*) / 2 + 1 mid_row_right from xy_table group by round(x)) groupsjoin ( -- for each record get equivalent of -- row_number() over(partition by gid order by y) select round(a.x) gid, a.x, a.y, count(*) rownr_by_y from xy_table a left join xy_table b on round(a.x) = round (b.x) and a.y >= b.y group by a.x) ids on ids.gid = groups.gidwhere ids.rownr_by_y = groups.mid_row_right
OK, this relies on a temporary table:
create temporary table tmp (x float, y float);insert into tmp select * from xy_table order by round(x), y
But you could potentially create this for a range of data you were interested in. Another way would be to ensure the xy_table
had this sort order, instead of just ordering on x
. The reason for this is SQLite's lack of row numbering capability.
Then:
select tmp4.x as gid, t.* from ( select tmp1.x, round((tmp2.y + coalesce(tmp3.y, tmp2.y)) / 2) as y -- <- for larger of the two, change to: (case when tmp2.y > coalesce(tmp3.y, 0) then tmp2.y else tmp3.y end) from ( select round(x) as x, min(rowid) + (count(*) / 2) as id1, (case when count(*) % 2 = 0 then min(rowid) + (count(*) / 2) - 1 else 0 end) as id2 from ( select *, rowid from tmp ) t group by round(x) ) tmp1 join tmp tmp2 on tmp1.id1 = tmp2.rowid left join tmp tmp3 on tmp1.id2 = tmp3.rowid) tmp4join xy_table t on tmp4.x = round(t.x) and tmp4.y = t.y
If you wanted to treat the median as the larger of the two middle values, which doesn't fit the definition as @Aprillion already pointed out, then you would simply take the larger of the two y
values, instead of their average, on the third line of the query.