Rolling median algorithm in C
I have looked at R's src/library/stats/src/Trunmed.c
a few times as I wanted something similar too in a standalone C++ class / C subroutine. Note that this are actually two implementations in one, see src/library/stats/man/runmed.Rd
(the source of the help file) which says
\details{ Apart from the end values, the result \code{y = runmed(x, k)} simply has \code{y[j] = median(x[(j-k2):(j+k2)])} (k = 2*k2+1), computed very efficiently. The two algorithms are internally entirely different: \describe{ \item{"Turlach"}{is the Härdle-Steiger algorithm (see Ref.) as implemented by Berwin Turlach. A tree algorithm is used, ensuring performance \eqn{O(n \log k)}{O(n * log(k))} where \code{n <- length(x)} which is asymptotically optimal.} \item{"Stuetzle"}{is the (older) Stuetzle-Friedman implementation which makes use of median \emph{updating} when one observation enters and one leaves the smoothing window. While this performs as \eqn{O(n \times k)}{O(n * k)} which is slower asymptotically, it is considerably faster for small \eqn{k} or \eqn{n}.} }}
It would be nice to see this re-used in a more standalone fashion. Are you volunteering? I can help with some of the R bits.
Edit 1: Besides the link to the older version of Trunmed.c above, here are current SVN copies of
Srunmed.c
(for the Stuetzle version)Trunmed.c
(for the Turlach version)runmed.R
for the R function calling these
Edit 2: Ryan Tibshirani has some C and Fortran code on fast median binning which may be a suitable starting point for a windowed approach.
I couldn't find a modern implementation of a c++ data structure with order-statistic so ended up implementing both ideas in top coders link suggested by MAK ( Match Editorial: scroll down to FloatingMedian).
Two multisets
The first idea partitions the data into two data structures (heaps, multisets etc) with O(ln N) per insert/delete does not allow the quantile to be changed dynamically without a large cost. I.e. we can have a rolling median, or a rolling 75% but not both at the same time.
Segment tree
The second idea uses a segment tree which is O(ln N) for insert/deletions/queries but is more flexible. Best of all the "N" is the size of your data range. So if your rolling median has a window of a million items, but your data varies from 1..65536, then only 16 operations are required per movement of the rolling window of 1 million!!
The c++ code is similar to what Denis posted above ("Here's a simple algorithm for quantized data")
GNU Order Statistic Trees
Just before giving up, I found that stdlibc++ contains order statistic trees!!!
These have two critical operations:
iter = tree.find_by_order(value)order = tree.order_of_key(value)
See libstdc++ manual policy_based_data_structures_test (search for "split and join").
I have wrapped the tree for use in a convenience header for compilers supporting c++0x/c++11 style partial typedefs:
#if !defined(GNU_ORDER_STATISTIC_SET_H)#define GNU_ORDER_STATISTIC_SET_H#include <ext/pb_ds/assoc_container.hpp>#include <ext/pb_ds/tree_policy.hpp>// A red-black tree table storing ints and their order// statistics. Note that since the tree uses// tree_order_statistics_node_update as its update policy, then it// includes its methods by_order and order_of_key.template <typename T>using t_order_statistic_set = __gnu_pbds::tree< T, __gnu_pbds::null_type, std::less<T>, __gnu_pbds::rb_tree_tag, // This policy updates nodes' metadata for order statistics. __gnu_pbds::tree_order_statistics_node_update>;#endif //GNU_ORDER_STATISTIC_SET_H
I've done a C implementation here. A few more details are in this question: Rolling median in C - Turlach implementation.
Sample usage:
int main(int argc, char* argv[]){ int i, v; Mediator* m = MediatorNew(15); for (i=0; i<30; i++) { v = rand() & 127; printf("Inserting %3d \n", v); MediatorInsert(m, v); v = MediatorMedian(m); printf("Median = %3d.\n\n", v); ShowTree(m); }}