Compute 2,3 quartile average in SQL
Look at answer and comment by @Richard aka cyberkiwi in this question:
Select *from( SELECT tbl.*, @counter := @counter +1 counter FROM (select @counter:=0) initvar, tbl ORDER BY ordcolumn) Xwhere counter >= (25/100 * @counter) and counter <= (75/100 * @counter);ORDER BY ordcolumn
You can create the quartile values by using IF to set them to zero if in the wrong quartile:
Let's assume, the raw data table is created by
DROP TABLE IF EXISTS `rawdata`;CREATE TABLE `rawdata` ( `id` int(11) NOT NULL AUTO_INCREMENT, `url` varchar(250) NOT NULL DEFAULT '', `time` int(11) NOT NULL, PRIMARY KEY (`id`), KEY `time` (`time`)) ENGINE=MyISAM DEFAULT CHARSET=utf8;
(and ofcourse populated).
Let's also assume the quartile table data is created by
DROP TABLE IF EXISTS `quartiles`;CREATE TABLE `quartiles` ( `url` varchar(250) NOT NULL, `Q1` float DEFAULT '0', `Q2` float DEFAULT '0', `Q3` float DEFAULT '0', `Q4` float DEFAULT '0', PRIMARY KEY (`url`),) ENGINE=MyISAM DEFAULT CHARSET=utf8;
(and left empty).
Then a procedure to populate quartiles from rawdata would look like
DELIMITER ;;CREATE PROCEDURE `ComputeQuartiles`() READS SQL DATABEGIN DECLARE numrows int DEFAULT 0; DECLARE qrows int DEFAULT 0; DECLARE rownum int DEFAULT 0; DECLARE done int DEFAULT 0; DECLARE currenturl VARCHAR(250) CHARACTER SET utf8; DECLARE Q1,Q2,Q3,Q4 float DEFAULT 0.0; DECLARE allurls CURSOR FOR SELECT DISTINCT url FROM rawdata; DECLARE CONTINUE HANDLER FOR NOT FOUND SET currenturl=''; OPEN allurls; FETCH allurls INTO currenturl; WHILE currenturl<>'' DO SELECT COUNT(*) INTO numrows FROM rawdata WHERE url=currenturl; SET qrows=FLOOR(numrows/4); if qrows>0 THEN -- Only session parameters can be recalculated inside a query, -- so @rownum:=@rownum+1 will work, but rownum:=rownum+1 will not. SET @rownum=0; SELECT SUM(IFNULL(QA,0))/qrows, SUM(IFNULL(QB,0))/qrows, SUM(IFNULL(QC,0))/qrows, SUM(IFNULL(QD,0))/qrows FROM ( SELECT if(@rownum<qrows,time,0) AS QA, if(@rownum>=qrows AND @rownum<2*qrows,time,0) AS QB, -- the middle 0-3 rows are left out if(@rownum>=(numrows-2*qrows) AND @rownum<(numrows-qrows),time,0) AS QC, if(@rownum>=(numrows-qrows),time,0) AS QD, @rownum:=@rownum+1 AS dummy FROM rawdata WHERE url=currenturl ORDER BY time ) AS baseview INTO Q1,Q2,Q3,Q4 ; REPLACE INTO quartiles values (currenturl,Q1,Q2,Q3,Q4); END IF; FETCH allurls INTO currenturl; END WHILE; CLOSE allurls;END ;;DELIMITER ;
The main points being:
- Use a cursor to cycle the URLs (or adapt the sample to accept the URL as a parameter)
- For every URL find the total number of rows
- Do some trivial math to leave out the middle rows, if
(rowcount % 4) != 0
- select all raw rows for the URL, assigning the value of
time
to one of QA-QD, depending on the row number, assigning the other Qx the value 0 - Use this query as a subquery to another one, which sums up and normalizes the values
- Use the results of this superquery to update quartiles table
I tested this with 18432 raw rows, url=concat('http://.../',floor(rand()*10)), time=round(rand()*10000)
on a 8x1.9GHz machine and it finished consistently in 0.50-0.54sec
how about this ?
prepare stmt from select concat('select * from test where a="a" LIMIT ',@of,@len);execute stmt;