Average over a timeframe with missing data
Your difficulty (most costly step) will be to fill in the blanks. If it is not possible to "fill in the blanks" in your source data, you probably want to have a template to join on, then use correlated-sub-queries to find the data associated with that template.
This is often best with real tables, but here is an example with hard-coded in-line-views instead...
SELECT `room`.`uid` `uid` , AVG(`data`.`users`) `average_users`FROM (SELECT 1 `UID` UNION ALL SELECT 2 `UID` UNION ALL SELECT 3 `UID` UNION ALL SELECT 4 `UID`) `room`CROSS JOIN (SELECT '2012-08-03 14:00:00' `datetime` UNION ALL SELECT '2012-08-03 14:15:00' `datetime` UNION ALL SELECT '2012-08-03 14:30:00' `datetime` UNION ALL SELECT '2012-08-03 14:45:00' `datetime`) `checkin`LEFT JOIN data ON `data`.`uid` = `room`.`uid` AND `data`.`datetime` = (SELECT MAX(`datetime`) FROM `data` WHERE `uid` = `room`.`uid` AND `datetime` <= `checkin`.`datetime`)GROUP BY `room`.`uid`
- The CROSS JOIN
creates the template to ensure that you always have a record for every checkin slot for every room.
- The correlated sub-query
searches back through time to find the most recent checkin for that room at that time.
You can use this solution:
SELECT b.Name, AVG(b.Users) avg_usersFROM ( SELECT a.UID, MAX(c.Datetime) last_date FROM (SELECT DISTINCT UID FROM tbl) a CROSS JOIN ( SELECT '14:00:00' intrvl UNION ALL SELECT '14:15:00' UNION ALL SELECT '14:30:00' UNION ALL SELECT '14:45:00' ) b JOIN tbl c ON a.UID = c.UID AND TIME(b.intrvl) >= TIME(c.Datetime) GROUP BY a.UID, b.intrvl ) aJOIN tbl b ON a.UID = b.UID AND a.last_date = b.DatetimeGROUP BY b.UID, b.Name
Query Breakdown:
Step 1:
The first thing we need to do is associate each room with each time-interval. For example, in your example data, Room 4
does not have an association with intervals 14:15:00
and 14:30:00
, but we still need to somehow represent those associations.
We accomplish this by creating a Cartesian product of each distinct room with the relevant time-intervals:
SELECT a.UID, b.intrvlFROM (SELECT DISTINCT UID FROM tbl) aCROSS JOIN ( SELECT '14:00:00' intrvl UNION ALL SELECT '14:15:00' UNION ALL SELECT '14:30:00' UNION ALL SELECT '14:45:00' ) bORDER BY b.intrvl, a.UID DESC --Ordering for display purposes
Renders:
UID | intrvl--------------4 | 14:00:003 | 14:00:002 | 14:00:001 | 14:00:004 | 14:15:003 | 14:15:002 | 14:15:001 | 14:15:004 | 14:30:003 | 14:30:002 | 14:30:001 | 14:30:004 | 14:45:003 | 14:45:002 | 14:45:001 | 14:45:00
Step 2:
Then once we have those associations, we join the result back onto the main table (tbl
) on the condition that the main table's time part of its Datetime
field is less than the Cartesian-joined time for each UID
. What this will do is for each UID
-> intrvl
association, it will show all entries that have occurred on or before the intrvl
time.
So for example, since Room 3
doesn't have an entry for the 14:30:00
intrvl, only two entries will join with that intrvl: the ones on 14:15:00
and 14:00:00
since they both occurred either on or before the intrvl time.
You can now see where we are going with this. The result of this step will give us access to the most recent entry for each intrvl.
SELECT a.UID, b.intrvl, c.*FROM (SELECT DISTINCT UID FROM tbl) aCROSS JOIN ( SELECT '14:00:00' intrvl UNION ALL SELECT '14:15:00' UNION ALL SELECT '14:30:00' UNION ALL SELECT '14:45:00' ) bJOIN tbl c ON a.UID = c.UID AND TIME(b.intrvl) >= TIME(c.Datetime)ORDER BY b.intrvl, a.UID DESC, c.Datetime --Ordering for display purposes
Renders (excluding the Name
column):
UID | intrvl | Datetime | Users---------------- --------------------------------4 | 14:00:00 | 2012-08-03 14:00:00 | 3 <-- Most recent entry up until 14:00:003 | 14:00:00 | 2012-08-03 14:00:00 | 1 <-- Most recent entry up until 14:00:002 | 14:00:00 | 2012-08-03 14:00:00 | 3 <-- Most recent entry up until 14:00:001 | 14:00:00 | 2012-08-03 14:00:00 | 2 <-- Most recent entry up until 14:00:004 | 14:15:00 | 2012-08-03 14:00:00 | 3 <-- Most recent entry up until 14:15:003 | 14:15:00 | 2012-08-03 14:00:00 | 13 | 14:15:00 | 2012-08-03 14:15:00 | 1 <-- Most recent entry up until 14:15:002 | 14:15:00 | 2012-08-03 14:00:00 | 32 | 14:15:00 | 2012-08-03 14:15:00 | 4 <-- Most recent entry up until 14:15:001 | 14:15:00 | 2012-08-03 14:00:00 | 21 | 14:15:00 | 2012-08-03 14:15:00 | 3 <-- Most recent entry up until 14:15:004 | 14:30:00 | 2012-08-03 14:00:00 | 3 <-- Most recent entry up until 14:30:003 | 14:30:00 | 2012-08-03 14:00:00 | 1 3 | 14:30:00 | 2012-08-03 14:15:00 | 1 <-- Most recent entry up until 14:30:002 | 14:30:00 | 2012-08-03 14:00:00 | 32 | 14:30:00 | 2012-08-03 14:15:00 | 4 <-- Most recent entry up until 14:30:001 | 14:30:00 | 2012-08-03 14:00:00 | 21 | 14:30:00 | 2012-08-03 14:15:00 | 31 | 14:30:00 | 2012-08-03 14:30:00 | 6 <-- Most recent entry up until 14:30:004 | 14:45:00 | 2012-08-03 14:00:00 | 34 | 14:45:00 | 2012-08-03 14:45:00 | 4 <-- Most recent entry up until 14:45:003 | 14:45:00 | 2012-08-03 14:00:00 | 13 | 14:45:00 | 2012-08-03 14:15:00 | 13 | 14:45:00 | 2012-08-03 14:45:00 | 8 <-- Most recent entry up until 14:45:002 | 14:45:00 | 2012-08-03 14:00:00 | 32 | 14:45:00 | 2012-08-03 14:15:00 | 42 | 14:45:00 | 2012-08-03 14:45:00 | 7 <-- Most recent entry up until 14:45:001 | 14:45:00 | 2012-08-03 14:00:00 | 21 | 14:45:00 | 2012-08-03 14:15:00 | 31 | 14:45:00 | 2012-08-03 14:30:00 | 61 | 14:45:00 | 2012-08-03 14:45:00 | 3 <-- Most recent entry up until 14:45:00
Step 3:
Our next step is to take the result-set above and pull only the most recent joined Datetime
for each intrvl. We can accomplish this by using GROUP BY
in conjunction with the MAX()
aggregate function.
Unfortunately, we can't also correctly pull the value of Users
along with each of the selected Datetime
s due to how GROUP BY
behaves.
SELECT a.UID, b.intrvl, MAX(c.Datetime) last_dateFROM (SELECT DISTINCT UID FROM tbl) aCROSS JOIN ( SELECT '14:00:00' intrvl UNION ALL SELECT '14:15:00' UNION ALL SELECT '14:30:00' UNION ALL SELECT '14:45:00' ) bJOIN tbl c ON a.UID = c.UID AND TIME(b.intrvl) >= TIME(c.Datetime)GROUP BY a.UID, b.intrvlORDER BY b.intrvl, a.UID DESC --Again, for display purposes
Renders:
UID | intrvl | last_date---------------------------------------4 | 14:00:00 | 2012-08-03 14:00:003 | 14:00:00 | 2012-08-03 14:00:002 | 14:00:00 | 2012-08-03 14:00:001 | 14:00:00 | 2012-08-03 14:00:004 | 14:15:00 | 2012-08-03 14:00:003 | 14:15:00 | 2012-08-03 14:15:002 | 14:15:00 | 2012-08-03 14:15:001 | 14:15:00 | 2012-08-03 14:15:004 | 14:30:00 | 2012-08-03 14:00:003 | 14:30:00 | 2012-08-03 14:15:002 | 14:30:00 | 2012-08-03 14:15:001 | 14:30:00 | 2012-08-03 14:30:004 | 14:45:00 | 2012-08-03 14:45:003 | 14:45:00 | 2012-08-03 14:45:002 | 14:45:00 | 2012-08-03 14:45:001 | 14:45:00 | 2012-08-03 14:45:00
Step 4
Now we have to grab the value of Users
for each last_date
so we can take the average of those values. We do this by wrapping our query in the last step as a subselect inside the FROM
clause and joining once again back onto the main table on the condition that for each matching UID
-> last_date
association, grab the value of Users
.
SELECT a.UID, a.last_date, b.UsersFROM ( SELECT a.UID, MAX(c.Datetime) last_date FROM (SELECT DISTINCT UID FROM tbl) a CROSS JOIN ( SELECT '14:00:00' intrvl UNION ALL SELECT '14:15:00' UNION ALL SELECT '14:30:00' UNION ALL SELECT '14:45:00' ) b JOIN tbl c ON a.UID = c.UID AND TIME(b.intrvl) >= TIME(c.Datetime) GROUP BY a.UID, b.intrvl ) aJOIN tbl b ON a.UID = b.UID AND a.last_date = b.DatetimeORDER BY a.UID DESC --Display purposes again
Renders:
UID | last_date | Users---------------------------------4 | 2012-08-03 14:00:00 | 34 | 2012-08-03 14:00:00 | 34 | 2012-08-03 14:00:00 | 34 | 2012-08-03 14:45:00 | 43 | 2012-08-03 14:00:00 | 13 | 2012-08-03 14:15:00 | 13 | 2012-08-03 14:15:00 | 13 | 2012-08-03 14:45:00 | 82 | 2012-08-03 14:00:00 | 32 | 2012-08-03 14:15:00 | 42 | 2012-08-03 14:15:00 | 42 | 2012-08-03 14:45:00 | 71 | 2012-08-03 14:00:00 | 21 | 2012-08-03 14:15:00 | 31 | 2012-08-03 14:30:00 | 61 | 2012-08-03 14:45:00 | 3
Step 5
Now it's just a simple matter of grouping on each room and averaging the Users
column:
SELECT b.Name, AVG(b.Users) avg_usersFROM ( SELECT a.UID, MAX(c.Datetime) last_date FROM (SELECT DISTINCT UID FROM tbl) a CROSS JOIN ( SELECT '14:00:00' intrvl UNION ALL SELECT '14:15:00' UNION ALL SELECT '14:30:00' UNION ALL SELECT '14:45:00' ) b JOIN tbl c ON a.UID = c.UID AND TIME(b.intrvl) >= TIME(c.Datetime) GROUP BY a.UID, b.intrvl ) aJOIN tbl b ON a.UID = b.UID AND a.last_date = b.DatetimeGROUP BY b.UID, b.Name
Renders:
Name | avg_users------------------Room 1 | 3.5Room 2 | 4.5Room 3 | 2.75Room 4 | 3.25
I just played around a bit with MySQL variables and came up with the following idea:
Just calculate the (discrete) integral of users over time, and then divide by the total time.
SET @avgSum := @lastValue := @lastTime := @firstTime := 0;SELECT *, @firstTime := IF(@firstTime = 0, UNIX_TIMESTAMP(`DateTime`), @firstTime), @avgSum := @avgSum + (UNIX_TIMESTAMP(`DateTime`) - @lastTime) * @lastValue, @lastValue, @lastTime, @lastValue := `Users`, @lastTime := UNIX_TIMESTAMP(`DateTime`), @avgSum / (UNIX_TIMESTAMP(`DateTime`) - @firstTime) AS `average`FROM `table`WHERE `UID` = 1 AND UNIX_TIMESTAMP(`DateTime`) >= … AND UNIX_TIMESTAMP(`DateTime`) < …ORDER BY UNIX_TIMESTAMP(`DateTime`) ASC;
@firstTime
is the timestamp of the first user record, @avgSum
the sum of users over time (the integral). @lastValue
and @lastTime
are value and time of the previous record. The column average
is the total sum of users divides by the whole interval (don't mind the NULL
due to division by zero for the first record).
Two restrictions are still present: The first and the last record for the given interval must be present. Without, the average "ends" at the last available record.