Average over a timeframe with missing data Average over a timeframe with missing data sql sql

Average over a timeframe with missing data


Your difficulty (most costly step) will be to fill in the blanks. If it is not possible to "fill in the blanks" in your source data, you probably want to have a template to join on, then use correlated-sub-queries to find the data associated with that template.

This is often best with real tables, but here is an example with hard-coded in-line-views instead...

SELECT  `room`.`uid`           `uid` ,  AVG(`data`.`users`)    `average_users`FROM  (SELECT 1 `UID`  UNION ALL   SELECT 2 `UID`  UNION ALL   SELECT 3 `UID`  UNION ALL   SELECT 4 `UID`)                                     `room`CROSS JOIN  (SELECT '2012-08-03 14:00:00' `datetime`  UNION ALL   SELECT '2012-08-03 14:15:00' `datetime`  UNION ALL   SELECT '2012-08-03 14:30:00' `datetime`  UNION ALL   SELECT '2012-08-03 14:45:00' `datetime`)            `checkin`LEFT JOIN  data    ON  `data`.`uid`      = `room`.`uid`    AND `data`.`datetime` = (SELECT MAX(`datetime`)                               FROM `data`                              WHERE `uid`       = `room`.`uid`                                AND `datetime` <= `checkin`.`datetime`)GROUP BY  `room`.`uid`

- The CROSS JOIN creates the template to ensure that you always have a record for every checkin slot for every room.

- The correlated sub-query searches back through time to find the most recent checkin for that room at that time.


You can use this solution:

SELECT   b.Name,          AVG(b.Users) avg_usersFROM     (         SELECT     a.UID,                     MAX(c.Datetime) last_date         FROM       (SELECT DISTINCT UID FROM tbl) a         CROSS JOIN (                    SELECT '14:00:00' intrvl UNION ALL                    SELECT '14:15:00'        UNION ALL                    SELECT '14:30:00'        UNION ALL                    SELECT '14:45:00'                    ) b         JOIN       tbl c ON a.UID           = c.UID                         AND TIME(b.intrvl) >= TIME(c.Datetime)         GROUP BY   a.UID,                    b.intrvl         ) aJOIN     tbl b ON a.UID       = b.UID              AND a.last_date = b.DatetimeGROUP BY b.UID,         b.Name

Query Breakdown:


Step 1:

The first thing we need to do is associate each room with each time-interval. For example, in your example data, Room 4 does not have an association with intervals 14:15:00 and 14:30:00, but we still need to somehow represent those associations.

We accomplish this by creating a Cartesian product of each distinct room with the relevant time-intervals:

SELECT     a.UID,            b.intrvlFROM       (SELECT DISTINCT UID FROM tbl) aCROSS JOIN (           SELECT '14:00:00' intrvl UNION ALL           SELECT '14:15:00'        UNION ALL           SELECT '14:30:00'        UNION ALL           SELECT '14:45:00'           ) bORDER BY   b.intrvl, a.UID DESC --Ordering for display purposes

Renders:

UID | intrvl--------------4   | 14:00:003   | 14:00:002   | 14:00:001   | 14:00:004   | 14:15:003   | 14:15:002   | 14:15:001   | 14:15:004   | 14:30:003   | 14:30:002   | 14:30:001   | 14:30:004   | 14:45:003   | 14:45:002   | 14:45:001   | 14:45:00

SQLFiddle Demo


Step 2:

Then once we have those associations, we join the result back onto the main table (tbl) on the condition that the main table's time part of its Datetime field is less than the Cartesian-joined time for each UID. What this will do is for each UID -> intrvl association, it will show all entries that have occurred on or before the intrvl time.

So for example, since Room 3 doesn't have an entry for the 14:30:00 intrvl, only two entries will join with that intrvl: the ones on 14:15:00 and 14:00:00 since they both occurred either on or before the intrvl time.

You can now see where we are going with this. The result of this step will give us access to the most recent entry for each intrvl.

SELECT     a.UID,            b.intrvl,           c.*FROM       (SELECT DISTINCT UID FROM tbl) aCROSS JOIN (           SELECT '14:00:00' intrvl UNION ALL           SELECT '14:15:00'        UNION ALL           SELECT '14:30:00'        UNION ALL           SELECT '14:45:00'           ) bJOIN       tbl c ON a.UID           = c.UID                AND TIME(b.intrvl) >= TIME(c.Datetime)ORDER BY   b.intrvl, a.UID DESC, c.Datetime --Ordering for display purposes

Renders (excluding the Name column):

UID |  intrvl    |  Datetime             |  Users---------------- --------------------------------4   |  14:00:00  |  2012-08-03 14:00:00  |  3   <-- Most recent entry up until 14:00:003   |  14:00:00  |  2012-08-03 14:00:00  |  1   <-- Most recent entry up until 14:00:002   |  14:00:00  |  2012-08-03 14:00:00  |  3   <-- Most recent entry up until 14:00:001   |  14:00:00  |  2012-08-03 14:00:00  |  2   <-- Most recent entry up until 14:00:004   |  14:15:00  |  2012-08-03 14:00:00  |  3   <-- Most recent entry up until 14:15:003   |  14:15:00  |  2012-08-03 14:00:00  |  13   |  14:15:00  |  2012-08-03 14:15:00  |  1   <-- Most recent entry up until 14:15:002   |  14:15:00  |  2012-08-03 14:00:00  |  32   |  14:15:00  |  2012-08-03 14:15:00  |  4   <-- Most recent entry up until 14:15:001   |  14:15:00  |  2012-08-03 14:00:00  |  21   |  14:15:00  |  2012-08-03 14:15:00  |  3   <-- Most recent entry up until 14:15:004   |  14:30:00  |  2012-08-03 14:00:00  |  3   <-- Most recent entry up until 14:30:003   |  14:30:00  |  2012-08-03 14:00:00  |  1   3   |  14:30:00  |  2012-08-03 14:15:00  |  1   <-- Most recent entry up until 14:30:002   |  14:30:00  |  2012-08-03 14:00:00  |  32   |  14:30:00  |  2012-08-03 14:15:00  |  4   <-- Most recent entry up until 14:30:001   |  14:30:00  |  2012-08-03 14:00:00  |  21   |  14:30:00  |  2012-08-03 14:15:00  |  31   |  14:30:00  |  2012-08-03 14:30:00  |  6   <-- Most recent entry up until 14:30:004   |  14:45:00  |  2012-08-03 14:00:00  |  34   |  14:45:00  |  2012-08-03 14:45:00  |  4   <-- Most recent entry up until 14:45:003   |  14:45:00  |  2012-08-03 14:00:00  |  13   |  14:45:00  |  2012-08-03 14:15:00  |  13   |  14:45:00  |  2012-08-03 14:45:00  |  8   <-- Most recent entry up until 14:45:002   |  14:45:00  |  2012-08-03 14:00:00  |  32   |  14:45:00  |  2012-08-03 14:15:00  |  42   |  14:45:00  |  2012-08-03 14:45:00  |  7   <-- Most recent entry up until 14:45:001   |  14:45:00  |  2012-08-03 14:00:00  |  21   |  14:45:00  |  2012-08-03 14:15:00  |  31   |  14:45:00  |  2012-08-03 14:30:00  |  61   |  14:45:00  |  2012-08-03 14:45:00  |  3   <-- Most recent entry up until 14:45:00

SQLFiddle Demo


Step 3:

Our next step is to take the result-set above and pull only the most recent joined Datetime for each intrvl. We can accomplish this by using GROUP BY in conjunction with the MAX() aggregate function.

Unfortunately, we can't also correctly pull the value of Users along with each of the selected Datetimes due to how GROUP BY behaves.

SELECT     a.UID,            b.intrvl,           MAX(c.Datetime) last_dateFROM       (SELECT DISTINCT UID FROM tbl) aCROSS JOIN (           SELECT '14:00:00' intrvl UNION ALL           SELECT '14:15:00'        UNION ALL           SELECT '14:30:00'        UNION ALL           SELECT '14:45:00'           ) bJOIN       tbl c ON a.UID           = c.UID                AND TIME(b.intrvl) >= TIME(c.Datetime)GROUP BY   a.UID,           b.intrvlORDER BY   b.intrvl, a.UID DESC --Again, for display purposes

Renders:

UID |  intrvl    |  last_date---------------------------------------4   |  14:00:00  |  2012-08-03 14:00:003   |  14:00:00  |  2012-08-03 14:00:002   |  14:00:00  |  2012-08-03 14:00:001   |  14:00:00  |  2012-08-03 14:00:004   |  14:15:00  |  2012-08-03 14:00:003   |  14:15:00  |  2012-08-03 14:15:002   |  14:15:00  |  2012-08-03 14:15:001   |  14:15:00  |  2012-08-03 14:15:004   |  14:30:00  |  2012-08-03 14:00:003   |  14:30:00  |  2012-08-03 14:15:002   |  14:30:00  |  2012-08-03 14:15:001   |  14:30:00  |  2012-08-03 14:30:004   |  14:45:00  |  2012-08-03 14:45:003   |  14:45:00  |  2012-08-03 14:45:002   |  14:45:00  |  2012-08-03 14:45:001   |  14:45:00  |  2012-08-03 14:45:00

SQLFiddle Demo


Step 4

Now we have to grab the value of Users for each last_date so we can take the average of those values. We do this by wrapping our query in the last step as a subselect inside the FROM clause and joining once again back onto the main table on the condition that for each matching UID -> last_date association, grab the value of Users.

SELECT   a.UID,         a.last_date,         b.UsersFROM     (         SELECT     a.UID,                     MAX(c.Datetime) last_date         FROM       (SELECT DISTINCT UID FROM tbl) a         CROSS JOIN (                    SELECT '14:00:00' intrvl UNION ALL                    SELECT '14:15:00'        UNION ALL                    SELECT '14:30:00'        UNION ALL                    SELECT '14:45:00'                    ) b         JOIN       tbl c ON a.UID           = c.UID                         AND TIME(b.intrvl) >= TIME(c.Datetime)         GROUP BY   a.UID,                    b.intrvl         ) aJOIN     tbl b ON a.UID       = b.UID              AND a.last_date = b.DatetimeORDER BY a.UID DESC --Display purposes again

Renders:

UID | last_date           | Users---------------------------------4   | 2012-08-03 14:00:00 | 34   | 2012-08-03 14:00:00 | 34   | 2012-08-03 14:00:00 | 34   | 2012-08-03 14:45:00 | 43   | 2012-08-03 14:00:00 | 13   | 2012-08-03 14:15:00 | 13   | 2012-08-03 14:15:00 | 13   | 2012-08-03 14:45:00 | 82   | 2012-08-03 14:00:00 | 32   | 2012-08-03 14:15:00 | 42   | 2012-08-03 14:15:00 | 42   | 2012-08-03 14:45:00 | 71   | 2012-08-03 14:00:00 | 21   | 2012-08-03 14:15:00 | 31   | 2012-08-03 14:30:00 | 61   | 2012-08-03 14:45:00 | 3

SQLFiddle Demo


Step 5

Now it's just a simple matter of grouping on each room and averaging the Users column:

SELECT   b.Name,          AVG(b.Users) avg_usersFROM     (         SELECT     a.UID,                     MAX(c.Datetime) last_date         FROM       (SELECT DISTINCT UID FROM tbl) a         CROSS JOIN (                    SELECT '14:00:00' intrvl UNION ALL                    SELECT '14:15:00'        UNION ALL                    SELECT '14:30:00'        UNION ALL                    SELECT '14:45:00'                    ) b         JOIN       tbl c ON a.UID           = c.UID                         AND TIME(b.intrvl) >= TIME(c.Datetime)         GROUP BY   a.UID,                    b.intrvl         ) aJOIN     tbl b ON a.UID       = b.UID              AND a.last_date = b.DatetimeGROUP BY b.UID,         b.Name

Renders:

Name   | avg_users------------------Room 1 | 3.5Room 2 | 4.5Room 3 | 2.75Room 4 | 3.25

SQLFiddle Demo of Final Result


I just played around a bit with MySQL variables and came up with the following idea:

Just calculate the (discrete) integral of users over time, and then divide by the total time.

SET @avgSum := @lastValue := @lastTime := @firstTime := 0;SELECT  *,  @firstTime := IF(@firstTime = 0, UNIX_TIMESTAMP(`DateTime`), @firstTime),  @avgSum := @avgSum + (UNIX_TIMESTAMP(`DateTime`) - @lastTime) * @lastValue,  @lastValue,  @lastTime,  @lastValue := `Users`,  @lastTime := UNIX_TIMESTAMP(`DateTime`),  @avgSum / (UNIX_TIMESTAMP(`DateTime`) - @firstTime) AS `average`FROM  `table`WHERE  `UID` = 1 AND  UNIX_TIMESTAMP(`DateTime`) >= … AND  UNIX_TIMESTAMP(`DateTime`) < …ORDER BY  UNIX_TIMESTAMP(`DateTime`) ASC;

@firstTime is the timestamp of the first user record, @avgSum the sum of users over time (the integral). @lastValue and @lastTime are value and time of the previous record. The column average is the total sum of users divides by the whole interval (don't mind the NULL due to division by zero for the first record).

Two restrictions are still present: The first and the last record for the given interval must be present. Without, the average "ends" at the last available record.