Method of finding gaps in time series data in MySQL? Method of finding gaps in time series data in MySQL? mysql mysql

Method of finding gaps in time series data in MySQL?


To start with, let us summarize the number of entries by hour in your table.

SELECT CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) hour,       COUNT(*) samplecount  FROM table GROUP BY CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME)

Now, if you log something every six minutes (ten times an hour) all your samplecount values should be ten. This expression: CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) looks hairy but it simply truncates your timestamps to the hour in which they occur by zeroing out the minute and second.

This is reasonably efficient, and will get you started. It's very efficient if you can put an index on your entry_time column and restrict your query to, let's say, yesterday's samples as shown here.

SELECT CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) hour,       COUNT(*) samplecount  FROM table WHERE entry_time >= CURRENT_DATE - INTERVAL 1 DAY   AND entry_time < CURRENT_DATE GROUP BY CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME)

But it isn't much good at detecting whole hours that go by with missing samples. It's also a little sensitive to jitter in your sampling. That is, if your top-of-the-hour sample is sometimes a half-second early (10:59:30) and sometimes a half-second late (11:00:30) your hourly summary counts will be off. So, this hour summary thing (or day summary, or minute summary, etc) is not bulletproof.

You need a self-join query to get stuff perfectly right; it's a bit more of a hairball and not nearly as efficient.

Let's start by creating ourselves a virtual table (subquery) like this with numbered samples. (This is a pain in MySQL; some other expensive DBMSs make it easier. No matter.)

  SELECT @sample:=@sample+1 AS entry_num, c.entry_time, c.value    FROM (        SELECT entry_time, value      FROM table         ORDER BY entry_time    ) C,    (SELECT @sample:=0) s

This little virtual table gives entry_num, entry_time, value.

Next step, we join it to itself.

SELECT one.entry_num, one.entry_time, one.value,        TIMEDIFF(two.value, one.value) interval  FROM (     /* virtual table */  ) ONE  JOIN (     /* same virtual table */  ) TWO ON (TWO.entry_num - 1 = ONE.entry_num)

This lines up the tables next two each other offset by a single entry, governed by the ON clause of the JOIN.

Finally we choose the values from this table with an interval larger than your threshold, and there are the times of the samples right before the missing ones.

The over all self join query is this. I told you it was a hairball.

SELECT one.entry_num, one.entry_time, one.value,        TIMEDIFF(two.value, one.value) interval  FROM (    SELECT @sample:=@sample+1 AS entry_num, c.entry_time, c.value      FROM (          SELECT entry_time, value            FROM table           ORDER BY entry_time      ) C,      (SELECT @sample:=0) s  ) ONE  JOIN (    SELECT @sample2:=@sample2+1 AS entry_num, c.entry_time, c.value      FROM (          SELECT entry_time, value            FROM table           ORDER BY entry_time      ) C,      (SELECT @sample2:=0) s  ) TWO ON (TWO.entry_num - 1 = ONE.entry_num)

If you have to do this in production on a large table you may want to do it for a subset of your data. For example, you could do it each day for the previous two days' samples. This would be decently efficient, and would also make sure you didn't overlook any missing samples right at midnight. To do this your little rownumbered virtual tables would look like this.

  SELECT @sample:=@sample+1 AS entry_num, c.entry_time, c.value    FROM (        SELECT entry_time, value      FROM table         ORDER BY entry_time         WHERE entry_time >= CURRENT_DATE - INTERVAL 2 DAY           AND entry_time < CURRENT_DATE /*yesterday but not today*/    ) C,    (SELECT @sample:=0) s


A very efficient way to do this is with a stored procedure using cursors. I think this is simpler and more efficient than the other answers.

This procedure creates a cursor and iterates it through the datetime records that you are checking. If there is ever a gap of more than what you specify, it will write the gap's begin and end to a table.

    CREATE PROCEDURE findgaps()    BEGIN        DECLARE done INT DEFAULT FALSE;    DECLARE a,b DATETIME;    DECLARE cur CURSOR FOR SELECT dateTimeCol FROM targetTable                           ORDER BY dateTimeCol ASC;    DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;         OPEN cur;           FETCH cur INTO a;           read_loop: LOOP        SET b = a;        FETCH cur INTO a;           IF done THEN            LEAVE read_loop;        END IF;             IF DATEDIFF(a,b) > [range you specify] THEN            INSERT INTO tmp_table (gap_begin, gap_end)            VALUES (a,b);        END IF;    END LOOP;               CLOSE cur;          END;

In this case it is assumed that 'tmp_table' exists. You could easily define this as a TEMPORARY table in the procedure, but I left it out of this example.


I'm trying this on MariaDB 10.3.27 so this procedure may not work, but I'm getting an error creating the procedure and I can't figure out why! I have a table called electric_use with a field Intervaldatetime DATETIME that I want to find gaps in. I created a target table electric_use_gaps with fields of gap_begin datetime and gap_end datetime

The data are taken every hour and I want to know if I'm missing even an hour's worth of data across 5 years.

 DELIMITER $$    CREATE PROCEDURE findgaps()    BEGIN        DECLARE done INT DEFAULT FALSE;    DECLARE a,b DATETIME;    DECLARE cur CURSOR FOR SELECT Intervaldatetime FROM electric_use                           ORDER BY Intervaldatetime ASC;    DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;         OPEN cur;           FETCH cur INTO a;           read_loop: LOOP        SET b = a;        FETCH cur INTO a;           IF done THEN            LEAVE read_loop;        END IF;             IF TIMESTAMPDIFF(MINUTE,a,b) > [60] THEN            INSERT INTO electric_use_gaps(gap_begin, gap_end)            VALUES (a,b);        END IF;    END LOOP;               CLOSE cur;          END&&        DELIMITER ;

This is the error:

Query: CREATE PROCEDURE findgaps() BEGIN DECLARE done INT DEFAULT FALSE; DECLARE a,b DATETIME; DECLARE cur CURSOR FOR SELECT Intervalda...Error Code: 1064You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near '[60] THEN            INSERT INTO electric_use_gaps(gap_begin, gap_end)   ...' at line 16