Time based data analysis with Python Time based data analysis with Python pandas pandas

Time based data analysis with Python


My first stab at this would be to create a multi-column index on "Sensor Data Table", of the like :

sensor->timestamp->value1 //Index 1sensor->timestamp->value2 //Index 2sensor->timestamp->value3 //Index 3sensor->timestamp->value4 //Index 4

See if your SQL queries are fast enough. You could query it via eventlets or cron. From a performance perspective it doesn't matter which you use as long as this query is fast enough, it is most likely to be your bottleneck.

Another suggestion is to try MySQL Memory tables, or the postgre equivalent (In-memory table in PostgreSQL).

Yet another suggestion is to try Redis. You can store "Sensor Data" as a collection of sorted sets; One sorted set per sensor id and value field, and sort data by timestamps.

 ZADD sensor_id:value1 timestamp value ZADD sensor_id:value2 timestamp value

Redis will require some application logic to accumalate the data, but it will be very fast if it all fits in RAM.

Re: MongoDB. You can get good perf. as long as your queryable data + indexes can fit in RAM and there aren't too many write locks. Albeit it's an administrative (and coding) burden to run 2 heavy-weight databases that provide overlapping features. Given that, compaction is not really an issue. You can create TTL indexes on sensor data and mongo will delete older data in a bg thread. The file size will remain constant after a while.

Hope this helps


If your rules are simple or few you could try and use SQL triggers to update stored views, which might be quickly queried. E.g. assuming, that you want to detect, that a certain sensor has been active for a given amount of time, you could have a table containing activation times of sensors active at present. Whenever you store a raw event, a trigger would update such a table.

It would be more difficult for the rules of type 3. Unless there is either few of them, and you can setup a set of triggers and views for each one, or allowed time periods are known upfront.


Option #4. The relational database is the clear bottleneck. Arrange for data delivery in some simpler form (files in a directory, named by sensor name or whatever the unique key is). You can process much more quickly, checking timestamps and reading-- then push data to the rdb at the back end, after you analyze.