How to efficiently store and query a billion rows of sensor data How to efficiently store and query a billion rows of sensor data hadoop hadoop

How to efficiently store and query a billion rows of sensor data


So you're going to have 3 bln records by the end of this year (which have just begun). Each record is 4 bytes ID + 4 bytes datetime + 8 bytes double value which totals in 3*10^9 * (4+4+8) == 48GB.

You can easily store and process this 48GB in a in-memory database like Redis, CouchBase, Tarantool, Aerospike. All of them are open-source, so you don't need to pay a license fee.

There might be some additional overhead on memory consumption of 10-30%, so 48GB can grow up to 64GB or slightly more. You should feed those databases with your real data to pick the most economical one for your case.

Only one physical machine should be enough for the whole workload because in-memory databases are able to handle 100K-1M queries/updates per second per node (the real number depends on your specific workload pattern) . For the sake of better availability I would setup two servers - a master and a slave.

The price of a physical server with 64GB on board up to my experience is $2-3K. Notice that you don't even need an SSD disk. A spinning one should be fine because all the reads hit RAM and all the writes only append to the transaction log. This is how in-memory databases work. I can elaborate on this if you have any questions.


So I have used all the Technologies you listed in some manner or another. What kind of queries do you need to perform? Because depending on that, you could rule some of the solutions. If you don't need to query a lot of different ways Table Storage could work out well for you. Its' going to scale real well if you follow the guidelines, and is cheap. But if you cant just do a point query for the data you need then it might not work so well, or be to complicated to be a good option. Opentsdb is great if you want a time series database. The will limit you to time series type querys. There a lot of time series dbs out there and there a lot of applications that are built on top of it like Bosun and Grafana, to list a two that I use. The last option HDI, I would store the data in parquet format (or some columnar format), create a hive table on top the data and query with Spark SQL. Really you don't need to use Spark, you could use Hive as well. But what you should stay away from is traditional Map Reduce, that paradigm is basically dead now days, and you should not write new code in it. On top of that if you don't know it, there is steep learning curve around it. I us all of technologies, and we use them for different parts are system and it depends really on the read and write requirments of the application. I would look at using spark and parquet if I were you but it a lot new tooling that might not be needed.


3 billion of data points per year is pretty low number for modern time series databases such as VictoriaMetrics. It can persist this amount of data points in less than 3 minutes at ingestion speed of 19 millions of samples per second on a computer with 64 vCPUs. See this article for details.

There are VictoriaMetrics production setups with up to 10 trillions of data points per single node. And it scales to multiple nodes.