How to tackle a BIG DATA Data Mart / Fact Table? ( 240 millions of rows ) How to tackle a BIG DATA Data Mart / Fact Table? ( 240 millions of rows ) hadoop hadoop

How to tackle a BIG DATA Data Mart / Fact Table? ( 240 millions of rows )


Have you checked out Google BigQuery (Paid Premium Service) which will suit your needs. It is as simple as

  1. Load the data in CSV (delimited by new line for record, or configurable char for field). The file can be in gzip format. You can also append to existing table.

  2. Start Querying using SQL statement ( limited sql statement though) and the results are returned in secs of multi-million rows.

  3. Extract the data into a CSV or another table ( similar to aggregation layer)

Check out here. https://developers.google.com/bigquery/

First 100GB for data processing is free. So you can get started now and it also integrates with Google Spreadsheet , which will allow you create visulaization like Charts and graphs etc for management. You can export the google spreadsheet as Microsoft Excel / PDF.

Google state it can scale to multi-terrabytes and provides real-time quering ( few secs response).


first up i'll assume its 240m not 2400m.

Firstly take a look at ssd.analytical-labs.com

The FCC demo has a 150m record fact table running on Infobright, I would suspect on VW it would be even faster.

The key is keeping it simple, there will be queries that make it fall slow down, but largley its pretty responsive.

I would suggest you think about aggregates, the way you are querying and importantly what you are querying.

For example split it down into Marts for performance, by product, by brand, by years etc. If the user wants to just do a query on <1 years worth of data(which is more often the case than most people would like to think) they could then use a much smaller fact table.

Storage is cheep so it doesn't matter particularly if you duplicate data as long as it keeps it responsive.

Of course also if you are doing OLAP you can make use of inline aggregate tables to make sure most of the queries run at a far more acceptable level assuming they've rolled up.

Hardware is also very important, make sure you have fast disks, that is nearly always the bottle neck, the faster you can get the data off the disks generally the faster it will displayed to the end user.

Schema design is also important, modern column store databases much prefer a denormalised table with 0 joins where possible, I have found in the past, having 1 denormalised table for 90% of queries then having a few joining tables (date dim for example) for special cases counts for most use cases.

Anyway thats my 2 cents. Ping me on twitter if you want a skype about it or something.

Tom

Edit:

Also here's a non scientific bench mark to back up what JVD was saying:

  • ssd on physical box: 175.67 MB/sec
  • sata on physical box: 113.52 MB/sec
  • ec2: 75.65 MB/sec
  • ec2 ebs raid: 89.36 MB/sec

As you can see there is a large difference in read speed.


I think there are a couple of approaches here,

1) You should try aggregate tables on mondrian, the downside of agg tables is that you need to know beforehand the use cases for most recurrent queries, if you don't then it's not so easy to tune that and you will end up having long response times for the queries you didn't optimize the aggregate table.

2) Another option is to partition the data of the fact table, maybe by year, create different schemas for every year and a virtual cube for the whole history. If you have the right software you could also create a materialized view (if you have Oracle) or a Indexed View if you have MS SqlServer.

The late approach has worked very good for me, with noticeable improvements on query times. Besides, my ETL process wasn't affected (in the option 1 you will need to create an extra process to build and maintain Aggregate Tables) since the RDMBS takes care of the process of update the data on every partition.