Are RDBMS that bad as described in Hadoop: The definitive guide? Are RDBMS that bad as described in Hadoop: The definitive guide? postgresql postgresql

Are RDBMS that bad as described in Hadoop: The definitive guide?


Their assertion that an RDMBS query planner takes that solution to the query is incorrect, at least for Postgresql 9.0, and I should imagine for other platforms too. I did a quick test with a similar query:

explain select * from client_attribute where client_attribute_type_code in ('UAG', 'RFR', 'IPA', 'FVD') order by client_attribute_id desc limit 10;                                                      QUERY PLAN----------------------------------------------------------------------------------------------------------------------- Limit  (cost=0.00..0.93 rows=10 width=85)   ->  Index Scan Backward using client_attribute_pkey on client_attribute  (cost=0.00..15516.47 rows=167234 width=85)         Filter: (client_attribute_type_code = ANY ('{UAG,RFR,IPA,FVD}'::bpchar[]))(3 rows)

Here client_attribute_id is indexed, so it does exactly as desired- walks back through the index, applies the filter and stops when the output hits the limit.

If the ordering column is not indexed, a table scan and sort is requierd, but only one table scan:

explain analyze select * from client_attribute where client_attribute_type_code in ('UAG', 'RFR', 'IPA', 'FVD') order by updated desc limit 10;                                                              QUERY PLAN--------------------------------------------------------------------------------------------------------------------------------------- Limit  (cost=13647.00..13647.03 rows=10 width=85) (actual time=180.961..180.964 rows=10 loops=1)   ->  Sort  (cost=13647.00..14065.09 rows=167234 width=85) (actual time=180.960..180.961 rows=10 loops=1)         Sort Key: updated         Sort Method:  top-N heapsort  Memory: 26kB         ->  Seq Scan on client_attribute  (cost=0.00..10033.14 rows=167234 width=85) (actual time=0.010..106.791 rows=208325 loops=1)               Filter: (client_attribute_type_code = ANY ('{UAG,RFR,IPA,FVD}'::bpchar[]))

This uses a heapsort to maintain the top 10 results through the course of the sequential scan, which sounds exactly like the solution they wrote themselves.


I don't think that Tom White is saying that relational databases are "bad"; they aren't optimal for non-relational, non-set based data.

It's been well known for a long time that deep object graphs don't lend themselves well to relational databases. They're typically found in problems like CAD representations of geometric data, where assemblies are made up of assemblies of assemblies of parts. The reference chains are very long, indeed.

Object and graph databases have been the solutions to that kind of problems since I was aware of them in the early 90s.

Relational databases are terrific for relational, set-based data. But all data doesn't fall into that category. That's why NoSQL is gaining mind share.

I think that's what the example you cite is saying.


RDBMS is for the queries you haven't thought of. Once you are certain exactly what you want, you can then apply the most optimum solution.