why is scaling writes to a relational database virtually impossible? why is scaling writes to a relational database virtually impossible? mysql mysql

why is scaling writes to a relational database virtually impossible?


The slowness of physical disk subsystems is usually the single greatest challenge to overcome when trying to scale a database to service a very large number of concurrent writers. But it is not "virtually impossible" to optimize writes to a relational database. It can be done. Yet there is a trade-off: when you optimize writes, selects of large subsets of logically related data usually are slower.

The writes of the primary data to disk and the rebalancing of index trees can be disk-intensive. The maintenance of clustered indexes, whereby rows that belong logically together are stored physically contiguous on disk, is also disk-intensive. Such indexes make selects (reads) quicker while slowing writes. A heavily indexed table does not scale well therefore, and the lower the cardinality of the index, the less well it scales.

One optimization aimed at improving the speed of concurrent writers is to use sparse tables with hashed primary keys and minimal indexing. This approach eliminates the need for an index on the primary key value and permits an immediate seek to the disk location where a row lives, 'immediate' in the sense that the intermediary of an index read is not required. The hashed primary key algorithm returns the physical address of the row using the primary key value itself-- a simple computation that requires no disk access.

The sparse table is exactly the opposite of storing logically related data so they are physically contiguous. In a sparse table, writers do not step on each others toes, so to speak. Writes are like raindrops falling on a large field not like a crowd of people on a subway platform trying to step into the train through a few open doors. The sparse table helps to eliminate write bottlenecks.

However, because logically related data are not physically contiguous, but scattered, the act of gathering all rows in a certain zipcode, say, is expensive. This sparse-table hashed-pk optimization is therefore optimal only when the predominant activity is the insertion of records, the update of individual records, and the lookup of data relating to a single entity at a time rather than to a large set of entities, as in, say, an order-entry system. A company that sold merchandise on TV and had to service tens of thousands of simultaneous callers placing orders would be well served by a system that used sparse tables with hashed primary keys. A national security database that relied upon linked lists would also be well served by this approach. Many social networking applications could also use it to advantage.


A sharded database is actually quite different to a normal SQL database. In a lot of ways it is more like a custom NoSQL system that just happens to use a database for storage. Unless your dataset consists of a lot of completely disconnected subsets, most queries more complex than get by ID won't work the same as they do on a single node database.

The other reason is that SQL writes tend to be fairly expensive due to the requirement for immediate consistency - the indexes that are required for decent read performance on a large database get updated as part of the write operation, and various constraints are checked. In systems designed for horizontal scalability these additional operations are usually either skipped entirely or performed separately from the write.


Obviously this is their opinion, with StackOverflow here as an easy proof that you can scale relational writes to busy sites effectively.

NoSQL providers like Cassandra do make it much easier to scale to multiple servers, but this is not impossible with traditional databases, and scaling to multiple db servers is rarely necessary.