Why would you append a shard ID to a generated ID? Why would you append a shard ID to a generated ID? database database

Why would you append a shard ID to a generated ID?


Maybe I can break it down for you a bit better, and it's not just because user-id wont fit.

They're using Twitter Snowflake ID. This was designed to generate a unique ID across multiple servers, across multiple data centers, in a parallel. For instance, at the same exact instant two "items" in two "places" need a guaranteed unique ID for anything at the same instant less than a millisecond apart, maybe even at the same nanosecond... This unique ID has the requirements of needing to be being extremely fast to produce, efficient, built in a logical way that can be parsed efficiently, can fit within 64 bits, and the method of generating it needs to be able to handle a HUGE amount if IDs over many peoples lifetimes. This means they cannot do DB lookups to get a unique ID that's not already taken, the can't verify that the generated ID is unique after generating it to be sure, and they couldn't use existing methods that could possibly generate duplicates even if very rarely like UUID. So they devised a way..

They set a custom common epoch, such at today in a long integer as a base point. So with this they have a 42 bit long integer that starts at 0+time since that epoch.

Then they also added a sequence as a 12 bit long integer in the case that a single process on a single machine had to generate 2 or more IDs in the same millisecond. Now they have 42+12=54 bits in use, and when your considering that multiple processes on multiple machines (normally only one machine per data center providing IDs, but could be more, and normally only one worker/process per machine) you realize that you need more than just 42+12..

So they also have to encode a data center ID and a "worker" (process) ID. This will cover multiple data centers with multiple workers in each data center. These two IDs are both 5 bit long integers. All these integers are unsigned, so these 5 bit integers can go up to 31 which give each of these partial IDs 32 possibilities including 0. So, 32 data centers, with up to 32 workers in each datacenter.. So now we're at 42+12+5+5=64bits, with up to 32x32=1024 workers producing these IDs distributed.

So.. With a lifetime up to 139 years of being able to fit in the 42 bit portion... 10 bits for a node ID (or data center+worker IDs)... a sequence of 12 bits (4096 IDs per millisecond per worker)... You come up with a 64 maximum guaranteed unique ID system/formula that scales amazingly well over those 139 years that doesn't rely on a database in any way but can be efficiently produced and stored in a database.

So, this ID system works out to 42+12+10 and you can divide those 10 bits up, or not, however you like and not go beyond storing a 64bit unsigned long integer anywhere. Very flexible, and works great.

Again, it's called a Snowflake ID and Twitter came up with it. Those 10 bits can be called a shard ID, node ID, or a combination of data center ID and worker ID, it really depends on your needs. But, by not tying that shard/node ID to a user but to multiple processes and being able to use that ID across multiple "things", you wont have to worry about a lot of things and you can span multiple databases full of multiple things and and and..

The one thing that does matter is that that shard/node ID can only hold 1024 different values and no user ID or any unique ID that they could use is just going to go from 0 to 1023 in they don't assign it themselves to whatever.

So you see, those 10 bits have to be something that's static, assignable and easily parse-able for them regardless.

Here's a simply python function that'll generate a snowflake ID:

def genSnowflakeId(worker_id, data_center_id, ids_generated):    "Returns a snowflake ID - This function will generate a unique ID that fits in a 64 bit unsigned number that scales for multiple workers running in mutiple datacenters. You must manage a timestamp and sequence sanity with ids_generated (i.e. increment if time apart < 1 millisecond or always increment and roll over to 0 if > 4095). Ultimately this will allow you to efficiently generate unique IDs across multiple locations for 139 years that fits in a bigint(20) database field and can be parsed for the created timestamp, worker ID, and datacenter ID. See https://github.com/twitter-archive/snowflake/tree/snowflake-2010"    import sys    import time    # Mon Jul  8 05:07:56 EDT 2019    twepoch = 1562576876131L    sequence = 0L    worker_id_bits = 5L    data_center_id_bits = 5L    sequence_bits = 12L    timestamp_bits = 42L    #total bits 64    max_worker_id = -1L ^ (-1L << worker_id_bits)    max_data_center_id = -1L ^ (-1L << data_center_id_bits)    max_ids_generated = -1L ^ (-1L << sequence_bits)    worker_id_shift = sequence_bits    data_center_id_shift = sequence_bits + worker_id_bits    timestamp_left_shift = sequence_bits + worker_id_bits + data_center_id_bits    sequence_mask = -1L ^ (-1L << sequence_bits)    # Sanity checks for input    if worker_id > max_worker_id or worker_id < 0:        raise ValueError("worker_id", "worker id can't be greater than %i or less than 0" % max_worker_id)    if data_center_id > max_data_center_id or data_center_id < 0:        raise ValueError("data_center_id", "data center id can't be greater than %i or less than 0" % max_data_center_id)    if ids_generated > max_ids_generated or ids_generated < 0:        raise ValueError("ids_generated", "ids generated can't be greater than %i or less than 0" % max_ids_generated)    timestamp = long(int(time.time() * 1000))    new_id = ((timestamp - twepoch) << timestamp_left_shift) | (data_center_id << data_center_id_shift) | (worker_id << worker_id_shift) | sequence    return new_id

Hope this answer satisfies ya :)


They need an image id with 64 bits of length.

41 bits for milliseconds since epoch + 13 bits for the shard-id + 10 bits for the autoincrement value.

They took the shard-id instead of the user-id simply because only shard-id fits in 13 bits whereas user-id will require more bits.