Clustering ~100,000 Short Strings in Python Clustering ~100,000 Short Strings in Python numpy numpy

Clustering ~100,000 Short Strings in Python


100,000 * 100,000 * 32bits = 40 GBytes, which would be a lot of RAM, so yes, you need to find another way. (And even if you could fit this data into memory, the calculation would take too long.)

One common and easy shortcut is to cluster a small random subset of the data, and after you find the clusters of this subset, just put the rest of the points into the clusters where they fit best.


10 billion elements is an awful lot. I don't know from q-grams, but if that matrix is sparse, you could use a 200,000-ish element dict.


Do you need the matrix? I assume you want to use a matrix for speed?

I have a k-means cluster algorithm (rather than a hierarchical cluster algorithm) and this calculates node distances as required. Probably only viable for fast distance metrics, though. And you have more data than I do - but you are bound by memory limitations.