Word2vec training using gensim starts swapping after 100K sentences Word2vec training using gensim starts swapping after 100K sentences numpy numpy

Word2vec training using gensim starts swapping after 100K sentences


As a first principle, you should always get more RAM, if your budget and machine can manage it. It saves so much time & trouble.

Second, it's unclear if you mean that on a dataset of more than 100K sentences, training starts to slow down after the first 100K sentences are encountered, or if you mean that using any dataset larger than 100K sentences experiences the slowdown. I suspect it's the latter, because...

The Word2Vec memory usage is a function of the vocabulary size (token count) – and not the total amount of data used to train. So you may want to use a larger min_count, to slim the tracked number of words, to cap the RAM usage during training. (Words not tracked by the model will be silently dropped during training, as if they weren't there – and doing that for rare words doesn't hurt much and sometimes even helps, by putting other words closer to each other.)

Finally, you may wish to avoid providing the corpus sentences in the constructor – which automtically scans and trains – and instead explicitly call the build_vocab() and train() steps yourself after model construction, to examine the state/size of the model and adjust your parameters as needed.

In particular, in the latest versions of gensim, you can also split the build_vocab(corpus) step up into three steps scan_vocab(corpus), scale_vocab(...), and finalize_vocab().

The scale_vocab(...) step can be called with a dry_run=True parameter that previews how large your vocabulary, subsampled corpus, and expected memory-usage will be after trying different values of the min_count and sample parameters. When you find values that seem manageable, you can call scale_vocab(...) with those chosen parameters, and without dry_run, to apply them to your model (and then finalize_vocab() to initialize the large arrays).


does it look like my setup isn't configured properly (or my code is inefficient)?

1) In general, I would say no. However, given that you only have a tiny amount of RAM, I would use a lower number of workers. It will slow down the training, but maybe you can avoid the swap this way.

2) You can try stemming or better: lemmatization. You will reduce the number of words since, for example, singular and plural forms will be counted as the same word

3) However, I think 4 GB of RAM is probably your main problem here (aside from your OS, you probably only have 1-2 GB that can actually be used by the processes/threads. I would really think about investing in more RAM. For example, nowadays you can get good 16 Gb RAM kits for < $100, however, if you have some money to invest in a decent RAM for common ML/"data science" task, I'd recommend > 64 GB