Speed up processing from CSV file Speed up processing from CSV file multithreading multithreading

Speed up processing from CSV file


It looks to me like you're I/O bound. It doesn't help that your data is over a network. I suspect that if you just add more machines then your performance will go DOWN because of the extra contention. Remember that there's still just one spindle and just one HD head reading your data. For the MPI solution I'd suggest making multiple copies of the data and putting them on the servers themselves.

For MySQL, I hear what you're saying. I found MySQL to be very inefficient with joins. I looks to me like it does full-table scans when it could get away without them. I remember MySQL taking over a minute on a query that Oracle would take less than a second. Maybe try PostgreSQL? I'm not sure if it's any better.Another approach could be to have the db sort the data for you so that you can then do the scan without a hashmap.

Unless your records are ginormous, 100M records shouldn't be that bad.


If you read the data from the CSV, I assume it won't change too often. So instead of loading it into a generic database product, you could also construct your own index over the CSV data. Or do you need to have full SQL support?

Apart from that you mention that you want to return the NUMBER of different K,V-Pairs. However, you really compute the actual pairs. I don't know if you need them for some other purpose, but you could also get that number as #distinctKeys x #distinctValues without actually building a HashMap.

Assuming you build an index for each colum of form

value -> {r | r is a byteOffset of a row that has "value" in the index column}

you could answer many, many queries and especially determining the number of distinct pairs should only take a couple of milliseconds.

I hope this answer is helpful, since I am not sure what other requirements have to be met. This solution is significantly less powerful that a DB supporting SQL (especially inserts will make stuff a lot more complicated) but at least determining the number of distinct pairs should be faster by several orders of magnitude


divide and conquer A hundred small databases should be WAY faster. you decide how to break it up - use split() or slice() I am currently using the first character of the first word of each line, so where there once was one huge slow DB there is now (A - Z + a - z + 0 - 9) 62 small faster databases. Another advantage is that a laptop can now do the job that only a powerful, expensive PC could do before