SQL: Inner joining two massive tables SQL: Inner joining two massive tables sql-server sql-server

SQL: Inner joining two massive tables


I'm not an SQL tuning expert, but joining hundreds of millions of rows on a VARCHAR field doesn't sound like a good idea in any database system I know.

You could try adding an integer column to each table and computing a hash on the NAME field that should get the possible matches to a reasonable number before the engine has to look at the actual VARCHAR data.


For huge joins, sometimes explicitly choosing a loop join speeds things up:

SELECT EGM.Name, BioEntity.BioEntityId INTO AUXFROM EGM INNER LOOP JOIN BioEntity     ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId

As always, posting your estimated execution plan could help us provide better answers.

EDIT: If both inputs are sorted (they should be, with the covering index), you can try a MERGE JOIN:

SELECT EGM.Name, BioEntity.BioEntityId INTO AUXFROM EGM INNER JOIN BioEntity     ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeIdOPTION (MERGE JOIN)


First, 100M-row joins are not at all unreasonable or uncommon.

However, I suspect the cause of the poor performance you're seeing may be related to the INTO clause. With that, you are not only doing a join, you are also writing the results to a new table. Your observation about the log file growing so huge is basically confirmation of this.

One thing to try: remove the INTO and see how it performs. If the performance is reasonable, then to address the slow write you should make sure that your DB log file is on a separate physical volume from the data. If it isn't, the disk heads will thrash (lots of seeks) as they read the data and write the log, and your perf will collapse (possibly to as little as 1/40th to 1/60th of what it could be otherwise).