Find matching records based on dynamic columns
The use of cursor, loops and dynamic SQL can be avoided by treating the fields used for comparison as an entry in a bit set for each pet. A bit set (FieldSetRank column) is calculated for each priority based on a bit entry (FieldRank rank column).
The Pets and Owner tables has to be unpivoted so that the fields and their associated values can be compared. Each of the fields and value that has been matched is assigned to a corresponding FieldRank. A new bit set is then calculated based on the matched values (MatchSetRank). Only records where the matched sets (MatchSetRank) are equal to the desired sets (FieldSetRank) are returned.
The query performs one final ranking to return records with the highest MatchSetRank (records that matched on the highest number of columns while maintaining priority criteria. The following T-SQL demonstrates the concept.
;WITH CTE_Builder AS ( SELECT [PetID] ,[Field] ,[Priority] ,[MatchTo] ,POWER(2, [Priority] - 1) AS [FieldRank] -- Define the field ranking as bit set numbered item. ,SUM(POWER(2, [Priority] - 1)) OVER (PARTITION BY [PetID] ORDER BY [Priority] ROWS UNBOUNDED PRECEDING) FieldSetRank -- Sum all the bit set IDs to define what constitutes a completed field set ordered by priority. FROM temp_builder ),CTE_PetsUnpivotedAS( -- Unpivot pets table and assign Field Rank and Field Set Rank. SELECT [PetsUnPivot].[PetID] ,[PetsUnPivot].[Field] ,[Builder].[MatchTo] ,[PetsUnPivot].[FieldValue] ,[Builder].[Priority] ,[Builder].[FieldRank] ,[Builder].[FieldSetRank] FROM ( SELECT [PetID], [Address], CAST([Zip] AS VARCHAR(100)) AS [Zip], [Country], [Document], [OwnerName] FROM temp_pets ) [Pets] UNPIVOT (FieldValue FOR Field IN ([Address], [Zip], [Country], [Document], [OwnerName]) ) AS [PetsUnPivot] INNER JOIN [CTE_Builder] [Builder] ON [PetsUnPivot].PetID = [Builder].PetID AND [PetsUnPivot].Field = [Builder].Field),CTE_OwnersAS( -- Unpivot Owners table and join with unpivoted Pets table on field name and field value. -- Next assign Pets field rank then calculated the field set rank (MatchSetRank) based on actual matches made. SELECT [OwnersUnPivot].[OwnerID] ,[Pets].[PetID] ,[OwnersUnPivot].[Field] ,[Pets].Field AS [PetField] ,[Pets].FieldValue as PetFieldValue ,[OwnersUnPivot].[FieldValue] ,[Pets].[Priority] ,[Pets].[FieldRank] ,[Pets].[FieldSetRank] ,SUM([FieldRank]) OVER (PARTITION BY [Pets].[PetID], [OwnersUnPivot].[OwnerID] ORDER BY [Pets].[Priority] ROWS UNBOUNDED PRECEDING) MatchSetRank FROM ( SELECT [OwnerID], [Addr], CAST([Zip] AS VARCHAR(100)) AS [Zip], [Country], [Document], [Name] FROM temp_owners ) [Owners] UNPIVOT (FieldValue FOR Field IN ([Addr], [Zip], [Country], [Document], [Name]) ) AS [OwnersUnPivot] INNER JOIN [CTE_PetsUnpivoted] [Pets] ON [OwnersUnPivot].[Field] = [Pets].[MatchTo] AND [OwnersUnPivot].[FieldValue] = [Pets].[FieldValue]),CTE_FinalRankingAS( SELECT [PetID] ,[OwnerID] -- -- Calculate final rank, if multiple matches have the same rank then multiple rows will be returned per pet. -- Change the “RANK()” function to "ROW_NUMBER()" to only return on result per pet. ,RANK() OVER (PARTITION BY [PetID] ORDER BY [MatchSetRank] DESC) AS [FinalRank] FROM CTE_Owners WHERE [FieldSetRank] = [MatchSetRank] -- Only return records where the field sets calculated based on -- actual matches is equal to desired field set ranks. This will -- eliminate matches where the number of fields that meets the -- criteria is the same but does not meet priority requirements. )SELECT [PetID] ,[OwnerID]FROM CTE_FinalRankingWHERE [FinalRank] = 1
I will say right away to save your time:
- My solution uses dynamic SQL. Michał Turczyn correctly noted that you can't avoid it while names of compared column are stored in the DB.
- My solution uses loops. And I strongly believe that you will not solve this problem with the pure SQL query which will work fast enough on the data size you declared (tables have > 1M records). The logic you described implies iterations by its nature - from bigger set of matching fields to the lower set. SQL as a query language was not designed to cover such tricky scenarios. You could try to solve your problem with pure SQL query, but even if you manage to build such query, it will be very tricky, complex and unclear. I'm not a fan of such solutions. That's why I have not even dig into this direction.
- On the other hand my solution does not require creation of temporary tables which is an advantage.
Given this, my approach is fair straightforward:
There is an outer loop that iterates from the biggest set of matchers (all matching fields) to the smallest set of matchers (one field). On the first iteration, when we don't know yet how many matchers are stored in the database for the pet, we read and use them all. On the next iterations, we decrease number of used matchers by 1 (removing the one with highest priority).
The inner loop iterates over current set of matchers and builds the
WHERE
clause that compares fields betweenPets
andOwners
table.Current query is executed and if some owner matches given criteria, we break from the outer loop.
Here is the code that implements this logic:
DECLARE @PetId INT = 2;DECLARE @MatchersLimit INT;DECLARE @OwnerID INT;WHILE (@MatchersLimit IS NULL OR @MatchersLimit > 0) AND @OwnerID IS NULLBEGIN DECLARE @CurrMatchFilter VARCHAR(max) = '' DECLARE @Field VARCHAR(30) DECLARE @MatchTo VARCHAR(30) DECLARE @CurrMatchersNumber INT = 0; DECLARE @GetMatchers CURSOR; IF @MatchersLimit IS NULL SET @GetMatchers = CURSOR FOR SELECT Field, MatchTo FROM temp_builder WHERE PetID = @PetId ORDER BY Priority ASC; ELSE SET @GetMatchers = CURSOR FOR SELECT TOP (@MatchersLimit) Field, MatchTo FROM temp_builder WHERE PetID = @PetId ORDER BY Priority ASC; OPEN @GetMatchers; FETCH NEXT FROM @GetMatchers INTO @Field, @MatchTo; WHILE @@FETCH_STATUS = 0 BEGIN IF @CurrMatchFilter <> '' SET @CurrMatchFilter = @CurrMatchFilter + ' AND '; SET @CurrMatchFilter = @CurrMatchFilter + ('temp_pets.' + @Field + ' = ' + 'temp_owners.' + @MatchTo); FETCH NEXT FROM @GetMatchers INTO @field, @matchTo; SET @CurrMatchersNumber = @CurrMatchersNumber + 1; END CLOSE @GetMatchers; DEALLOCATE @GetMatchers; IF @CurrMatchersNumber = 0 BREAK; DECLARE @CurrQuery nvarchar(max) = N'SELECT @id = temp_owners.OwnerID FROM temp_owners INNER JOIN temp_pets ON (' + CAST(@CurrMatchFilter AS NVARCHAR(MAX)) + N') WHERE temp_pets.PetID = ' + CAST(@PetId AS NVARCHAR(MAX)); EXECUTE sp_executesql @CurrQuery, N'@id int OUTPUT', @id=@OwnerID OUTPUT; IF @MatchersLimit IS NULL SET @MatchersLimit = @CurrMatchersNumber - 1; ELSE SET @MatchersLimit = @MatchersLimit - 1;ENDSELECT @OwnerID AS OwnerID, @MatchersLimit + 1 AS Matched;
Performance considerations
There are basically 2 queries that are executed in this approach:
SELECT Field, MatchTo FROM temp_builder WHERE PetID = @PetId;
You should add an index on
PetID
field intemp_builder
table and this query will be executed very fast.SELECT @id = temp_owners.OwnerID FROM temp_owners INNER JOIN temp_pets ON (temp_pets.Document = temp_owners.Document AND temp_pets.OwnerName = temp_owners.Name AND temp_pets.Zip = temp_owners.Zip AND ...) WHERE temp_pets.PetID = @PetId;
This query looks scary because it joins two big tables -
temp_owners
andtemp_pets
. Howevertemp_pets
table is filtered byPetID
column that should result in just one record. So if you have an index ontemp_pets.PetID
column (and you should as this column seems like a primary key), the query will result into scan oftemp_owners
table. Such scan will not take the ages even for table with over 1M rows. If the query is still too slow, you could consider adding indexes for columns oftemp_owners
table that are used in the matchers (Addr
,Zip
, etc.). Adding indexes has downsides, like bigger database and slower insert/update operations. So before adding the indexes ontemp_owners
columns, check the query speed on table without indexes.
I'm not sure if I've got the end result correctly or not, But I suggest using a couple of common table expressions to generate a batch of update statements using dynamic SQL (I'm afraid it can't be done without dynamic SQL), and then execute them using Exec(sql)
.
The benefit of this approach is that it involves no loops or cursors.
Every update statement I generate is using an inner join
between the pets and the owners tables, updating the pets table's owner id with the owners table owner id, using the mappings from the builder table as the basis to the on
clause.
The first cte is responsible of generating the on
clause from the builder table, and the second one is responsible to generate the update statements.
Finally, I'm selecting all the SQL statements from the second CTE into a single nvarchar(max)
variable and execute it.
The way I've tackled the priority issue is to generate a single update statement for each group of priorities, starting with including all priorities, and excluding values from the next SQL statement, highest priority gets excluded first, until I'm left with an on
clause mapping only a single set of columns.
So, the first thing is to declare a variable to hold the generated update statements:
DECLARE @Sql nvarchar(max) = ''
Now, the first CTE is using cross apply
with stuff
and for xml
to generate the on
clause for each pair of petId
and Priority
:
;WITH OnClauseCTE AS(SELECT DISTINCT PetId, Priority, OnClauseFROM temp_builder t0CROSS APPLY( SELECT STUFF ( ( SELECT ' AND p.'+ Field +' = o.'+ MatchTo FROM temp_builder t1 WHERE PetID = t0.PetId AND Priority <= t0.Priority FOR XML PATH('') ) , 1, 5, '') As OnClause) onClauseGenerator)
The second CTE generates a single UPDATE
statement for every petId
and Priority
combination:
, UpdateStatementCTE AS( SELECT PetId, Priority, 'UPDATE p SET OwnerID = o.OwnerID FROM temp_pets p INNER JOIN temp_owners o ON ' + OnClause + ' WHERE p.PetId = '+ CAST(PetId as varchar(10)) +' AND p.OwnerID IS NULL; -- THIS IS CRITICAL! ' AS SQL FROM OnClauseCTE)
And Finally, Generating a single batch of update statements from the UpdateStatementCTE:
SELECT @Sql = @Sql + SQLFROM UpdateStatementCTE ORDER BY PetId, Priority DESC -- ORDER BY Priority is CRITICAL!
The order by PetId
is strictly to help the readability, when you print out the content of @Sql
. However, the Priority DESC
part of the order by
clause is critical, since we want to execute the highest priority first and the lowest priority last.
Now, @Sql
contains this (shortened):
UPDATE p SET OwnerID = o.OwnerID FROM temp_pets p INNER JOIN temp_owners o ON p.Address = o.Addr AND p.Zip = o.Zip AND p.Country = o.Country AND p.OwnerName = o.Name WHERE p.PetId = 1AND p.OwnerID IS NULL;...UPDATE p SET OwnerID = o.OwnerID FROM temp_pets p INNER JOIN temp_owners o ON p.OwnerName = o.Name WHERE p.PetId = 1AND p.OwnerID IS NULL;...UPDATE p SET OwnerID = o.OwnerID FROM temp_pets p INNER JOIN temp_owners o ON p.OwnerName = o.Name AND p.Document = o.Document WHERE p.PetId = 2AND p.OwnerID IS NULL;...UPDATE p SET OwnerID = o.OwnerID FROM temp_pets p INNER JOIN temp_owners o ON p.Country = o.Country WHERE p.PetId = 3AND p.OwnerID IS NULL;
As you can see, every update statement is represented in the builder table, and will only change the Owner Id if the previous update statement haven't already, because of the AND p.OwnerID IS NULL
part of the where
clause.
After running the batch of update statements, your temp_pets table looks like this:
PetID Address Zip Country Document OwnerName OwnerID Field1 Field21 123 5th st 12345 US test.csv John 5 NULL NULL2 234 6th st 23456 US a.csv Alex 6 NULL NULL3 345 7th st 34567 US b.csv Mike 1 NULL NUL
You can see a live demo on rextester.
However, please note that the less conditions you have the more records might return back from the join, making the update more likely to be inaccurate.For instance, For PetId 3 I've got OwnerId 1 since the only thing I've had to match the records was the Country
column, meaning it could actually be every OwnerId
in this sample data, since everyone have the same value US
in the Country
column.
Under the following rules, there is not much I can do about that.