Find matching records based on dynamic columns

sql sql-server tsql sql-server-2016

The use of cursor, loops and dynamic SQL can be avoided by treating the fields used for comparison as an entry in a bit set for each pet. A bit set (FieldSetRank column) is calculated for each priority based on a bit entry (FieldRank rank column).

The Pets and Owner tables has to be unpivoted so that the fields and their associated values can be compared. Each of the fields and value that has been matched is assigned to a corresponding FieldRank. A new bit set is then calculated based on the matched values (MatchSetRank). Only records where the matched sets (MatchSetRank) are equal to the desired sets (FieldSetRank) are returned.

The query performs one final ranking to return records with the highest MatchSetRank (records that matched on the highest number of columns while maintaining priority criteria. The following T-SQL demonstrates the concept.

;WITH CTE_Builder AS (     SELECT  [PetID]            ,[Field]            ,[Priority]            ,[MatchTo]            ,POWER(2, [Priority] - 1) AS [FieldRank] -- Define the field ranking as bit set numbered item.            ,SUM(POWER(2, [Priority] - 1)) OVER (PARTITION BY [PetID] ORDER BY [Priority] ROWS UNBOUNDED PRECEDING) FieldSetRank -- Sum all the bit set IDs to define what constitutes a completed field set ordered by priority.     FROM   temp_builder ),CTE_PetsUnpivotedAS(   -- Unpivot pets table and assign Field Rank and Field Set Rank.    SELECT   [PetsUnPivot].[PetID]            ,[PetsUnPivot].[Field]            ,[Builder].[MatchTo]            ,[PetsUnPivot].[FieldValue]            ,[Builder].[Priority]            ,[Builder].[FieldRank]            ,[Builder].[FieldSetRank]    FROM        (            SELECT [PetID], [Address], CAST([Zip] AS VARCHAR(100)) AS [Zip], [Country], [Document], [OwnerName]            FROM temp_pets        ) [Pets]    UNPIVOT       (FieldValue FOR Field IN           ([Address], [Zip], [Country], [Document], [OwnerName])    ) AS [PetsUnPivot]    INNER JOIN [CTE_Builder] [Builder] ON [PetsUnPivot].PetID = [Builder].PetID AND [PetsUnPivot].Field = [Builder].Field),CTE_OwnersAS(    -- Unpivot Owners table and join with unpivoted Pets table on field name and field value.      -- Next assign Pets field rank then calculated the field set rank (MatchSetRank) based on actual matches made.    SELECT   [OwnersUnPivot].[OwnerID]            ,[Pets].[PetID]            ,[OwnersUnPivot].[Field]            ,[Pets].Field AS [PetField]            ,[Pets].FieldValue as PetFieldValue            ,[OwnersUnPivot].[FieldValue]            ,[Pets].[Priority]            ,[Pets].[FieldRank]            ,[Pets].[FieldSetRank]            ,SUM([FieldRank]) OVER (PARTITION BY [Pets].[PetID], [OwnersUnPivot].[OwnerID] ORDER BY [Pets].[Priority] ROWS UNBOUNDED PRECEDING) MatchSetRank    FROM        (            SELECT [OwnerID], [Addr], CAST([Zip] AS VARCHAR(100)) AS [Zip], [Country], [Document], [Name]            FROM temp_owners        ) [Owners]    UNPIVOT       (FieldValue FOR Field IN           ([Addr], [Zip], [Country], [Document], [Name])    ) AS [OwnersUnPivot]    INNER JOIN [CTE_PetsUnpivoted] [Pets] ON [OwnersUnPivot].[Field] = [Pets].[MatchTo] AND [OwnersUnPivot].[FieldValue] = [Pets].[FieldValue]),CTE_FinalRankingAS(    SELECT   [PetID]            ,[OwnerID]            -- -- Calculate final rank, if multiple matches have the same rank then multiple rows will be returned per pet.             -- Change the “RANK()” function to "ROW_NUMBER()" to only return on result per pet.            ,RANK() OVER (PARTITION BY [PetID] ORDER BY [MatchSetRank] DESC) AS [FinalRank]     FROM    CTE_Owners    WHERE   [FieldSetRank] = [MatchSetRank] -- Only return records where the field sets calculated based on                                             -- actual matches is equal to desired field set ranks. This will                                             -- eliminate matches where the number of fields that meets the                                             -- criteria is the same but does not meet priority requirements. )SELECT   [PetID]        ,[OwnerID]FROM    CTE_FinalRankingWHERE   [FinalRank] = 1

sql sql-server tsql sql-server-2016

I will say right away to save your time:

My solution uses dynamic SQL. Michał Turczyn correctly noted that you can't avoid it while names of compared column are stored in the DB.
My solution uses loops. And I strongly believe that you will not solve this problem with the pure SQL query which will work fast enough on the data size you declared (tables have > 1M records). The logic you described implies iterations by its nature - from bigger set of matching fields to the lower set. SQL as a query language was not designed to cover such tricky scenarios. You could try to solve your problem with pure SQL query, but even if you manage to build such query, it will be very tricky, complex and unclear. I'm not a fan of such solutions. That's why I have not even dig into this direction.
On the other hand my solution does not require creation of temporary tables which is an advantage.

Given this, my approach is fair straightforward:

There is an outer loop that iterates from the biggest set of matchers (all matching fields) to the smallest set of matchers (one field). On the first iteration, when we don't know yet how many matchers are stored in the database for the pet, we read and use them all. On the next iterations, we decrease number of used matchers by 1 (removing the one with highest priority).
The inner loop iterates over current set of matchers and builds the WHERE clause that compares fields between Pets and Owners table.
Current query is executed and if some owner matches given criteria, we break from the outer loop.

Here is the code that implements this logic:

DECLARE @PetId INT = 2;DECLARE @MatchersLimit INT;DECLARE @OwnerID INT;WHILE (@MatchersLimit IS NULL OR @MatchersLimit > 0) AND @OwnerID IS NULLBEGIN    DECLARE @CurrMatchFilter VARCHAR(max) = ''    DECLARE @Field VARCHAR(30)    DECLARE @MatchTo VARCHAR(30)    DECLARE @CurrMatchersNumber INT = 0;    DECLARE @GetMatchers CURSOR;    IF @MatchersLimit IS NULL        SET @GetMatchers = CURSOR FOR SELECT Field, MatchTo FROM temp_builder WHERE PetID = @PetId ORDER BY Priority ASC;    ELSE        SET @GetMatchers = CURSOR FOR SELECT TOP (@MatchersLimit) Field, MatchTo FROM temp_builder WHERE PetID = @PetId ORDER BY Priority ASC;    OPEN @GetMatchers;    FETCH NEXT FROM @GetMatchers INTO @Field, @MatchTo;    WHILE @@FETCH_STATUS = 0    BEGIN        IF @CurrMatchFilter <> '' SET @CurrMatchFilter = @CurrMatchFilter + ' AND ';        SET @CurrMatchFilter = @CurrMatchFilter + ('temp_pets.' + @Field + ' = ' + 'temp_owners.' + @MatchTo);        FETCH NEXT FROM @GetMatchers INTO @field, @matchTo;        SET @CurrMatchersNumber = @CurrMatchersNumber + 1;    END    CLOSE @GetMatchers;    DEALLOCATE @GetMatchers;    IF @CurrMatchersNumber = 0 BREAK;    DECLARE @CurrQuery nvarchar(max) = N'SELECT @id = temp_owners.OwnerID FROM temp_owners INNER JOIN temp_pets ON (' + CAST(@CurrMatchFilter AS NVARCHAR(MAX)) + N') WHERE temp_pets.PetID = ' + CAST(@PetId AS NVARCHAR(MAX));    EXECUTE sp_executesql @CurrQuery, N'@id int OUTPUT', @id=@OwnerID OUTPUT;    IF @MatchersLimit IS NULL        SET @MatchersLimit = @CurrMatchersNumber - 1;    ELSE        SET @MatchersLimit = @MatchersLimit - 1;ENDSELECT @OwnerID AS OwnerID, @MatchersLimit + 1 AS Matched;

Performance considerations

There are basically 2 queries that are executed in this approach:

SELECT Field, MatchTo FROM temp_builder WHERE PetID = @PetId;
You should add an index on PetID field in temp_builder table and this query will be executed very fast.
SELECT @id = temp_owners.OwnerID FROM temp_owners INNER JOIN temp_pets ON (temp_pets.Document = temp_owners.Document AND temp_pets.OwnerName = temp_owners.Name AND temp_pets.Zip = temp_owners.Zip AND ...) WHERE temp_pets.PetID = @PetId;
This query looks scary because it joins two big tables - temp_owners and temp_pets. However temp_pets table is filtered by PetID column that should result in just one record. So if you have an index on temp_pets.PetID column (and you should as this column seems like a primary key), the query will result into scan of temp_owners table. Such scan will not take the ages even for table with over 1M rows. If the query is still too slow, you could consider adding indexes for columns of temp_owners table that are used in the matchers (Addr, Zip, etc.). Adding indexes has downsides, like bigger database and slower insert/update operations. So before adding the indexes on temp_owners columns, check the query speed on table without indexes.

sql sql-server tsql sql-server-2016

I'm not sure if I've got the end result correctly or not, But I suggest using a couple of common table expressions to generate a batch of update statements using dynamic SQL (I'm afraid it can't be done without dynamic SQL), and then execute them using Exec(sql).

The benefit of this approach is that it involves no loops or cursors.

Every update statement I generate is using an inner join between the pets and the owners tables, updating the pets table's owner id with the owners table owner id, using the mappings from the builder table as the basis to the on clause.
The first cte is responsible of generating the on clause from the builder table, and the second one is responsible to generate the update statements.
Finally, I'm selecting all the SQL statements from the second CTE into a single nvarchar(max) variable and execute it.

The way I've tackled the priority issue is to generate a single update statement for each group of priorities, starting with including all priorities, and excluding values from the next SQL statement, highest priority gets excluded first, until I'm left with an on clause mapping only a single set of columns.

So, the first thing is to declare a variable to hold the generated update statements:

DECLARE @Sql nvarchar(max) = ''

Now, the first CTE is using cross apply with stuff and for xml to generate the on clause for each pair of petId and Priority:

;WITH OnClauseCTE AS(SELECT DISTINCT PetId, Priority, OnClauseFROM temp_builder t0CROSS APPLY(    SELECT STUFF (    (          SELECT ' AND p.'+ Field +' = o.'+ MatchTo        FROM temp_builder t1        WHERE PetID = t0.PetId        AND Priority <= t0.Priority        FOR XML PATH('')      )    , 1, 5, '') As OnClause) onClauseGenerator)

The second CTE generates a single UPDATE statement for every petId and Priority combination:

, UpdateStatementCTE AS(    SELECT  PetId,            Priority,            'UPDATE p             SET OwnerID = o.OwnerID             FROM temp_pets p             INNER JOIN temp_owners o ON ' + OnClause + '             WHERE p.PetId = '+ CAST(PetId as varchar(10)) +'            AND p.OwnerID IS NULL; -- THIS IS CRITICAL!            ' AS SQL    FROM OnClauseCTE)

And Finally, Generating a single batch of update statements from the UpdateStatementCTE:

SELECT @Sql = @Sql + SQLFROM UpdateStatementCTE    ORDER BY PetId, Priority DESC -- ORDER BY Priority is CRITICAL!

The order by PetId is strictly to help the readability, when you print out the content of @Sql. However, the Priority DESC part of the order by clause is critical, since we want to execute the highest priority first and the lowest priority last.

Now, @Sql contains this (shortened):

UPDATE p SET OwnerID = o.OwnerID FROM temp_pets p INNER JOIN temp_owners o ON p.Address = o.Addr AND p.Zip = o.Zip AND p.Country = o.Country AND p.OwnerName = o.Name WHERE p.PetId = 1AND p.OwnerID IS NULL;...UPDATE p SET OwnerID = o.OwnerID FROM temp_pets p INNER JOIN temp_owners o ON p.OwnerName = o.Name WHERE p.PetId = 1AND p.OwnerID IS NULL;...UPDATE p SET OwnerID = o.OwnerID FROM temp_pets p INNER JOIN temp_owners o ON p.OwnerName = o.Name AND p.Document = o.Document WHERE p.PetId = 2AND p.OwnerID IS NULL;...UPDATE p SET OwnerID = o.OwnerID FROM temp_pets p INNER JOIN temp_owners o ON p.Country = o.Country WHERE p.PetId = 3AND p.OwnerID IS NULL;

As you can see, every update statement is represented in the builder table, and will only change the Owner Id if the previous update statement haven't already, because of the AND p.OwnerID IS NULL part of the where clause.

After running the batch of update statements, your temp_pets table looks like this:

PetID   Address         Zip     Country     Document    OwnerName   OwnerID     Field1  Field21       123 5th st      12345   US          test.csv    John        5           NULL    NULL2       234 6th st      23456   US          a.csv       Alex        6           NULL    NULL3       345 7th st      34567   US          b.csv       Mike        1           NULL    NUL

You can see a live demo on rextester.

However, please note that the less conditions you have the more records might return back from the join, making the update more likely to be inaccurate.For instance, For PetId 3 I've got OwnerId 1 since the only thing I've had to match the records was the Country column, meaning it could actually be every OwnerId in this sample data, since everyone have the same value US in the Country column.
Under the following rules, there is not much I can do about that.

CodeHunter

Find matching records based on dynamic columns

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last