Pipes and filters at DBMS-level: Splitting the MERGE output stream
My understanding is that the main obstacle is the limitation of the OUTPUT
clause in SQL Server. It allows one OUTPUT INTO table
and/or one OUTPUT
that returns result set to the caller.
You want to save the outcome of the MERGE
statement in two different ways:
- all rows that were affected by
MERGE
for gathering statistics - only inserted rows for
queue
Simple variant
I would use your S2 solution. At least to start with. It is easy to understand and maintain and should be quite efficient, because the most resource-intensive operation (MERGE
into Target
itself would be performed only once). There is a second variant below and it would be interesting to compare their performance on real data.
So:
- Use
OUTPUT INTO @TempTable
in theMERGE
- Either
INSERT
all rows from@TempTable
intoStats
or aggregate before inserting. If all you need is aggregated statistics, it makes sense to aggregate results of this batch and merge it into the finalStats
instead of copying all rows. INSERT
intoQueue
only "inserted" rows from@TempTable
.
I'll take sample data from the answer by @i-one.
Schema
-- I'll return to commented lines laterCREATE TABLE [dbo].[TestTarget]( -- [ID] [int] IDENTITY(1,1) NOT NULL, [foo] [varchar](10) NULL, [bar] [varchar](10) NULL);CREATE TABLE [dbo].[TestStaging]( [foo] [varchar](10) NULL, [bar] [varchar](10) NULL, [baz] [varchar](10) NULL);CREATE TABLE [dbo].[TestStats]( [MergeAction] [nvarchar](10) NOT NULL);CREATE TABLE [dbo].[TestQueue]( -- [TargetID] [int] NOT NULL, [foo] [varchar](10) NULL, [baz] [varchar](10) NULL);
Sample data
TRUNCATE TABLE [dbo].[TestTarget];TRUNCATE TABLE [dbo].[TestStaging];TRUNCATE TABLE [dbo].[TestStats];TRUNCATE TABLE [dbo].[TestQueue];INSERT INTO [dbo].[TestStaging] ([foo] ,[bar] ,[baz])VALUES ('A', 'AA', 'AAA'), ('B', 'BB', 'BBB'), ('C', 'CC', 'CCC');INSERT INTO [dbo].[TestTarget] ([foo] ,[bar])VALUES ('A', 'A_'), ('B', 'B?');
Merge
DECLARE @TempTable TABLE ( MergeAction nvarchar(10) NOT NULL, foo varchar(10) NULL, baz varchar(10) NULL);MERGE INTO TestTarget AS DstUSING TestStaging AS SrcON Dst.foo = Src.fooWHEN MATCHED THENUPDATE SET Dst.bar = Src.barWHEN NOT MATCHED BY TARGET THENINSERT (foo, bar)VALUES (Src.foo, Src.bar)OUTPUT $action AS MergeAction, inserted.foo, Src.bazINTO @TempTable(MergeAction, foo, baz);INSERT INTO [dbo].[TestStats] (MergeAction)SELECT T.MergeActionFROM @TempTable AS T;INSERT INTO [dbo].[TestQueue] ([foo] ,[baz])SELECT T.foo ,T.bazFROM @TempTable AS TWHERE T.MergeAction = 'INSERT';SELECT * FROM [dbo].[TestTarget];SELECT * FROM [dbo].[TestStats];SELECT * FROM [dbo].[TestQueue];
Result
TestTarget+-----+-----+| foo | bar |+-----+-----+| A | AA || B | BB || C | CC |+-----+-----+TestStats+-------------+| MergeAction |+-------------+| INSERT || UPDATE || UPDATE |+-------------+TestQueue+-----+-----+| foo | baz |+-----+-----+| C | CCC |+-----+-----+
Second variant
Tested on SQL Server 2014 Express.
OUTPUT
clause can send its result set to a table and to the caller. So, OUTPUT INTO
can go into the Stats
directly and if we wrap the MERGE
statement into a stored procedure, then we can use INSERT ... EXEC
into the Queue
.
If you examine execution plan you'll see that INSERT ... EXEC
creates a temporary table behind the scenes anyway (see also The Hidden Costs of INSERT EXEC by Adam Machanic), so I expect that overall performance would be similar to the first variant when you create temporary table explicitly.
One more problem to solve: Queue
table should have only "inserted" rows, not all effected rows. To achieve that you could use a trigger on the Queue
table to discard rows other than "inserted". One more possibility is to define a unique index with IGNORE_DUP_KEY = ON
and prepare the data in such a way that "non-inserted" rows would violate the unique index and would not be inserted into the table.
So, I'll add an ID IDENTITY
column to the Target
table and I'll add a TargetID
column to the Queue
table. (Uncomment them in the script above).Also, I'll add an index to the Queue
table:
CREATE UNIQUE NONCLUSTERED INDEX [IX_TargetID] ON [dbo].[TestQueue]( [TargetID] ASC) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = ON, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
Important part is UNIQUE
and IGNORE_DUP_KEY = ON
.
Here is the stored procedure for the MERGE
:
CREATE PROCEDURE [dbo].[TestMerge]ASBEGIN SET NOCOUNT ON; SET XACT_ABORT ON; MERGE INTO dbo.TestTarget AS Dst USING dbo.TestStaging AS Src ON Dst.foo = Src.foo WHEN MATCHED THEN UPDATE SET Dst.bar = Src.bar WHEN NOT MATCHED BY TARGET THEN INSERT (foo, bar) VALUES (Src.foo, Src.bar) OUTPUT $action INTO dbo.TestStats(MergeAction) OUTPUT CASE WHEN $action = 'INSERT' THEN inserted.ID ELSE 0 END AS TargetID, inserted.foo, Src.baz ;END
Usage
TRUNCATE TABLE [dbo].[TestTarget];TRUNCATE TABLE [dbo].[TestStaging];TRUNCATE TABLE [dbo].[TestStats];TRUNCATE TABLE [dbo].[TestQueue];-- Make sure that `Queue` has one special row with TargetID=0 in advance.INSERT INTO [dbo].[TestQueue] ([TargetID] ,[foo] ,[baz])VALUES (0 ,NULL ,NULL);INSERT INTO [dbo].[TestStaging] ([foo] ,[bar] ,[baz])VALUES ('A', 'AA', 'AAA'), ('B', 'BB', 'BBB'), ('C', 'CC', 'CCC');INSERT INTO [dbo].[TestTarget] ([foo] ,[bar])VALUES ('A', 'A_'), ('B', 'B?');INSERT INTO [dbo].[TestQueue]EXEC [dbo].[TestMerge];SELECT * FROM [dbo].[TestTarget];SELECT * FROM [dbo].[TestStats];SELECT * FROM [dbo].[TestQueue];
Result
TestTarget+----+-----+-----+| ID | foo | bar |+----+-----+-----+| 1 | A | AA || 2 | B | BB || 3 | C | CC |+----+-----+-----+TestStats+-------------+| MergeAction |+-------------+| INSERT || UPDATE || UPDATE |+-------------+TestQueue+----------+------+------+| TargetID | foo | baz |+----------+------+------+| 0 | NULL | NULL || 3 | C | CCC |+----------+------+------+
There will be an extra message during INSERT ... EXEC
:
Duplicate key was ignored.
if MERGE
updated some rows. This warning message is sent when unique index discards some rows during INSERT
due to IGNORE_DUP_KEY = ON
.
A warning message will occur when duplicate key values are inserted into a unique index. Only the rows violating the uniqueness constraint will fail.
Consider following two approaches to solve the problem:
- Merge data into target and output inserted into queue in a single statement, and summarize statistics in the trigger created on target. Batch identifier can be passed into trigger via temporary table.
- Merge data into target and output inserted into queue in a single statement, and summarize statistics immediately after the merge, using built-in change tracking capabilities, instead of doing it in the trigger.
Approach 1 (merge data and gather statistics in the trigger):
Sample data setup (indexes and constraints omitted for simplicity):
create table staging (foo varchar(10), bar varchar(10), baz varchar(10));create table target (foo varchar(10), bar varchar(10));create table queue (foo varchar(10), baz varchar(10));create table stats (batchID int, inserted bigint, updated bigint, deleted bigint);insert into staging values ('A', 'AA', 'AAA') ,('B', 'BB', 'BBB') ,('C', 'CC', 'CCC') ;insert into target values ('A', 'A_') ,('B', 'B?') ,('E', 'EE') ;
Trigger for gathering inserted/updated/deleted statistics:
create trigger target_onChangeon targetafter delete, update, insertasbegin set nocount on; if object_id('tempdb..#targetMergeBatch') is NULL return; declare @batchID int; select @batchID = batchID from #targetMergeBatch; merge into stats t using ( select batchID = @batchID, cntIns = count_big(case when i.foo is not NULL and d.foo is NULL then 1 end), cntUpd = count_big(case when i.foo is not NULL and d.foo is not NULL then 1 end), cntDel = count_big(case when i.foo is NULL and d.foo is not NULL then 1 end) from inserted i full join deleted d on d.foo = i.foo ) s on t.batchID = s.batchID when matched then update set t.inserted = t.inserted + s.cntIns, t.updated = t.updated + s.cntUpd, t.deleted = t.deleted + s.cntDel when not matched then insert (batchID, inserted, updated, deleted) values (s.batchID, s.cntIns, s.cntUpd, cntDel);end
Merge statements:
declare @batchID int;set @batchID = 1;-- or select @batchID = batchID from ...;create table #targetMergeBatch (batchID int);insert into #targetMergeBatch (batchID) values (@batchID);insert into queue (foo, baz)select foo, bazfrom( merge into target t using staging s on t.foo = s.foo when matched then update set t.bar = s.bar when not matched then insert (foo, bar) values (s.foo, s.bar) when not matched by source then delete output $action, inserted.foo, s.baz) m(act, foo, baz)where act = 'INSERT' ;drop table #targetMergeBatch
Check the results:
select * from target;select * from queue;select * from stats;
Target:
foo bar---------- ----------A AAB BBC CC
Queue:
foo baz---------- ----------C CCC
Stats:
batchID inserted updated deleted-------- ---------- --------- ---------1 1 2 1
Approach 2 (gather statistics, using change tracking capabilities):
Sample data setup is the same as in previous case (just drop everything incl. trigger and recreate tables from scratch), except that in this case we need to have PK on target to make sample work:
create table target (foo varchar(10) primary key, bar varchar(10));
Enable change tracking on database:
alter database Test set change_tracking = on
Enable change tracking on target table:
alter table target enable change_tracking
Merge data and grab statistics immediately after that, filtering by the change context to count only rows affected by merge:
begin transaction;declare @batchID int, @chVersion bigint, @chContext varbinary(128);set @batchID = 1;-- or select @batchID = batchID from ...;SET @chVersion = change_tracking_current_version();set @chContext = newid();with change_tracking_context(@chContext)insert into queue (foo, baz)select foo, bazfrom( merge into target t using staging s on t.foo = s.foo when matched then update set t.bar = s.bar when not matched then insert (foo, bar) values (s.foo, s.bar) when not matched by source then delete output $action, inserted.foo, s.baz) m(act, foo, baz)where act = 'INSERT' ;with ch(foo, op) as ( select foo, sys_change_operation from changetable(changes target, @chVersion) ct where sys_change_context = @chContext)insert into stats (batchID, inserted, updated, deleted)select @batchID, [I], [U], [D]from ch pivot(count_big(foo) for op in ([I], [U], [D])) pvt ;commit transaction;
Check the results:
select * from target;select * from queue;select * from stats;
They are same as in previous sample.
Target:
foo bar---------- ----------A AAB BBC CC
Queue:
foo baz---------- ----------C CCC
Stats:
batchID inserted updated deleted-------- ---------- --------- ---------1 1 2 1
I suggest extracting the stats be coding using three independent AFTER INSERT / DELETE / UPDATE
triggers along the lines of:
create trigger dbo.insert_trigger_targeton [dbo].[target]after insertasinsert into dbo.[stats] ([action],[count])select 'insert', count(1)from inserted;gocreate trigger dbo.update_trigger_targeton [dbo].[target]after updateasinsert into dbo.[stats] ([action],[count])select 'update', count(1) from inserted -- or deleted == after / before image, count will be the samegocreate trigger dbo.delete_trigger_targeton [dbo].[target]after deleteasinsert into dbo.[stats] ([action],[count])select 'delete', count(1) from deletedgo
If you need more context, put something in CONTEXT_INFO
and pluck it out from the triggers.
Now, I'm going to assert that the AFTER triggers are not that expensive, but you'll need to test that to be sure.
Having dealt with that, you'll be free to use the OUTPUT
clause (NOT OUTPUT INTO
) in the MERGE
and then use that nested inside a select to subset the data that you want to go into the queue
table.
Justification
Because of the need to access columns from both staging
and target
in order to build the data for queue
, this HAS to be done using the OUTPUT
option in MERGE
, since nothing else has access to "both sides".
Then, if we have hijacked the OUTPUT
clause for queue
, how can we re-work that functionality? I think the AFTER
triggers will work, given the requirements for stats that you have described. Indeed, the stats could be quite complex if required, given the images that are available. I'm asserting that the AFTER
triggers are "not that expensive" since the data of both before and after must always be available in order that a transaction can be both COMMITTED OR ROLLED BACK - yes, the data needs to be scanned (even to get the count) but that doesn't seem like too much of a cost.
In my own analysis that scan added about 5% to the execution plan's base cost
Sound like a solution?