Trouble using ROW_NUMBER() OVER (PARTITION BY ...) Trouble using ROW_NUMBER() OVER (PARTITION BY ...) sql-server sql-server

Trouble using ROW_NUMBER() OVER (PARTITION BY ...)


I would do something like this:

;WITH x  AS (SELECT *,             Row_number()               OVER(                 partition BY employeeid                 ORDER BY datestart) rn      FROM   employeehistory) SELECT * FROM   x x1    LEFT OUTER JOIN x x2                 ON x1.rn = x2.rn + 1 

Or maybe it would be x2.rn - 1. You'll have to see. In any case, you get the idea. Once you have the table joined on itself, you can filter, group, sort, etc. to get what you need.


A bit involved. Easiest would be to refer to this SQL Fiddle I created for you that produces the exact result. There are ways you can improve it for performance or other considerations, but this should hopefully at least be clearer than some alternatives.

The gist is, you get a canonical ranking of your data first, then use that to segment the data into groups, then find an end date for each group, then eliminate any intermediate rows. ROW_NUMBER() and CROSS APPLY help a lot in doing it readably.


EDIT 2019:

The SQL Fiddle does in fact seem to be broken, for some reason, but it appears to be a problem on the SQL Fiddle site. Here's a complete version, tested just now on SQL Server 2016:

CREATE TABLE Source(  EmployeeID int,  DateStarted date,  DepartmentID int)INSERT INTO SourceVALUES(10001,'2013-01-01',001),(10001,'2013-09-09',001),(10001,'2013-12-01',002),(10001,'2014-05-01',002),(10001,'2014-10-01',001),(10001,'2014-12-01',001)SELECT *,   ROW_NUMBER() OVER (PARTITION BY EmployeeID ORDER BY DateStarted) AS EntryRank,  newid() as GroupKey,  CAST(NULL AS date) AS EndDateINTO #RankedDataFROM Source;UPDATE #RankedDataSET GroupKey = beginDate.GroupKeyFROM #RankedData sup  CROSS APPLY   (    SELECT TOP 1 GroupKey    FROM #RankedData sub     WHERE sub.EmployeeID = sup.EmployeeID AND      sub.DepartmentID = sup.DepartmentID AND      NOT EXISTS         (          SELECT *           FROM #RankedData bot           WHERE bot.EmployeeID = sup.EmployeeID AND            bot.EntryRank BETWEEN sub.EntryRank AND sup.EntryRank AND            bot.DepartmentID <> sup.DepartmentID        )      ORDER BY DateStarted ASC    ) beginDate (GroupKey);UPDATE #RankedDataSET EndDate = nextGroup.DateStartedFROM #RankedData sup  CROSS APPLY   (    SELECT TOP 1 DateStarted    FROM #RankedData sub    WHERE sub.EmployeeID = sup.EmployeeID AND      sub.DepartmentID <> sup.DepartmentID AND      sub.EntryRank > sup.EntryRank    ORDER BY EntryRank ASC  ) nextGroup (DateStarted);SELECT * FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY GroupKey ORDER BY EntryRank ASC) AS GroupRank FROM #RankedData) FinalRankingWHERE GroupRank = 1ORDER BY EntryRank;DROP TABLE #RankedDataDROP TABLE Source


It looks like a common gaps-and-islands problem. The difference between two sequences of row numbers rn1 and rn2 give the "group" number.

Run this query CTE-by-CTE and examine intermediate results to see how it works.

Sample data

I expanded sample data from the question a little.

DECLARE @Source TABLE(    EmployeeID int,    DateStarted date,    DepartmentID int)INSERT INTO @SourceVALUES(10001,'2013-01-01',001),(10001,'2013-09-09',001),(10001,'2013-12-01',002),(10001,'2014-05-01',002),(10001,'2014-10-01',001),(10001,'2014-12-01',001),(10005,'2013-05-01',001),(10005,'2013-11-09',001),(10005,'2013-12-01',002),(10005,'2014-10-01',001),(10005,'2016-12-01',001);

Query for SQL Server 2008

There is no LEAD function in SQL Server 2008, so I had to use self-join via OUTER APPLY to get the value of the "next" row for the DateEnd.

WITHCTEAS(    SELECT        EmployeeID        ,DateStarted        ,DepartmentID        ,ROW_NUMBER() OVER (PARTITION BY EmployeeID ORDER BY DateStarted) AS rn1        ,ROW_NUMBER() OVER (PARTITION BY EmployeeID, DepartmentID ORDER BY DateStarted) AS rn2    FROM @Source),CTE_GroupsAS(    SELECT        EmployeeID        ,MIN(DateStarted) AS DateStart        ,DepartmentID    FROM CTE    GROUP BY        EmployeeID        ,DepartmentID        ,rn1 - rn2)SELECT    CTE_Groups.EmployeeID    ,CTE_Groups.DepartmentID    ,CTE_Groups.DateStart    ,A.DateEndFROM    CTE_Groups    OUTER APPLY    (        SELECT TOP(1) G2.DateStart AS DateEnd        FROM CTE_Groups AS G2        WHERE            G2.EmployeeID = CTE_Groups.EmployeeID            AND G2.DateStart > CTE_Groups.DateStart        ORDER BY G2.DateStart    ) AS AORDER BY    EmployeeID    ,DateStart;

Query for SQL Server 2012+

Starting with SQL Server 2012 there is a LEAD function that makes this task more efficient.

WITHCTEAS(    SELECT        EmployeeID        ,DateStarted        ,DepartmentID        ,ROW_NUMBER() OVER (PARTITION BY EmployeeID ORDER BY DateStarted) AS rn1        ,ROW_NUMBER() OVER (PARTITION BY EmployeeID, DepartmentID ORDER BY DateStarted) AS rn2    FROM @Source),CTE_GroupsAS(    SELECT        EmployeeID        ,MIN(DateStarted) AS DateStart        ,DepartmentID    FROM CTE    GROUP BY        EmployeeID        ,DepartmentID        ,rn1 - rn2)SELECT    CTE_Groups.EmployeeID    ,CTE_Groups.DepartmentID    ,CTE_Groups.DateStart    ,LEAD(CTE_Groups.DateStart) OVER (PARTITION BY CTE_Groups.EmployeeID ORDER BY CTE_Groups.DateStart) AS DateEndFROM    CTE_GroupsORDER BY    EmployeeID    ,DateStart;

Result

+------------+--------------+------------+------------+| EmployeeID | DepartmentID | DateStart  |  DateEnd   |+------------+--------------+------------+------------+|      10001 |            1 | 2013-01-01 | 2013-12-01 ||      10001 |            2 | 2013-12-01 | 2014-10-01 ||      10001 |            1 | 2014-10-01 | NULL       ||      10005 |            1 | 2013-05-01 | 2013-12-01 ||      10005 |            2 | 2013-12-01 | 2014-10-01 ||      10005 |            1 | 2014-10-01 | NULL       |+------------+--------------+------------+------------+