Eliminating duplicate values based on only one column of the table Eliminating duplicate values based on only one column of the table sql-server sql-server

Eliminating duplicate values based on only one column of the table


This is where the window function row_number() comes in handy:

SELECT s.siteName, s.siteIP, h.dateFROM sites s INNER JOIN     (select h.*, row_number() over (partition by siteName order by date desc) as seqnum      from history h     ) h    ON s.siteName = h.siteName and seqnum = 1ORDER BY s.siteName, h.date


From your example it seems reasonable to assume that the siteIP column is determined by the siteName column (that is, each site has only one siteIP). If this is indeed the case, then there is a simple solution using group by:

select  sites.siteName,  sites.siteIP,  max(history.date)from sitesinner join history on  sites.siteName=history.siteNamegroup by  sites.siteName,  sites.siteIPorder by  sites.siteName;

However, if my assumption is not correct (that is, it is possible for a site to have multiple siteIP), then it is not clear from you question which siteIP you want the query to return in the second column. If just any siteIP, then the following query will do:

select  sites.siteName,  min(sites.siteIP),  max(history.date)from sitesinner join history on  sites.siteName=history.siteNamegroup by  sites.siteNameorder by  sites.siteName;


I solve such queries using this pattern:

SELECT *FROM tWHERE t.field=(  SELECT MAX(t.field)   FROM t AS t0   WHERE t.group_column1=t0.group_column1    AND t.group_column2=t0.group_column2 ...)

That is it will select records where the value of a field is at its max value. To apply it to your query I used the common table expression so that I don't have to repeat the JOIN twice:

WITH site_history AS (  SELECT sites.siteName, sites.siteIP, history.date  FROM sites  JOIN history USING (siteName))SELECT *FROM site_history hWHERE date=(  SELECT MAX(date)   FROM site_history h0   WHERE h.siteName=h0.siteName)ORDER BY siteName

It's important to note that it works only if the field we're calculating the maximum for is unique. In your example the date field should be unique for each siteName, that is if the IP can't be changed multiple times per millisecond. In my experience this is commonly the case otherwise you don't know which record is the newest anyway. If the history table has an unique index for (site, date), this query is also very fast, index range scan on the history table scanning just the first item can be used.