SQL - When would an empty OVER clause be used?
OVER() is part of analytic function and define partitions in your recordset. OVER() is just one partition and applied to the whole dataset
i.e. COUNT(*) OVER() will return in each row how many records in your dataset.
look to this http://msdn.microsoft.com/en-us/library/ms189461.aspx
Say our table is employees
:
+-----------+-------+---------+| badge_num | name | surname |+-----------+-------+---------+| 1 | John | Smith || 2 | Mark | Pence || 3 | Steve | Smith || 4 | Bob | Smith |+-----------+-------+---------+
Running
SELECT surname, COUNT(*)FROM employeesGROUP BY surname;
we'll get:
+---------+----------+| surname | COUNT(*) |+---------+----------+| Smith | 3 || Pence | 1 |+---------+----------+
While running
SELECT surname, COUNT(*) OVER()FROM employeesGROUP BY surname;
we'll get:
+---------+-----------------+| surname | COUNT(*) OVER() |+---------+-----------------+| Smith | 2 || Pence | 2 |+---------+-----------------+
In the second case, in each row we are just counting the number of rows of the whole select (not the single partition).
To summarize things, the OVER clause can be used with Ranking Functions(Rank, Row_Number, Dense_Rank..), Aggregate Functions like (AVG, Max, Min, SUM...etc) and Analytics Functions like (First_Value, Last_Value, and few others).
Let's See basic syntax of OVER clause
OVER ( [ <PARTITION BY clause> ] [ <ORDER BY clause> ] [ <ROW or RANGE clause> ] )
PARTITION BY:It is used to partition data and perform operations on groups with the same data.
ORDER BY:It is used to define the logical order of data in Partitions. When we don't specify Partition, entire resultset is considered as a single partition
:This can be used to specify what rows are supposed to be considered in a partition when performing the operation.
Let's take an example:
Here is my dataset:
Id Name Gender Salary----------- -------------------------------------------------- ---------- -----------1 Mark Male 50002 John Male 45003 Pavan Male 50004 Pam Female 55005 Sara Female 40006 Aradhya Female 35007 Tom Male 55008 Mary Female 50009 Ben Male 650010 Jodi Female 700011 Tom Male 550012 Ron Male 5000
So let me execute different scenarios and see how data is impacted and I'll come from difficult syntax to simple one
Select *,SUM(salary) Over(order by salary RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sum_sal from employeesId Name Gender Salary sum_sal----------- -------------------------------------------------- ---------- ----------- -----------6 Aradhya Female 3500 35005 Sara Female 4000 75002 John Male 4500 120003 Pavan Male 5000 320001 Mark Male 5000 320008 Mary Female 5000 3200012 Ron Male 5000 3200011 Tom Male 5500 485007 Tom Male 5500 485004 Pam Female 5500 485009 Ben Male 6500 5500010 Jodi Female 7000 62000
Just observe the sum_sal part. Here I am using order by Salary and using "RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW".In this case, we are not using partition so entire data will be treated as one partition and we are ordering on salary.And the important thing here is UNBOUNDED PRECEDING AND CURRENT ROW. This means when we are calculating the sum, from starting row to the current row for each row.But if we see rows with salary 5000 and name="Pavan", ideally it should be 17000 and for salary=5000 and name=Mark, it should be 22000. But as we are using RANGE and in this case, if it finds any similar elements then it considers them as the same logical group and performs an operation on them and assigns value to each item in that group. That is the reason why we have the same value for salary=5000. The engine went up to salary=5000 and Name=Ron and calculated sum and then assigned it to all salary=5000.
Select *,SUM(salary) Over(order by salary ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sum_sal from employees Id Name Gender Salary sum_sal----------- -------------------------------------------------- ---------- ----------- -----------6 Aradhya Female 3500 35005 Sara Female 4000 75002 John Male 4500 120003 Pavan Male 5000 170001 Mark Male 5000 220008 Mary Female 5000 2700012 Ron Male 5000 3200011 Tom Male 5500 375007 Tom Male 5500 430004 Pam Female 5500 485009 Ben Male 6500 5500010 Jodi Female 7000 62000
So with ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW The difference is for same value items instead of grouping them together, It calculates SUM from starting row to current row and it doesn't treat items with same value differently like RANGE
Select *,SUM(salary) Over(order by salary) as sum_sal from employeesId Name Gender Salary sum_sal----------- -------------------------------------------------- ---------- ----------- -----------6 Aradhya Female 3500 35005 Sara Female 4000 75002 John Male 4500 120003 Pavan Male 5000 320001 Mark Male 5000 320008 Mary Female 5000 3200012 Ron Male 5000 3200011 Tom Male 5500 485007 Tom Male 5500 485004 Pam Female 5500 485009 Ben Male 6500 5500010 Jodi Female 7000 62000
These results are the same as
Select *, SUM(salary) Over(order by salary RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sum_sal from employees
That is because Over(order by salary) is just a short cut of Over(order by salary RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)So wherever we simply specify Order by without ROWS or RANGE it is taking RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW as default.
Note: This is applicable only to Functions that actually accept RANGE/ROW. For example, ROW_NUMBER and few others don't accept RANGE/ROW and in that case, this doesn't come into the picture.
Till now we saw that Over clause with an order by is taking Range/ROWS and syntax looks something like this RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROWAnd it is actually calculating up to the current row from the first row. But what If it wants to calculate values for the entire partition of data and have it for each column (that is from 1st row to last row). Here is the query for that
Select *,sum(salary) Over(order by salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as sum_sal from employeesId Name Gender Salary sum_sal----------- -------------------------------------------------- ---------- ----------- -----------1 Mark Male 5000 620002 John Male 4500 620003 Pavan Male 5000 620004 Pam Female 5500 620005 Sara Female 4000 620006 Aradhya Female 3500 620007 Tom Male 5500 620008 Mary Female 5000 620009 Ben Male 6500 6200010 Jodi Female 7000 6200011 Tom Male 5500 6200012 Ron Male 5000 62000
Instead of CURRENT ROW, I am specifying UNBOUNDED FOLLOWING which instructs the engine to calculate till the last record of partition for each row.
Now coming to your point on what is OVER() with empty braces?
It is just a short cut for Over(order by salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
Here we are indirectly specifying to treat all my resultset as a single partition and then perform calculations from the first record to the last record of each partition.
Select *,Sum(salary) Over() as sum_sal from employeesId Name Gender Salary sum_sal----------- -------------------------------------------------- ---------- ----------- -----------1 Mark Male 5000 620002 John Male 4500 620003 Pavan Male 5000 620004 Pam Female 5500 620005 Sara Female 4000 620006 Aradhya Female 3500 620007 Tom Male 5500 620008 Mary Female 5000 620009 Ben Male 6500 6200010 Jodi Female 7000 6200011 Tom Male 5500 6200012 Ron Male 5000 62000
I did create a video on this and if you are interested you can visit it.https://www.youtube.com/watch?v=CvVenuVUqto&t=1177s
Thanks,Pavan Kumar AryasomayajuluHTTP://xyzcoder.github.io