Monday, May 23, 2011

Sorting NULL columns in SQL Server.

Recently I was working with a report developer who was struggling with NULL date-time values in their report output.  The problem was this BI programmer was used to working with another RDBMS that had control settings specifying the sort order for NULLs.  By SQL convention NULL are ordered first.

Unfortunately the case in the data had “end dates” for when policies come up for renewal. The desired output was to have policies without valid end dates(NULL) to be last in the output.  For this article I’ll use Microsoft’s sample database AdventWorks. The table we’ll use for our report is ProductCostHistory. The EndDate column has NULLs that we want to be listed last in our report.
If we simply query the data with an ORDER-BY we’ll get the following result:
SELECT *
  FROM [AdventureWorks].[Production].[ProductCostHistory]
  ORDER BY [EndDate]

ProductID   StartDate               EndDate                 StandardCost       ModifiedDate
----------- ----------------------- ----------------------- ------------------ -----------------------
707         2003-07-01 00:00:00.000 NULL                    13.0863            2003-06-17 00:00:00.000
708         2003-07-01 00:00:00.000 NULL                    13.0863            2003-06-17 00:00:00.000
              --Data edited out for clarity---
860         2002-07-01 00:00:00.000 2003-06-30 00:00:00.000 9.7136             2003-06-30 00:00:00.000
860         2003-07-01 00:00:00.000 NULL                    9.1593             2003-06-17 00:00:00.000

As expected the NULL EndDate values come first in the list.
My report writer buddy came up some pretty clever work around involving temp tables and self-joins.  Not very clear and poorly performing to say the least.  So I showed him the following as the first  replacement solution to placing NULLs at the end.

SELECT *
  FROM [AdventureWorks].[Production].[ProductCostHistory]
  ORDER BY isnull([EndDate],'12/31/9999')
ProductID   StartDate               EndDate                 StandardCost       ModifiedDate
----------- ----------------------- ----------------------- ------------------ -----------------------
707         2001-07-01 00:00:00.000 2002-06-30 00:00:00.000 12.0278            2002-06-30 00:00:00.000
708         2001-07-01 00:00:00.000 2002-06-30 00:00:00.000 12.0278            2002-06-30 00:00:00.000
              --Data edited out for clarity---
860         2002-07-01 00:00:00.000 2003-06-30 00:00:00.000 9.7136             2003-06-30 00:00:00.000
860         2003-07-01 00:00:00.000 NULL                    9.1593             2003-06-17 00:00:00.000

Exactly what we wanted!  BUT, this assumes that the date value ‘12/31/9999’ is beyond any other date in the table. Many times this solution is a safe assumption. This may NOT always be the case. I’ve seen databases where high values are sometimes used to indicate “out of range” or “present but invalid” etc.
The next solution addresses this condition by finding the MAX value in the column then adding an appropriate delta value to place the column after this maximum value. Here is the solution for our product cost history example:

DECLARE @maxEndDate DATETIME

SELECT @maxEndDate = MAX(EndDate)+1
     FROM [AdventureWorks].[Production].[ProductCostHistory]
SELECT *
  FROM [AdventureWorks].[Production].[ProductCostHistory]
  ORDER BY  isnull([EndDate],@maxEndDate)
ProductID   StartDate               EndDate                 StandardCost       ModifiedDate
----------- ----------------------- ----------------------- ------------------ -----------------------
707         2001-07-01 00:00:00.000 2002-06-30 00:00:00.000 12.0278            2002-06-30 00:00:00.000
708         2001-07-01 00:00:00.000 2002-06-30 00:00:00.000 12.0278            2002-06-30 00:00:00.000
              --Data edited out for clarity---
860         2002-07-01 00:00:00.000 2003-06-30 00:00:00.000 9.7136             2003-06-30 00:00:00.000
860         2003-07-01 00:00:00.000 NULL                    9.1593             2003-06-17 00:00:00.000

As expected the data is sorted in the correct order with the NULL values following the valid dates.
Now for what I think is a better solution and does not require hard coding a MAX(or Min) value or scanning through the data finding the last value. Here is my preferred solution to our date problem:
SELECT *
  FROM [AdventureWorks].[Production].[ProductCostHistory]
  ORDER BY (case when EndDate IS NULL then 1 else 0 end),EndDate
ProductID   StartDate               EndDate                 StandardCost       ModifiedDate
----------- ----------------------- ----------------------- ------------------ -----------------------
707         2001-07-01 00:00:00.000 2002-06-30 00:00:00.000 12.0278            2002-06-30 00:00:00.000
708         2001-07-01 00:00:00.000 2002-06-30 00:00:00.000 12.0278            2002-06-30 00:00:00.000
              --Data edited out for clarity---
860         2002-07-01 00:00:00.000 2003-06-30 00:00:00.000 9.7136             2003-06-30 00:00:00.000
860         2003-07-01 00:00:00.000 NULL                    9.1593             2003-06-17 00:00:00.000

Look at this for a moment.  It is obvious but has a subtle requirement; the CASE statement must precede the column that contains NULL values. The benefit is that this will always work and requires no additional consideration for the range or contents of the column.

Conclusion:  While this may be a simple little trick you’d be surprised how many report writers and data consumers  struggle with exactly this problem.  The CASE solution can also be expanded to provide table sort orders based on column ranges instead of simply NULL versus non-NULL.
-Dave Winters

1 comment:

tongchen@seattle said...

Greetings from USA! Your blog is really cool.
Are you living in Liechtenstein?
You are welcomed to visit me at:
http://blog.sina.com.cn/usstamps
Thanks!

Post a Comment