Monday, May 23, 2011

Sorting NULL columns in SQL Server.

Recently I was working with a report developer who was struggling with NULL date-time values in their report output.  The problem was this BI programmer was used to working with another RDBMS that had control settings specifying the sort order for NULLs.  By SQL convention NULL are ordered first.

Unfortunately the case in the data had “end dates” for when policies come up for renewal. The desired output was to have policies without valid end dates(NULL) to be last in the output.  For this article I’ll use Microsoft’s sample database AdventWorks. The table we’ll use for our report is ProductCostHistory. The EndDate column has NULLs that we want to be listed last in our report.
If we simply query the data with an ORDER-BY we’ll get the following result:
SELECT *
  FROM [AdventureWorks].[Production].[ProductCostHistory]
  ORDER BY [EndDate]

ProductID   StartDate               EndDate                 StandardCost       ModifiedDate
----------- ----------------------- ----------------------- ------------------ -----------------------
707         2003-07-01 00:00:00.000 NULL                    13.0863            2003-06-17 00:00:00.000
708         2003-07-01 00:00:00.000 NULL                    13.0863            2003-06-17 00:00:00.000
              --Data edited out for clarity---
860         2002-07-01 00:00:00.000 2003-06-30 00:00:00.000 9.7136             2003-06-30 00:00:00.000
860         2003-07-01 00:00:00.000 NULL                    9.1593             2003-06-17 00:00:00.000

As expected the NULL EndDate values come first in the list.
My report writer buddy came up some pretty clever work around involving temp tables and self-joins.  Not very clear and poorly performing to say the least.  So I showed him the following as the first  replacement solution to placing NULLs at the end.

SELECT *
  FROM [AdventureWorks].[Production].[ProductCostHistory]
  ORDER BY isnull([EndDate],'12/31/9999')
ProductID   StartDate               EndDate                 StandardCost       ModifiedDate
----------- ----------------------- ----------------------- ------------------ -----------------------
707         2001-07-01 00:00:00.000 2002-06-30 00:00:00.000 12.0278            2002-06-30 00:00:00.000
708         2001-07-01 00:00:00.000 2002-06-30 00:00:00.000 12.0278            2002-06-30 00:00:00.000
              --Data edited out for clarity---
860         2002-07-01 00:00:00.000 2003-06-30 00:00:00.000 9.7136             2003-06-30 00:00:00.000
860         2003-07-01 00:00:00.000 NULL                    9.1593             2003-06-17 00:00:00.000

Exactly what we wanted!  BUT, this assumes that the date value ‘12/31/9999’ is beyond any other date in the table. Many times this solution is a safe assumption. This may NOT always be the case. I’ve seen databases where high values are sometimes used to indicate “out of range” or “present but invalid” etc.
The next solution addresses this condition by finding the MAX value in the column then adding an appropriate delta value to place the column after this maximum value. Here is the solution for our product cost history example:

DECLARE @maxEndDate DATETIME

SELECT @maxEndDate = MAX(EndDate)+1
     FROM [AdventureWorks].[Production].[ProductCostHistory]
SELECT *
  FROM [AdventureWorks].[Production].[ProductCostHistory]
  ORDER BY  isnull([EndDate],@maxEndDate)
ProductID   StartDate               EndDate                 StandardCost       ModifiedDate
----------- ----------------------- ----------------------- ------------------ -----------------------
707         2001-07-01 00:00:00.000 2002-06-30 00:00:00.000 12.0278            2002-06-30 00:00:00.000
708         2001-07-01 00:00:00.000 2002-06-30 00:00:00.000 12.0278            2002-06-30 00:00:00.000
              --Data edited out for clarity---
860         2002-07-01 00:00:00.000 2003-06-30 00:00:00.000 9.7136             2003-06-30 00:00:00.000
860         2003-07-01 00:00:00.000 NULL                    9.1593             2003-06-17 00:00:00.000

As expected the data is sorted in the correct order with the NULL values following the valid dates.
Now for what I think is a better solution and does not require hard coding a MAX(or Min) value or scanning through the data finding the last value. Here is my preferred solution to our date problem:
SELECT *
  FROM [AdventureWorks].[Production].[ProductCostHistory]
  ORDER BY (case when EndDate IS NULL then 1 else 0 end),EndDate
ProductID   StartDate               EndDate                 StandardCost       ModifiedDate
----------- ----------------------- ----------------------- ------------------ -----------------------
707         2001-07-01 00:00:00.000 2002-06-30 00:00:00.000 12.0278            2002-06-30 00:00:00.000
708         2001-07-01 00:00:00.000 2002-06-30 00:00:00.000 12.0278            2002-06-30 00:00:00.000
              --Data edited out for clarity---
860         2002-07-01 00:00:00.000 2003-06-30 00:00:00.000 9.7136             2003-06-30 00:00:00.000
860         2003-07-01 00:00:00.000 NULL                    9.1593             2003-06-17 00:00:00.000

Look at this for a moment.  It is obvious but has a subtle requirement; the CASE statement must precede the column that contains NULL values. The benefit is that this will always work and requires no additional consideration for the range or contents of the column.

Conclusion:  While this may be a simple little trick you’d be surprised how many report writers and data consumers  struggle with exactly this problem.  The CASE solution can also be expanded to provide table sort orders based on column ranges instead of simply NULL versus non-NULL.
-Dave Winters

Friday, June 4, 2010

VoltDB, the newest database idea?

Why VoltDB? The new database without a problem to solve….

I have been tracking the development and launch of the VoltDB effort for the past year and listened very closely to Michael Stonebreaker’s presentation on “Urban SQL Myth” on June 3rd. This event was scheduled to be a presentation to debunk some of the ideas that the growing NoSQL community is touting as to why NOT to use SQL DBMS platforms. Here are my take-aways on VoltDB and Stonebreaker’s latest DB effort.

On the whole- the presentation was very effective in putting a lot of the FUD from NoSQL pundits to rest. Yes, if your application is purely a K-V(key value) type of data application, then NoSQL products like Cassandra may be exactly what you need. But how many applications are purely and narrowly confined to K-V data models? There are folks that have passionate arguments that RDBMS platforms are at the core K-V oriented … but that is a mute point if the solutions require home-brew wrappers in the application or data tier of the application to support a K-V only engine. A point that Stonebreaker correctly pointed out.

As the presentation unfolded it became very clear that the VoltDB solution and counter to NoSQL is that it can perform Insert/Update/Delete operations at speeds that approach Warp 9! Michael gave a presentation that dissected the overhead and load that consume time in traditional RDBMS offerings. His statement, and I tend to agree, is that only 12% of the work done in systems like Oracle, MS SQL, or MySQL are really “useful” work! The rest is dedicated to log processing for recovery, lock management, buffer latching, and thread or process marshalling. He described these as the “four deadly horsemen” of performance. In his opinion, the only way to get a truly high-speed SQL database is to eliminate ALL of these from the database.

How does VoltDB do this? What is the secret sauce? Well very simply … it is a single threaded engine that processes SQL in a time stamped sequential order. OK given that condition eliminating all the spin locks, buffer management, and housekeeping overhead makes perfect sense. Of course it will process SQL inserts, updates, and deletes at superfast speed!

So what????

Really how many applications are narrowly focused on this problem? Not many! If you look at past RDBMS platforms that are in-memory solutions you’ll find very few that are still independent products used by wide audience of developers and applications. Remember TimesTen or Solid? The first was purchased by Oracle and the latter by IBM. Both have been recast in the roles of auxiliary cache like appliances to help offload the main database. The only remaining independent memory optimized platform is AntsDB/GeneroDB from 4Js.com. Sybase licensed its non-locking technology for use in its platform offerings.

What important point did Dr. Stonebreaker miss?

Query load!! If you analyze typical web or human facing applications they tend to be query (Select) intensive with a small percentage of updates to the database. It is the “window shopping” or browse versus buy scenario. How many queries are required to support a shopper trying to book an airline ticket, hotel room, or rental car? Literally the ratio of selects to updates approaches 100-to-1. So just to be safe let’s say the ratio is 10-to-1 for query versus updates. Well if you accept this premise, and most web app developers will, you’ll see the real problem in trying to speed up and scale up is handling time consuming Select loads. VoltDB makes no claims for performance advantage in this area.

Couple the browse-buy load with the fact that a high percentage of this load is repetitive over mostly static data sets … you quickly understand why many large sites implement query caching architectures based on custom application and data layer software. Look at Facebook, Twitter, Ebay, Google, and Sabre(Travelocity).

So I conclude that VoltDB is targeted a very very narrow segment of the market. A market that already has entries that went down the “memory” database path with mixed success. TimesTen, Solid, and AntsDB to name a few.

Stonebreaker kept saying that legacy RDBMS platforms should be sent to “the land of tired old databases” and VoltDB is something new. Well, I suggest that we ought to send single threaded memory databases to “the land of tired and already tried ideas.”

Dave Winters

Thursday, April 8, 2010

TRUE and FALSE values in SQL Server, BIT datatypes

I had an interesting discussion with my office mate today about return values and the what is TRUE and FALSE. He was wondering what value to put as 'success' on a RETURN statement inside of a stored procedure.

My recommendation to him was to follow the @@ERROR convention of 0 (zero) being the value for success. He rightly asked if 0 did also mean 'false' by convention. That got me into a discussion of the BIT data type in SQL Server and the use of 0 or 1 to indicate true or false.

Well it turns out that SQL Server recoginizes two strings; 'TRUE' and 'FALSE' as valid values that can be assigned and used for comparison of BIT variables. Any other strings will cause a conversion error.

To illustrate the point see the code at the bottom of this entry. Why would you ever use the strings 'TRUE' and 'FALSE' versus 1 or 0?

The answer CLARITY!

Regards, Dave

--
--
declare @theBit bit
set @theBit = 'TRUE'
if @theBit = 'TRUE' print 'It is true'
if @theBit = 'FALSE' print 'It is false'
--
print '--Next case--'
set @theBit = 'FALSE'
if @theBit = 'TRUE' print 'It is true'
if @theBit = 'FALSE' print 'It is false'
--
--This will cause a conversion error
--
set @theBit = 'BLAH'
--

Wednesday, February 10, 2010

Sign a SQL Stored Procedure to track changes.

Several times I've run into situations where I need to deploy a SQL stored procedure to clients that maybe complex and critical to correct functioning of the database.

Things go along fine and then I get a call/Email after about 6 months with a bug or performance issue. After a little detective work I think I detect that a critical procedure as been changed either by the client or consultant working on their site. Verifying that the stored procedure is the orginal unaltered version I placed on the server can lead to some long conversations.

So I came up with a solution that allows me to "sign" or finger print any SP after I've installed it. The technique uses SQL 2005 and above ability to hash data with the HashBytes function. The hash contains date information, object definition, object IDs, and a "secret" word.

The secret word allows you to publish the signing procedures without fear that the end-user will fake a signature. At least as long as you don't tell them the secret word.

The resulting binary hash signature is stored in the extended properties of the procedure and cannot be modified using SSMS.

I've found this very helpful with some of my clients that love to tinker with critical procs.

Dave Winters
Terminus
--------------------------------------------------------------------------
--
--
if object_id('USP_Sign_Object') is not NULL
drop procedure USP_Sign_Object
GO
--------------------------------------------------------------
create procedure USP_Sign_Object
@objectName sysname
,@secretWord nvarchar(20)='My Software'
as
--
-- This procedure builds a hash of the contents of a stored procedure for "finger printing"
-- and detecting changes. The hash/signature is stored in the extended properties of the object.
-- Here are some implementatio details:
-- A "secret" word should be supplied to prevent spoofing.
-- The has includes the object ID, create date, modify date, and first 4000 bytes of the DDL
-- To resign the procedure, the "SIGNED" property must be dropped via SSMS
--
-- Dave Winters, January 2010
-- TheSeldonVault.blogspot.com
--
SET NOCOUNT ON
Begin --Top

declare @objid nvarchar(20),
@crdate nvarchar(20),
@modate nvarchar(20),
@signHash sql_variant

select @objid=cast(object_id as nvarchar(20))
,@crdate=cast(create_date as nvarchar(20))
,@modate=cast(modify_date as nvarchar(20)) from sys.objects where [name]=@objectName

if @@rowcount = 0 return -1

SELECT @signHash = cast( HashBytes(
'SHA1',isnull(substring(
@secretWord+@objid+@crdate+@modate
+OBJECT_DEFINITION(OBJECT_ID (@objectName))
,1,4000 ),'EnCrYpTeD' )
) as sql_variant)

EXEC sys.sp_addextendedproperty @name=N'SIGNED'
,@value=@signHash
,@level0type=N'SCHEMA'
,@level0name=N'dbo'
,@level1type=N'PROCEDURE'
,@level1name=@objectName

End --Bottom
GO
--
--exec USP_Sign_Object 'asp_benefit_warnings','MySeCrEtE'
--
------------------------------------------------------------------------
if object_id('USP_Validate_Object') is not NULL
drop procedure USP_Validate_Object
GO
--
create procedure USP_Validate_Object
@objectName sysname
,@secretWord nvarchar(20)='My Software'
as
SET NOCOUNT ON
Begin --Top

declare @objid nvarchar(20),
@crdate nvarchar(20),
@modate nvarchar(20),
@signHash sql_variant,
@currHash sql_variant

select @objid=cast(object_id as nvarchar(20))
,@crdate=cast(create_date as nvarchar(20))
,@modate=cast(modify_date as nvarchar(20)) from sys.objects where [name]=@objectName

if @@rowcount = 0 return -1

SELECT @signHash = cast( HashBytes('SHA1',isnull(substring(
@secretWord+@objid+@crdate+@modate
+OBJECT_DEFINITION(OBJECT_ID (@objectName))
,1,4000 ),'EnCrYpTeD' )
)
as sql_variant)
--
SELECT @currHash = p.value
FROM sys.all_objects AS sp
INNER JOIN sys.extended_properties AS p ON p.major_id=sp.object_id AND p.minor_id=0 AND p.class=1
where sp.name = @objectName--N'asp_benefit_warnings'
and p.name = 'SIGNED'

--select @currHash as CurrHash, @signHash as CompHash

if @currHash <> @signHash
begin
print '!!!Signature Invalid!!!'
return -1
end

Print '---Signature Valid---'
return 0

End --Bottom

--
--
--exec USP_Validate_Object 'asp_benefit_warnings','MySeCrEtE'
--
--