Back when I was saw SQL Azure was producing an @@version string that mentioned version 11 (that's SQL Server 2012's major version number), I'd started looking at the version number every week or so. And looking for SQL Server 2012 functionality in SQL Azure. See the blog post "SQL Denali T-SQL features in SQL Azure now" for a list of the first functions supported. I'm not going to list them all again here. And Ed Katibah's (a.k.a Spatial Ed) latest list of SQL Server 2012 spatial functionality in SQL Azure.

This week, I'd thought to look at the SQL Azure version again. I get:

Microsoft SQL Azure (RTM) - 11.0.1831.30
 Jan 27 2012 23:11:55
 Copyright (c) Microsoft Corporation

This is different (newer) than when I previously looked, so its time to look for new T-SQL 2012 functions again. And to find some. 13 of the 14 new T-SQL scalar functions (all of them except TRY_CONVERT, I wonder why that one didn't make it) are in. That would be: FORMAT, PARSE, TRY_PARSE, IIF, CHOOSE, CONCAT, EOMONTH, and the new "date-from-parts" series.

In addition, the T-SQL 2012 updated version of the T-SQL LOG function that allows a logarithm's base as a second argument works in SQL Azure as well.

There's also one more new function that isn't mentioned in the books online (I hit it by mistake when I thought I was somehow mistaken about there being a TRY_CONVERT and it works in SQL Server 2012 RC0 as well), that is TRY_CAST. It's analogous to TRY_CONVERT, as CAST is analogous to CONVERT, modulo CONVERT's expression and style arguments.

This isn't a complete implementation of SQL Server 2012's T-SQL improvements in SQL Azure, though. SQL Azure is still waiting for:

1. Sequences
2. UTF-16 collations with supplimentary characters
3. Windowing (OVER clause enhancements, LAG/LEAD and others, new analytical functions)

I didn't notice any change in spatial support moving more towards to SQL Server 2012 feature set this time, maybe Ed noticed something.

Well, a bit at a time, I guess. Welcome the new T-SQL functions to SQL Azure.

@bobbeauch

Earlier this month I noticed that SQL Azure, with the latest upgrade has a version of 11...something (actually 11.0.1467.26, to be exact). And I wondered, aside from the long-touted new spatial library, if there were any Denali T-SQL enhancements that were now available. That is, what version of "version 11" is this? I did do some cursory tests of spatial functionality, but Ed Katibah (aka SpatialEd) will be posting the long list of spatial feature tests directly, I don't want to steal his thunder on those. But here's the tests for Denali T-SQL features, tested against SQL Azure, today.

Working in SQL Azure, now:
1. ORDER BY with OFFSET and FETCH
2. THROW (adjunct to TRY-CATCH)
3. Metadata discovery system procedures (sp_describe_first_result_set and friends)
4. EXECUTE ... WITH RESULT SETS
5. FORCESCAN and FORCESEEK hints with extensions

Not working in SQL Azure yet:
1. Sequences
2. UTF-16 collations with supplimentary characters
3. New functions (CONCAT, IIF, EOMONTH, and friends)
4. Windowing (OVER clause enhancements, LAG/LEAD and others, new analytical functions)

So it appears as if T-SQL and Spatial in SQL Azure is now approximately SQL Denali CTP1-- (that's one-minus-minus). Interesting. Wonder when the rest of the functionality will appear. And will it beat the SQL Denali release to RTM?

Bear in mind, of course, that most of the SQL Denali funtionality doesn't apply to SQL Azure (e.g. Always On, security enhancements) because SQL Azure doesn't support this level of control in the first place. But you could almost (if not exactly) state that SQL Azure databases were "contained" databases all along.

Followup from a friend on the SQL Azure team: "The features from Denali that are enabled in SQL Azure are fully supported and can be safely used. There aren't any changes to terms of use of the SLAs we provide."

@bobbeauch

I've been trying this out every day or so since I'd heard about the update (originally named the July 2011 Service Release), but I'd forgotten about it for about a week. So I don't exactly know when this happened on my SQL Azure server. But tonight, around midnight, SELECT @@version returns:

Microsoft SQL Azure (RTM) - 11.0.1465.26 Aug 10 2011 22:54:49 Copyright (c) Microsoft Corporation

So cool, we're at Denali on SQL Azure. Or are we just on Denali spatial (which was the new feature that was mentioned in the announcement). Or, where are we?

First to try some T-SQL Denali features.
  The new THROW statement works, as does OFFSET and FETCH, but...
  Creating a sequence fails, as does LAG/LEAD and EXECUTE WITH RESULT SETS
 
How about spatial? Not trying to be tremendously rigorous for now, I tested some obvious things.
  The new HasM and HasZ properties work, as does the new ShortestLineTo method. Probably some other methods (like the new ST-methods on the geography type) do too.
  The items that were caught by sys.dm_db_objects_impacted_on_version_change WERE affected by the change.
  A test of the new spatial precision indicates that the Denali improved precision is being used but...
  Cannot create a FULLGLOBE type or a CIRCULARSTRING type
  The new AUTO_GRID indexes are not supported
  Attempting to use geometry::UnionAggregate produces "A severe error occurred on the current command. The results, if any, should be discarded."

And BTW, both SSMS from SQL Server Denali and SQL Server 2008 R2 SP1 connect to the updated SQL Azure version just fine. I don't have any earlier SSMS versions to try right now.

Hmmm...well that's kind of a mixed bag now, isn't it? I know we're not in Kansas anymore, but we're not all in Denali either. I'm sure someone will come up with a feature matrix for this SQL Azure version (or I will), but there's enough of a subset of new features to make it interesting. More info as it arrives.

@bobbeauch

In the previous post, I insinuated that allowing LAG/LEAD to be sensitive to value RANGEs might help with series of data with missing values. It won't do that unless there's some logic that allows the offset that LAG/LEAD uses to be calculated on a row-by-row basis (i.e. LAG/LEAD uses an expression based on the window values). That's a little too much to expect; notice I've updated the original post to back off a bit on that one. I'll need my densified data for that.

However, a common scenario that would be helped by an expansion of use of RANGE in the window specification is rolling totals/rolling balances. Here's an example.

Suppose I wanted to calculate a 3-month rolling average (or rolling total) of sales on a per-employee basis. Looking at an employee who doesn't make at least one sale every month, I get a skewed total/average. The totals don't reflect the last three calendar months, but instead they reflect the last three months in which the employee had a nonzero sales total. Skews the figures upward a little....

SELECT  Employee,
        [Year] ,
        [Month] ,
        EmployeeTotal AS SalesThisMonth,
        COUNT(*) OVER (PARTITION BY Employee ORDER BY [Year], [Month] ROWS 2 PRECEDING) AS NumberOfMonths,
        SUM(EmployeeTotal) OVER (PARTITION BY Employee ORDER BY [Year], [Month] ROWS 2 PRECEDING) AS ThreeMonthTotal,
        AVG(EmployeeTotal) OVER (PARTITION BY Employee ORDER BY [Year], [Month] ROWS 2 PRECEDING) AS ThreeMonthAverage
FROM EmployeeSalesByMonth
WHERE Employee = 272
ORDER BY Employee, [Year], [Month];

                        SalesThisMonth Number   ThreeMonthTotal ThreeMonthAverage
272     2005    8       20544.7015      1       20544.7015      20544.7015
272     2005    9       2039.994          2       22584.6955      11292.3477  
272     2005    11     6341.551          3       28926.2465      9642.0821
272     2006    2       61206.4782      3       69588.0232      23196.0077
272     2006    3       18307.746        3       85855.7752      28618.5917
272     2006    4       33406.7043      3       112920.9285     37640.3095

Suppose I had a way to specify a time interval and use RANGE instead of ROWS. Something like: "RANGE BETWEEN '2 MONTHS' PRECEEDING AND CURRENT ROW" where '2 MONTHS' represents a time-interval data type. Then I'd get the "non-skewed" answer. Currently SQL Server Denali doesn't even allow that form the RANGE specification (RANGE BETWEEN <unsigned value specification> PRECEDING) with any data type currently. Instead, I can use the densified data (see previous post) with the rows window and get that "non-skewed" answer.

WITH SalesByMonth AS
(
SELECT Employee,
       DATEPART(yyyy,dt) as Year,
    DATEPART(MONTH,dt) as Month,
    ISNULL(EmployeeTotal, 0) as EmployeeTotal
FROM #allMonths m
OUTER APPLY (
SELECT  EmployeeTotal
FROM    EmployeeSalesByMonth e
WHERE m.Employee = e.Employee  
 AND    DATEPART(yyyy,dt) = [Year]
 AND    DATEPART(mm,dt) = [Month]
) AS t
)
SELECT  Employee,
        [Year] ,
        [Month] ,
        EmployeeTotal AS SalesThisMonth,
        COUNT(*) OVER (PARTITION BY Employee ORDER BY [Year], [Month] ROWS 2 PRECEDING) AS NumberOfMonths,
 SUM(EmployeeTotal) OVER (PARTITION BY Employee ORDER BY [Year], [Month] ROWS 2 PRECEDING) AS ThreeMonthTotal,
        AVG(EmployeeTotal) OVER (PARTITION BY Employee ORDER BY [Year], [Month] ROWS 2 PRECEDING) AS ThreeMonthAverage
FROM SalesByMonth
WHERE Employee = 272
ORDER BY Employee,
        [Year],
        [Month];
GO

                        SalesThisMonth Number   ThreeMonthTotal ThreeMonthAverage  
272     2005    7       0.00                 1       0.00                 0.00
272     2005    8       20544.7015      2       20544.7015      10272.3507
272     2005    9       2039.994          3       22584.6955      7528.2318
272     2005    10     0.00                 3       22584.6955      7528.2318
272     2005    11     6341.551          3       8381.545          2793.8483

Notice, however, that there's another subtle change in the answer. In choosing to start the time series exactly on 2005-07 (rather than the month when each employee had his/her first sale), the averages for the first couple of months for this employee are a little low. We can account for this, if we want, by basing the #allmonths table on each employee's hire date, for example. Yet another instance of "always pay attention to what you're actually measuring".

@bobbeauch
 

You've heard my rant before "measure what you think you are measuring". If not, follow the link. Here's an example using the LAG function, new in SQL Server Denali, to measure sales trends. We'll start with a view from AdventureWorksDW2008R2 (the double cast is for clarity).

CREATE VIEW dbo.EmployeeSalesByMonth
AS
SELECT  EmployeeKey as [Employee],
        DATEPART(yyyy, CAST(CAST(OrderDateKey as varchar(8)) as DATE)) AS [Year] ,
        DATEPART(mm, CAST(CAST(OrderDateKey as varchar(8)) as DATE)) AS [Month] ,
        SUM(SalesAmount) AS [EmployeeTotal]
FROM    dbo.FactResellerSales
WHERE    EmployeeKey IS NOT NULL
GROUP BY EmployeeKey,
        DATEPART(yyyy, CAST(CAST(OrderDateKey as varchar(8)) as DATE)),
        DATEPART(mm, CAST(CAST(OrderDateKey as varchar(8)) as DATE))
GO

Let's use LAG to get sales for this month, previous month, and three months ago. I'm choosing a specific employee, just leave the WHERE clause out to get everyone.

SELECT  Employee,
        [Year] ,
        [Month] ,
        EmployeeTotal AS SalesThisMonth,
        LAG(EmployeeTotal, 1, 0.00) OVER (PARTITION BY Employee ORDER BY [Year], [Month]) AS SalesLastMonth ,
        LAG(EmployeeTotal, 3, 0.00) OVER (PARTITION BY Employee ORDER BY [Year], [Month]) AS SalesThreeMonthsAgo
FROM dbo.EmployeeSalesByMonth
WHERE Employee = 272
ORDER BY Employee, [Year], [Month];
GO

The LAG function works perfectly. Or does it...? Let's look at the first few rows.

                               This Month      Last Month      Three Months Ago
272     2005    8       20544.7015      0.00               0.00
272     2005    9       2039.994         20544.7015     0.00
272     2005    11      6341.551        2039.994         0.00
272     2006    2       61206.4782      6341.551        20544.7015
272     2006    3       18307.746       61206.4782     2039.994

Although this sounded like a good idea, and the function is working as advertised, the answer isn't what we want. Look at the 2005-11 line, for example. Last month was 2005-10, but there was no sales that month, so we get 2005-9 sales. Three months ago (2005-08) there were sales, but we get zero. What gives?

We're only producing rows for months where there were sales for employee 272. Probably not what you'd want. Although we think we're counting months with this LAG function, we're actually counting ROWS. SQL Server BOL implementation of LAG states that the offset (first parameter) is "The number of rows back from the current row from which to obtain a value."

So, can we fix things to get "reasonable" values? Because LAG uses rows, we'd need to come up with some blank (zero total) rows for months where there are no sales. Some databases have a special data densification syntax, like "partition-by joins" to fill in the gaps. Using Itzik Ben-Gan's dbo.GetNums TVF, we can make a table of allMonths and all Employees between a date range. (Note: this code is a quick hack and pretty fragile, but we will end up with the right answer. It also assumes you have dbo.GetNums in tempdb, see the SQLMag article for that function).

DECLARE
 @startdt AS DATE = '20050701',  -- first date we care about
 @enddt   AS DATE = '20080731'   -- last date we care about
 
SELECT DATEADD(month, n-1, @startdt) AS dt, EmployeeKey AS Employee
INTO #allMonths
FROM tempdb.dbo.GetNums(DATEDIFF(month, @startdt, @enddt) + 1) AS Nums
CROSS JOIN
(
SELECT DISTINCT EmployeeKey
FROM dbo.FactResellerSales
) AS A; 
GO

With this table in hand, we can use an OUTER APPLY to "add in" the months with no sales, and our sales trend query becomes:

WITH SalesByMonth AS
(
SELECT Employee,
       DATEPART(yyyy,dt) as Year,
    DATEPART(MONTH,dt) as Month,
    ISNULL(EmployeeTotal, 0) as EmployeeTotal
FROM #allMonths m
OUTER APPLY (
SELECT  EmployeeTotal
FROM    EmployeeSalesByMonth e
WHERE m.Employee = e.Employee  
 AND    DATEPART(yyyy,dt) = [Year]
 AND    DATEPART(mm,dt) = [Month]
) AS t
)
SELECT  Employee,
        [Year] ,
        [Month] ,
        EmployeeTotal AS SalesThisMonth,
        LAG(EmployeeTotal, 1, 0.00) OVER (PARTITION BY Employee ORDER BY [Year], [Month]) AS SalesLastMonth ,
        LAG(EmployeeTotal, 3, 0.00) OVER (PARTITION BY Employee ORDER BY [Year], [Month]) AS SalesThreeMonthsAgo
FROM SalesByMonth
WHERE Employee = 272
ORDER BY Employee,
        [Year],
        [Month];
GO

A few corresponding rows...the sales figures now agree with the column headings

                               This Month      Last Month      Three Months Ago
272     2005    8       20544.7015      0.00               0.00
272     2005    9       2039.994         20544.7015     0.00
272     2005    10     0.00                2039.994         0.00
272     2005    11     6341.551         0.00               20544.7015
272     2005    12     0.00                6341.551        2039.994
272     2006    1       0.00                0.00                0.00
272     2006    2       61206.4782     0.00                 6341.551
272     2006    3       18307.746       61206.4782      0.00

So remember...although the default window for the OVER clause really is RANGE UNBOUNDED PRECEDING AND CURRENT ROW, the LAG function only counts ROWs, not RANGEs.

@bobbeauch  

So, quickly on the heels of the first window clause and last_value() question, came a followup:

OK smartie, why does last_value work fine here? I didn't have to change from the default window. What gives?

select SalesPersonID, SalesOrderID,
 first_value(SalesOrderID) over(partition by SalesPersonID order by SalesPersonID) as FirstOrderForSalesPerson,
 last_value(SalesOrderID) over(partition by SalesPersonID order by SalesPersonID) as LastOrderForSalesPerson
from Sales.SalesOrderHeader
where SalesPersonID IS NOT NULL
order by SalesPersonID, SalesOrderID

It's close to the last example, except that we're ordering by SalesPersonID instead of SalesOrderID. And, partitioning by (works like group by except that we keep all the detail rows in each group) SalesPersonID. So why DOES it produce the "right" answer for each SalesPersonID?

Remember that the query above is equivalent to:

select SalesPersonID, SalesOrderID,
 first_value(SalesOrderID) over(partition by SalesPersonID order by SalesPersonID
                                range between unbounded preceding and current row) as FirstOrderForSalesPerson,
 last_value(SalesOrderID) over(partition by SalesPersonID order by SalesPersonID
                               range between unbounded preceding and current row) as LastOrderForSalesPerson
from Sales.SalesOrderHeader
where SalesPersonID IS NOT NULL
order by SalesPersonID, SalesOrderID

And it gets the intuitive "right" answer because of the way the "range" windowing spec handles ties. Since the order by clause specifies "SalesPersonID", all the rows for the same SalesPersonID are ties. The window frame with range includes all rows with the same value (all ties) as the part of the frame. So, in the query above, SalesPersonID X has multiple (tied) rows in the window, range considers all of them.

Contrast to this, using the "rows" windowing spec, which *doesn't* consider multiple tied rows as part of the window, we're back to our "weird" answer.

select SalesPersonID, SalesOrderID,
 first_value(SalesOrderID) over(partition by SalesPersonID order by SalesPersonID
                                rows between unbounded preceding and current row) as FirstOrderForSalesPerson,
 last_value(SalesOrderID) over(partition by SalesPersonID order by SalesPersonID
                               rows between unbounded preceding and current row) as LastOrderForSalesPerson
from Sales.SalesOrderHeader
where SalesPersonID IS NOT NULL
order by SalesPersonID, SalesOrderID

A different way to prove to yourself that RANGE considers ties part of the frame and ROWS doesn't is to use the use the SUM aggregate instead of FIRST/LAST_VALUE().

-- all of SumOfTotal for a specific SalesPersonID are equal with RANGE
-- SumOfTotal is equal for each row, each SalesPersonID
select SalesPersonID, SalesOrderID, TotalDue,
 sum(TotalDue) over(partition by SalesPersonID order by SalesPersonID) as SumOfTotal
from Sales.SalesOrderHeader
where SalesPersonID IS NOT NULL

-- you get a "running total" when using ROWS
select SalesPersonID, SalesOrderID, TotalDue,
 sum(TotalDue) over(partition by SalesPersonID order by SalesPersonID
                    rows between unbounded preceding and current row) as SumOfTotal
from Sales.SalesOrderHeader
where SalesPersonID IS NOT NULL

However, notice that the running total isn't necessarily by SalesOrderID becuase without an "order by" in the OVER clause, the spec doesn't guarentee ordering by both columns. Note for example, that in the query above, for SalesPersonID 281, SalesOrderID 48327 comes after 48370 (at least it does on my machine, without additional indexes on SalesOrderHeader table).

If you truly want running total by SalesOrderID, in SalesOrderID order, add SalesOrderID to the "ORDER BY" in the OVER clause. When you add SalesOrderID to ORDER BY, now there are no ties in ordering. So either ROWS or RANGE will do the trick.

-- same answer as next query
select SalesPersonID, SalesOrderID, TotalDue,
 sum(TotalDue) over(partition by SalesPersonID order by SalesPersonID, SalesOrderID
                    rows between unbounded preceding and current row) as SumOfTotal
from Sales.SalesOrderHeader
where SalesPersonID IS NOT NULL

-- same answer as previous query, uses RANGE by default
select SalesPersonID, SalesOrderID, TotalDue,
 sum(TotalDue) over(partition by SalesPersonID order by SalesPersonID, SalesOrderID) as SumOfTotal
from Sales.SalesOrderHeader
where SalesPersonID IS NOT NULL

Hope this helps, Bob

@bobbeauch

I've been working with the new Denali T-SQL windowing functionality and ran into someone who asked about this "problem". It's almost sure to become an FAQ.

Why does last_value not always "work right"?

Take, as an example, the following query (against Adventureworks2008R2)

select  AccountNumber,  SalesPersonID, SalesOrderID,
 first_value(SalesOrderID) over(partition by AccountNumber order by SalesOrderID) as FirstOrderForAcct,
 last_value(SalesOrderID) over(partition by AccountNumber  order by SalesOrderID) as LastOrderForAcct
from Sales.SalesOrderHeader
where SalesPersonID IS NOT NULL

When you look as the results, FirstOrderForAcct seems OK, but LastOrderForAcct always returns the same number as the current account for each row. Not the last number. Why?

The reason is that, when using first/last_value with a window of rows, the default window is "range between unbounded preceding and current row". So the query above with the "weird" result is equivalent to this one.

select  AccountNumber,  SalesPersonID, SalesOrderID,
 first_value(SalesOrderID) over(partition by AccountNumber order by SalesOrderID
                                range between unbounded preceding and current row) as FirstOrderForAcct,
 last_value(SalesOrderID) over(partition by AccountNumber  order by SalesOrderID
                               range between unbounded preceding and current row) as LastOrderForAcct
from Sales.SalesOrderHeader
where SalesPersonID IS NOT NULL

If you want the "intuitive" answer for last_value, just change the window clause to:


select  AccountNumber,  SalesPersonID, SalesOrderID,
 first_value(SalesOrderID) over(partition by AccountNumber order by SalesOrderID
                                range between unbounded preceding and unbounded following) as FirstOrderForAcct,
 last_value(SalesOrderID) over(partition by AccountNumber  order by SalesOrderID
                               range between unbounded preceding and unbounded following) as LastOrderForAcct
from Sales.SalesOrderHeader
where SalesPersonID IS NOT NULL

@bobbeauch

I'm always leery when I hear people say "statistics show that..." followed by whatever their opinion is. Scientists do it. And your users probably do it too. I worked with a product called SAS once, on statistics for response time. Got some lovely reports and statistics, *from which other folks draw conclusions*. It's important to step back every once in a while, and make sure you're measuring what you think you're measuring. And that it really does add up. I've made that into kind of a koan "I don't want to hear your conclusion, I want to see your raw data". And, if I'm really interested, I'll do the math "by hand". Amazingly (well maybe not so amazingly) the conclusion doesn't always jibe with the raw data and/or what folks *think* they're measuring.

End of rant.

So SQL Server Denali CTP3 added some new analytical functions: PERCENT_RANK, CUME_DIST, PERCENTILE_DISC and PERCENTILE_CONT. The first three, I have no problem with. The Denali BOL lists how it works and I can confirm this using sample data. Good. The PERCENTILE_CONT function though...that's a little different.

BTW, There is a slight typo in BOL for PERCENT_RANK. It says "The range of values returned by PERCENT_RANK is greater than 0 and less than or equal to 1." Actually, PERCENT_RANK can, and is, in their example, sometimes *equal to 0*. It's CUME_DIST that is greater than 0 and less than or equal to 1. Fine.

For PERCENTILE_CONT, the SQL Server BOL says: "Calculates a percentile based on a continuous distribution of the column value... The result is interpolated and might not be equal to any of the specific values in the column." That's sufficiently vague, interpolated how? And "might not be equal to any of the specific values in the column"? That's as opposed to PERCENTILE_DISC which is always equal to one of the specific values of the column, by definition.

I've also read that "PERCENTILE_CONT(X) examines the percent_rank of values in a group until it finds one greater than or equal to X." In two or three places on the web. Nope. Not true. Doesn't jibe with "might not be equal to any of the specific values in the column".

So, I'm intrigued . And, last night, tried to figure it out by running a variation of BOL sample query.

USE AdventureWorks2008R2;

SELECT Name AS DepartmentName
      , ph.Rate
   ,PERCENT_RANK() OVER(PARTITION BY Name ORDER BY ph.Rate) as Percent_rank
      ,PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY ph.Rate)
                            OVER (PARTITION BY Name) AS MedianCont
   ,CUME_DIST() OVER(PARTITION BY Name ORDER BY ph.Rate) as Cume_Dist
      ,PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY ph.Rate)
                            OVER (PARTITION BY Name) AS MedianDisc
FROM HumanResources.Department AS d
INNER JOIN HumanResources.EmployeeDepartmentHistory AS dh
    ON dh.DepartmentID = d.DepartmentID
INNER JOIN HumanResources.EmployeePayHistory AS ph
    ON ph.BusinessEntityID = dh.BusinessEntityID
WHERE dh.EndDate IS NULL;

Let's look at a few groups in the raw data (the three columns that matter have headers):

                Rate     Percent Rank            MedianCont
-----------------------------------------------------------------------------------------------
Executive 39.06    0                              54.32695              0.25   48.5577
Executive 48.5577 0.333333333333333 54.32695                0.5   48.5577
Executive 60.0962 0.666666666666667 54.32695              0.75   48.5577
Executive 125.50   1                             54.32695                   1   48.5577

Tool Design 8.62       0                          25                             0.166666666666667 25.00
Tool Design 23.72     0.2                       25                             0.333333333333333 25.00
Tool Design 25.00     0.4                       25                             0.666666666666667 25.00
Tool Design 25.00     0.4                       25                             0.666666666666667 25.00
Tool Design 28.8462 0.8                       25                             0.833333333333333 25.00
Tool Design 29.8462 1                          25                                                       1   25.00

For "Executive", the value is interpolated, like the BOL says. Not a value in the set. For "Tool Design", the MedianCont (PERCENTILE_CONT(0.5)) is 25, a value in the set, but not a value where percent rank is > 0.5 (its 0.4). Oh. So I ripped out the rows that produced the resultset and played with these for a while, adding new rows to each set and watching how the results changed. No joy. After yet another web search (I DID use web search, I swear, but missed this one the first 10 times) I came across this in Oracle Database SQL Reference:

"The result of PERCENTILE_CONT is computed by linear interpolation between values after ordering them. Using the percentile value (P) and the number of rows (N) in the aggregation group, we compute the row number we are interested in after ordering the rows with respect to the sort specification. This row number (RN) is computed according to the formula RN = (1+ (P*(N-1)). The final result of the aggregate function is computed by linear interpolation between the values from rows at row numbers CRN = CEILING(RN) and FRN = FLOOR(RN).
 
The final result will be:
 If (CRN = FRN = RN) then the result is
    (value of expression from row at RN)
  Otherwise the result is
    (CRN - RN) * (value of expression for row at FRN) +
    (RN - FRN) * (value of expression for row at CRN)
".

SQL Server's numbers do agree with Oracle's for the "demo" (that's scott/tiger) database examples. I tried them (yes, I really have a "demo" database on Denali, it's lamer than even pubs. But they've replaced it now, with more robust samples DBs). So let's work these two out.

For "Tool Design":
declare @p float = 0.5
declare @n int = 6
select 1+ (@p*(@n-1)) --3.5
select ceiling(1+ (@p*(@n-1))) --4
select floor(1+ (@p*(@n-1))) --3
-- value at row 3 = 25,value at row 4 = 25
select .5*25 + .5*25 -- 25

25 is the right answer

For "Executive":
declare @p float = 0.5
declare @n int = 4
select 1+ (@p*(@n-1)) --2.5
select ceiling(1+ (@p*(@n-1))) --3
select floor(1+ (@p*(@n-1))) --2
-- value at row 2 = ,value at row 3 =
select .5*48.5577 + .5*60.0962 -- 54.32695

54.32695 is the right answer

Got it? If you think you have it, run the sample query and work out dept "Finance". Then (if you're up late like I was, and I worked out most/all of the groups in the sample query), go to sleep. You've earned it. And...you may not ever use PERCENTILE_CONT in your day-to-day work.

[Rant on] To reiterate, when looking at reports that contain statistical calculations, make sure you know what the statistic means. And check it to make sure it DOES mean that, especially if you think the numbers look "weird" (ie, don't jibe with the description). We aren't all rocket scientists. And rocket scientists may not have a background in statistics anyhow. But I'll bet they (like everyone) have opinions. And are hankerin' to "prove" them.

@bobbeauch

In a previous blog entry a while ago, I wrote about writing a single SQL statement that did a SQL MERGE operation, used the OUTPUT clause to put out a rowset and directed the rowset into an INSERT statement. This happens in one statement without the need of explicitly defining a temporary table and using multiple SQL statements.

Turns out that this feature has a name: Composable DML. I've also heard it called (in SQL Server Books Online) "DML table source".

I've also heard the term composable queries, both in references to both Entity Framework functions and LINQ to SQL. They even have an "IsComposable" attribute in the Function definition) In this context, it means that the output of a function that produces a rowset (in this case a SQL Server table-valued function) or code that produces a rowset can be combined with further queries that do more filtering or projection, or even retrieve related rows. When the queries are submitted to the database, they are composed into a single SQL statement.

The point is pretty similar, reduce the resources required to accomplish a set of operations by reducing the number of statements or database round-trips needed.

One last SQL syntax post for the evening...

We've all heard about SQL Server 2008 row constructors. They allow syntax like this to work:

CREATE TABLE name_table (name varchar(20), age int);
go
INSERT INTO name_table VALUES ('Bob', 54), ('Mary', 30), ('Sam', 15), ('Buddy', 9);
go

But how about using them as a table source:

SELECT n.name, n.age, tab.species
FROM name_table n
JOIN
(
  VALUES ('Bob', 'person'), ('Mary', 'person'), ('Sam', 'cat'), ('Buddy', 'cat')
) tab (name, species)
ON n.name = tab.name;

You specify a table alias and name the columns, and its just another (synthesized on the fly) table.

SQL Server 2005 introduced the OUTPUT clause in DML statements. With the OUTPUT clause, you can output column values to a table variable, a table, or return these values to the user. You can even use multiple OUTPUT clauses to push the values into both a table variable and a table from the same statement.

In SQL Server 2008 there is an additional option. You can use your OUTPUT column values directly in an INSERT-SELECT statement. Here's what it would look like, using MERGE with an OUTPUT clause (and an example from one of the early webcasts):

CREATE TABLE Stock (Stock VARCHAR(10) PRIMARY KEY, Qty INT CHECK (Qty > 0));
CREATE TABLE Trades (Stock VARCHAR(10) PRIMARY KEY, Delta INT CHECK (Delta <> 0));
CREATE TABLE AuditChanges (Action varchar(6), Stock VARCHAR(6), Qty INT);
GO

INSERT Stock VALUES('MSFT', 10), ('BOEING', 5);
INSERT Trades VALUES('MSFT', 5), ('BOEING', -5), ('GE', 3);
GO

INSERT INTO AuditChanges
SELECT * FROM
(
MERGE Stock S
  USING Trades T
  ON S.Stock = T.Stock
  WHEN MATCHED AND (Qty + T.Delta = 0) THEN
    DELETE
  WHEN MATCHED THEN
    UPDATE SET Qty += T.Delta
  WHEN NOT MATCHED THEN
    INSERT VALUES(Stock, T.Delta)
  OUTPUT $action, T.Stock, inserted.Qty
) tab (action, stock, qty);
GO

Notice how the OUTPUT clause requires a table alias (in this case "tab") and needs to name the columns returned from OUTPUT.

I've been looking at the mapping of the new Katmai date/time data types to .NET types, both from the point of view of SQL Server ADO.NET clients and of SQLCLR procedures/functions/triggers. There are a couple of things that "interesting", if I'm not misunderstanding something obvious.

There are no new System.Data.SqlTypes to correspond to the new SQL Server data types. The beta Visual Studio docs state this as a fact. So, unless SQLCLR supports the generic Nullable types as parameters in Katmai, you'll pass these in and out as strings. Or only use them in UDFs with RETURNS NULL ON NULL INPUT. No word yet (that I'm aware of) on Nullable type support in SQL Server 2008, BUT the new HierarchyID type (which is .NET based) supports INullable, like SqlTypes do. So I can pass in a NULL HierarchyID to SQLCLR, but not a NULL TIME parameter.

SQL Server's TIME data type is mapped to TimeSpan, which is a time interval. data type. This might confuse folks that know SQL Server doesn't support SQL-99's date/time interval data types. But the TimeSpan seems to have appropriate semantics as long as you stay away from the "Days" related properties and methods.

There is a new .NET data type for DATETIMEOFFSET, System.DateTimeOffset. That's good.

To distinguish between SQL Server 2000/2005/2008 usage/mappings, there is a (client-side) connection string parameter: "Type System Version". You have a choice of 2008, 2005, 2000, or "Latest". This was already used in SQL Server 2005 to distinguish between 2005 (XML data type exists) vs 2000 (XML data type is a long string). And other distinctions.

I wonder if LINQ for SQL and Entity Framework will support these... ;-)

I've been trying out table-valued parameters along with ADO.NET support in Orcas and came across an interesting dilemma. It centers around INSERTs using TVP against a table with an identity column. ADO.NET can use DataTable, IDataReader or IList<SqlDataRecord> to represent a TVP parameter. Let's say that I want a TVP and a procedure for insert that looks like this:

CREATE TYPE dbo.JobsTableWithIdentity AS TABLE (
  job_id smallint identity primary key,
  job_desc varchar(50),
  min_lvl tinyint,
  max_lvl tinyint
);

The "insert proc" would look like this:

CREATE PROCEDURE dbo.InsertJobsID (@tvp1 dbo.JobsTableWithIdentity readonly)
as
INSERT INTO dbo.Jobs (job_desc, min_lvl, max_lvl)
  SELECT job_desc, min_lvl, max_lvl from @tvp1;

Using this in ADO.NET (with either DataTable or IDataReader as a parameter) produces the error: "INSERT into an identity column not allowed on table variables. The data for table-valued parameter "@tvp1" doesn't conform to the table type of the parameter." But I didn't DO an insert into an identity column in the proc. And this works just fine in T-SQL:

DECLARE @t dbo.JobsTableWithIdentity;
INSERT @t VALUES('hi', 10, 10);
INSERT @t VALUES('hi2', 10, 10);
EXEC InsertJobsID @t;

So its ADO.NET "deciding" this is an error. The ADO.NET workaround (if I do want to start with a DataTable that contains the identity column, add rows to it, and call Update) is this:

CREATE TYPE dbo.JobsTableWithoutIdentity AS TABLE (
  -- job_id smallint identity primary key,
  job_desc varchar(50),
  min_lvl tinyint,
  max_lvl tinyint
);

-- sproc dbo.InsertJobsNoID changed accordingly

// and then, in ADO.NET code
// DataTable "t" contains a real jobs table, to which I've added rows

DataTable added = t.GetChanges(DataRowState.Added);
added.Columns.Remove("job_id");
da.InsertCommand.CommandText = "dbo.InsertJobsNoID";
da.InsertCommand.CommandType = CommandType.StoredProcedure;
da.InsertCommand.Parameters.AddWithValue("@tvp1", added);

But, should I have to do this? Or modify the T-SQL code, given I've not used the IDENTITY column? But, I will need this column to UPDATE (or MERGE) using the TVP. A dilemma...

SQL Server 2008 will contain an ANSI SQL standard MERGE statement with extensions. Listening to the webcast last Friday, there's a fairly straightforward way to describe how this works.

Let's go back to first principals. Relational databases support two ways to do UPDATE and DELETE; positioned updates and searched updates. In a positioned update, you open an updatable cursor over a set of rows, navigate to the row you want, and issue an "UPDATE...WHERE CURRENT OF" statement. The searched update (UPDATE foo SET x=y WHERE z=1) is a SELECT and UPDATE in one statement. You specify the rows you want and also update them. The way you specify/gather the rows to update is mostly up to you, SQL Server even supports using a JOIN to do this.

update t
set t.name = s.name, t.age  = s.age
from [target] t
join [source] s on t.id = s.id;
go

MERGE in SQL Server 2008 is a searched operation that can operate on up to three different rowsets depending on how you specify the MERGE statement. These are actually the three rowsets you can get from a FULL OUTER JOIN. The statement:

select [target] t
inner join [source] s on t.id = s.id;

gets the rows in table T with a matching id value in table S. Let's call this rowset1. A left outer join will also include the rows in T that do not match S (rowset2); right outer join includes the rows in T that don't match S (rowset3), and FULL OUTER JOIN contains all three rowsets.

In MERGE, you can get up to all three rowsets (INNER, LEFT, and RIGHT join rowsets)

merge [target] t
 using [source] s on t.id = s.id
 when matched then update t.name = s.name, t.age = s.age -- use "rowset1"
 when not matched then insert values(id,name,age) -- use "rowset2"
 when source not matched then delete; -- use "rowset3"

The query processor will do the appropriate type of join to gather the rowsets that you need, as all clauses are not required. This optimizes performance over multiple statements, as you only have to gather the rows to process once.

Even, if you use only "when matched", MERGE is an improvement over our first "update using a join". If more than one row in the source matches one row in the target...

insert into t values(1, 'Fred', 42)
insert into s values(1, 'Buddy', 43)
insert into s values(1, 'Sam', '95)

The update using a join non-deterministically picks a row in the source to do the update, it could be Buddy or Sam. MERGE throws the following exception.

Msg 8672, Level 16, State 1, Line 1
The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to  ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.

MERGE can actually do more than three operations using predicates in the "match/no match clauses", but that's it for now.

ORDPATH is a hierarchical labeling scheme used in the internal implementation of the XML data type in SQL Server 2005. It's meant to provide optimized representation of hierarchies, simplify insertion of nodes at arbitrary locations in a tree, and also provide document order. It's described in the academic whitepaper "ORDPATHs: Insert-Friendly XML Node Labels". In addition to being used internally when the XML data type is stored, its also part of the key of the PRIMARY XML INDEX, used to speed up XQuery. It's an implementation/specialization of the path enumeration model of representing hierarchies in relational databases, mentioned in Joe Celko's book "Trees and Hierarchies in SQL".

In SQL Server 2008, there are additional uses of ORDPATH. There is a new system data type HierarchyID, that will likely use ORDPATH in its implementation. This allows simply hierarchies to be represented as relational column and provides methods that optimize common hierarchical operations (like parent, child, sibling, ancestors, descendants) without being concerned about the intricacies of elements and attributes.

In addition to representing and indexing XML and hierarchies, Michael Rys mentioned at his TechEd chalktalk on spatial data that the spatial data types may be indexed using a multi-level grid system and that these indexes would also use ORDPATH. Since neither HierarchyID or spatial types are in the current CTP of SQL Server 2008, we'll have to wait a bit to see if this is truly "ordpath everywhere".

SQL Server 2008 will contain, if my count is correct, 7 new data types. Note: none of these are in the current CTP.

DATE - ANSI-compliant date data type
TIME - ANSI-compliant time data type with variable precision
DATETIMEOFFSET - timezone aware/preserved datetime
DATETIME2 - like DATETIME, but with variable precision and large date range

GEOMETRY - "flat earth" spatial data type
GEOGRAPHY - "round earth" spatial data type
HIERARCHYID - represents hierarchies using path enumeration model

The first four (date/time series) are NOT implemented/exposed as .NET system UDTs, but the last three are exposed as .NET system UDTs. This means that, for the first time, .NET will be used as part of SQL Server. In SQL Server 2005, there were originally DATE and TIME data types implemented in .NET, but implementing temporal data is an intricate process. After much wailing and gnashing of teeth by some members of the user community, these were removed.  One of the complaints was the .NET implementation. Hmmm...

One of the nice side-effects of implementing spatial and hierarchyid as .NET types is that these will be shipped as a separate assembly, and that the types will be available for client-side and middle-tier use as well as in the database. So if you want to do some massive number crunching of spatial sequences on a computation server and the network traffic from database to computation server is acceptable, you can do so.

In general, SQLCLR makes the "logic in database or middle-tier" argument easier to deal with. Although there's no "run on database or run on server" switch in VS, with minimal code changes you can move your logic, or even duplicate the logic between tiers if need be. You can't do this with T-SQL; although its faster and better for data access on the database, it doesn't run outside the database. Unless you want to use SQL Server Express Edition as an application server. But that's a discussion for another time.

The June CTP of SQL Server 2008 contains support for table-valued parameters. Here's a usage scenario for these that has been around for a while.

Imagine you are running an online store and deal with (among other data) orders and order detail lines. You'd like to have a stored procedure that can add an entire order in one server round trip, regardless of the number of items that I order (that is, 1 order header and 1->n order detail lines). You'd even settle for two round trips, one for the order header, one for the details. Before SQL Server 2008, there is no built-in mechanism that supports this. In the past, I've seen some pretty interesting workarounds, such as:

1. Compose an arbitrarily large SQL batch as a single "command text" by using string concatenation on the client/middle-tier. All SQL Server database APIs support one (and only one) batch per Command. ADO classic did something like this when you inserted/updated/deleted multiple rows in a disconnected Recordset and called for a "batch update".

2. Create a stored procedure with some "static" parameters and an arbitrarily large number of repeating parameters, most of which will always be NULL. The limit to the number of parameters in a stored procedure is 2100. Both this method and the previous one make for some pretty hideous-looking code.

3. Use multiple parameter sets. OLE DB does support multiple parameter sets and some databases can optimize inserts that use multiple parameter sets. The SQL Server providers, at least last time I looked with SQL Profiler, turn multiple parameter sets into multiple calls, that is, one round trip per parameter set. That's not what I wanted.
 
SQL Server 2008's solution to this age-old problem is table-valued parameters (TVPs, for short). You start using a TVP by creating a custom type, using the CREATE TYPE statement, like this:

CREATE TYPE lineitem_type (line_number INT, order_id INT, product INT, quantity INT);

Information about these table types appear in sys.types and also in a new metadata view, sys.table_types. Once you've created such a table type, you can use it in T-SQL like this:

CREATE PROC new_order (@order_id INT, @line_items lineitem_type)
AS
-- silly table names used for clarity
INSERT INTO orders_table VALUES (@order_id ... ) ;
INSERT INTO line_items_table
  SELECT * FROM @line_items;

One round trip. Compact, clean code. Nice.

SQL Server 2008 will include GROUPING SETS; a (very) short explanation is that these allow the equivalent of multiple GROUP BY clauses in a single SQL statement. The result is a UNION ALL of the resultsets. SQL Server 2008 also contains/allows standard syntax for ROLLUP and CUBE, which have been in SQL Server for a while.

One way to use (or to think of) grouping sets is that, while ROLLUP with N columns produces a UNION of N+1 results and CUBE produces N-squared -1 results, grouping sets can produce an intermediate number of results, when not all the dimensions produced by CUBE are needed. Grouping sets should allow better optimization of this type of dimensional query.

SQL Server 2008 Reporting Services will contain a new type of control, called the TABLIX. The SQL Server 2008 CTP BOL defines a TABLIX as: "A Reporting Services RDL data region that contains rows and columns resembling a table or matrix, possibly sharing characteristics of both." Grouping sets sound like a good fit with this component. When TABLIX is available, a short profiler session would confirm this.

Of course, in addition to performance benefits and TABLIX support, GROUPING SETS are part of the ISO-ANSI SQL-2006 spec. Another plus.

Here's another blog posting to answer a question from over a month ago.

With separation of users and schemas, its known that the CREATE USER DDL statement without a DEFAULT_SCHEMA parameter assigns a DEFAULT_SCHEMA of dbo. Which the new user usually has no access to. And that sp_adduser, for backward compatibility, will CREATE a SCHEMA named after the user and assign that SCHEMA as the user's default schema. Question was, "does sp_dropuser do the right thing and drop the schema named after the user?"

Short answer is "yes, it does". Pretty easy to prove. Actually, sp_adduser can eventually call sp_grantdbaccess and sp_dropuser eventually can call sp_revokedbaccess, so...

create login bob with password = 'A^#DNEfdfhkWD#*iubdwc )000ks1'
go

use adventureworks
go

sp_grantdbaccess 'bob'
go

select * from sys.schemas  -- bob schema is there
go

sp_revokedbaccess 'bob'
go

select * from sys.schemas  -- bob schema is gone
go

The more intriguing thing is when you check this by using the OBJECT_DEFINITION system function:

-- creates schema
PRINT OBJECT_DEFINITION(OBJECT_ID('sys.sp_grantdbaccess'))

-- drops schema
PRINT OBJECT_DEFINITION(OBJECT_ID('sys.sp_revokedbaccess'))

Sure enough, there is nicely commented code in sp_revokedbaccess to drop a schema. BUT rather than using the DROP SCHEMA DDL statement, it looks like this:

EXEC %%Owner(Name = @name_in_db).DropSchema(OwnerType = 1)

Hmmm....

I've been doing my talk on try-catch in T-SQL for a while and whining about not being able to "rethrow" (via RAISERROR) a system error. Even went as far as writing a RegEx decoder so that I can copy the error into a user error message text and parse the text back on the client. Two weeks ago in a class, and again this week at TechEd, someone suggested the simple, relatively low-tech solution; "Just add 50000". RAISERROR to rethrow and subtract 50000 at the client. Unless you have already defined user errors in the range that can't be rethrown + 50000 (up to 13000), this should work fine. Thanks for the suggestion.

There's an interesting article that caught my attention in this month's (July 2006) MSDN magazine by John Mollman about building the MSDN aggregation system. The system uses SQL Server 2005 Service Broker for reliable messaging, activation procedural written in T-SQL, SQLCLR for interactions with the world outside the database and the XML data type and schema support built in to SQL Server.

I found the use of SQL Server Service Broker because of its scalability potential, the data dependent routing implementation, and the plug-in object and interface based provider model most interesting. Check it out.

I've been going over how errors are raised in SQLCLR against the SQL Server 2005 RTM version. If you remember (search the blog on "6522" if you don't) how errors made their way from the managed to the unmanaged stack evolved throughout the beta. My goal in investigating this was to determine if you can make SQLCLR acts the same way that T-SQL acted with respect to errors. Then, you could replace T-SQL functions and procedures with SQLCLR without changing every caller.

6522 (that's a general error that means "an error in the SQLCLR stack occurred) is still with us. You get one if there is an unhandled exception in ExecuteNonQuery or ExecuteReader inside a SQLCLR proc.If you want this error to go away and want only the original SQL error (e.g. 547 referential integrity error) to be returned to the caller, the only way to do this is to use SqlPipe.ExecuteAndSend in your SQLCLR procs. If you have no catch block, both the 6522 and the "original" (e.g. 547) error will be returned. If you're called from client code, the error number is e.g. 547, the 6522 follows afterward. If you want to "lose" the 6522 altogether use ExecuteAndSend and use a dummy try-catch block in your SQLCLR code. It would look like this:

try {
...
...
SqlContext.Pipe.ExecuteAndSend(cmd)
}
catch { // no code here
}

In this case you only get the original error e.g. the 547. Note that if you use SqlCommand's ExecuteNonQuery or ExecuteReader with the dummy try-catch concept, you lose the error entirely.

Bear in mind that if you use T-SQL's TRY-CATCH in SQL Server 2005, there is no way to raise *exactly* the original error either. You can come close, but the is no RETHROW. You can use RAISERROR, but RAISERROR doesn't work with system errors. You can reformat the original error message in a user error.

Upshot of this is that if you convert to either SQLCLR or T-SQL TRY-CATCH and depend on seeing the original error number at the caller (normally the case even if you have error handling in your procs), you're going to be changing the callers' code. Or using only ExecuteAndSend.

Transact-SQL has been around for a lot longer, the list will be a bit smaller. But the T-SQL enhancements in 2005 were numerous but, of course, left some folks wanting more. As always.

More error handling improvements - TRY-CATCH is VERY nice. For next time, how about FINALLY and RETHROW? The ability to rethrow system errors, rather than having to convert them to user-defined errors would be nice too. The error levels and semantics could be made more consistent too, as Erland Sommarskog's dissertation on the subject points out. A personal request would be to implement some of the ANSI SQL error handling constructs, or subtly change the syntax to come closer. But no one doesn't like TRY-CATCH over @@ERROR, that I'm aware of.

An option for catalog time object resolution for procedural code. It would make sysdepends more dependable too.

Subselect support for the IN clause of PIVOT.

BEFORE TRIGGERs. I didn't ever use them much when I worked in [database that had them] but it would at least help conversion.

UPSERT (insert or update in a single statement).

When using the new OUTPUT clause, a way to get row values before or after triggers fire. You only get "before" now.

ANSI DOMAIN support.

Support for SQL/MED syntax (well, it IS a wish list, right?)

Explicit windowing functions. The new Windowing has inline only.

Support for "caching part of a rowset in memory". I'm unsure what the formal name for this is, all I keep hearing is that database [some popular database] has it.

Less restrictions on what can SQL be in INDEXED VIEWs. There is a set of restrictions that are similar for INDEXED VIEWs, Query Notifications (obvious why) and Recursive CTEs. Probably an engine optimization. The one that folks seems to ask for most is UNION support in these.

Ability to debug T-SQL procedural code in SSMS.

Using FOR XML with hierarhical CTEs. Didn't know where to put this, here or part 1.

SQLCLR is next, while I'm on a roll.

 

 

It's just a few weeks until the launch of SQL Server 2005. I'm been hiding under a rock since returning from Hong Kong. TechEd there was a great time, biggest attendence they've ever had. Security and SQL Server 2005 were the hot topics. Yes, even after PDC in September.

It's been over three years since I got my first look at SQL Server 2005, in September 2002. I knew then it was a major undertaking; kudos to everyone who worked on it so hard for so long. I've been teaching developer topics (its hard to make a hard distinction, for example, is snapshot isolation a developer or administrator topics?) for over 2 years, since the first class of 45-or-so Microsoft folks up in Redmond. Everyone I've presented features too has been really excited. Or pretended to be excited because I was.

They're already talking about the next version and Micheal Rys (of XQuery and XML data type fame) was asking what folks would like in the next version. Thought I'd take a shot at it, at least from the developer perspective. No better time...so here goes. I'll do multiple blog entires, starting with infrastructure. It's a big WISH list, I don't expect all of them. Or even most of them. Mostly, they are customer wishes, but some are mine too.

In this arena, we could use Row-level security. It was in beta 1 for a while, and folks always asked about it. Certain business sectors require it.

Next, true ANSI DATE and TIME data types. We had them too, for a while, but they were written using SQLCLR. Some folks didn't appreciate that, they were cut. They're needed to provide a conversion path from databases that use them. Like [main competitors names go here]. Frankly it didn't bother me as much as most *how* they were implemented just that we had them. I'm still looking for the promised source code so we could compile 'em and use 'em ourselves. Whatever happened to that? You could implement them yourself using SQLCLR, but a supported version would be better.

A hint/set of options so you could keep the default READ COMMITTED locking behavior, but use READ COMMITTED versioning (aka statement level snapshot) through a hint. Oh, and while we're at it, a separate “rollback database“ instead of using TEMPDB for old versions when using versioning.

Finally, some folks asked about DML event notifications (using Broker), FILESTREAM data type, higher capacity MAX data types, and there's a set of people who always ask about bitmapped indexes. You know who you are...

XML/XQuery wishes are next.

About a week or two ago, there was a LONG discussion on the SQLCLR beta newsgroup about the fact that the IsNull property that you use on CLR UDTs won't return TRUE or FALSE inside the server. It returns FALSE or NULL. Turns out that, although you use this property to *indicate* to the engine that your instance is NULL (database NULL, not null reference/value), the engine will optimize things by storing the fact that your instance is NULL. And so, a method called on a NULL instance yields NULL.

This was posted by the SQL CLR team here with a workaround if you *really* wanted this to work right, even inside the server. The workaround was to decorate your IsNull get method with:

[SqlMethod(OnNullCall=true)]

I tried this, it didn't work any better. But it WILL work if you use the correct field on the SqlMethod attribute. It's

[SqlMethod(InvokeIfReceiverIsNull=true)]

The difference is OnNullCall indicates whether a method will be called if any of its input parameters are NULL. This (OnNullCall=false) allows you to use non-SqlTypes as method parameters in your .NET code and not crash if someone passes in a NULL value. InvokeIfReceiverIsNull indicates whether the method will be called if the instance of the class itself is NULL. Obviously, not null class (you can't call a method on a null reference, for example), but database NULL.

This does work as advertised:

CREATE TABLE UDTTab (theUDT sometype);
go
INSERT UDTTab VALUES(NULL);
go
SELECT COUNT(*) FROM UDTTab where theUDT IS NULL
SELECT COUNT(*) FROM UDTTab where theUDT.IsNull = 1
go

both counts return 1.

I suppose its much easier "best practice" to remember is always use the SQL IS NULL in SQL statements. Because it's FASTER. They don't have to instanciate all those NULL UDT instances, just to confirm that IsNull is, indeed, true. And mark your “get” method for those who forget.

I commonly do a demo when teaching SQL Server 2005 where I write a SQLCLR UDF that's returns the string "Hello World". The define it, sans VS autodeploy, like this.

CREATE FUNCTION somefunc()
RETURNS NVARCHAR(4)
AS EXTERNAL NAME MyAssembly.MyClass.MyFunction

When invoked, it returns "Hell", silently truncating the string the CLR sent it. UNTIL Apr CTP. Now its returns "Truncation Exception". Surprise, surprise... After reporting this as a bug, I was told that "That's the way its supposed to work. To prevent silent data loss."

I agree. Except now its works differently than this T-SQL function:

CREATE FUNCTION somefuncT()
RETURNS NVARCHAR(4)
AS
BEGIN
  RETURN N'Hello World'
END

Correct again, it does work differently. The idea is the T-SQL one still silently truncates data, but the SQLCLR one does "the right thing". The T-SQL one still works the way it does for backward compatibility only. Maybe in the next release they'll work the same.

So what's your preference, backward compatibility or lack of silent data loss (and exceptions)? Just curious...

With April CTP came the new combined managed provider replaces System.Data.SqlServer with a new improved, works in-process or out, System.Data.SqlClient. I just call it "the combined provider" now. People that didn't work on the betas will look at me funny when the product RTMs; "was there ever anything other than System.Data.SqlClient?". Why yes, Virginia...

With the new provider, some of the error handling problems passing SQLCLR errors back to T-SQL resurfaced. Some work-arounds didn't work-around the same way. Some people noticed this on the newsgroups. I reported a bug on first day, but didn't want to be too "complain-y" here. And noticed what work-arounds (AKA coding practices so that things work right) still work.

The most severe problem was that if you tried to catch a SQLCLR error with a dummy try-catch block in your CLR code, AND executed your SQLCLR code inside a T-SQL TRY-CATCH, you got:

Msg 0, Level 11, State 0, Line 0
A severe error occurred on the current command. The results, if any,
should be discarded.

Oh. That was in SSMS. In SQLCMD you got nothing. No error from the CATCH, no results.

Bug is reported as fixed today. Cool. Although it was the fourth CTP after beta2, this was first *ever* release of the combined provider. They'll iron it out. Can't wait to try it in next CTP. Then I'll write about it.

BTW, transactions are MUCH improved in the new combined provider. Not only can you use System.Transactions (try rolling back in that trigger with Transaction.Current.Rollback() now), but using BeginTransaction and nesting transactions in nested stored procedures works exactly like it does in T-SQL. Excellent.

I'm still assimilating little tidbits of information about the Nov-Dec CTP build. Put this one in your pubs database in the CTP:

create procedure deletejob (@id int)
as
begin try
 delete jobs where job_id = @id
end try
begin catch
 -- you knew about these
 print error_number()
 print error_message()
 print error_severity()
 print error_state()
 -- these are undoc'd but work in the CTP
 print error_line()           -- line number
 print error_procedure()  -- procedure name
end catch
go

execute deletejob 2
go

OK, I couldn't help myself. The Dec 2004 CTP-specific readme had quite a number of things that are new. There are many improvements in SSIS, for example. The “real“ readme and the known issues file (sqlki.chm) still have July dates on them. But there are always a few little things I have queued up to try with a new build. Here's some that took 5 minutes after installing to test:

1. "Smiley face" XQuery comments still don't work. {-- comment --} still does.
2. INTERSECT and EXCEPT still not there. They are however, still in BOL.
3. FOR XML still does not work with UDTs. However, convert UDT to XML type DOES.

create table ctab (cnum complexnumber)
insert ctab values('1:1i')
insert ctab values('2:2i')

-- works! 2 rows returned with nicely formatted XML in them
select convert(xml, cnum) from ctab

-- fails, Msg 6865, FOR XML does not support CLR User Defined Types
select * from cnum for xml auto

I wonder why one works and not the other. Should use almost the same codepath for serializing UDT. Oh well. That's all for now. More later.

There's a new Community Technology Preview of SQL Server 2005 available on MSDN for universal subscribers. I'm sure you've all heard of it by now, I was out of the country last week with limited bandwidth and just downloaded mine yesterday. I was meandering through the BOL, looking for interesting things, when it dawned on me that the BOL title bar read "Microsoft SQL Server 2005 Beta 3". Oh.

I'd heard rumors that the SQL set operations INTERSECT and EXCEPT might be added in beta 3. So I went to my handy BOL Search tab (it took some time to get used to Search being a Tab in the main page), and entered INTERSECT. They were doc'd! Cool.

Brought up SQL Server Management Studio and hammered out my example tables for the test. Then entered:

SELECT id FROM table1 INTERSECT SELECT id FROM table2

drumroll...answer is: "Incorrect syntax near the keyword 'intersect'."

Darn...maybe next CTP release. Either the rumor was mistaken or BOL is just a wee bit ahead of the implementation/testing. It happens.

For more on INTERSECT and EXCEPT check out the BOL. Or your E.F. Codd or ANSI SQL-92 book. Looks like it will be coming. For those of us that remember set theory (I studied it in the public school system in the 60s in 4th grade, does this fact date me?) it's how sets "always worked". For SQL relational calculus afficiandos we've been missing these keywords, although it was relatively straightforward to get the same results with using EXISTS and NOT EXISTS. Yet another feature from the standard that's implemented. I'm sure there's more to it than that, such as interesting questions like "can I use INTERSECT and EXCEPT with indexed views? or in common table expressions? or updateable ADO recordsets?"

So in this version of SQL Server there's the complement of relational calculus operations in T-SQL. And with .NET procedural code, you can whip up UDFs with formulas from differential and integral calculus that run fast, too. [sorry. couldn't help it]

About a month ago, I'd written about unhandled errors in .NET procedural code always causing error 6522 to be returned to the (T-SQL) caller (see "Error 6522 and You"). Because the "real" error (the one that caused the .NET code exception) is wrapped in a 6522 error, retrieving the value of @@error will always give you 6522, not the real error. The canonical example is:

public static void Proc1
{
 SqlCommand cmd = SqlContext.GetCommand();
 // causes error 547 - reference constraint 
 cmd.CommandText = "delete authors where au_id like '1%'";
 SqlContext.GetPipe().Execute(cmd);
}

Called from T-SQL:

execute proc1
select @@error

The error you get is:

Msg 6522, Level 16, State 1, Procedure CauseError, Line 1
A .NET Framework error occurred during execution of user defined routine or aggregate 'CauseError':
System.Data.SqlServer.SqlException: DELETE statement conflicted with REFERENCE constraint 'FK__titleauth__au_id__0AD2A005'. The conflict occurred in database 'pubs', table 'titleauthor', column 'au_id'.

@@error = 6522

This week one of the students in SQL Server 2005 class discovered a neat workaround. Wrap the .NET code in a try-catch block and do nothing in the catch block. Like this:

public static void Proc1
{
 try
 {
   SqlCommand cmd = SqlContext.GetCommand();
   // causes error 547 - reference constraint 
   cmd.CommandText = "delete authors where au_id like '1%'";
   SqlContext.GetPipe().Execute(cmd);
 }
 catch { // dummy catch block }
}

This produces the expected error. And the expected value of @@error:

Msg 547, Level 16, State 0, Line 1
DELETE statement conflicted with REFERENCE constraint 'FK__titleauth__au_id__0AD2A005'. The conflict occurred in database 'pubs', table 'titleauthor', column 'au_id'.

@@error = 547

I'll have to try this in some other example exceptions, but it seems to do the trick in this one. Thanks to Bertil Syamken for the suggestion.

After having been immersed in .NET since the alpha version, I'd lost touch with how many products use OLE DB and ODBC. After all, when you talk about SQL Server and SQL 92/99/2003 standards, the standard API for a SQL database *is* still SQL CLI (call-language interface). And the canonical implementation of SQL-CLI is ODBC. When I spoke recently about SQL Server 2005 Web Services as a way to acheive interop between unlike client architectures for SQL Server, someone reminded me that ODBC drivers for "unlike architectures" have existed for years. Uh, yes, of...course.

Parts of SQL Server 2005 like linked servers, replication, reporting services, and DTS (renamed SQL Server Integration Services (SSIS) last week) use OLE DB and ODBC as well. Although SSIS supports ADO.NET too.

SQL Server 2005 ships with a new communication layer known as SNI and a new OLE DB provider and new ODBC driver that use it. These are bundled together in a part of the product that runs on both server and client called SQL Native Client (or SNAC for short). Folks that have existing non-.NET applications are *really* interested in:

1. How the SNAC provider/driver supports the new data types
2. Any subtle differences between the new provider/driver and the current ones

Two weeks ago, someone asked about support of the new "MAX" data types VARCHAR(MAX), NVARCHAR(MAX), and VARBINARY(MAX) in both SQLOLEDB/SQL Server ODBC and SNAC OLE DB/ODBC. For example, does VARCHAR(MAX) resemble VARCHAR or TEXT? So I took out my handy-dandy OLE DB Rowset Viewer and had a look. I looked at column metadata (DBSCHEMA_COLUMNS) and the  DBCOLUMNINFO structure (after a SELECT) of the following table:

CREATE TABLE testclob (
  avar VARCHAR(8000),
  amax VARCHAR(MAX),
  atext TEXT)

Answer is: VARCHAR(MAX) looks almost exactly like a TEXT data type (as far as the API is concerned of course; they're way different as far as TSQL is concerned), both in SQLOLEDB and the SNAC OLE DB provider. And both providers yield almost the same metadata. The big caveat there is the "almost". Here's the two almosts:

1. Both VARCHAR(MAX) and TEXT have the same capabilities/properties with respect ISLONG and MAYDEFER (ie, supports deferred fetch). But, in the Schema Rowset the character_max_length and octet_length is 2147483647 (2gb) for TEXT and 0 (that zero) for VARCHAR(MAX). Be careful using this to size to allocate buffers.

2. SNAC's DBCOLUMNINFO listed (maximum) ColumnSize as 4294967295 (4 gb?) rather than 2 gb as SQLOLEDB provider did.

I'd thought that SNAC's behavior was going to be identical to SQLOLEDB to ease migration, and it appears as though, except for the one metadata anomaly this is true. And I didn't really expect SNAC and SQLOLEDB to be exactly the same regarding *new* features; after all, that's the point of using the SNAC provider/driver, new feature support. There may be a few other subtle differences, more on this later.

What's error 6522? It's the error that you always get from an unhandled exception in a SQLCLR stored procedure. If I write a SQLCLR error that throws an unhandled exception (let's write one on purpose):

[SqlProcedure]
public static void ErrorExecute()
{
    SqlCommand cmd = SqlContext.GetCommand();
    // everyone knows Bob can't type
    cmd.CommandText = "select * from authorss";
    SqlContext.GetPipe().Execute(cmd);
}

using it from T-SQL:

execute errorexecute
print @@error

Yields:

Msg 208, Level 16, State 1, Line 0
Invalid object name 'authorss'.
Msg 6522, Level 16, State 1, Procedure ErrorExecute, Line 1
A .NET Framework error occurred during execution of user defined routine or aggregate 'ErrorExecute':
System.Data.SqlServer.SqlException: Invalid object name 'authorss'.
   at System.Data.SqlServer.Internal.StandardEventSink.HandleErrors()
   at System.Data.SqlServer.Internal.RequestExecutor.HandleExecute(EventTranslator eventTranslator, SqlConnection conn, SqlTransaction tran, ClrLevelContext clrlvlctxtPipe, UrtExecutionType uetType, InternalResultSetOptions irsoOpts, CommandBehavior eBehavior, Object& objResult)
   at System.Data.SqlServer.Internal.RequestExecutor.ExecuteToPipe(SqlConnection conn, SqlT.
6522

Not that @@ERROR returns 6522, NOT 208. If you write an equivalent bad T-SQL proc:

create procedure errortsql
as
-- cant type any better in TSQL
select * from authorss
go

execute errortsql
print @@error

returns 208.

Why does this matter? Say that I have a T-SQL stored procedure (not this one, obviously) that I want to replace with a SQLCLR equivalent that say, runs faster. Say the procedure is used in 20 places in my application, each with semantics that check for specific values of @@ERROR. Won't work the same.

You can't catch the exception in SQLCLR and throw the "correct" one (using a SqlCommand with CommandText of RAISERROR(...)) either, because you can't throw a 208 error (or any other system error) with RAISERROR. So I'll have to change each one of my caller procs too when I switch to SQLCLR. The only current workaround is to change to the new T-SQL try-catch syntax

BEGIN TRY
  EXECUTE errorexecute
END TRY
BEGIN CATCH
  -- prints 208, whew...
  PRINT CONVERT(varchar(10), error_number())
END CATCH

Hopefully this will be changed or a workaround in SQLCLR will be available in the next beta. I want @@ERROR to somehow return 208, not 6522. Or this will be a great motivator for everyone to change to T-SQL TRY-CATCH.

Theme design by Nukeation based on Jelle Druyts