Search This Blog

Showing posts with label SQL Server. Show all posts
Showing posts with label SQL Server. Show all posts

Friday, 8 August 2008

Tracking v/s Capturing changes

Change Data Capture has by far been my favourite feature everytime i think about SQL Server 2008. The really neat bit is something I missed during my learning process, There are two flavours to capturing changes, the change itself and the data that has changed, this is what distinguishes Change Data Tracking and Change Data Capture in SQL Server 2008

Change data capture provides historical change information for a user table by capturing both the fact that DML changes were made and the actual data that was changed. Changes are captured by using an asynchronous process that reads the transaction log and has a low impact on the system.. When you want to stage data in logical blocks such as a website publishing engine, or a clearing system this feature could prove very useful. This is mainly because of the granularity of the changes that are captured and the nature in which they are stored providing no coupling to the object whose changes are captured.

Change tracking on the other hand captures the rows in a table that changed, but does not capture the data itself. This allows applications to determine rows that have changed only with the latest row data being available in the user tables. Therefore, change tracking is more limited in the historical questions it can answer compared to change data capture. However, for applications that do not require historical information, there is far less storage overhead because of the changed data not being captured. It is the data captured which causes the database to grow. A synchronous tracking mechanism is used to track the changes and has been designed to have minimal overhead to the DML operations.

Either of these features can be used to synchronize applications or there database engines. Synchronization can be implemented in applications in two directions, one-way and two way.

In One-way synchronization applications, such as a client or mid-tier caching application, can be built that use change tracking. e.g, a caching application requires data to be stored in the database and to be cached in other data stores. In this scenario the application must be able to keep the cache up-to-date with any changes that have been made to the database. There are no changes to pass back to the Database Engine.

In two way synchronization, the data in the Database Engine is synchronized with one or more data stores. The data in those stores can be updated and the changes must be synchronized back to the database. A good example of two-way synchronization is an application which is occasionally connected such as a mobile application. In this type of application, a client application queries and updates a local store. When a connection is available between a client and server, the application will synchronize with a server, and changed data flows in both directions. In two-way synchronization applications must be able to detect conflicts. A conflict would occur if the same data was changed in both data stores in the time between synchronization's. With the ability to detect conflicts, an application can make sure that changes are not lost.

So my misinterpretation that change capture and change tracking meant the same proved to be wrong. This really useful feature could be put to use effectively in a scalable manner by choosing the right flavour of change capture which is based on the needs or nature of your application. There is no denial that applications on SQL Server versions prior to 2005 will need a major overhaul if there is an existing mechanism in place, that said it is best not to underestimate the implementation of Change Data Capture for a existing application. However new applications could base there designs around this feature and seek to benefit rapidly.

On this note a quick note to people who use Log shipping, The feature is useful when batch processing of transactions is to be done on a regular frequency, however it is still limited in not being able to identify each transaction individually, there is no denial however that is the best choice for Disaster Recovery options.

Friday, 25 July 2008

Astoria Data Services – IIS + SQL Server ?

A few months ago I had the opportunity to preview Astoria Data Services in a NextGen User Group meeting and I was thinking I have seen this before, but i failed to recollect where at that time, read the following you will see why i am pointing this out. At the time of preview MS promised to add security to Astoria, Basically all I could see was SQL queries on the browser URL / request URL

When you install IIS on a machine , in management console of windows you have an option to configure SQL XML support for IIS. Do this and then configure a Virtual directory in IIS, which offers quite a list of decent security settings. Configure these and browse to the URL with a SQL query in the URL and see what happens, there ain’t much difference?

I am not cynical but i want to point out this is not new, we will however have to wait and see what additional features are offered by Astoria before we can comment

Wednesday, 9 July 2008

Data Modelling Jazz

When we often think of data modelling it is a pretty picture created in Visio and for the some one more serious about data modelling it is representing entities, attributes and relationships in a meaningful manner. I didn’t realise modelling languages are different and a tool such as Visio supports/ works on such languages. e.g IDEFIX(Integration Definition for Information Modelling) is modelling language .. Woah that definition really woke me up .. and as usual this was developed in the US Airforce in 1985

The primary tool of a database designer is the data model. It’s such a great tool because it can show the details not only of single tables at a time, but the relationships between several
entities at a time. Of course it is not the only way to document a database;

• Often a product that features a database as the central focus will include a document that lists all tables, data types, and relationships. (developers think can’t be bothered)
• Every good DBA has a script of the database saved somewhere for re-creating the database. (developers think am still not bothered)
• SQL Server’s metadata includes ways to add properties to the database to describe the objects. (developers by now would think oh get a life will you).

Some common terms you would come across are Entities which are synonymous to tables in database, attributes which are synonymous to column definitions in a table and relationships represent how two entities relate to each other.  We represent these pictorially or grammatically in written text Anyway my idea of blogging about Data Modelling was to drop a few notes on some practices we could adopt while modelling data.

  • Entity names There are two ways you can go about these: plural or singular. Some argue tables names should be singular , but many feel that the table name refers to the set of rows and should be plural. Whatever convention you choose be consistent with it, mixing and matching could end up confusing the person reading the data model.
  • Attribute names: It’s generally not necessary to repeat the entity name in the attribute name, except for the primary key. The entity name is implied by the attribute’s inclusion in the entity. The chosen attribute name should reflect precisely what is contained in the attribute and how it relates to the entity.
  • Relationships: Name relationships with verb phrases, which make the relationship between a parent and child entity a readable sentence. The sentence expresses the
    relationship using the entity names and the relationship cardinality. The relationship sentence is a very powerful tool for communicating the purpose of the relationships with non technical members of the project team (e.g., customer representatives.
  • Domains: Define domains for your attributes, implementing type inheritance wherever possible to take advantage of domains that are similar. Using domains gives you a set of standard templates to use when building databases that ensures consistency across your database.
  • Objects: Define every object so it is clear what you had in mind when you created a given object. This is a tremendously valuable practice to get into, as it will pay off later when questions are asked about the objects, and it will serve as documentation to provide to other programmers and/or users.

Wednesday, 18 June 2008

SQL Server 2008 RC0 is out

RC0 SQL Server 2008 for some reason seems to be quietly done, didn't realise it was out, it is available to download at SQL Server 2008 RC

PS: SQL Server 2008 RC0 will automatically expire after 180 days

Thursday, 22 May 2008

RESEED

Just learnt a new TSQL term , although i did know this functionality is achievable in other ways, i hadn't come across this one before. RESEED resets the seed value of the IDENTITY. However, SQL Server 2000 works differently on RESEED for virgin tables when compared with 2005/2008.

For the table_name, to reset the seed value of the identity column to a new_value DBCC CHECKIDENT(table_name, RESEED, new_value) does the trick.

In SQL Server 2000 RESEED always increments the seed value but on 2005/2008, it doesn't increment but starts with the new_value.

Tuesday, 20 May 2008

Storing Hierarchical Data - HeirarchyID

In SQL Server 2000 we were limited by the 32 level recursion limit for TSQL, and storing and querying hierarchical data in the form  of trees was really difficult and inefficient, We used Cursors or temporary tables to write these queries. But simplicity, maintenance or performances were sacrificed. Ofcourse we could bundle a bit of code in the data layer of your application to share the load however this didn't the solve the problem in reporting scenarios where the data processing was to be done on the database server

As we moved on to SQL Server 2005 it improved because of the introduction of CTE's, CTE's where beautiful solutions to solving querying hierarchical data,  I use the word beautiful because it looked nicer on the outset when you used it without knowing the limitations it worked well on development environments For e.g using the AdventureWorks database we could use a CTE to query employee manager data as shown below

WITH UpperHierarchy(EmployeeId, LastName, Manager, HierarchyOrder)
 AS
 (
    SELECT emp.EmployeeId, emp.LoginId, emp.LoginId, 1 AS HierarchyOrder
    FROM HumanResources.Employee AS emp
      WHERE emp.ManagerId isNull
    UNION ALL
    SELECT emp.EmployeeId, emp.LoginId, Parent.LastName, HierarchyOrder + 1
    FROM HumanResources.Employee AS emp
           INNER JOIN UpperHierarchy AS Parent
                 ON emp.ManagerId = parent.EmployeeId
 )
 SELECT *
 From UpperHierarchy

Although this decreased the complexity of writing queries, performance of these queries was still challenged on large databases, the optimisation of the CTE execution plans did improve things but as in any database situation the optimizer is handicapped without indexing capabilities. Indexes reduce the load on the query increasing performance and scalability. In addition in SQL 2005 the underlying storage structure was still something users had to design to suit there requirements. This just got better with the introduction of a new managed SQL CLR data type called HeirarchyID in SQL Server 2008., It is available ready to use in the databarse server now... If you now look back and remember the introduction of the CLR into SQL Server in Yukon you will appreciate this feature even more and how it has been panned out..

This data type does not store the identifier of the parent element but a set of information to locate the element in the hierarchy. This type represents a node in the tree structure. If you look at values contained in a column of HeirarchyID type, you realize that they are binary values.It is extremely compact and supports arbitrary inserts and deletions. As per MS a node in an organizational hierarchy of 100,000 people with an average fanout of 6 levels takes about 38 bits. This is rounded up to 40 bits, or 5 bytes, for storage. Because it stores the elements hierarchy in its entirety it is also indexable now..

As always there are a few limitations

  • It can hold upto 892 bytes but to be honest that should allow you to span out into a really big tree structure under a single node.
  • A query with the FOR XML clause will fail on a table with HeirarchyID unless the column is first converted to a character data type. Use the ToString() method to convert the HeirarchyID value to the logical representation as a nvarchar(4000) data type

We can represent the HeirarchyID type in a string format. This format shows clearly information carried by this type. Indeed, string representation is formatted as is:/<index level 1>/<index level 2>/…/<index level N>. This representation corresponds to atree structure . Note that first child of a node does not have a value of 1 all the time but can have the /1.2/ value. So to play around a bit we first need a table with a column of HeirarchyID type and an index on the hierarchy ID column

CREATE TABLE Organization
(
    EmployeeID heirarchyid NOT NULL,
    EmployeeName nvarchar(50) NOT NULL
)

ALTER TABLE dbo.Organization
ADD HierarchyLevel As EmployeeID.GetLevel()

CREATE INDEX IX_Employee
ON Organization(HierarchyLevel,EmployeeID);

To populate the data table we use the CTE we mentioned earlier to  just modify the SELECT statement like the following

Insert Into dbo.Organization(EmployeeId, EmployeeName)
Select Node, LastName 
 From UpperHierarchy

Now Hierarchical data can be queried using the functions

HeirarchyID data type can be manipulated through a set of functions.· GetAncestor, GetDescendant, GetLevel, GetRoot, ToString, IsDescendant, Parse, Read, Reparent, Write, for details on the functions please refer to the CTP documentation of Katmai but most of them are self explanatory #

For e.g to see how we use these functions, let us insert a node as the last child of an existing node.To do this we first retrieve the sibling node.

--finding sibling node
SELECT @sibling = Max(EmployeeID)
FROM dbo.Organization
WHERE EmployeeId.GetAncestor(1)= @Parent;
--inserting node
INSERT dbo.Organization(EmployeeId, EmployeeName)
VALUES(@Parent.GetDescendant(@sibling,NULL), @Name)

We do not always want to (or can) recover the sibling node to perform insertion. There is perhaps an implied policy to determine node position. For example, let’s say we have an [order] column which position nodes among its siblings. We can compute node path as string: In this example, since the node @Parent is the root, that will give/<order>/. Thanks to the Parse() function, we can use this value to create the new node.

Declare @Parent As HeirarchyID = HeirarchyID::GetRoot() 
Declare @NewPath As varchar(10)= @Parent.ToString()+ CAST([Order] AS varchar(3))+ '/'
INSERT dbo.Organization(EmployeeId, EmployeeName) VALUES(HierarchyId::Parse(@NewPath),'aChild')

You will have note the new syntax of SQL Server 2008 to declare and assign variables in only one line. :: denotes a static method on a SQL CLR type in TSQL. So what am i getting to finally the CTE is not so much of a beauty anymore, just run this query to see what it returns

Select *
From dbo.Organization
Where @BossNode.IsDescendant(EmployeeId)

If  you run this query along side the CTE query and compare the execution plan of these queries you would see why this new feature is being talked about :)

Monday, 19 May 2008

SQL Profiler for 2005 Express Edition

I was trying to see how i can profile SQL 2005 Express Edition as the management studio does not have a profiler. I found one at the following location though , it is very useful and quite a light weight tool http://code.google.com/p/sqlexpressprofiler/downloads/list

Friday, 16 May 2008

NEW INSERT STATEMENT

In SQL 2008 some new row value constructors have been added, we are familiar with the INSERT statement which has been around for ages the ANSI way

INSERT INTO Table1 (column1 ,column2 ,... columnN) VALUES (value1,value2,....valueN)

Another way of inserting a single row of data is as follows

INSERT INTO Table1 SELECT value1,value2,....valueN

Similarly for multiple rows of data

INSERT INTO Table1 
SELECT value1,value2,....valueN
UNION SELECT value1,value2,....valueN
UNION SELECT value1,value2,....valueN

Now the new ROW VALUE CONSTRUCTOR allows the following to add multiple rows of data

INSERT INTO Table1(column1 ,column2 ,... columnN) VALUES
(value1 , value2 , ... valueN),
(value1 , value2 , ... valueN),
(value1 , value2 , ... valueN),
(value1 , value2 , ... valueN),
(value1 , value2 , ... valueN),

We normally use INSERT statements in stored procedures using parameters to the stored procedure, so the above row value constructor is not very useful , it seems to be use of a table valued parameter to insert multiple rows of data is a better option in SQL Server 2008

Databases and Software Development

Database is a form of software development and yet all too often the database is thought of as a secondary entity when development teams discuss architecture and test plans—many developers do not seem to believe, understand or leave alone feel the need to understand that standard software development best practices apply to database development. Virtually every application imaginable requires some form of data store. And many in the development community go beyond simply persisting data, creating applications that are data driven. Given this dependency upon data and databases, Data is the central factor that dictates the value any application can bring to its users. Without the data, there is no need for the application.

The very reason we use the word legacy to refer to an application as old as 3 years is more often than not because of the database. As applications grow in size with new features , the amount of thought put into refactoring front end code or developing new code by developers is not put into the database development, ,so what happens essentially is a situation where your front end is two years ahead of the database , and as time progresses your applications capabilities and features get pulled back by the limitations of database quality and standard. We can all deny this but in reality it results either in a limitation or extra cost on creating work arounds.

The central argument on many a database forum is what to do with that ever-present required logic. Sadly, try as we might, developers have still not figured out how to develop an application without the need to implement business requirements. And so the debate rages on. Does “business logic” belong in the database? In the application tier? What about the user interface? And what impact do newer application architectures have on this age-old question?In recent times I have been able to have a look at technologies like Astoria, Linq and the Entity Model framework and Katmai. I was amazed at how little or no database code needs to be written by a developer who is doing UI or business layer development in a software. At the same time being a SQL Server fan myself.. I was worried that my database skills will slowly vaporise into thin air. Hmm that's not as bad as it sounds. All these new technologies such as Astoria, Linq or the Entity Framework are abstractions of the database and allow developing a logical data layer which maps to a physical database , so all developers who work primarily in the UI level or business layer level will slowly stop doing any SQL code at all, instead churning code that interests them against a logical data layer, but what contradicts this is the need to learn a new syntax in the form Linq, . On the other hand, database development will shift to being the specialist job...the design development and management activities of the database developer and Administrator will begin emerging as specialist skills as opposed to generalist skills in the near future. The future of the database specialist seems to be bright.. but what is to be seen is how organisations look at this shift in software development ideology.. To be fair this model of specialist and generalists is not new..

Thursday, 15 May 2008

TSQL - GROUP BY and ALL


We sure have used Group By in our SQL queries to group data and get result sets with aggregates from SQL Server, But before I explain lets create a temporary table with order details table using the Northwind database in SQL 2000

I will just join orders and [order detail] table in Northwind to get the data i need into a temporary table as shown below

SELECT 
  O.OrderID, 
  ProductID, 
  UnitPrice, 
  Quantity, 
  (UnitPrice*Quantity) AS Amount,
  CustomerID
INTO 
  #tempOrders
FROM 
  Orders O 
INNER JOIN
  [order details] D 
ON
  O.[orderid] = D.[orderid]
ORDER BY
  ProductID

So now i have a table called #tempOrders with the order details i need.

Now suppose I'd like to see the customers that were sold Product #1 along with the total amount that they spent. I will usea query with a GROUP BY clause as below with a WHERE condition to filter records

SELECT 
   CustomerID,
   SUM(Amount) AS TotalAmount
 FROM
   #tempOrders
 WHERE 
   ProductID = 1
 GROUP BY 
   CustomerID


Now, let's say that I'd like to see all customers that have been sold any products, but we still just want to see the "TotalAmount" for ProductID #1. For customers that have never ordered ProductID #1, it should output a "TotalAmount" value of 0. One way to do this is with a CASE expression as shown below

SELECT 
  CustomerID,
  SUM(CASE WHEN ProductID = 1 THEN Amount ELSE 0 END) AS TotalAmount
FROM
  #tempOrders
GROUP BY 
  CustomerID

Now this would return customers who haven't purchased Product #1 with a total of 0. In situations like these the SUM(CASE...) expression can be replaced with a GROUP BY ALL.

SELECT 
    CustomerID,
    ISNULL(SUM(Amount), 0) AS TotalAmount
  FROM
    #tempOrders
  WHERE 
    ProductID = 1
  GROUP BY ALL CustomerID

Values that are excluded from the aggregation according to the WHERE clause have NULL values returned, the ISNULL function makes sure all customers who haven't ordered Product #1 have a total of 0 instead of NULL. The ALL option basically says "ignore the WHERE clause when doing the GROUPING, but still apply it for any aggregate functions". So, in this case, the WHERE clause is not considered when generating the population of CustomerID values, but it is applied when calculating the SUM. This is very much like our first solution, where we removed the WHERE clause completely, and used a SUM(CASE...) expression to conditionally calculate the aggregate.

GROUP BY ALL is kind of obscure and neat to know, but not really useful in most situations since there are usually easier or better ways to get this result. This won't work if we want all Customers to be displayed, since a customer must have at least one order to show up in the result.Another limitation is we can not use GROUP BY ALL if we want to return a grand total for all orders

Tuesday, 13 May 2008

PIVOT and UNPIVOT

There may be situations while writing SQL code where we would like to have totals or aggregates of columns alongside a column, e.g number of points scored by a sales person with their sales, as well as separate sales by month or year.

So lets first create a table which represents sales data as below

CREATE TABLE #tempTest
(
  empName VARCHAR(25),
  noOfSales INT,
  [monthName] VARCHAR(3),
  Points INT
)

Just add a few rows of data to this table

INSERT INTO #tempTest VALUES('James',2,'Jan', 2);
INSERT INTO #tempTest VALUES('Jason',1,'Jan', 1);
INSERT INTO #tempTest VALUES('Mark',3,'Jan', 3);
INSERT INTO #tempTest VALUES('Mark',1,'Jan', 1);
INSERT INTO #tempTest VALUES('James',1,'Feb', 1);
INSERT INTO #tempTest VALUES('James',2,'Feb', 2);
INSERT INTO #tempTest VALUES('Jason',3,'Feb', 3);
INSERT INTO #tempTest VALUES('Jason',2,'Feb', 2);
INSERT INTO #tempTest VALUES('Mark',1,'Mar', 1);
INSERT INTO #tempTest VALUES('James',2,'Mar', 2);
INSERT INTO #tempTest VALUES('Jason',2,'Mar', 2);

Now to show how many sales each sales person gets per month we could use PIVOT to achieve this easily


SELECT
  empName,
  [Jan] AS Total_Sales_Jan,
  [Feb] AS Total_Sales_Feb,
  [Mar] AS Total_Sales_Mar
FROM
  (
    SELECT
      empName, noOfSales, [monthName]
    FROM
      #tempTest
  ) AS Source PIVOT (SUM(noOfSales) for [monthName] IN ([Jan],[Feb],[Mar]) )AS PivotTable

This query results in the following result

Results Pivot 1

Please note that PIVOT can have only one non pivoted column, Example below shows how PIVOT can be used incorrectly

SELECT
  empName,
  SUM([Jan]) AS Jan,
  SUM([Feb]) AS Feb,
  SUM([Mar]) AS Mar,
  SUM(Points) AS Points
FROM
  (
    SELECT
      empName,
      noOfSales,
      [monthName],
      Points
    FROM
      #tempTest
  ) AS Source PIVOT (SUM(noOfSales) for [monthName] IN ([Jan],[Feb],[Mar]) )AS PivotTable
GROUP BY
  empName

Results Pivot 2

Note that the results are incorrect for Points, they dont sum up correctly for each sales person as expected, this is because you cannot use more than one unpivoted column

UNPIVOT is the exact opposite of PIVOT although the usage of it is likely to be minimal there are chances this may be used in some occassions

Friday, 9 May 2008

Deleting Duplicates using CTE's

While writing stored procedures and dealing with temporary data within the scope of a stored procedure we come across a situation where our temporary data has duplicate records and we want to get rid of duplicated. What we do normally in SQL 2000 would be to create a new temporary table and select distinct records from the temporary table which has duplicates. Then we select the data from the new cleaned up table.

This is not very performant as is evident, SQL 2005 onwards we have been able to do this with CTE's efficiently as below

Deleting from the CTE actually changes the underlying table. Be careful how you setup your CTE. You could have some unintended deletes without the right logic.

CREATE TABLE #itemtable(
  Item_Code varchar(10),
  Item_Name varchar(100)
)

  INSERT INTO #itemtable(Item_Code, Item_Name)
  VALUES ('1','Item 1');

  INSERT INTO #itemtable(Item_Code, Item_Name)
  VALUES ('2','Item 2');
  INSERT INTO #itemtable(Item_Code, Item_Name)
  VALUES ('3','Item 3');

  INSERT INTO #itemtable(Item_Code, Item_Name)
  VALUES ('3','Item 3');

  INSERT INTO #itemtable(Item_Code, Item_Name)
  VALUES ('4','Item 4');

  INSERT INTO #itemtable(Item_Code, Item_Name)
  VALUES ('5','Item 5');

  INSERT INTO #itemtable(Item_Code, Item_Name)
  VALUES ('6','Item 6');

  INSERT INTO #itemtable(Item_Code, Item_Name)
  VALUES ('6','Item 6');

  INSERT INTO #itemtable(Item_Code, Item_Name)
  VALUES ('7','Item 7');

GO

  SELECT * FROM #itemtable;

  WITH Duplicates AS
  (
    SELECT
      *,
      row_number() OVER(PARTITION BY Item_Code ORDER BY Item_Code) as RowNum
    FROM
      #itemtable
  )

  DELETE FROM Duplicates WHERE RowNum > 1; --

SELECT * FROM #itemtable;

Wednesday, 7 May 2008

Table Valued Parameters

In recent times I haven’t written any complex TSQL code which needed to scale, possible a scenario you may come across while working on a reporting application where most of the processing is done on the database server using TSQL. In SQL 2005 user defined types were available but in Katmai this has been furthered to allow table valued parameters to be passed to stored procedures. Well we can also think of them as instances of user defined types. At the end of the day it boils down to being able to efficient and more performant code , but then this feature seems to have a limitation at the moment.

The table data type allows variables that are, in effect, tables. They also allow you to return tabular data from functions without doing a SELECT at the end of the function. That's great, but don't temporary tables do the same? Yes and No …. While table variables and temporary tables are similar, there are several key differences that determine which option you use.
E.g. Consider the sample below in SQL 2005 to add orders to the database table using TSQL, I will need a stored procedure which takes each column in a row as a parameter, and more over each row needs a round trip to the database server..

CREATE PROC sp_add_order(@order_id int, @customer_id int, @order_total money )
AS
INSERT INTO Order
(
order_id,
customer_id,
order_total
)
VALUES
(
@order_id,
@customer_id,
@order_total
)
GO

Now with SQL 2008 we are able to insert several rows of data with just one round trip to the server by using a user defined type called OrderType which has a table structure. That is efficient and quite performant…

CREATE TYPE OrderType AS TABLE
(
[order_id] int NOT NULL,
customer_id int NOT NULL,
[OrderTotal] [money] NULL
)
GO

ALTER PROC sp_new_order (@order OrderType READONLY)
AS
INSERT INTO [OrderTest]
(
order_id,
customer_id,
[OrderTotal]
)
SELECT * FROM @order

RETURN
GO

The one limitation to note here is the READONLY option on the parameter, At the moment the CTP only allows us to pass read-only table valued parameters, We are unable to use OUT parameters of Table Value Type, It would be great to be able to do this, as we don’t have to make nested calls to stored procedures in reporting scenarios… It is to be seen how this pans out in the final release, although the feature doesn’t look big, the sheer benefits of being able to use Table Valued Parameters comes as blessing in disguise for many TSQL developers.

Tuesday, 6 May 2008

Spatial Data - Delivering Location based Intelligence


I have been on and off blogging recently due to my part time course at the Uni which has kept me busy.. By the way I did look through most of the new features on Katmai and actually did a small lunch time session at work on Katmai, one of the topics I did speak in details was spatial data support on Katmai.

SQL Server 2008 provides the geography (for geodetic spatial data) and the geometry(for planar spatial data) data types. Both are .NET CLR types, just like your Hierarchy data type which I will post about in another blog soon. The presence of the CLR in SQL Server provides an avenue for some really new and exciting features in SQL Server. When the CLR was introduced on the Server in Yukon I did think it is a bit too much to be present in a database server, but then now that some of the new features in Katmai have panned out as a result of that it does all seem to make sense now.

The geography data type provides a storage structure for spatial data which is ellipsoidal in nature, and is represented by latitude and longitude coordinates. Typical uses of this kind of data include defining roads, buildings, or geographical features as vector data that can be overlaid onto a raster-based map, basically any round earth data.

The geometry data type provides a storage structure for spatial data which is planar or in other words Euclidean in nature. This kind of data is commonly used for maps and interior floor plans where the curvature of the Earth does not need to be taken into account.

To create data that is of the type geometry or geography we could create instances using input in the format called Well Known Text (WKT) or Well Known Binary(WKB). Well WKB is the format specified by the Open Geospatial Consortium that permits the geometry data to be exchanged between the client and the SQL server. The geometry data type provides properties and methods that are aligned with the Open Geospatial Consortium (OGC) Simple Features Specification for SQL and enable you to perform operations on geometric data that produce industry-standard behavior.

SQL Server 2008 increases the maximum size for CLR types in the database from the 8000 bytes limit that was imposed in SQL Server 2005 to two gigabytes (2 GB), which makes it possible to store extremely complex spatial data elements, such as polygons, which are defined by a large number of points. By storing spatial data in relational tables, SQL Server 2008 makes it possible to combine spatial data with any other kind of business data; this removes the need to maintain a separate, dedicated spatial data store and enables high performance queries that do not need to combine data from multiple external sources. Performance of queries against spatial data is further enhanced by the inclusion of spatial index support in SQL Server 2008.

Both spatial data types in SQL Server 2008 provide a comprehensive set of instance and static methods that you can use to perform queries and operations on spatial data. If we explore these two data types we find that Point, MultiPoint, LineString, MultiLineString, Polygon, MultiPolygon, GeometryCollection are all instances of spatial data types recognized by both geometry and geography data types.

To construct instances of a point, Multipoint etc. from WKT format Katmai provides functions as following
Point instance from WKT input use STPointFromText
MultiPoint instance from WKT input use STMPointFromText
LineString instance from WKT input use STLineFromText
MultiLineString instance from WKT input use STMLineFromText
Polygon instance from WKT input use STPolyFromText
MultiPolygon instance from WKT input use STMPolyFromText
GeometryCollection instance from WKT input STGeomCollFromText


To construct instances from WKB format Katmai provides the following functions

Point instance from WKB input use STPointFromWKB
MultiPoint instance from WKB input use STMPointFromWKB.
LineString instance from WKB input use STLineFromWKB.
MultiLineString instance from WKB input use STMLineFromWKB
Polygon instance from WKB input use STPolyFromWKB
MultiPolygon instance from WKB input use STMPolyFromWKB.
GeometryCollection instance from WKB input use STGeomCollFromWKB

There are also methods which allow developers to compare or determine the relationship between geometrical instances such as the ones listed below

STEquals determines if two instances comprise the same point set
STDisjoint determines if two instances are disjoint
STIntersects determines if two instances intersect.
STTouches determines if two instances touch.
STOverlaps determines if two instances overlap.
STCrosses determines if two instances cross .
STWithin determines if one instance is within another .
STContains determines if one instance contains another.
STRelate determines if two instances are spatially related
STDistance determines the shortest distance between points in two geometries

Equivalent set of functions and methods are available for Geographical types as well .

To Create and add Spatial Data

USE Adventureworks
GO

IF OBJECT_ID ( 'dbo.SpatialTable', 'U' ) IS NOT NULL DROP TABLE dbo.SpatialTable;
GO

CREATE TABLE SpatialTable
( id int IDENTITY (1,1),
GeomCol1 geometry,
GeomCol2 AS GeomCol1.STAsText() );
GO


INSERT INTO SpatialTable (GeomCol1)
VALUES (geometry::STGeomFromText('LINESTRING (100 100, 20 180, 180 180)', 0));

INSERT INTO SpatialTable (GeomCol1)
VALUES (geometry::STGeomFromText('POLYGON ((0 0, 150 0, 150 150, 0 150, 0 0))', 0));
GO


To create a Geography instance from WKT format

DECLARE @g geography;
SET @g = geography::STGeomFromText('LINESTRING(47.656 -122.360, 47.656 -122.343)', 4326);
SELECT @g.ToString();

To create a Geometry instance from WKB format

DECLARE @g geometry;
SET @g = geometry::STGeomFromWKB(0x010200000003000000000000000000594000000000000059400000000000003440000000000080664000000000008066400000000000806640, 0);
SELECT @g.STAsText();


-- Point from WKB
SET @g = geometry::STPointFromWKB(0x010100000000000000000059400000000000005940, 0);
SELECT @g.STAsText();

To create a Geometry instance from WKT format

-- Creating an instance with WKT
DECLARE @g geometry;
SET @g = geometry::STGeomFromText('LINESTRING (100 100, 20 180, 180 180)', 0);
SELECT @g.ToString();

-- Parse is also useful to convert data from WKT
SET @g = geometry::Parse('LINESTRING (100 100, 20 180, 180 180)');
SELECT @g.ToString();

-- Point instance from WKT
SET @g = geometry::STPointFromText('POINT (100 100)', 0);
SELECT @g.ToString();

Determining if the instances intersect

USE AdventureWorks
GO

DECLARE @geom1 geometry;
DECLARE @geom2 geometry;
DECLARE @result geometry;

SELECT @geom1 = GeomCol1 FROM SpatialTable WHERE id = 1;
SELECT @geom2 = GeomCol1 FROM SpatialTable WHERE id = 2;
SELECT @result = @geom1.STIntersection(@geom2);
SELECT @result.STAsText();

Sample code


As the integration of geospatial information into applications becomes more prevalent, we will require database systems that can store and manipulate spatial data. With the introduction of the spatial types, SQL Server 2008 provides a data storage solution for spatial data, and enables organizations of any scale to integrate geospatial features into their applications and services.

Friday, 15 February 2008

Building Time Zone awareness - Date Time types


I have been working for the last two years on projects which involve synchronising data between web applications and many a the date and time on web application used as the source of the replication is not on the same time zone as the destination web application. Some times it is necessary that some portion of the date time values moved across these applications be time zone aware. In other words some date time values need to be moved into the local time where the server is present, while some needed to be moved at absolute times(same as specified on source).

For eg.

Consider a table with four columns on a web application in UK on SQL 2000 server
ID - identity
Name - VARCHAR(100)
sellingDate - datetime -- The date and time on which the product should start selling
ModifiedDate - datetime -- The date and time on which the product record or data was last modified

Now when I move this data across to a web application in India the selling date should still remain on absolute date time specified, but the last modified time should be moved across as relative time to the time zone where this is present.

This is quite a constraint since you would end up having a business rule for every datetime column of your database and this rule is only a adhoc one determined by functionality rather than any real type based information. It would be great to distinguish these time zone aware columns in the database level.

Another problem when a .Net datetime value is serialised is it has an extra time offset appended at the end. This when passed to SQL 2000 parameters was not usable in its raw string form (say when passed as a XML fragment parameter containing this value)

In SQL 2008 we have new date time data types as below
DATE. - Allows only dates, accuracy is a day
TIME - Allows only times to the accuracy of nano seconds
SMALLDATETIME - Allows dates and times , accuracy is one second
DATETIME - Allows date and time as before with 3ms accuracy, but the range of dates that can be specified has been increased to 01-01-1753 to 31-12-9999
DATETIME2 - Allows specifying date and time like the DATETIME type but to a accuracy of nano seconds , YYYY-MM-DD hh:mm:ss:[nnnnnnn]
DATETIMEOFFSET - Allows users to specify a date and time with a time zone offset.

With these 6 new date time types in SQL server 2008, time zone aware applications can be more efficiently and easily designed/ developed.

As we have seen the date time types, lets see what functions are available, its great to see an introduction of the datetimeoffset but it would even better to do some playing around with it at the database level

Functions which return DATETIME values, These are as present in SQL 2000, 2005
GETDATE(), CURRENT_TIMESTAMP and GETUTCDATE() are still available.

Functions which return DATETIME2 values, other wise refered to as HIGH PRECISION date time functions, SYSDATETIME(), SYSUTCDATETIME(),SYSDATETIMEOFFSET() (returns DATETIMEOFFSET type value).

Other functions as in previous versions such as DATEPART, DATEDIFF, DATEADD are available , but two new useful functions which complement the DATETIMEOFFSET datatype hav been introduced.

SWITCHOFFSET() and TODATETIMEOFFSET()

e.g.

SELECT SWITCHOFFSET(sysdatetimeoffset(), '+08:00'

The SWITCHOFFSET function converts a value of datetimeoffset type from stored time zone to specified time zone

SELECT TODATETIMEOFFSET ( GETDATE() , '+08:00' )

The TODATETIMEOFFSET function converts a datetime type value into a DATETIMEOFFSET type with the offset specified

ISDATE()

Another useful function for datetime is the ISDATE() function, which validates that the specified value in the parameter is a valid date time value. There are several different scenarios to use it but in general for something as below

SELECT ISDATE(NULL); --returns 0


USE AdventureWorks;
GO
SELECT ISDATE(s.OrderDate) FROM Sales.SalesOrderHeader s;
GO
--Returns ones



All the information above is based on the November CTP for Katmai. Looking at all the above it does show good prospects for handling date and times for developers.

Tuesday, 12 February 2008

Change Data Capture (CDC) - I

 


Oracle 9i has this feature since 2003 as I can recollect, but being a SQL Server developer I am quite pleased its available in Katmai. To give a bit of background as to why CDC is a crucial feature , we will try to understand ETL in Data warehousing. A crucial function of data warehousing is extracting data on a regular basis applying some kind of transform if necessary to the data and storing it in the warehouse. So if i have a database replication setup I would run DTS packages which replicate and transform all data between the transactional database and the warehouse database. If you are a DBA or a developer involved in a data integration project that involves the regular extraction and transportation of a large amount of data from one system to another system or systems, you could consider CDC to help reduce your ETL time. The point is its more efficient than our previous replication methodologies for sure and it is particularly effective as we scale up on the database size.
In SQL Server 2000, 2005 you would have written triggers to update data into another table where the metadata on the change is stored and then further right more plumbing code to put it all together ... I work on a product which has a similar concept.. Believe me it’s not a easy on 2000 and 2005. We where able to track data even though it was complicated but now in Katmai we will be able to capture the details of the event which caused data change on the relevant database table.

In Katmai, when we use Change Data Capture feature on a database table, a mirror of the tracked table schema is created with additional columns that include metadata. The metadata holds all the information about the change.

To know if CDC is enabled for a database, we could run the following query

SELECT * FROM sys.databases WHERE is_cdc_enabled = 1

Enabling the database for CDC


USE AdventureWorks

GO

EXEC sys.sp_cdc_enable_db_change_data_capture

GO


When you execute this command five tables in a schema called cdc are created. In addition to this there are two table valued functions added on to the cdc schema of the database. These functions behave like stubs for table based CDC functions

The five tables created are as below


cdc.captured_columns

cdc.ddl_history

cdc.index_columns

cdc.lsn_time_mapping

cdc.change_tables


In Katmai, when we use Change Data Capture feature on a database table, a mirror of the tracked table schema is created with additional columns that include metadata. The metadata holds all the information about the change.

In addition to this you will notice there are two Jobs one for Capture and one for cleanup added under the SQL Server Agent jobs for the database CDC functionality

CDC Schema




Enabling the table for CDC


On the AdventureWorks database to enable CDC on HumanResources.Employee table, we should use the following functions

USE Adventureworks

GO

EXEC sys.sp_cdc_enable_table_change_data_capture

@source_schema = 'HumanResources',

@source_name = 'Employee',

@role_name = 'cdc_AdventureWorks',

@capture_instance='HumanResources_Employee_CT'

GO


When we execute this command a new capture instance is created a table with this name of the capture instance is created with the schema as shown in the diagram



Note that columns as in the Employee table are created, In addition to which there are five extra columns which store metadata about the data.

Now we update a row on the HumanResources.Employee table as below

UPDATE HumanResources.Employee

SET LoginID = 'adventureworks\guy2'

WHERE employeeid = 1

Now select the data in the HumanResources_Employee_CT, we will find two rows inserted in this table, One representing the data before the update and the other after the data was inserted

Disabling CDC on a table/capture instance

USE Adventureworks

GO

EXEC sys.sp_cdc_disable_table_change_data_capture

@source_schema = 'HumanResources',

@source_name = 'Employee',

@capture_instance = 'HumanResources_Employee_CT'

GO

I guess I will pause here and continue on the rest in my next blog…. There are several cdc related system functions which could be used for different purposes, I will detail each of these in my next blog.. and i will also detail the schema a bit when i get my head around it hopefully

Download - Scripts CDC

Monday, 11 February 2008

Resource Governor

Before we understand the Resource Governor let us consider an example where we have hosted two databases App1 and App2 on a single SQL Server 2005 with a query execution time out of 600 seconds. Lets say App1 database should not allow queries to execute for more than 5 minutes which is 300 seconds and App2 should allow 600 seconds. This was not easy to achieve on SQL 2005. (We may argue that we could specify connection timeout on the client connection string in ADO.Net, but that’s just a work around), Alright, what if I want to give connections of App1 database more priority than App2 database connections when server resources are limited?

One of the new features in SQL 2008 (Katmai) is the Resource Governer. It is a component that allows administrators to configure pools of resources. Consider this an analogy to the Application Pool in IIS 6.

The 2 main components in the Resource Governer that matter are the Resource Pool and the Workload Group.



Resource Pool

The Resource Pool allows you set server (as in the physical server on which SQL server is installed) level resource usage such as

MAX_CPU_PERCENT, MIN_CPU_PERCENT and MAX_MEMORY_PERCENT, MIN_MEMORY_PERCENT

Using TSQL

CREATE RESOURCE POOL poolNorthwind
WITH
(
MAX_CPU_PERCENT = 30,
MAX_MEMORY_PERCENT = 50
);

Workload Group.

The Workload Group allows you to control some specific SQL Server related conditions, as well as apply Resource Pools. It allows you to control settings such as the connection priority, query time out, max parallel requests and other parameters as mentioned below

IMPORTANCE = MEDIUM , REQUEST_MAX_MEMORY_GRANT_PERCENT, REQUEST_MAX_CPU_TIME_SEC , REQUEST_MEMORY_GRANT_TIMEOUT_SEC , MAX_DOP , GROUP_MAX_REQUESTS

Using TSQL

CREATE WORKLOAD GROUP workGroupNorthwind
WITH
(
IMPORTANCE = MEDIUM , -- default is medium, its like OS thread priority
REQUEST_MEMORY_GRANT_TIMEOUT_SEC = 300,-- max query timeout to 5 min
MAX_DOP = 8, -- maximum degree of parallelism to 8
GROUP_MAX_REQUESTS = 30
)
USING poolNorthwind;

In order for SQL Server to be able to use the Workload Groups that you've created, you need to classify each connection to determine which group that specific connections falls into. This is done through the use of a Classifier Function. Classifier functions are new in SQL Server 2008 and execute each time a new connection is made to the server. Classifier functions are scalar user-defined functions and are basically used to return the name of the target Workload Group for each user connection.

Code Example

CREATE FUNCTION ufnNorthwindRsGovMgr()
RETURNS SYSNAME WITH SCHEMABINDING
AS
BEGIN
DECLARE @grpName SYSNAME

IF (APP_NAME() = 'Microsoft SQL Server Management Studio')
SET @grpName = 'workGroupNorthwind'
ELSE
SET @grpName = 'default'
RETURN @grpNameEND;

The final step in the configuration of the Resource Governor is to assign the classifier function and activate the configuration.

ALTER RESOURCE GOVERNOR
WITH
(
CLASSIFIER_FUNCTION=dbo.ufnNorthwindRsGovMgr
);

ALTER RESOURCE GOVERNOR RECONFIGURE;

Using the Management Studio

If we use the Management Studio which comes with the Developer versrion of SQL 2008 CTP allows you to create the Resource Pool using the Resource Governor properties dialog, shown in the picture on this entry



I have created a solution with the above code .
Download



Features on SQL Server 2008 - Katmai

For the last three weeks i have been playing around with SQL Server 2008 (Katmai) CTP. I managed to install it on Windows 2003 server without any problems. Although i would advise people not to install alongside SQL Server 2005. If you want to use management studio of SQL 2008 CTP be sure to use the developer edition CTP, the expression edition only comes with some client tools. Since its a CTP and you only are trying to get your head around new features it is useful to have management studio installedAs with any CTP i didnt know where to start and i found that Chad Boyd ihas listed the new features for Katmai quite extensively on his blog. Since I am just a developer and i filtered the list of features that may be more useful for a developer to look at.. I m planning to plough through one feature at a time and put up some information on this site for people who are interested.


Performance

Management

  • Policy based management framework (manage via policies vs. scripts, enterprise wide support, automated monitoring/enforcement, etc.).
  • Integrate with Microsoft System Centre.
  • Extended Events (high perf lightweight tracing infrastructure, NOT sql trace, integrated with ETW).

Development Enhancements

Service Broker

  • New UI and Tools for working with (add/drop/edit functionality within SSMS, Diag tools).
    Conversation Priority (set message ordering, send/receive impact, 110 levels).

Data Storage

  • Data compression (see above).
  • FILESTREAM attribute (get the 'best of both' functionality from BLOBs in the DB vs. BLOBs on filesystem, no more "to blob or not to blob").
  • Integrated Full Text Search (FTS fully integrated into DB engine, no external storage, no external service, more efficient and reliable costing).
  • Sparse columns (more efficient storage for 'wide' tables with many columns that repeat and don't contain data).
  • New index types (spatial indexes, hierarchical indexes, FILTERED indexes (indexes on filtered values within columns), etc.).
  • Reporting

    • IIS Agnostic Reporting Services Deployment (no IIS required to run RS any longer).Richtext support.
    • Enhanced visualiztion (graphing).
    • New Word rendering (render reports to Microsoft Word).