SQL Archives - Blog IT

.NET EF Core 6 support for groupby top(n) queries

Telmo Rodrigues — Fri, 04 Feb 2022 16:14:46 +0000

EF Core 6 comes with some GroupBy queries improvements. In this post I wanna talk about the improvements related to “group by top(n)” queries.

Let’s say we have the following table Documents:

CREATE TABLE [dbo].[Documents](
    [Id] [integer] PRIMARY KEY,
    [UserId] [integer] NOT NULL,
    [Title] [nvarchar](50) NOT NULL,
    [Body] [nvarchar](250) NOT NULL,
    [CreatedOn] [datetime] NOT NULL
)

and that we want to get the two most recent documents for each user. For instance, if we have the following records:

the query should return

To do this query using EF Core and LINQ we can try to group the Documents by UserId and then sort each group by the CreatedOn column to pick the first two documents for each user.

We can start by trying to group all documents for each user

 var usersDocs = await ctx
        .Documents
        .GroupBy(doc => doc.UserId)

and then for each group we try to sort its elements by the CreatedOn column

    var usersDocs = await ctx
        .Documents
        .GroupBy(doc => doc.UserId)
        .SelectMany(userDocs => userDocs.OrderByDescending(doc => doc.CreatedOn).Take(2))
        .ToArrayAsync();

If we try to execute this query using a previous version of EF Core 6 we get the following error:

    .OrderByDescending(doc => doc.CreatedOn)' could not be translated. Either rewrite the query in a form that can be translated, or switch to client evaluation explicitly by inserting a call to 'AsEnumerable', 'AsAsyncEnumerable', 'ToList', or 'ToListAsync'. See https://go.microsoft.com/fwlink/?linkid=2101038 for more information.'

This is because the previous versions of EF don’t know how to translate the GroupBy inner expressions to sql. After following their suggestion I ended up with this query

    var usersDocs = await ctx
        .Documents
        .Select(doc => doc.UserId)
        .Distinct()
        .SelectMany(userId => 
            ctx
            .Documents
            .Where(doc => doc.UserId == userId)
            .OrderByDescending(doc => doc.CreatedOn)
            .Take(2)
        )
        .ToArrayAsync();

However this query has two main issues. It does a distinct over the UserId column, loads all the user ids to application memory and then it does a query for each user resulting in a n+1 query problem.

One way to solve these issues is to forget LINQ and rewrite the query using raw sql with a partition by UserId and the ROW_NUMBER() window funtion, doing a CTE .

Now, with the EF Core 6 we can use the first version of the query, since the EF Core team has added the support for translating some GroupBy inner expressions and it can translate this LINQ query to a single sql query.

    var usersDocs = await ctx
        .Documents
        .GroupBy(doc => doc.UserId)
        .SelectMany(userDocs => userDocs.OrderByDescending(doc => doc.CreatedOn).Take(2))
        .ToArrayAsync();

You can read more about these features and other improvements added to GroupBy queries here and check the related github issues 12088 13805

Happy coding!

The post .NET EF Core 6 support for groupby top(n) queries appeared first on Blog IT.

NoSQL First Act – a historical introduction

Gonçalo Melo — Tue, 13 Aug 2019 17:57:42 +0000

NoSQL gets a lot of “heat” about not having a good direct definition. And the term NoSQL only gives somes clues of what this is not.

NoSQL is like a new definition of something that is a database but is different than the usual relational model. Likewise, making a parallel by going back 30 years ago. Back then probably no one knew what a relational database was also. Let’s use this and start with a small historical and motivational point of view for NoSQL. PS: I was not a developer back then.

This article will be a first part of a series of articles. The goal will be to unite some knowledge about this topic.

Starting in the 1980’s

Relational databases emerged bringing ACID properties – Atomicity, Consistency, Isolation and Persistency. Properties taken for granted nowadays. Also, they also brought the SQL language. SQL is common enough across different systems for one to use. Although there are different flavors, can almost be considered a standard.

Relation databases also allowed for a simple and very common integration mechanism between 2 or more systems. Data can be easily shared by reading or writing data in a table on a shared relational database. This is today still a very common integration pattern.

In the 1990’s

Object database’s started to appear with more strength. They had been around for some time. Most important, they implement a different database paradigm model. This new paradigm tried to solve the impedance mismatch problem. The impedance mismatch was caused by relational databases. It emerges from the need to map objects used in memory in our application model to tables in a relational database. This page https://en.wikipedia.org/wiki/Object-relational_impedance_mismatch is a good reference for this issue.

Saving an application object into these database model should be a direct operation. Conceptually no mapping would be necessary when an object database type was used. Albeit, these gains, object databases were not able to substitute relational databases.

In the 2000’s

As internet availability grows 2 things started to have a big impact:

Generation of enormous amount of data that had to be stored and processed;
Accesses from anywhere in the globe became more and more frequent. Therefore latency was an issue to consider. And even speed of light limitation can contribute significantly to this. Data can now be stored far from where we are. For example, a distance of 10.000 km will add about 100 ms on each round trip. Therefore, data needs to spread around the globe to provide a quick access.

In this environment, relational databases do not always solve customer needs, either because of flexibility, price or performance issues:

Relational databases enforces a ridged set of rules in the relational model. These will impact the flexibility of the application development;
Price because of the need for more powerful machines and also the software licenses that go with this;
Performance, because relational databases do not naturally meet the increasing workload. Huge CPU, memory, storage and throughput is needed to be perform in this environment. And for relational model, vertical scaling can only get you so far.

Scaling Vertically versus Horizontally

Relational databases usual scaling method is a vertical one. As a result whenever the workload incresses we will use a bigger machine with more hardware resources.

On the other hand, the big internet companies adopted an horizontal scaling paradigm. This is a cluster type environment with lots and lots of machines. In other words, this means that more machines will be used to handle bigger workloads.

Relational databases do not thrive very well in this paradigm. They hardly take any benefit of using more machines. Hence, NoSQL like databases arises to take advantage of the horizontal scale paradigm. Also companies like Google and Amazon started researching in this area. As a result, Google created BigTable and Amazon DynamoDB.

Relational databases dominated the market, why are NoSQL databases being used now?

I think this is important. Dominance and usage factors are usually a combination of several aspects. What has been changing:

Developers hide databases behind integration layers. This makes it simpler to use. And in addition, easier to replace one database or persistence method per other;
Cloud growth in some cases means we can try different approaches with less effort regarding infrastructure;
Handling big quantities of data introduced new needs. NoSQL provides a good development flexibility. And in addition can also take advantage of cluster solutions.

Finally, what is NoSQL?

Common NoSQL databases characteristics

NoSQL doesn’t have a clear direct meaning. The term alludes to something like “Not Only SQL”. A more exact explanation should be “Non Relational database”. This would reinforce the new model paradigm and not the SQL language itself. The fact that some NoSQL databases actually supports some form of SQL language adds even more to the confusion.

But, the name is catchy and so common that will probably stay. The point is: defining NoSQL is hard. Therefore, we will do the next best thing. Introduce the dominant traits of these database systems:

Non relational;
Usually (not all) cluster-friendly;
Most of them are open source;
No-schema / schema-less;
Have a big internet drive (cluster and bigdata friendly).

By these common traits, almost any non relational database can be a NoSQL database! Also, there are probably published studies in all these areas dating back 30 or 40 year ago. Out of curiosity, the name NoSQL itself is referred to being born by the end of 2009. The term emerged as a twitter hashtag for a meetup were people talked about these subjects. Even though some of these traits were not new at the time.

Next steps

NoSQL has a lot to explore. It will provide a more direct paradigm that will make simpler specific needs like this ones to query Json objects inside Sql Server and performance considerations.

My plan for the next article(s) will be to:

Present some of the NoSQL data models;
Talk about the no schema or the schema less feature;
Discuss the aggregate concept. And how this relates to the CAP theorem.

Further reads

There is a lot of information online. For me, I like Martin Fowler approach. Some of these information’s are from his content. You can check his website here: https://martinfowler.com/nosql.html. He also has a book: https://martinfowler.com/books/nosql.html.

The post NoSQL First Act – a historical introduction appeared first on Blog IT.

Query performance for JSON objects inside SQL Server using JSON_VALUE function

Gonçalo Melo — Thu, 20 Dec 2018 09:39:12 +0000

Following up on this article about querying JSON Data I would like to talk about how to improve searches on JSON data inside SQL Server. Starting SQL Server 2016 Microsoft deployed a set of functions that allow us to work with JSON data in a structured way inside SQL Server.

I will introduce a small usage sample for the JSON_VALUE function in combination with indexes to improve information retrieval from a table containing one JSON object. For our testes, we have a UserDetailTest table that has more than 500k rows with 2 columns: an UserId and a nvarchar(max) to hold a small JSON details string like the following:

SELECT *, LEN(DetailsJSON) AS [Len(DetailsJSON)] FROM UserDetailTest

I will activate time statistics and clean SQL Server cache between each query to have some consistency across the execution times using the following SQL statements:

CHECKPOINT
DBCC DROPCLEANBUFFERS
SET STATISTICS TIME ON

JSON_VALUE Function intro

The JSON_VALUE function extracts a value from a JSON string. This functions receives 2 arguments, the first being an expression for the JSON value and the second a path for the value we want to obtain. A simple sample with an inline JSON object would be the following:

SELECT JSON_VALUE('{"PostalCode":"376-3765","PhoneNumber":"351003765718"}', '$.PhoneNumber')

The return is a scalar value (nvarchar(4000)) for the PhoneNumber element.

Query scenarios

Starting with one of the really worst case scenario (yes, this can happen…):

More than 17 seconds really seems like a worst scenario! Next, we will use the JSON_VALUE function to get the PhoneNumber from the JSON string and use it in our where clause:

This still takes more than 3 seconds. A good improvement, but yet a high cost if we need to search information in this way.

Index creation

Let’s add a new virtual column to table that displays the result from the JSON_VALUE function – this will allow us to create an index and simplify the SELECT queries.

ALTER TABLE UserDetailTest 
	ADD vPhoneNumber AS 
	CAST(JSON_VALUE(DetailsJSON, '$.PhoneNumber') AS NVARCHAR(255));

The cast truncates the output from the JSON_VALUE to ensure that it does not exceed the maximum lenght for the index key. Now, if we search using the column, the result is still more or less the same 3 seconds has before:

Let’s create an index and perform the search again:

CREATE INDEX idx_vPhoneNumber ON [UserDetailTest] (vPhoneNumber);

Consequently the improvement is huge, 13 miliseconds is more interesting! But what’s the cost?

The cost

When we create an index, we’re basically trading space for time – more occupied space versus faster operations. Therefore let’s see what the increase is using sp_spaceused function before and after index creation:

	rows	reserved	data	index_size	unused
Initial	557801	1120904 KB	1115584 KB	5224 KB	96 KB
After Index	557801	1148752 KB	1115584 KB	32992 KB	176 KB

Probably, an interesting comparison would be if the PhoneNumber column was an explicit column in that table containing the value. Let’s do that!

ALTER TABLE UserDetailTest  
	ADD PhoneNumberCopy NVARCHAR(255)

UPDATE UserDetailTest
	SET PhoneNumberCopy = vPhoneNumber

The sp_spaceused remained the same – SQL Server internal black magic regarding space allocation (for another time!). But performance wise, for this approach in my machine this query still took more than 2 seconds to complete:

Slightly better than the query with the JSON function but the advantage is that in this scenario SQL can better optimize the searches. The following runs of the same query without clearing the cache returns results much faster – each took around 220 miliseconds to complete:

Summary Results

Just to sum up all the results for the different types of searches comparing the first execution after cache clean up and the following runs. I did several iterations for each step just to make sure the results were consistent although the goal was just to have a baseline. Certanly, the use of this will depend on each case.

WHERE clausule type	First run	Following runs
WHERE DetailsJson LIKE '%PhoneNumber":"351003765718"%'	17754 ms	around the same
WHERE JSON_VALUE(DetailsJson, '$.PhoneNumber') = '351003765718'	3375 ms	1958 ms
WHERE vPhoneNumber = '351003765718' vPhoneNumber is not indexed	3339 ms	1332 ms
WHERE vPhoneNumber = '351003765718' vPhoneNumber is indexed	13 ms	0 ms
WHERE PhoneNumberCopy = '351003765718' PhoneNumberCopy is of type nvarchar(255) and explicitly contains all values	2309 ms	222 ms

The post Query performance for JSON objects inside SQL Server using JSON_VALUE function appeared first on Blog IT.

Query a JSON array in SQL

Diogo Guiomar — Mon, 26 Feb 2018 11:07:45 +0000

For the purpose of this post, lets not evaluate the db design option and lets focus on the operations on the json column.

Lets say we have a table of customers where we have an Id, a CompanyName and an Address, although these customers can have some information related to a service, for example they may have a specific customer id for each service.

Customer

Id: the id

CompanyName: the company name

Address: the address

--- ServiceId: 1

--- ServiceCustomerId: the customer id for service 1

--- ServiceId: 2

--- ServiceCustomerId: the customer id for service 2

One way we could store this on a single table could be with these columns: Id / Name / Address / ServicesDataInJson

So for this example, the information on the ServicesDataInJson is an array of ‘objects’ that contains the information of our customer on each service. Since there is no native JSON format on SQL Server, the data type of this column is just a nvarchar(max)

Here’s how a simple SELECT looks like:

Now we want to query our customers table by a customer service id (which is inside that json). How can we query this?


SELECT *

FROM Customers c

CROSS APPLY OPENJSON(c.ServicesDataInJson)

WITH (ServiceId int '$.ServiceId',

ServiceCustomerId nvarchar(255) '$.ServiceCustomerId') as jsonValues

WHERE jsonValues.ServiceCustomerId = @TheIdWeWantToSearchFor

The OPENJSON is a table-valued function that parses the json into a row/column result and the WITH clause let us define how we want that output.

Here is the query without the WHERE clause:

Note: The OPENJSON function will not work if the database compatibility level is lower than 130.

To change it:

ALTER DATABASE DatabaseName SET COMPATIBILITY_LEVEL = 130

The post Query a JSON array in SQL appeared first on Blog IT.

Converting a Vertical Table to an Horizontal Table in SQL Server

Ricardo Costa — Thu, 27 Oct 2016 21:24:13 +0000

Today I’ve encountered a vertical table in an SQL Database and I wanted to transform it to an horizontal one. A vertical table is described as an EAV model.

Imagine you have this table


CREATE TABLE VerticalTable
(
Id int,
Att_Id varchar(50),
Att_Value varchar(50)
)

INSERT INTO VerticalTable
SELECT 1, 'FirstName', 'John' UNION ALL
SELECT 1, 'LastName', 'Smith' UNION ALL
SELECT 1, 'Email', 'john.smith@dummy.com' UNION ALL
SELECT 2, 'FirstName', 'Jack' UNION ALL
SELECT 2, 'LastName', 'Daniels' UNION ALL
SELECT 2, 'Email', 'jack.daniels@dummy.com'

If you run


SELECT * FROM VerticalTable

you get this

	Id	Att_Id	Att_Value
1	1	FirstName	John
2	1	LastName	Smith
3	1	Email	john.smith@dummy.com
4	2	FirstName	Jack
5	2	LastName	Daniels
6	2	Email	jack.daniels@dummy.com

To convert this into an horizontal table I’m going to use PIVOT.


SELECT [Id], [FirstName], [LastName], [Email] 
FROM
(
 SELECT Id, Att_Id, Att_Value FROM VerticalTable
) as source
PIVOT
(
 MAX(Att_Value) FOR Att_Id IN ([FirstName], [LastName], [Email])
) as target

And I will get this

	Id	FirstName	LastName	Email
1	1	John	Smith	john.smith@dummy.com
2	2	Jack	Daniels	jack.daniels@dummy.com

You can find the code here.

The post Converting a Vertical Table to an Horizontal Table in SQL Server appeared first on Blog IT.

How to Access the Previous Row and Next Row value in SELECT statement?

Ricardo Costa — Fri, 07 Nov 2014 11:09:37 +0000

LAG – http://msdn.microsoft.com/en-us/library/hh231256.aspx

USE AdventureWorks2012;
GO
SELECT BusinessEntityID, YEAR(QuotaDate) AS SalesYear, SalesQuota AS CurrentQuota, 
       LAG(SalesQuota, 1,0) OVER (ORDER BY YEAR(QuotaDate)) AS PreviousQuota
FROM Sales.SalesPersonQuotaHistory
WHERE BusinessEntityID = 275 and YEAR(QuotaDate) IN ('2005','2006');

LEAD – http://msdn.microsoft.com/en-us/library/hh213125.aspx

USE AdventureWorks2012;
GO
SELECT BusinessEntityID, YEAR(QuotaDate) AS SalesYear, SalesQuota AS CurrentQuota, 
    LEAD(SalesQuota, 1,0) OVER (ORDER BY YEAR(QuotaDate)) AS NextQuota
FROM Sales.SalesPersonQuotaHistory
WHERE BusinessEntityID = 275 and YEAR(QuotaDate) IN ('2005','2006');

SQL SERVER 2012 and 2014

The post How to Access the Previous Row and Next Row value in SELECT statement? appeared first on Blog IT.

Strange behavior with “Merge Join” in SSIS

Ricardo Costa — Fri, 24 Oct 2014 10:04:36 +0000

If the datasources of the Merge Join block in SSIS aren’t sorted the merge will not work correctly.

Strange behavior will occur if the datasources aren’t sorted equal by the same key.

http://msdn.microsoft.com/en-us/library/ms141775.aspx

“The Merge Join Transformation requires sorted data for its inputs.”

http://msdn.microsoft.com/en-us/library/ms137653.aspx

The post Strange behavior with “Merge Join” in SSIS appeared first on Blog IT.