Divide certain distinct row values into separate columns

Question 1

This might be difficult for me to describe clearly but I will try anyway. Also note that this is entirely speculative; it might not even be able to run any faster (that is why I am asking).

This is a question directed at performance. I have been trying to improve the performance of a method and have managed to improve it quite a bit already (around 40% faster). Right now, the entire process takes about 1 minute and 15 seconds to complete. I would like to see if maybe you guys can see some obvious faults in my code and how we could perhaps improve it even further, seeing as I am not quite satisfied yet.

The method fetches a big amount of data from a couple of databases and needs to be displayed in a grid in a certain way. Here is the heavy part of the method. I will describe it in more detail after:

DataClasses1DataContext dbContext = new DataClasses1DataContext(sqlConnectionString);
//Find minimum value of N in the specified timespan
var headerMin = ((from y in dbContext.Testheaders
 where Convert.ToDateTime(y.StartTime) >= GlobalVariables.StartTimeMSA
 select y.N).Min());
//Find maximum value of N in the specified timespan
var headerMax = ((from y in dbContext.Testheaders
 where Convert.ToDateTime(y.StartTime) <= GlobalVariables.EndTimeMSA
 select y.N).Max());
//Find appropriate data in the specified timespan (min and max headerid)
var msaData = from x in dbContext.Testdatas
 where x.HeaderID >= headerMin
 &&
 x.HeaderID <= headerMax
 select new
 {
 StartTime = Convert.ToDateTime((from z in dbContext.Testheaders
 where x.HeaderID == z.N
 select z.StartTime).FirstOrDefault()),
 DUT_id = Convert.ToString((from y in dbContext.Testheaders
 where x.HeaderID == y.N
 select y.DUT_id).FirstOrDefault()),
 Meas = (double)x.Meas,
 x.Step,
 x.LogID,
 x.HeaderID,
 printID = Convert.ToInt32((from y in dbContext.Testheaders
 where x.HeaderID == y.N
 select y.PrintID).FirstOrDefault()),
 Prod_Blok = Convert.ToInt32((from y in dbContext.Testheaders
 where x.HeaderID == y.N
 select y.Prod_Blok).FirstOrDefault()),
 Rep_Count = Convert.ToInt32((from y in dbContext.Testheaders
 where x.HeaderID == y.N
 select y.Rep_Count).FirstOrDefault()),
 Operator = Convert.ToString((from y in dbContext.Testheaders
 where x.HeaderID == y.N
 select y.Operator).FirstOrDefault()),
 };
msaDataTable.Columns.Add("Time");
msaDataTable.Columns.Add("DUT");
msaDataTable.Columns.Add("Position");
var functionDistinct = msaData.Select(x => new { x.Step, x.LogID }).Distinct().ToList().OrderBy(x => x.Step).ThenBy(x => x.LogID);
//Add columns to datatable corresponding to each function. The number of columns is only known at runtime.
foreach (var item in functionDistinct)
{
 var column = new DataColumn() { DataType = typeof(double), ColumnName = String.Format("{0}-{1}", item.Step, item.LogID), AllowDBNull = true };
 msaDataTable.Columns.Add(column);
}
var distinctHeaderID = msaData.Select(x => x.HeaderID).Distinct();
foreach (var headerID in distinctHeaderID)
{
 ct.ThrowIfCancellationRequested();
 //Fetch the data for the current headerID
 var headerIDData = msaData.Where(x => x.HeaderID == headerID).Select(x => new { x.StartTime, x.DUT_id, x.printID, x.Step, x.LogID, x.Meas }).ToList();
 if (headerIDData != null)
 {
 DataRow row = msaDataTable.NewRow();
 row["Time"] = headerIDData.FirstOrDefault().StartTime;
 row["DUT"] = headerIDData.FirstOrDefault().DUT_id;
 row["Position"] = headerIDData.FirstOrDefault().printID;
 //Run through the data for the current headerID and assign the values to the right function.
 foreach (var data in headerIDData)
 {
 string function = String.Format("{0}-{1}", data.Step, data.LogID);
 row[function] = data.Meas;
 }
 msaDataTable.Rows.Add(row);
 }
}

Some sample rows from the database query (msaData) could look like this (these are just made up):

Sample rows

I am basically taking those sample rows and combine them into a single row, where the distinct Step and LogID are combined into "functions" shown as columns with the Measurements as cell values. To illustrate, here is how the above picture would be displayed correctly:

Sample rows correct

A few notes:

Step and LogID combined are known as a "function". In the database they are separated into two columns instead of just the function. For example, Step could have a field value of 5 and LogID 1. Then the function would be known as 5-1. This is what I am displaying as columns instead.
To achieve the goal I am creating a DataTable and binding it to the grid after I am done filling it with appropriate data.
I need to create the columns dynamically as there is no way to know how many functions there are at design time. That is the reason I am using a DataTable in the first place instead of just creating a class and an ObservableCollection for example.
The above sample rows is a very small amount of what the actual data will consist of. We are talking 400000 rows in the test scenario I am running currently.

Are there any immediate problems in my code you can spot? Perhaps some places where a ToList() call could improve performance (even though I have tried some already). Or perhaps another way to tackle how I combine the rows?

Question 2

Just out of curiosity, why would you return 400k rows to an end user? That's sure to cause the user's brain to overflow with too much data.

Question 3

Well the thing is that the number of rows will be reduced greatly before it is shown to the user, since the number of functions can be as big as about 70. This is a client request to get it shown this way and with my method it is shown the way they want it. I am using DevExpress so it will be easy for them to search for items in the grid. But the most common use will be to export the grid data into excel, where they will use some statistical program (that I have no idea how works or what actually is) to do some automatic modifications/readings on the data.

Question 4

Furthermore, it might be acceptable for them for the application to take this long before spewing out the data, but I find it somewhat slow. I am asking the question out of curiosity to see if anyone can spot some obvious flaws, seeing as I am relatively new to C# and haven't actually done an application like this before. Hence why I stated the question as merely speculative :)

Question 5

Have you looked at the generated SQL and looked at the plan and stats for it?

Question 6

Oh sorry, yeah have a google around analysing query plans in SQL Server. You will be able to speed your function up more effectively by speeding up the SQL aspect.

Question 7

I'm thinking that the 2 calls at the beginning might be reduced to one. Which should cut down the time somewhat. Something like this:

var headers = ((from y in dbContext.Testheaders
 let startTime = Convert.ToDateTime(y.StartTime)
 where startTime >= GlobalVariables.StartTimeMSA &&
 startTime <= GlobalVariables.EndTimeMSA
 select y.N));
//Find minimum value of N in the specified timespan
var headerMin = headers.Min();
//Find maximum value of N in the specified timespan
var headerMax = headers.Max());

In a similar vein extracting the appropriate header, only once for each record your looking at in testDatas should eliminate the queries your doing in the new block. Something like this:

var msaData = from x in dbContext.Testdatas
 where x.HeaderID >= headerMin
 &&
 x.HeaderID <= headerMax
 let header = (from h in dbContext.Testheaders
 where x.HeaderID == h.N
 select h).FirstOrDefault() 
 select new
 {
 StartTime = Convert.ToDateTime(header.StartTime),
 DUT_id = Convert.ToString(header.DUT_id),
 Meas = (double)x.Meas,
 x.Step,
 x.LogID,
 x.HeaderID,
 printID = Convert.ToInt32(header.PrintID),
 Prod_Blok = Convert.ToInt32(header.Prod_Blok),
 Rep_Count = Convert.ToInt32(header.Rep_Count),
 Operator = Convert.ToString(header.Operator),
 };

Question 8

Hmm interesting, haven't heard of the let clause before. I'll try this out.

Question 9

Doing this actually did cut the elapsed time a bit. Now it is down to about 50 seconds which I do find somewhat acceptable. Thanks! I'll go read what let actually does to gain a bit of understanding of what you did.

Question 10

You should call ToList() at the big Select() and remove the calls to ToList() here

var functionDistinct = msaData.Select(x => new { x.Step, x.LogID
 }).Distinct().ToList().OrderBy(x => x.Step).ThenBy(x => x.LogID);

or here

var headerIDData = msaData.Where(x => x.HeaderID == headerID).Select(x => new { 
 x.StartTime, x.DUT_id, x.printID, x.Step, x.LogID, x.Meas }).ToList();

Skip the creation of the anonymous types where you don't need it.
Replace the check headerIDData != null with headerIDData.Any(), because it can't be null

call Distinct() before Select() by implementing an IEqualityComparer<T>

var distinctHeaderID = msaData.Select(x => x.HeaderID).Distinct();

Question 11

Alright, tried removing the ToList() calls where you specified. Here are the elapsed times: Time elapsed: 00:01:06.7339292 (with the ToList() calls) Time elapsed: 00:03:20.5137437 (removing the ToList() calls) So removing the ToList() calls basically triples the elapsed time.

Question 12

Oh yeah, let me update the question with that query.

Question 13

Once again, the elapsed time was way longer with calling the ToList() at the initial msaData and removing it from functionDistinct and headerIDData. Performance: Time elapsed: 00:05:41.6543591. Performance with old code: Time elapsed: 00:00:58.3041223.

Question 14

So, the call to ToList() will take almost 5 minutes ?

Question 15

Apparently, yes. Does seem a bit weird.

user33306user33306 · Accepted Answer · 2014-12-23 02:58:59Z

I'm thinking that the 2 calls at the beginning might be reduced to one. Which should cut down the time somewhat. Something like this:

var headers = ((from y in dbContext.Testheaders
 let startTime = Convert.ToDateTime(y.StartTime)
 where startTime >= GlobalVariables.StartTimeMSA &&
 startTime <= GlobalVariables.EndTimeMSA
 select y.N));
//Find minimum value of N in the specified timespan
var headerMin = headers.Min();
//Find maximum value of N in the specified timespan
var headerMax = headers.Max());

In a similar vein extracting the appropriate header, only once for each record your looking at in testDatas should eliminate the queries your doing in the new block. Something like this:

var msaData = from x in dbContext.Testdatas
 where x.HeaderID >= headerMin
 &&
 x.HeaderID <= headerMax
 let header = (from h in dbContext.Testheaders
 where x.HeaderID == h.N
 select h).FirstOrDefault() 
 select new
 {
 StartTime = Convert.ToDateTime(header.StartTime),
 DUT_id = Convert.ToString(header.DUT_id),
 Meas = (double)x.Meas,
 x.Step,
 x.LogID,
 x.HeaderID,
 printID = Convert.ToInt32(header.PrintID),
 Prod_Blok = Convert.ToInt32(header.Prod_Blok),
 Rep_Count = Convert.ToInt32(header.Rep_Count),
 Operator = Convert.ToString(header.Operator),
 };

Hmm interesting, haven't heard of the let clause before. I'll try this out.
Doing this actually did cut the elapsed time a bit. Now it is down to about 50 seconds which I do find somewhat acceptable. Thanks! I'll go read what let actually does to gain a bit of understanding of what you did.

Stack Exchange Network

Divide certain distinct row values into separate columns

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Divide certain distinct row values into separate columns

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions