SQL Calculation based on 'similar' rows within table

Question 1

I'm using SQL Server 2008 R2, building a fact table for SSAS.

I can therefore accept solutions to this via either SQL query to SSIS task.

Let's say I have a fact table that looks a little like this, its a record of when people have eaten cakes, and how many (for illustrative purposes).

Surrogate Key | CakeType | PersonKey | CakesEaten | Date | HadSeconds
1 20 1234 4 20120412 0
2 41 2345 3 20120401 0 
3 25 3456 4 20120408 0
4 39 2345 2 20120402 0
5 20 4567 3 20120409 0
6 89 9876 2 20120405 0
7 31 1579 4 20120408 0
8 29 2709 3 20120402 0
9 27 3456 3 20120410 0
10 26 2345 2 20120408 0

CakeType and PersonKey and Date are all foreign keys to dimensions.

CakesEaten and HadSeconds are measures.

Techie type question: What I want to do, is to mark a fact as 'HadSeconds' where a Person has a record in the table within 2 days of the previous fact.

Business type question: Show me the instances of people going to eat cake when they've already had some cake in the previous 2 days.

For example, Person '2345' has 3 records in this fact table, IDs 2, 4, 10.

I want to mark 'HadSeconds' for ID '4' as a '1', because the Date for ID 4 is within 2 days of ID2, therefore the person ate cake in quick succession, and this instance needs to be marked as 'HadSeconds'. ID '10' has a datestamp which is 6 days after ID '4', therefore the 'HadSeconds' column for ID 10 does not need to be marked.

Note that only records were a Person has multiple entries in the fact need to be considered.

The best (lol) thing? This fact table has approx. 60 million records in it.

What's the most efficient way to achieve this? Breaking stuff down into temporary tables? Sorting? Cursors (ew...), Aggregration tasks in SSIS...

Question 2

Just an UPDATE using EXISTS will be your best bet:

UPDATE c
SET HadSeconds = 1
FROM Caketable C
WHERE EXISTS (SELECT 1
 FROM CakeTable C2
 WHERE c2.personid = c.personid
 AND c2.date > DATEADD(day, -2, c.date))

If you think this will run too long, you can always break it down by date range, cake type, etc to do smaller batches.

If you have a composite index on personid, date this should be pretty fast.

You could also do a self-join but I'm pretty confident the EXISTS will run a lot faster:

UPDATE c
SET HadSeconds = 1
FROM Caketable C
INNER JOIN CakeTable C2
 ON c2.personid = c.personid
 AND c2.date > DATEADD(day, -2, c.date)

Question 3

Missing a closing bracket round the subquery in that code snippet, (also I think you mean c2.date instead of cd.date but that self-join seems to be the way forward I think... Cheers!

Question 4

@JamesLove fixed the syntax. Let me know if you want a self join example

Question 5

? I thought that was a self join inside the EXISTS... what else did you have in mind otherwise?

Question 6

It's not a JOIN, it's an exists. I'll post an example.

Question 7

@JamesLove The exists should be faster since it doesn't JOIN all the non-matching records, just returns a boolean if there is a match.

JNK JNK 18.1k6 gold badges63 silver badges98 bronze badges · Accepted Answer · 2012-04-12 20:25:08Z

3

Just an UPDATE using EXISTS will be your best bet:

UPDATE c
SET HadSeconds = 1
FROM Caketable C
WHERE EXISTS (SELECT 1
 FROM CakeTable C2
 WHERE c2.personid = c.personid
 AND c2.date > DATEADD(day, -2, c.date))

If you think this will run too long, you can always break it down by date range, cake type, etc to do smaller batches.

If you have a composite index on personid, date this should be pretty fast.

You could also do a self-join but I'm pretty confident the EXISTS will run a lot faster:

UPDATE c
SET HadSeconds = 1
FROM Caketable C
INNER JOIN CakeTable C2
 ON c2.personid = c.personid
 AND c2.date > DATEADD(day, -2, c.date)

Share

Improve this answer

edited Apr 12, 2012 at 20:30

answered Apr 12, 2012 at 20:25

JNK's user avatar

JNK JNK

18.1k6 gold badges63 silver badges98 bronze badges

6

Missing a closing bracket round the subquery in that code snippet, (also I think you mean c2.date instead of cd.date but that self-join seems to be the way forward I think... Cheers!

James Love
– James Love

2012年04月12日 20:27:38 +00:00
Commented Apr 12, 2012 at 20:27
@JamesLove fixed the syntax. Let me know if you want a self join example

JNK
– JNK

2012年04月12日 20:28:24 +00:00
Commented Apr 12, 2012 at 20:28
? I thought that was a self join inside the EXISTS... what else did you have in mind otherwise?

James Love
– James Love

2012年04月12日 20:29:26 +00:00
Commented Apr 12, 2012 at 20:29
It's not a JOIN, it's an exists. I'll post an example.

JNK
– JNK

2012年04月12日 20:30:00 +00:00
Commented Apr 12, 2012 at 20:30
@JamesLove The exists should be faster since it doesn't JOIN all the non-matching records, just returns a boolean if there is a match.

JNK
– JNK

2012年04月12日 20:31:43 +00:00
Commented Apr 12, 2012 at 20:31

| Show 1 more comment

Stack Exchange Network

SQL Calculation based on 'similar' rows within table

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

SQL Calculation based on 'similar' rows within table

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions