I'm using SQL Server 2008 R2, building a fact table for SSAS.
I can therefore accept solutions to this via either SQL query to SSIS task.
Let's say I have a fact table that looks a little like this, its a record of when people have eaten cakes, and how many (for illustrative purposes).
Surrogate Key | CakeType | PersonKey | CakesEaten | Date | HadSeconds
1 20 1234 4 20120412 0
2 41 2345 3 20120401 0
3 25 3456 4 20120408 0
4 39 2345 2 20120402 0
5 20 4567 3 20120409 0
6 89 9876 2 20120405 0
7 31 1579 4 20120408 0
8 29 2709 3 20120402 0
9 27 3456 3 20120410 0
10 26 2345 2 20120408 0
CakeType and PersonKey and Date are all foreign keys to dimensions.
CakesEaten and HadSeconds are measures.
Techie type question: What I want to do, is to mark a fact as 'HadSeconds' where a Person has a record in the table within 2 days of the previous fact.
Business type question: Show me the instances of people going to eat cake when they've already had some cake in the previous 2 days.
For example, Person '2345' has 3 records in this fact table, IDs 2, 4, 10.
I want to mark 'HadSeconds' for ID '4' as a '1', because the Date for ID 4 is within 2 days of ID2, therefore the person ate cake in quick succession, and this instance needs to be marked as 'HadSeconds'. ID '10' has a datestamp which is 6 days after ID '4', therefore the 'HadSeconds' column for ID 10 does not need to be marked.
Note that only records were a Person has multiple entries in the fact need to be considered.
The best (lol) thing? This fact table has approx. 60 million records in it.
What's the most efficient way to achieve this? Breaking stuff down into temporary tables? Sorting? Cursors (ew...), Aggregration tasks in SSIS...
1 Answer 1
Just an UPDATE
using EXISTS
will be your best bet:
UPDATE c
SET HadSeconds = 1
FROM Caketable C
WHERE EXISTS (SELECT 1
FROM CakeTable C2
WHERE c2.personid = c.personid
AND c2.date > DATEADD(day, -2, c.date))
If you think this will run too long, you can always break it down by date range, cake type, etc to do smaller batches.
If you have a composite index on personid, date
this should be pretty fast.
You could also do a self-join but I'm pretty confident the EXISTS
will run a lot faster:
UPDATE c
SET HadSeconds = 1
FROM Caketable C
INNER JOIN CakeTable C2
ON c2.personid = c.personid
AND c2.date > DATEADD(day, -2, c.date)
-
Missing a closing bracket round the subquery in that code snippet, (also I think you mean
c2.date
instead ofcd.date
but that self-join seems to be the way forward I think... Cheers!James Love– James Love2012年04月12日 20:27:38 +00:00Commented Apr 12, 2012 at 20:27 -
@JamesLove fixed the syntax. Let me know if you want a self join exampleJNK– JNK2012年04月12日 20:28:24 +00:00Commented Apr 12, 2012 at 20:28
-
? I thought that was a self join inside the EXISTS... what else did you have in mind otherwise?James Love– James Love2012年04月12日 20:29:26 +00:00Commented Apr 12, 2012 at 20:29
-
It's not a
JOIN
, it's an exists. I'll post an example.JNK– JNK2012年04月12日 20:30:00 +00:00Commented Apr 12, 2012 at 20:30 -
@JamesLove The exists should be faster since it doesn't JOIN all the non-matching records, just returns a boolean if there is a match.JNK– JNK2012年04月12日 20:31:43 +00:00Commented Apr 12, 2012 at 20:31