I have a wide table (80 columns) with 70 million rows in it. I can currently trying to set a foreign key across all rows within the table.
Approach one An update statement over the entire table such as:
BEGIN TRAN
UPDATE Cx
SET Cx.CohortId = Chx.CohortId
FROM Customers Cx
INNER JOIN Cohorts Chx ON Chx.NameField = Cx.NameField
COMMIT TRAN
I was informed this has the potential to fill up the transaction log (please note recovery mode is currently set to simple) and it would be better to batch the updates:
Approach two Update in batches
DECLARE @Rows INT,
@BatchSize INT;
SET @BatchSize = 20000;
SET @Rows = @BatchSize;
WHILE (@Rows = @BatchSize)
BEGIN
UPDATE TOP(@BatchSize) Cx
SET Cx.CohortId = Chx.CohortId
FROM Customers Cx
INNER JOIN Cohorts Chx ON Chx.NameField = Cx.NameField
WHERE Cx.CohortId IS NULL -- Do not update records that have already been uplifted to include the FK
SET @Rows = @@ROWCOUNT
END
Does the second approach yield any benefit in terms of the transaction log on large table updates?
-
There are temporary indexes in place to speed up my joins and there is an index on the column I am trying to populate. The execution plan actually tends to do non-clustered index scans rather than looking at the clustered index which is fine for an integer column for my needs.Andrew Secondary– Andrew Secondary2018年07月24日 12:30:40 +00:00Commented Jul 24, 2018 at 12:30
-
This directly answers your question about the transaction log: Break large delete operations into chunks.Erik Reasonable Rates Darling– Erik Reasonable Rates Darling2018年07月24日 13:26:42 +00:00Commented Jul 24, 2018 at 13:26
1 Answer 1
I was informed this has the potential to fill up the transaction log . . . and it would be better to batch the updates
It would be better if it didn't fill up the transaction log. But the proposed remedy is the wrong one.
You should size your transaction log to support your transactions, not the other way around.
Does the second approach yield any benefit in terms of the transaction log on large table updates?
Yes. The transaction log can be reused between batches, so long as all the batches aren't in a transaction.
increasing the disk space to support that would be troublesome
Yes, sometimes you do have to code around space issues, but coding isn't free. Spending money on infrastructure to reduce coding time is usually a wise move.
Also if you are trying to minimize log space, consider loading an empty table with your query, and then performing an ALTER TABLE ... SWITCH. Loading the new table can be minimally logged, and the ALTER TABLE is a metadata operation.
This will also enable you to perform the operation in a transaction, and not commit intermediate results.
-
Your point about the transaction log being appropriate is very true I feel. The issue would be that the table is 25gb and I will be doing three updates across the entire table. That would result in a 75gb transaction log wouldn't it? Unfortunately increasing the disk space to support that would be troublesome to say the least.Andrew Secondary– Andrew Secondary2018年07月24日 13:21:31 +00:00Commented Jul 24, 2018 at 13:21
-
I'll take a look at the options. This is a one time operation which needs to be completed within the span of hours as it stands. I'll review the options with the relevant people tomorrow and see what would work - but your point about space being cheaper to add than dev time is probably apt.Andrew Secondary– Andrew Secondary2018年07月24日 13:37:21 +00:00Commented Jul 24, 2018 at 13:37
-
@AndrewSecondary Also, have you considered using a Clustured Columnstore instead of a rowstore table?David Browne - Microsoft– David Browne - Microsoft2018年07月24日 13:40:56 +00:00Commented Jul 24, 2018 at 13:40
-
@DavidBrowne-Microsoft yes we did in-fact use a clustered columnstore for a short while, however the system is heavily transactional and it caused some deadlocking issues. The system was performing incredibly quickly though with the clustered columnstore.Andrew Secondary– Andrew Secondary2018年07月24日 14:13:10 +00:00Commented Jul 24, 2018 at 14:13