Aug 18, 2011

Control the Size of UPDATE Transactions

In several of my production databases I have to deal with tables that have hundreds of millions of records.  Querying these tables are challenging; doing updates to them can be nightmarish.
I recently had to add a column to a table that had 65 million records.  The table is an aggregate table so the new column needed to be loaded from a table that had 180 million records.
Doing an unmanaged UPDATE would only serve to fill the transaction log and waste a bunch of time and system resources.
After doing a little tinkering I was able to break the transactions into pieces by using a view.  First let’s layout the setup for this.
CREATE TABLE tmp_top_view ( id           int   identity (1, 1),
                            ContactID    int,
                            test_bit     bit,
                            loopcount    int );

INSERT INTO tmp_top_view ( ContactID )
    SELECT    ContactID
    FROM    Person.Contact;

CREATE VIEW v_tmp_top_view AS
    SELECT    TOP 500 id
    FROM    tmp_top_view  WITH (NOLOCK)
    WHERE    test_bit is null;

The tmp_top_view table would be the production table we want to update.  In a production situation it would have tens of millions of records.  Certainly more than you would want to have in a singe UPDATE transaction.
The test_bit  column is the target column we want to update from some other table, or just set to a value.
The loop_count column is just something we’ll use to show how many records get updated per transaction.
The INSERT statement is just populating the table with some dummy data from the AdventureWorks Person.Contact table.  In this case we are inserting 19,972 records into our table.
So far nothing interesting…
The CREATE VIEW is a key part of the solution.  You’ll notice the TOP 500 in the SELECT statement of the view.  This is literally setting the max size of the UPDATE transactions.
The next key piece of the puzzle is the WHERE clause that filters on NULL test_bit.  This column will be updated in our query, below.  When the column becomes non-NULL it will drop out of the view.
Also, very important, notice the use of the NOLOCK hint.  Without this hint the batch will run much slower.
OK, we’re all setup.  Now let’s get some work done:
DECLARE @rowcount    int,
        @loopcount    int;

SELECT  @rowcount = 1,
        @loopcount = 1;

WHILE @rowcount > 0
BEGIN
    BEGIN TRAN

    UPDATE  t
    SET     test_bit = 1,
            loopcount = @loopcount
    FROM        tmp_top_view t
            INNER JOIN v_tmp_top_view v
                ON v.id = t.id;

    SELECT  @rowcount = @@rowcount,
            @loopcount += 1;

    COMMIT;
END

Observer the test_bit column getting set to one.  The value that we are setting it to does not matter.  What’s important is that it’s no longer NULL.  Once it stops being NULL, it stops being a member of the view.
Fortunately, in practice, the newly updated record doesn’t drop out of the view immediately.  If it did the view would keep adding records to itself and the entire 19,972 record dataset would be updated in one transaction, which is exactly what we are trying to avoid.
In this sample set the data gets updated with 40 transactions; the first 39 are 500 record transactions; the last transaction updates 472 records.
To see how the records were updated use the following query:
SELECT  loopcount,
        COUNT(*) AS [Count of loopcount]
FROM        tmp_top_view
GROUP BY loopcount
ORDER BY loopcount;


Finding ways to neatly break up large transactions is an art as well as a science.  What I didn’t discuss here is the 2 hours I spent trying to figure out the optimal transaction size to get all 64M records to update in the fastest time.
A key piece of the performance ingredient is to use the NOLOCK hint in the view definition.
I attempted to populate an indexed temp table to capture a subset of the main table (instead of the TOP 500 view) but found the time to write to the temp table and manage the index was too high.
I took a swag at using an indexed view but that doesn’t fly because of the use of the TOP in the view definition.
What solutions have you devised to break up large UPDATE transactions?  Have you found anything unique about DELETE vs UPDATE vs INSERT transactions?  Send me a pointer to your solutions!

No comments:

Post a Comment

Hi,

Thanks for your visit to this blog.
We would be happy with your Queries/Suggestions.