Aug 18, 2011

RowCount

I had a simple enough task at hand.  An existing EXTRACT that loads a result set to a DATA file needed to have a corresponding COUNT flat file that contained the number of rows in the extract’s data file.

# Step 1
How Not To Do It #1: Use the SSIS ROW COUNT Transformation
SSIS has a built-in, ready to go, out of the box transformation called ROW COUNT.  Basically what it does is set the value of a variable to the number of records passing through it.  This was my first approach and it seemed like such a simple task to write the value out to a flat file, until I got to the point of writing the file.
So here’s how I did it:
1.  Create an OLE DB query to pull some records out of a database.  In this case I’ll use AdventureWorks.  I’ll grab about 2 percent of the 19,000+ records in the Sales.Customer table using TABLESAMPLE.
image
2.  Create an INTEGER variable called RowCount.
3.  Add a ROWCOUNT transform after the OLE DB. Configure the ROWCOUNT transform to store the results to a variable.
image
3.  Add a DERIVED TABLE transformation and attach it after the ROWCOUNT transformation.
4.  Configure the DERIVED TABLE transform to use the RowCount variable by dragging the USER:RowCount variable down to the “Derived Column Name”.
image
5.  Finally, add the flat file destination and configure the flat file destination to wherever you want the file to be created.  Configure the mappings to include only the RowCount; exclude the other columns:
image
6.  Run the package.  The final product will look like this:
image
7.  Check the output file.  Oh know!!!!  There are 188 zeros in it!
So what went wrong?
1.  We still have 188 rows.  To get this to display just one row with the value of the RowCount variable we could add the Row Sampling transform, which is equivalent to the T-SQL TOP command.  Set the row sample size to 1.
2.  But we still have a problem.  The row count is still zero!  The problem is that the variable that gets set by the ROWCOUNT transformation does not get set until the last row passes through the data flow.  In other words, the user variable is not usable until the data flow exits.  Please reference the MSDN page for the ROW COUNT transform for details.
One of those 2 issues alone might seem like a show stopper.  But not quite and the learning process is working in all its glory.
In step 2, we will modify this package to give us the row count to the flat file that we are looking for.
# Step 2
We thought the process was straight forward, but it didn’t work.  Our approach was:
1.  Create a query that processed some records.
2.  Send the result set of the query to the ROW COUNT transform, to count the rows.
3.  Since the ROW COUNT transform populates a variable, use a DERIVED COLUMN transform to place the variable into a virtual table column.
4.  Write to a flat file destination the value of the ROW COUNT variable.
The result set was a zero for every row that the original query generated.  The reason for the zeros is because the ROW COUNT transform does not populate the SSIS variable with the row count until the DATA FLOW completes.
The second problem with the output was that we didn’t want 188 copies of the row count, we only wanted one.  That means we need to find a way to process 188 records but write out the row count only once.
Today’s post will build on what we learned yesterday to create a package that will actually work.  The trick of making the ROW COUNT transformation usable is to access the variable OUTSIDE of the DATA FLOW that populates it.
Here’s what the control flow will look like for our working package.
image
The general principal behind this is the first data flow will use the ROW COUNT transformation to populate a RowCount variable.  The second data flow will use the local variable set by the ROW COUNT transform to write the value, one time, to a flat file.
In a production package the first data flow would probably write out the data of an extract file; the second data flow would be used to write out a control file, containing the number of rows expected in the data file and any other related information, such as the extraction date.
Remember, the ROW COUNT transform does not actually set the value of the variable until all the data has passed through.  Hence, the need for the second data flow.
Now let’s take a look under the hood of each of the data flows.
DATA FLOW 1:  Here’s what the first data flow looks like:
image
We re-used yesterday’s query, which is a query that selects about 2 percent of the rows from the AdventureWorks Sales.Customer table.  Then we pump the result set into a ROW COUNT transformation that sets the user variable RowCount variable.
The configuration details of these two transformations were posted in yesterday’s Part I post.
One additional lesson that one might get from this… The ROW COUNT transform can be used as a terminator for your data flow without actually doing anything.  It’s simple to setup and is an option when testing other data source functionality.
DATA FLOW 2:  The second data flow is only slightly more interesting so I’ll go into slightly more detail on it’s configuration.
image
The DERIVED COLUMN and FLAT FILE DESTINATION transforms are literally cut and paste from yesterday’s single data flow approach that didn’t work.  Nothing needed to be changed.
The concept behind this data flow is to map the RowCount user variable that was set AFTER the first data flow exited to a data column and write the value of that column to the output file.
At first I attempted to build this data flow without the OLE DB data source.  Although the package ran, no data was written to the output file.  The reason is because the DERIVED COLUMN transform needs some sort of data driver, even though we only want to utilize the RowCount variable.
In order to get the RowCount variable to be written out I prefixed the DERIVED COLUMN with an OLE DB transform.  Here’s what’s in the OLE DB transform:
image
It doesn’t get any simpler than this.  This simple query does two things:
1.  Generates an input data stream that will allow the DERIVED COLUMN to fire.
2.  Generates a single row so the ROW COUNT will only be written out once.
Just for completeness, here’s the configuration of the DERIVED COLUMN transform:
image
Again, it doesn’t get any simpler than that.
CONCLUSION  This shows how to use the ROW COUNT transformation, utilizing 2 distinct data flows, to write the row count variable out to a flat file.  Although the focus has been on the RowCount variable, the use of the DERIVED COLUMN transform can be used to write ANY variable out to ANY destination.

# Step 3
This is the third and final part to this thread on how to capture the number of rows being processed in SSIS to a flat file.
In Part I we learned how not to do it and why it didn’t work.
In Part II we showed how to use the ROW COUNT transformation to populate an SSIS variable, use the DERIVED COLUMN transformation to convert the value of the variable to a column of data, and write the row count to a flat file.  We learned that we needed to use two data flows to do this.  The biggest benefit of this approach is that we can use the same technique of using the DERIVED COLUMN transform to write any SSIS variable out to a flat file, such as an audit file.
In this post we will show a simpler way to capture the row count of a transaction.  The downside is that it will not work for any variable in the package, as in the case of the DERIVED COLUMN.
So let’s get to work…
1.  We’ll only need a single data flow for this solution.  And here’s what it looks like:
image
What’s new here is the use of the AGGREGATE transformation circled in red.  Basically all this transform is doing is the classic T-SQL COUNT(*).
2.  To configure the AGGREGATE TRANSFORM just select the (*) input column.
image
3.  The last step is to map the output of the AGGREGATE to the input of theFLAT FILE DESTINATION transform, or any other target transform you need.
The following execution of the plan shows how the AGGREGATE transform takes the 940 rows and outputs the singe row count row that we were looking for.
image
CONCLUSION:  SSIS provides a powerful set of options when doing just about any task.  That certainly holds true even when doing something as simple as capturing the number of rows in a data flow.  To capture the row count all you really need to do is use the AGGREGATE transform.  If you’re more interested in being able to write any SSIS variable out than use the DERIVED COLUMN transform.

No comments:

Post a Comment

Hi,

Thanks for your visit to this blog.
We would be happy with your Queries/Suggestions.