Aggregator Not Preceded by ‘Check’ Sort
Rule name | Aggregator not preceded by a ‘Check’ Sort. |
|---|---|
Parallel Job | Yes |
Server Job | - |
Job Sequence | - |
Description | Identifies Parallel Aggregator Stages not preceded by a ‘Check’ Sort Stage. |
Inroduction
The Aggregator stage summarises data rows from a single input link into groups, computing totals or other aggregate functions for each group. To correctly configure an Aggregator stage, you need to ensure two things:
The input link is partitioned on the Aggregator’s specified grouping keys, to ensure records with identical grouping keys values are present in the same partitions, and
The input link rows are sorted on the Aggregator’s specified grouping keys.
This second criterion can be achieved using a number of methods: Sort the data in the job using a Sort stage or read the data from a pre-sorted source (such as a Database connector using an ORDER BY clause). In either case, this optional rule enforces the use of a design pattern where you to test that the Aggregator’s incoming data is sorted appropriately using a ‘Check’ sort.
The existence of this rule does not imply that it should be used in all (or indeed any!) instances. It’s provided as an example should this be a rule your organization wishes to apply.
Description
This rule identifies Parallel Aggregator Stages which are not preceded by a ‘Check’ Sort Stage. A ‘check’ sort is a Sort Stage with all of its sort keys having a Sort Key Mode property of 'Do not Sort (Previously Sorted)'.
The DataStage Parallel execution framework typically inserts sorts before any stage that requires matched key values or ordered groupings (Join, Merge, Remove Duplicates, and Aggregator). Sorts are only inserted automatically when the Job does not explicitly define an input sort. Though ensuring correct results, inserted sorts can have a significant (and often unnecessary) performance impact. There are two ways to prevent the Parallel framework from inserting an un-necessary sort:
Insert an upstream Sort stage on each link, defining all sort key columns with the “Do not Sort (Previously Sorted)” Sort Mode key property, or
Set the environment variable
APT_SORT_INSERTION_CHECK_ONLY. This verifies sort order but does not perform a sort, aborting the job if data is not in the required sort order.
This rule verifies you have adopted the former approach.
Actions
Ensure a Parallel Aggregator Stage is preceded by a Sort stage will all sort keys specified, and ensure all of those sort keys have a Sort Key Mode property of 'Do not Sort (Previously Sorted)'.