Statistical information is vital for the optimizer
when it builds query plans. But collecting statistics can involve time and
resources. By understanding and combining several different statistics
gathering techniques, users of Teradata can find the correct balance between
good query plans and the time required to ensure adequate statistical
information is always available.
This recently-updated compilation of
statistics collection recommendations are intended for sites that are on any of
the Teradata Database 14.10, 15.0, or 15.10 software release levels.
Some of these recommendations apply to releases earlier than Teradata Database
14.10 and some rely on new features available starting in Teradata Database
14.10. Statistics collection functionality in the Teradata
Database works the same in 14.10, 15.0 and 15.10 releases.
For greater detail on collecting statistics
for Teradata Database 14.10, see the orange book titled: Teradata Database 14.10 Statistics Enhancements by Rama Krishna Korlapati.
Collect Full Statistics
·
Non-indexed
columns used in predicates
·
All
NUSIs
·
USIs/UPIs
if used in non-equality predicates (range constraints)
·
Most
NUPIs (see below for a fuller discussion of NUPI statistic collection)
·
Full
statistics always need to be collected on relevant columns and indexes on small
tables (less than 100 rows per AMP)
·
PARTITION
for all partitioned tables undergoing upward growth
·
Partitioning
columns of a row-partitioned table
Can Rely on Dynamic AMP Sampling
·
USIs or
UPIs if only used with equality predicates
·
NUSIs
with an even distribution of values
·
NUPIs
that display even distribution, and if used for joining, conform to assumed
uniqueness (see Point #2 under “Other Considerations” below)
·
See
“Other Considerations” for additional points related to dynamic AMP sampling
Collect Multicolumn
Statistics
·
Groups
of columns that often appear together with equality predicates. These
statistics are used for single-tables estimates.
·
Groups
of columns used for joins or aggregations, where there is either a dependency
or some degree of correlation among them. With no multicolumn statistics
collected, the optimizer assumes complete independence among the column values.
The more that the combination of actual values are correlated, the greater the
value of collecting multicolumn statistics is in this situation.
·
Specify
a name for such statistics, for ease of recollection, viewing, and/or dropping.
General Suggestions for
the Statistics Collection Process
·
When
multiple statistics on a table are collected for the first time, group all
statistics with the same USING options into a single request.
·
After
first time collections, collect all statistics being refreshed on a table into
one statement, and if all are being refreshed, re-collect at the
table level.
·
Do not
rely on copied or transferred SUMMARY statistics or PARTITION statistics
between tables within a system or across systems. Recollect them natively. This
ensures that changes to the configuration and other internal details (such as
how internal partitions are arranged) are available to the optimizer.
·
Recollect
table-level SUMMARY statistics after data loading events in order to provide
the optimizer with current table row counts and other detail. This operation
runs very quickly and supports more effective extrapolations of statistics that
were not able to be recollected after updates.
·
Do not
drop and then recollect statistics, as history records for the statistic are
lost when the statistic is dropped, making it less likely that the optimizer
skips statistics collection or downgrades to sampling.
New Recommendations for
Teradata Database 14.10
·
If
migrating from a previous release, set the DBS Control internal field
NoDot0Backdown to true in order to make use of the new Version 6 statistics
histograms, which can carry update, delete and insert counts.
·
Enable
DBQL USECOUNT logging for all important databases whose table row counts may
change over time. This provides information about updates, deletes and inserts
performed on each table within the logged databases and contributes to better
extrapolations.
·
Benefit
from the default system threshold option, which may allow some submitted
statistics to be skipped, by turning on DBQL USECOUNT logging and building up
statistic history records. Skipping can potentially reduce the resources
required by statistics recollections.
·
Expect
to perform several full collections before statistic skipping or automatic
downgrade to sampling is considered.
·
Use the
collection statement-level THRESHOLD option only for cases where there is a
specific need to override the global threshold default.
·
Consider
collecting statistics (and providing a name) on SQL expressions if they are
frequently used in queries and if they reference columns from a single table.
Other Considerations
1.
Optimizations
such as nested join, partial GROUP BY, and dynamic partition elimination are
not chosen unless statistics have been collected on the relevant columns.
2.
NUPIs
that are used in join steps in the absence of collected statistics are assumed
to be 75% unique, and the number of distinct values in the table is derived
from that. A NUPI that is far off from being 75% unique (for example, it’s 90%
unique, or on the other side, it’s 60% unique or less) benefit from having
statistics collected, including a NUPI composed of multiple columns regardless
of the length of the concatenated values. However, if it is close to being 75%
unique, dynamic AMP samples are adequate. To determine what the uniqueness of a
NUPI is before collecting statistics, you can issue this SQL statement:
EXPLAIN SELECT DISTINCT NUPI-column FROM
table;
3.
For a
partitioned table, it is recommended that you always collect statistics on:
·
PARTITION.
This tells the optimizer how many row partitions are empty, a histogram of how
many rows are in each row partition, and the compression ratio for column
partitions. This statistic is used for optimizer costing.
·
Any
partitioning columns. This provides cardinality estimates to the optimizer when
the partitioning column is part of a query’s selection criteria.
4.
For a
partitioned primary index table, consider collecting these statistics if the
partitioning column is not part of the table’s primary index (PI):
·
(PARTITION,
PI). This statistic is most important when a given PI value may exist in
multiple partitions, and can be skipped if a PI value only goes to one
partition. It provides the optimizer with the distribution of primary index
values across the partitions. It helps in costing the sliding-window and
rowkey-based merge join, as well as dynamic partition elimination.
·
(PARTITION,
PI, partitioning column). This statistic provides the combined number of
distinct values for the combination of PI and partitioning columns after
partition elimination. It is used in rowkey join costing.
5.
Dynamic
AMP sampling has the option of pulling samples from all AMPs, rather than from
a single AMP (the default). Dynamic all-AMP sampling has these particular
advantages:
·
It
provides a more accurate row count estimate for a table with a NUPI. This
benefit becomes important when NUPI statistics have not been collected (as
might be the case if the table is extraordinarily large), and the
NUPI has an uneven distribution of values.
·
Statistics
extrapolation for any column in a table is not attempted for small tables or
tables whose primary index is skewed (based on full statistics having been
collected on the PI), unless all-AMP dynamic AMP sampling is turned
on. Because a dynamic AMP sample is compared against the table row count in the
histogram as the first step in the extrapolation process, an accurate dynamic
AMP sample row count is critical for determining if collected statistics are
stale, or not.
6.
For
temporal tables, follow all collection recommendations made above. Currently,
statistics are not supported on BEGIN and END period types. That capability is
planned for a future release.
[1] Any column which is over 95% unique is
considered as a nearly-unique column.
[2] Dynamic AMP sampling is sometimes referred
to as random AMP sampling.