Teradata: April 2017

Wednesday, April 19, 2017

Teradata Collect statistics Recommendations v 14, v 15

Statistical information is vital for the optimizer when it builds query plans. But collecting statistics can involve time and resources. By understanding and combining several different statistics gathering techniques, users of Teradata can find the correct balance between good query plans and the time required to ensure adequate statistical information is always available.

This recently-updated compilation of statistics collection recommendations are intended for sites that are on any of the Teradata Database 14.10, 15.0, or 15.10 software release levels. Some of these recommendations apply to releases earlier than Teradata Database 14.10 and some rely on new features available starting in Teradata Database 14.10. Statistics collection functionality in the Teradata Database works the same in 14.10, 15.0 and 15.10 releases.

For greater detail on collecting statistics for Teradata Database 14.10, see the orange book titled: Teradata Database 14.10 Statistics Enhancements by Rama Krishna Korlapati.

Collect Full Statistics

· Non-indexed columns used in predicates

· All NUSIs

· USIs/UPIs if used in non-equality predicates (range constraints)

· Most NUPIs (see below for a fuller discussion of NUPI statistic collection)

· Full statistics always need to be collected on relevant columns and indexes on small tables (less than 100 rows per AMP)

· PARTITION for all partitioned tables undergoing upward growth

· Partitioning columns of a row-partitioned table

Can Rely on Dynamic AMP Sampling

· USIs or UPIs if only used with equality predicates

· NUSIs with an even distribution of values

· NUPIs that display even distribution, and if used for joining, conform to assumed uniqueness (see Point #2 under “Other Considerations” below)

· See “Other Considerations” for additional points related to dynamic AMP sampling

Collect Multicolumn Statistics

· Groups of columns that often appear together with equality predicates. These statistics are used for single-tables estimates.

· Groups of columns used for joins or aggregations, where there is either a dependency or some degree of correlation among them. With no multicolumn statistics collected, the optimizer assumes complete independence among the column values. The more that the combination of actual values are correlated, the greater the value of collecting multicolumn statistics is in this situation.

· Specify a name for such statistics, for ease of recollection, viewing, and/or dropping.

General Suggestions for the Statistics Collection Process

· When multiple statistics on a table are collected for the first time, group all statistics with the same USING options into a single request.

· After first time collections, collect all statistics being refreshed on a table into one statement, and if all are being refreshed, re-collect at the table level.

· Do not rely on copied or transferred SUMMARY statistics or PARTITION statistics between tables within a system or across systems. Recollect them natively. This ensures that changes to the configuration and other internal details (such as how internal partitions are arranged) are available to the optimizer.

· Recollect table-level SUMMARY statistics after data loading events in order to provide the optimizer with current table row counts and other detail. This operation runs very quickly and supports more effective extrapolations of statistics that were not able to be recollected after updates.

· Do not drop and then recollect statistics, as history records for the statistic are lost when the statistic is dropped, making it less likely that the optimizer skips statistics collection or downgrades to sampling.

New Recommendations for Teradata Database 14.10

· If migrating from a previous release, set the DBS Control internal field NoDot0Backdown to true in order to make use of the new Version 6 statistics histograms, which can carry update, delete and insert counts.

· Enable DBQL USECOUNT logging for all important databases whose table row counts may change over time. This provides information about updates, deletes and inserts performed on each table within the logged databases and contributes to better extrapolations.

· Benefit from the default system threshold option, which may allow some submitted statistics to be skipped, by turning on DBQL USECOUNT logging and building up statistic history records. Skipping can potentially reduce the resources required by statistics recollections.

· Expect to perform several full collections before statistic skipping or automatic downgrade to sampling is considered.

· Use the collection statement-level THRESHOLD option only for cases where there is a specific need to override the global threshold default.

· Consider collecting statistics (and providing a name) on SQL expressions if they are frequently used in queries and if they reference columns from a single table.

Other Considerations

1. Optimizations such as nested join, partial GROUP BY, and dynamic partition elimination are not chosen unless statistics have been collected on the relevant columns.

2. NUPIs that are used in join steps in the absence of collected statistics are assumed to be 75% unique, and the number of distinct values in the table is derived from that. A NUPI that is far off from being 75% unique (for example, it’s 90% unique, or on the other side, it’s 60% unique or less) benefit from having statistics collected, including a NUPI composed of multiple columns regardless of the length of the concatenated values. However, if it is close to being 75% unique, dynamic AMP samples are adequate. To determine what the uniqueness of a NUPI is before collecting statistics, you can issue this SQL statement:

EXPLAIN SELECT DISTINCT NUPI-column FROM table;

3. For a partitioned table, it is recommended that you always collect statistics on:

· PARTITION. This tells the optimizer how many row partitions are empty, a histogram of how many rows are in each row partition, and the compression ratio for column partitions. This statistic is used for optimizer costing.

· Any partitioning columns. This provides cardinality estimates to the optimizer when the partitioning column is part of a query’s selection criteria.

4. For a partitioned primary index table, consider collecting these statistics if the partitioning column is not part of the table’s primary index (PI):

· (PARTITION, PI). This statistic is most important when a given PI value may exist in multiple partitions, and can be skipped if a PI value only goes to one partition. It provides the optimizer with the distribution of primary index values across the partitions. It helps in costing the sliding-window and rowkey-based merge join, as well as dynamic partition elimination.

· (PARTITION, PI, partitioning column). This statistic provides the combined number of distinct values for the combination of PI and partitioning columns after partition elimination. It is used in rowkey join costing.

5. Dynamic AMP sampling has the option of pulling samples from all AMPs, rather than from a single AMP (the default). Dynamic all-AMP sampling has these particular advantages:

· It provides a more accurate row count estimate for a table with a NUPI. This benefit becomes important when NUPI statistics have not been collected (as might be the case if the table is extraordinarily large), and the NUPI has an uneven distribution of values.

· Statistics extrapolation for any column in a table is not attempted for small tables or tables whose primary index is skewed (based on full statistics having been collected on the PI), unless all-AMP dynamic AMP sampling is turned on. Because a dynamic AMP sample is compared against the table row count in the histogram as the first step in the extrapolation process, an accurate dynamic AMP sample row count is critical for determining if collected statistics are stale, or not.

6. For temporal tables, follow all collection recommendations made above. Currently, statistics are not supported on BEGIN and END period types. That capability is planned for a future release.

[1] Any column which is over 95% unique is considered as a nearly-unique column.

[2] Dynamic AMP sampling is sometimes referred to as random AMP sampling.

Teradata 15 Features

· Teradata Database 15 offers new capabilities to empower analytic.

· Teradata QueryGrid, provides virtual compute capability within and beyond the Teradata Unified Data Architecture, enabling companies to access and process analytic queries across multiple systems

· Teradata Database 15, with QueryGrid, offers bi-directional data movement and push-down processing to open source Hadoop, and that the announcements develop its unified data architecture approach to bridging the relational and non-relational technologies in the database market.

· It minimizes data movement which takes time, and processes data where it resides. It also reduces data duplication.

· Teradata 15 will really open up the potential uses in data warehouse, including storing and analyzing sensor data along with other data stored on the Teradata platform.

JavaScript Object Notation:

JSON (JavaScript Object Notation) is new functionality to support the storage and processing of data into Teradata database.

· A JSON data type, its stores the data in JSON document or JSON records in relational format.

· JSON data can be used for all methods, functions, and procedures for processing, shredding, and publishing.

· The size of JSON documents up to 16MB

· JSONPath support provides simple traversal and regular expressions with wildcards to filter and navigate complex JSON documents.

· One of the advantages of processing JSON is the capability to perform query-on-reads, or late data binding, as opposed to the traditional early binding, or the query-on-write method used with traditional data types. With JSON, the database can actually discover what data is there when you run your query.

Teradata QueryGrid:

Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models.

Teradata database now able to connect Hadoop with this QueryGrid so it’s called as Teradata Database-to-Hadoop also referred as Teradata-to-Hadoop connector.

· Query Grid for Multiple System Analytics.

· It provides a SQL interface for transferring data between Teradata Database and remote Hadoop hosts.

· Import Hadoop data into a temporary or permanent Teradata table.

· Export data from temporary or permanent Teradata tables into existing Hadoop tables.

· Create or drop tables in Hadoop from Teradata Database.

· Reference tables on the remote hosts in SELECT and INSERT statements.

· Select Hadoop data for use with a business tool.

· Select and join Hadoop data with data from independent data warehouses for analytical use.

XML Data Type:

· Teradata also provides a new data type which stores and process the XML data.
It supports methods, functions, stored procedures for all operations like parsing, validation, transformations and Query.

· The XML type stores values up to 2GB in size.

DBQL Show Parameters

A parameterized query used to place parameters, and the parameter values are provided in a separate statement at time of execution. The main purpose is to distinguishes between code and data. Also avoids attackers from changing the query by inserting SQL commands

Two new Data Dictionary tables are

· DBC.DBQLParamTbl

· DBC.DBQLParamTbl logs

Below points are shows the overall enhancements in Teradata 15 release.

Performance

Light-Weight Redistribution

· The Light-Weight Redistribution(LWR) also referred to as the Single Sender Redistribution (SSR). With this feature, the Teradata Database optimizer can switch from an all-AMP row redistribution to a few AMP row redistribution.

· While executing the query, Teradata optimizer determines a query step is eligible for SSR, then the retrieved rows are redistributed by hash code to receiver AMP. In this case the number of AMPs is few. Without SSR all rows are normally redistributed across all AMPs.

· Software Efficiency Improvements

Quality

· New PI On Access Rights Table

· DBQL - Show Parameters

· Onsite System and Dump Analysis

DATA COMPRESSION

In addition to changing the internal data organization for the table, Teradata Columnar includes automatic dynamic compression and new compression mechanisms. Because all the data in a column storage container is for the same column, it is more consistent and lends itself to more compression algorithms than row-oriented data. To take advantage of the consistency within the data, Teradata Columnar includes various compression mechanisms, such as:

· Dictionary

· Run length encoding

· NULL

· Trim leading or trailing bytes or characters

· Delta from mean

· UNICODE to UTF8

Teradata Columnar can apply multiple compression mechanisms to a column partition. Applying this broad set of mechanisms to homogeneous data in columnar storage achieves dramatic compression rates.

GIVE THE HIGHEST PRIORITY TO IMPORTANT TACTICAL WORK

Everybody wants TO run queries ON your DATABASE AT the same

TIME - FROM the boardroom TO the LOADING dock, FROM marketing

TO front-line customer service - AND they ALL want answers

AS CLOSE TO REAL-TIME AS possible. TERADATA WORKLOAD

Management prioritizes the WORK AS per your specifications

AND assigns resources TO make sure everything gets done ON TIME TO

meet business demands.

HASH JOIN ENHANCEMENTS

These enhancements extend the application of hash joins to include:

* Classical and dynamic hash outer joins

* Inclusion hash semi joins and exclusion hash semi joins

* Dynamic, inclusion, and exclusion hash semi joins with dynamic partition

elimination

* Hash joins with cross terms

Benefits

* More efficient hash joins.

* Enhanced performance of outer joins, inclusion and exclusion semi joins, joins with

partition elimination, and joins with cross terms.

Considerations :

Hash join enhancements with dynamic partition elimination only apply to dynamic hash

joins, inclusion hash joins, and exclusion hash joins. They do not apply to classical or

direct hash joins.

Indexes on UDT Columns

You can now create primary and secondary indexes on most types of user-defined type (UDT)columns

Benefits

You can declare a primary or secondary index on a UDT column when you create:

* Indexed tables

* Join indexes

* Hash indexes

* Secondary indexes

Considerations

* You cannot create primary or secondary indexes on:

* LOB UDTs

* ARRAY/VARRAY (Teradata internal UDT)

* The VARIANT_TYPE data type

ENCRYPTION ENHANCEMENTS

· Teradata Database defaulted to a single set of encryption standards with no options for increasing encryption strength:

· Teradata client applications provided the option to encrypt network traffic to and from the database, but all encryption used the default AES 128-bit algorithm.

THE QUERY WINDOW

SQL Assistant 15.0 uses a new code editor control to provide the Query window

and SQL parsing functionality. This has required changes to the way in

which some functions are performed, but has also added additional functionality.

The following changes will be observed:

· The scroll bars now conform to your chosen appearance settings

· Outline indicators now appear at the left of the SQL text

· The background colour of the current line may now be different from other lines

· A narrow yellow bar now appears at the left of any line that you have changed

· The bookmark symbols have changed

· Quick-Info tool tips may be displayed when you hover over a built-in function name

· Suggestion lists are now displayed for additional commands

· The Find and Replace dialogs have changed

· The process to Record, Play back, or manage, Keyboard macros has changed

· Changes during query execution

Query Execution

· When you execute a query only the portion that you are executing will become read only. This will allow you to continue working on other parts of the query without having to copy it to another Query tab.

· You can now highlight and submit part of a Query for execution as an Import statement.

· The approximate location of syntax errors is now indicated by red wavy lines below the text. These lines will remain until you next submit the query.
Note that syntax error locations are still only available when connected using Teradata.Net.

Teradata

Wednesday, April 19, 2017

Teradata Collect statistics Recommendations v 14, v 15

Teradata 15 Features

Identifying Teradata table skew query

Teradata Online books

Report Abuse

Teradata self assessment books