teradata application utilities: April 2010

Wednesday, April 28, 2010

Product Join and Merge Join and Hash Join

A product join, every row of the left table to every row of the right table. Typically it is chosen as the join method, when one of the tables is small. Since it basically needs to read every row of the left table once, and every row of the right table n times, where n is the number of rows in the left table, it can be expensive to execute, if the number of rows in the left table is large.

The typical alternative is merge join. The extra step, however, that is needed to run a merge join is to sort the two source tables. If the tables are already in sorted order this step is not necessary. A merge join executes very fast, since it only reads both tables once. In addition, it is possible, depending on the join selectivities, that not all the datablocks of both tables need to be read.

The hash join does not require that both tables are sorted. The smaller table/spool is "hashed" into memory (sometimes using multiple hash partitions). Then, the larger table is scanned and for each row, it looks up the row from the smaller table in the hashed table that was created in memory. If the smaller table must be broken into partitions to fit into memory, the larger table must also be broken into the same partitions prior to the join.

So, the main advantage of the hash join is that the big table does not have to be sorted and the small table can be much larger than for a product join.

NOPI feature in Teradata 13.0

The purpose of teradata 13.0 feature that allows us to create table with no primary index is to improve performance of fastload. because there is no primary index for the rows of a NOPI table the rows are not hashed to an AMP based on their primary index.Instead , Teradata Database either hashes on the Query ID for a row, or it uses a different algorithm to assign the row to its home AMP once it reaches the AMP onto which it has been FastLoaded. The Teradata Database then generates a RowID for each row in a NoPI table by randomly selecting an arbitrary hash bucket that an AMP owns and using it to generate a RowID. Using a NoPI table as a staging table for such applications avoids the row redistribution and sorting required for primary-indexed staging tables. Another advantage of NoPI tables is that you can quickly load data into them and be finished with the acquisition phase of the utility operation, which frees client resources for other applications. LIMITATIONS : You cannot create a NoPI queue table. • You cannot create a NoPI error table. • You cannot create a NoPI table as a SET table. The default table type for NoPI tables in both Teradata and ANSI session modes is always MULTISET. • If neither PRIMARY INDEX (column_list) nor NO PRIMARY INDEX is specified explicitly in a CREATE TABLE request, then whether the table is created with or without a primary index generally depends on the setting of the DBS Control flag DisableNoPI • NoPI tables cannot specify a permanent journal. • NoPI tables cannot specify an identity column. • Hash indexes cannot be defined on NoPI tables because they inherit the primary index of their underlying base table, and NoPI tables have no primary index. • SQL UPDATE triggers cannot update a NoPI table. • SQL UPDATE and UPDATE (Upsert Form) requests cannot update a NoPI target table. SQL UPDATE and UPDATE (Upsert Form) requests can update a primary-indexed target table from a NoPI source table. • SQL MERGE requests cannot update or insert into a NoPI target table. SQL MERGE requests can update or insert into a primary-indexed target table from a NoPI source table. • You cannot load rows into a NoPI table using the MultiLoad utility.

Saturday, April 24, 2010

difference between TRUNCATE and DELETE

TRUNCATE :
ddl command.
truncate being a ddl is auto commit.
we can only truncate the whole table(cant use where clause)
once table is truncated we cant rollback the changes.
when a table is truncated the memory occupied is released.

DELETE :
delete is a dml command and can be rolled back.
is slower than truncate as it is dml has to go through rollback segments etc.
we can use where clause with delete.
when a table is deleted memory occupied is not released

Monday, April 19, 2010

difference between OLAP and aggregate functions.

OLAP functions are similar to aggregate functions in that they:
Operate on groups of rows (like the GROUP BY clause)
Can filter groups using QUALIFY (like the HAVING clause)

OLAP functions are unlike aggregate functions because they:
Return a data value for each qualifying row - not group.
May not be performed within subqueries.

Thursday, April 15, 2010

FastExport sessions and Bteq Export

advantages of FastExport over Bteq Export :
FastExport Has
Full Restart Capability
uses multiple sessions
uses Block I/O.

By default FastExport uses one sessions for each AMP.
Maximum specification must be greater than zero and no more than
the total number of AMPs on the system.

Wednesday, April 14, 2010

All AMP Operations vs Full Table Scan

All AMP Operation (NUSI) :
1. The SQL is submitted, specifying a NUSI.The hashing algorithm calculates a row hash value for
the NUSI .
2. All AMPs are activated to find the hash value of the NUSI in their index subtables. The AMPs
whose subtables contain that value become the participating AMPs in this request The other
AMPs discard the message.
3. Each participating AMP locates the row IDs (row hash value plus uniqueness value) of the base
rows corresponding to the hash value.
4. The participating AMPs access the base table rows, which are located on the same AMP as the
NUSI subtable.
5. The qualifying rows are sent over the BYNET to the PE, and the PE sends the answer set on to
the client application .

Full-Table Scan : (Accessing data with out indexes) :
If the request does not use a defined index, the Teradata Database does a full-table scan. A full-table scan is another way to access data without using Primary or Secondary Indexes. In evaluating an SQL request, the Optimizer examines all possible access methods and chooses the one it believes to be the most efficient.

A request could turn into a full-table scan when:
The SQL request searches on a NUSI column with many duplicates.
An SQL request uses a non-equality WHERE clause on an index column.
An SQL request uses a range WHERE clause on an index column.

For all requests, you must specify a value for each column in the index

Data File Formats in bteq and load utilities

Bteq :

While creating the data file (exporting from table) the following modes are accepted.
DATA : Record Mode (Raw Binary Data).
INDICDATA : contains indicator bytes.number of indicator bytes = 1 bit for each column.that bit will be
set to 1 if the value for a perticular column is NULL else it will set to 0.
REPORT : Field Mode (column Headings).data is exported 1 column at a time. default format.
DIF : Data Interchange Format . to transport data to various pc programs.

while importing target table from data file the following modes are accepted.
DATA :
INDICDATA :
VARTEXT : each column is separated by a delimiter.

FastLoad,MultiLoad,FastExport,TPUMP,TPT
there are 5 data formats available for load or unload utilities.
Formatted :
Each record is in a format traditionally known as FastLoad or Teradata format. The data is prefixed with the data length and followed by and end-of-record (EOR) marker. This is the most stringent and robust data format and will allow any TPT supported data type to be included.
Let’s consider a record containing 6 bytes of inferred text data "abcdef";

x'00 06 61 62 63 64 65 66 0a' and in character: "..abcdef."

Text :
There is no length indicator as with formatted data (above). Records are separated only by an end-of-record (EOR) marker. To insure data integrity, each record must be entirely character data.
The EOR marker can be either a single-byte linefeed (X'0A') or a double-byte carriage-return/line-feed pair (X'0D0A'). Whether a single or double byte EOR is used is determined by the EOR marker encountered for the first record. If the first record has a single byte EOR and contains the text "abcd", it would look like this:
x'61 63 64 63 0a' and in character: "abcd."

Every subsequent record is expected to use the same single byte EOR.
If the first record has a double byte EOR and contains the text "acdc", it would look like this:
x'61 63 64 63 0d 0a' and in character: "abcd.."

Again, every subsequent record is expected to use the same double-byte EOR.
Single and double byte EORs cannot be mixed in the same data.

Binary :
This format is the same as formatted (see above) except that there is no trailing EOR byte. Each record contains a 2-byte integer data length, n, followed by n bytes of data. As with formatted data, all TPT data types are supported.
E.g., a record containing 4 bytes of text "abcd" would be;
x'00 06 61 62 63 64' and in character: "..abcd"

Delimited / variable length Text :
This format is an extension of the text format. It is similar because it must contain all character data and be terminated by an EOR marker. But it has a different restriction: all columns in the TPT schema must be VARCHAR** (as opposed to CHAR/ANSIDATE for simple text). The advantage of this format is that it supports an additional feature that allows the definition of individual fields (which map to DBS table row columns) separated by the special delimiter character. To successfully parse delimited data, the delimiter character can not appear within the data. Delimiter data embedded within a column will result in a data column count error.
UnFormatted :
The data does not conform to any external format. There is neither a prepended length as with binary and formatted nor is there an EOR marker as with text and delimited.

limitations of fastload and multiload

FastLoad Limitations :

Only load one table.

Table must be empty.

No secondary Indices.

No enabled triggers.

No referential integrity.

Duplicate rows can’t be loaded.

MultiLoad Limitations :

load up to 5 tables.

Table may be empty/populated.

No Unique secondary Indices.(NUSI permitted) (USI – subtable rows are stored in different AMP ).

No enabled triggers.

No referential integrity.

No join indexes.

Tuesday, April 13, 2010

Difference between Fastload and multiload interms of performance

The basic concept is
FastLoad used to load and empty table and none of the secondary indicis are allowed on the target table.and it can load only 1 table.
Multiload can load already populated table and can load upto 5 tables.NUSI is allowed on the target table.

in multiload, as the target table can contain NUSI, it has to load both the subtable and the base table.where as in fastload no need of subtable as NUPI is not allowed.

one more difference is multiload uses two work tables.
the purpose of work tables is to hold the DML requests and the input data that is ready to APPLY to AMPS.

So multiload has to first load the worktable and then the target table.

loading worktable is just like fastloading the work table.So it will take twice the resources and hence the performance is less compared to fastload

different phases of fastload in brief

Two Phases :

Acquisition Phase :

PE sends Blocks to Each AMP in round robin.

AMPS hash and re-distributes to appropriate AMPs.

Records are written to AMP spool file .the records are un sorted.

Application Phase :

This Phase starts once END LOADING is encountered

Each AMP sort spool file and creates data blocks.

Writes data blocks to disk.

Fallback rows generated if required.

MultiLoad Modes: IMPORT and DELETE

MultiLoad provides two types of operations via modes:
IMPORT and DELETE.
In MultiLoad IMPORT mode,
you have the freedom to "mix and match" up to twenty (20) INSERTs, UPDATEs or DELETEs on up to five target tables.
The execution of the DML statements is not mandatory for all rows in a table.
Instead, their execution hinges upon the conditions contained in the APPLY clause of the script.
Once again, MultiLoad demonstrates its user-friendly flexibility.
For UPDATEs or DELETEs to be successful in IMPORT mode, they must reference the Primary Index in the WHERE clause.

The MultiLoad DELETE
mode is used to perform a global (all AMP) delete on just one table.
The reason to use .BEGIN DELETE MLOAD is that it bypasses the Transient Journal (TJ) and can be
RESTARTed if an error causes it to terminate prior to finishing.
When performing in DELETE mode,
the DELETE SQL statement cannot reference the Primary Index in the WHERE clause. This due to
the fact that a primary index access is to a specific AMP; this is a global operation.
The other factor that makes a DELETE mode operation so good is that it examines an entire block
of rows at a time. Once all the eligible rows have been removed, the block is written one time and a
checkpoint is written. So, if a restart is necessary, it simply starts deleting rows from the next block
without a checkpoint. This is a smart way to continue. Remember, when using the TJ all deleted
rows are put back into the table from the TJ as a rollback. A rollback can take longer to finish than
the delete. MultiLoad does not do a rollback; it does a restart.

Phases Of MultiLoad

Preliminary Phase :
-- Checks SQL commands or Multiload commands are valid .
-- Establishes multiload sessions with Teradata.
-- Creates Error Tables, Work Tables and log table.
Error Tables : requires 2 error tables per target table.
First contains constraint violations.
Second contains UPI violations.
Work Tables : holds the DML tasks requested and the input data that is ready to APPLY to AMPS.
LOGTABLE : keeps a record of results from each phase of the load so that multiload knows
proper point from which to restart.
-- Apply utility locks to the target table.

DML Transaction Phase :
-- All the SQL DML statements are sent to database.
-- PE parses the DML and generates a step-by-step plan to execute the request.
-- This execution plan is then communicated to each AMP and stored in appropriate work tables for each target table.
-- Later, during the Acquisition phase the actual input data will also be stored in the worktable so that it may be applied the Application Phase.
-- Next, a match tag is assigned to each DML request that will match it with the appropriate rows of input data. The match tags will not actually be used until the data has already been acquired and is about to be applied to the worktable. This is somewhat like a student who receives a letter from the university in the summer that lists his courses, professor’s names, and classroom locations for the upcoming semester. The letter is a “match tag” for the student to his school schedule, although it will not be used for several months. This matching tag for SQL and data is the reason that the data is replicated for each SQL statement using the same data record.

Acquisition Phase :
-- MultiLoad now acquires the data in large, unsorted 64K blocks from the host and sends it to the AMPs.(round Robin).
-- The AMP hashes each row on the primary index and sends it over the BYNET to the proper AMP where it will ultimately be used. But the row does not get inserted into its target table, just yet.
-- The AMP puts all of the hashed rows it has received from other AMPs into the worktables where it assembles them into the SQL.
-- Now the utility places a load lock on each target table in preparation for the Application Phase.

Application Phase :
The purpose of this phase is to write, or APPLY, the specified changes to both the target tables and NUSI subtables.

Cleanup phase :
The utility looks at the final Error Code (&SYSRC). MultiLoad believes the adage, “All is well that ends well.” If the last error code is zero (0), all of the job steps have ended successfully (i.e., all has certainly ended well). This being the case, all empty error tables, worktables and the log table are dropped. All locks, both Teradata and MultiLoad, are released. The statistics for the job are generated for output (SYSPRINT) and the system count variables are set. After this, each MultiLoad session is logged off.

teradata application utilities

Search This Blog