Search This Blog

Tuesday, March 30, 2010

Difference between subquery and corelated subquery

Conceptually, the subquery is processed first so that all the values are expanded into the list for comparison with the column specified in the WHERE clause. These values in the subquery SELECT can only be used for comparison against the column or columns referenced in the WHERE.

Columns inside the subquery SELECT cannot be returned to the user via the main SELECT. The only columns available to the client are those in the tables named in the main (first) FROM clause. The query in parentheses is called the subquery and it is responsible for building the IN list.

The correlated subquery is a very powerful tool. It is an excellent technique to use when there is a need to determine which rows to SELECT based on one or more values from another table. This is especially true when the value for comparison is based on an aggregate. It combines subquery processing and join processing into a single request.

The operation for a correlated subquery differs from that of a normal subquery. Instead of comparing the selected subquery values against all the rows in the main query, the correlated subquery works backward. It first reads a row in the main query, and then goes into the subquery to find all the rows that match the specified column value. Then, it gets the next row in the main query and retrieves all the subquery rows that match the next value in this row. This processing continues until all the qualifying rows from the main SELECT are satisfied.

Although this sounds terribly inefficient and is inefficient on other databases, it is extremely efficient in Teradata. This is due to the way the AMPs handle this type of request. The AMPs are smart enough to remember and share each value that is located.

Monday, March 22, 2010

Collect Statistics In Teradata

Collect stats is an important concept in teradata.
collect stats gives PE to come up with a plan with least cost for an requested query.
Collect stats defines the confidence level of PE in estimating "how many rows it is going to access ? how many unique values does a table have , null values etc and all this info is stored in data dictionary. Once you submit a query in TD the parsing engine checks if the stats are avaiable for the requested table , if it has collected stats earlier PE generates a plan with "high confidence" .
in absence of collect stats plan will PE with "low confidence" . however teradata's optimizer is very robust and intelligent, even if you do not collect stats on a table,column,indexes PE does an "Dynamic Amp sampling " which means it will select a random amp and this random amp comes up with the info about the table data it has , based upon this PE ( knows datademographics & available system componets) estimates the workload and generates a plan.

how to find the list of views that are created on a table in teradata

the following query gives the list of views that are created on a table.

select tablename from dbc.tables where TableKind='V' and RequestText LIKE '%Table%' group by 1 order by 1;

In place of Table specify your table name.
for example if i want to find the list of views that are created on table test1,i will use the following query.

select tablename from dbc.tables where TableKind='V' and RequestText LIKE '%test1%' group by 1 order by 1;

The query displays

TableName
------------------------------
view1
view2
view3





Friday, March 19, 2010

How to remove duplicate rows from a teradata table

This method uses a temporary table to remove the duplicate rows.
if a table contains duplicate rows means
It should be
1. MultiSet tabel.
2. should not contain UPI.

let us assume we need to remove duplicate rows from tab1.
the table is created as
create multiset table tab1(col1 integer,col2 integer);
populate the table with duplicate rows.

STEP1:
create a temporary SET/MULTISET table () with the following query.
create set table temp_tab1 as tab1 with no data;
this will create a temporary set table which does not allow duplicates.

STEP2:
populate the temporary temporary set table by removing the duplicate rows from the original table. this can be done using the following query.

insert into temp_tab1 select * from tab1 group by col1,col2 having count(*) > 1;
(or)
insert into temp_tab1 select distinct * from tab1 ;

STEP3:
delete all the duplicate rows from the original table.(this query rows all the duplicate rows including the base row).
 delete from tab1
where (col1,col2)
in (select col1,col2 from temp_tab1);

STEP4:
insert the rows from the temp table to the base table using the following query.
insert into tab1
sel * from temp_tab1;

there are two more methods to delete duplicate rows without using the temporary
table using the Teradata Fastload Utility and Teradata Warehouse Builder.

Method1 :
Use Fast Export to write the rows from the source table into a file.
Then Delete all rows from the source table.
then use FastLoad to load the target table.The file to load the table is generated from the FastExport.

Method 2:
Use Teradata WareHouseBuilder (TWB) with 2 steps.
in the first step use data connector as consumer and Export operator as producer.
in the second step use Dataconnector as producer and load operator as consumer.

these are based on the fact that Fastload cannot load duplicate rows.


Thursday, March 18, 2010

Loading an empty table using fastload with data present in multiple files

This will show how to load an empty table using fastload with data present in multiple files.
let us do this with 3 files.each file contains 2 records each.

A the table contains one column and that is integer for simplicity.

data file : f1-2.dat contains values 1 and 2.
f3-4.dat contains values 3 and 4.
f5-6.dat contains values 5 and 6.

lets prepare 3 fastload scripts.

STEP1:

fload1.fl :-
.logon abc/xyz,xyz
drop table errtbl1;
drop table errtbl2;
drop table test1;
create table test1(col1 integer);

.BEGIN LOADING test1 ERRORFILES errtbl1,errtbl2;
DEFINE col1(integer)
FILE=f1-2.dat;
INSERT INTO test1 values(:col1);
.logoff

fload2.fl :-
.logon abc/xyz,xyz
.BEGIN LOADING test1 ERRORFILES errtbl1,errtbl2;
DEFINE col1(integer)
FILE=f3-4.dat;
INSERT INTO test1 values(:col1);
.logoff


fload3.fl :-
.logon abc/xyz,xyz
.BEGIN LOADING test1 ERRORFILES errtbl1,errtbl2;
DEFINE col1(integer)
FILE=f5-6.dat;
INSERT INTO test1 values(:col1);
.END LOADING
.logoff.

STEP 2:
execute the 3 fastload scripts in sequence like
first execute fload1.fl
next fload2.fl
next fload3.fl


after executing fload3.fl the table test1 will be loaded with 6 records.

Explanation :
After the execution of fload1.fl and fload2.fl fastload is in paused state.
when the fload3.fl is executed (i.e) when the fastload sees .END LOADING statement it will update the error tables and target table and release locks.
makesure that each .BEGIN LOADING.. statement should have the same error tables specified.



Wednesday, March 17, 2010

concept of indexes in teradata

UPI, NUPI:
Teradata uses primary index to distribute data across AMPs.
PI access is always one amp operation as teradata knows exactly where the record is.
So both UPI, NUPI results in one amp operation.
Where clause on PI with UPI may return 0 to 1 record
Where clause on PI with NUPI may return 0 to many records.

USI, NUSI:
Secondary index provides an alternate path to access data.
Index creates sub table on all AMPs.
It stores index value, index rowhash, rowid of record. So when ever you insert data into a table having index, DB also makes corresponding entries into index sub tables.
USI is 2 AMP operation. First teradata hashes index value, goes to a particular AMP, looks for index rowhash, Then gets the index value and also rowid where record is present. Its definitely much faster than doing Full table scans.
NUSI is all AMP operation but not full table scan. In case of NUSI, each index sub table stores only its values. So if you limit on a NUSI column, Each AMP will search for the limit you apply and returns records if present.

Global Temporary Tables and Volatile Temporary Tables in Teradata

These are Teradata Specific concepts.
Both are temporary tables and are used normally as a kind of "intermediate stage" for an SQL operation
.

Global Temporary Tables (GTT) :
1.GTT takes up space from Temporary space
2.GTT survive TD system restarts
3.Index can be created on GTT.
4.CHECK or BETWEEN constraints, COMPRESS column and DEFAULT
and TITLE clause are supported by GTT.
5.you can collect stats on GTT.
6.Table Definition is stored in DD for GTT.
7.In a single session 2000 Global temporary table can be materialized.
8.Materialized instance of GTT will be discarded at session end.

Volatile Temporary Tables (VTT):
1.VT takes up space from spool, doesn't have any info in DD.
so is slightly faster that GTT in that terms because no DD lookup is required.
2.VT can't survive TD system restarts
3.Index cann't be created on GTT.
4.CHECK or BETWEEN constraints, COMPRESS column and DEFAULT and TITLE clause are not supported by VT.
5.No collect stats for VTT.
6.Table Definition is stored in System cache for VTT.
7.
In a single session 1000 Volatile tables can be materialized.
8.Volatile tables are always qualified by the session's userid.

This is how we usually create the GTT and VTT.
GTT :
CREATE GLOBAL TEMPORARY TABLE gt_deptsal
( deptno SMALLINT
,avgsal DEC(9,2)
,maxsal DEC(9,2)
,minsal DEC(9,2)
,sumsal DEC(9,2)
,empcnt SMALLINT);

VTT :

CREATE VOLATILE TABLE vt_deptsal, LOG
(deptno SMALLINT
,avgsal DEC(9,2)
,maxsal DEC(9,2)
,minsal DEC(9,2)
,sumsal DEC(9,2)
,empcnt SMALLINT)
ON COMMIT PRESERVE ROWS;


ON COMMIT PRESERVE ROWS allows us to use the Volatile table again for other queries in the session. The default statement is ON COMMIT DELETE ROWS, which means the data is
deleted when the query is committed.

INSERT INTO gt_deptsal
SELECT dept ,AVG(sal) ,MAX(sal) ,MIN(sal) ,SUM(sal) ,COUNT(emp)
FROM emp
GROUP BY 1;

Problem
To show all employees who make the maximum salary in their department.
Sol :
SELECT emp, last, dept, sal
FROM emp INNER JOIN gt_deptsal
ON dept=deptno
WHERE sal=maxsal
ORDER BY 3;

How to display Date in 'YY-MM-dd' format in Sql Assistant

creating table:
create table tab6(col1 integer,col2 date);

populate table as :
ins tab6(1,DATE);
ins tab6(2,DATE);
ins tab6(3,DATE);

In Bteq :

select col2 (format 'yy-mm-dd') from tab6;


*** Query completed. 3 rows found. One column returned.
*** Total elapsed time was 1 second.
col2
--------
08-09-14
08-09-14
08-09-14

But this won't work in SQL ASST

In SQl ASST :
we need to specify the following query

select cast( (col2 (format 'YY-MM-DD')) as char(10)) from tab6;

This is because SQL Assistant is connected to Teradata via ODBC and BTEQ uses CLI.
And ODBC doesnot support FORMAT clause.
This is why the results are different in SQL Assistant and BTEQ.

select TOP n row(s) from a table.

At the start the table contains the following rows.

select * from tab7 order by 1;

col1 col2
----------- -----------

1 2

2 2
3 2
4 4
5 5
6 6

Problem:
To Get 2nd row from the bottom.

select * from tab7 QUALIFY RANK(col1) =2;
col1 col2
----------- -----------

5 5


To Get Top 2nd row :

select * from tab7 QUALIFY RANK(col1 ASC) =2;

col1 col2
----------- -----------

2 2



To Get Top 4 rows :


select * from dup_tab1 QUALIFY RANK(col1 ASC) <= 4;


To Get Top 4 rows randomly :


select TOP 10 * from dup_tab1;

How do i update multiple rows of a table with different values in a single query??

select * from tab1;

*** Query completed. 2 rows found. 2 columns returned.
*** Total elapsed time was 1 second.

col1 col2
----------- ----------
101 vij
102 ven

UPDATE tab1
SET col2 = CASE WHEN col1 = 101 THEN 'XXXX'
WHEN col1 = 102 THEN 'YYYY'
END;

select * from tab1;
*** Query completed. 2 rows found. 2 columns returned.
*** Total elapsed time was 1 second.

col1 col2
----------- ----------
101 XXXX
102 YYYY

Concept of Primary Key and Primary Index in Teradata

In Teradata,
Primary key is used to uniquely identify each row in a table.PKs are used in conjunction with foreign keys to define the important column relationships in a database.

The concept of primary index is completely different from that of Primary key.
Primary Index is used to distribute and locate rows of a table over the AMPS .Choice of Primary Index will affect distribution ,access and performance.

Oftentimes, but not always, the Primary Index and Primary Key are the same.
Indexes (primary or secondary) may be used to enforce uniqueness (as in a PK) or to improve access.

Accessing rows using primary index is always one AMP operation.

PK is a relational modeling convention where as PI is a teradata convention.


Example to create a table with Unique Primary Index (UPI) :
CREATE TABLE MJ1.emp_data,FALLBACK,
NO BEFORE JOURNAL,
NO AFTER JOURNAL
(
employee_number INTEGER NOT NULL,
department_number SMALLINT,
job_code INTEGER COMPRESS ,
last_name CHAR(20) NOT CASESPECIFIC NOT NULL,
first_name VARCHAR(20) NOT CASESPECIFIC,
street_address VARCHAR(30) NOT CASESPECIFIC TITLE 'Address',
city CHAR(15) NOT CASESPECIFIC DEFAULT 'Boise'
COMPRESS 'Boise',
state CHAR(2) NOT CASESPECIFIC DEFAULT ' ',
birthdate DATE FORMAT 'mm/dd/yyyy',
salary_amount DECIMAL(10,2),
sex CHAR(1) UPPERCASE NOT CASESPECIFIC)
UNIQUE PRIMARY INDEX ( employee_number )
INDEX ( department_number );

Teradata - Permanent Space VS Temporary Space VS Spool Space

Teradata space usage:
----------------------

3 kinds of Space(s) in Teradata database:

1. Permanent Space
2. Spool Space
3. Temporary Space

Example: create user xyz perm = 1e4, spool = 1e3, temporary = 1e3, password = xyz;

-> Permanent space is used for database object (tables, indexes etc) creation and to hold their data.

-> Spool space is used to hold data for Intermediate Query results.
Holds only data.
Active upto the current session only.
Spool space is same as that of the root, unless specified.

-> Temporary Space is used to store Global temporary tables data.
Data is active upto the current session only.

5 kinds of Tables in Teradata database:

1. SET Table
2. Multiset Table
3. Global Temporary Table (GTT)
4. Temporary Table
5. Volatile Temporary Table

Create tables go into data dictionary (Permament Space)

Tables
------

SET table:
----------
1. Table defintion : In data dictionary (permanent space)
2. Data storage : Permanent space
3. After Database restart: Both table and data survives
4. Example : Can directly create with 'create set table ….'
5. Collect stats : Yes

Mutliset table:
---------------
1. Table defintion : In data dictionary (permanent space)
2. Data storage : Permanent space
3. After Database restart: Both table and data survives
4. Example : Can directly create with 'create multiset table ….'
5. Collect stats : Yes
Global Temporary Table(GTT):
----------------------------
1. Table defintion : In data dictionary (permanent space)
2. Data storage : Temporary space
3. After Database restart: Table definition survives
4. Example : Can directly create with 'create global temporary table ….'
5. Collect stats : Yes

Temporary Table:
----------------------------
1. Table defintion : No table definition
2. Data storage : Spool
3. After Database restart: Nothing survives
4. Example : Indirectly ( derived tables )
5. Collect stats : No

Volatile Table:
----------------------------
1. Table defintion : System Cache (primary memory - RAM)
2. Data storage : Spool
3. After Database restart: Nothing survives
4. Example : Indirectly ( derived tables )
5. Collect stats : No

To get syntax for a command in teradata

After logging on to teradata through bteq/Sql assistant
type the following command

exec dbc.getsyntax('sql','create table');

getsyntax is a macro that accepts 2 parameters.

First parameter is the type of utility that the command belongs to.
valid values includes the following.

PMPC
BULKLOAD
DUMP
ARCHIVE
MULTILOAD
VERSION
SPL
FASTEXPORT
DATA_TYPE
HELP
SQL
FASTLOAD
TPCCONS

Second parameter the command name for which we need to find the syntax.

examples includes:
exec dbc.getsyntax('FASTLOAD','DEFINE');
exec dbc.getsyntax('MULTILOAD','.IMPORT');
exec dbc.getsyntax('SQL','UPDATE');

getting count of data from n different tables in teradata

I have n tables .
all tables need not to have
1.same number of columns
2.common columns.

I need to get the count of the data which is union of all the table's data.
for example there are 3 tables a,b,c in which
table 'a' has 3 rows
table 'b' has 4 rows.
table 'c' has 5 rows.

then Result Should be Count of a +
Count of b + Count of c
(i.e) 3+4+5 = 12

Ans :
I am doing this by taking 3 tables as example.
we can extend this for any number tables;

select * from a;

*** Query completed. 3 rows found. One column returned.
*** Total elapsed time was 1 second.

col1
-----------
3
1
2

select * from b;

*** Query completed. 4 rows found. 2 columns returned.
*** Total elapsed time was 1 second.

col1 col2
----------- -----------
3 3
1 1
4 4
2 2


select * from c;

*** Query completed. 5 rows found. 3 columns returned.
*** Total elapsed time was 1 second.

col1 col2 col3
----------- ----------- -----------
5 5 5
3 3 3
1 1 1
4 4 4
2 2 2



SELECT SUM(CountTables) FROM
(
SELECT COUNT(*) AS "CountTables" FROM a
UNION
SELECT COUNT(*) AS "CountTables" FROM b
UNION
SELECT COUNT(*) AS "CountTables" FROM c
)t;

*** Query completed. One row found. One column returned.
*** Total elapsed time was 1 second.

Sum(CountTables)
----------------
12

caliculate permspace taken by a table in teradata

To find total permanent space taken by a table on all AMPs.

select databasename, tablename, sum(currentperm)
from dbc.tablesize where databasename = 'raja'
and tablename = 'testing' group by databasename , tablename;

DataBaseName
TableName Sum(CurrentPerm)

---------------- ------------ - ------------------
raja testing 1,536


here 'testing' is the table name and raja the user/database name.

To find total permanent space taken by a table on each AMP.


select databasename, tablename, currentperm,Vproc from dbc.tablesize where databasename ='raja' and tablename = 'testing' ;

DataBaseName TableName Sum(CurrentPerm) Vproc
---------------- ------------ ------------------- ------
raja testing 1,024 0
raja testing 512 1