What is Data

Data is information that has been translated into a form that is more convenient to move or process.

What is Database

Database is an organized collection of related data.

What is warehouse

Warehouse is the place where we store data.

What is Data Mart

Data mart contains a subset of organization-wide data. This subset of data is valuable to specific groups of an organization.

In other words, we can claim that data marts contain data specific to a particular group. For example, the marketing data mart may contain data related to items, customers, and sales. Data marts are confined to subjects.

What is Data Warehouse

A data warehouse is a centralized repository that stores data from multiple information sources and transforms them into a common, multidimensional data model for efficient querying and analysis.

The most popular definition came from Bill Inmon, who provided the following:

“A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process.”

Subject-Oriented:

A data warehouse can be used to analyze a particular subject area. For example, “sales” can be a particular subject.

Integrated:

A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.

Time-Variant:

Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.

Non-volatile:

Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.

OLAP vs OLTP

Sr.No.	Data Warehouse (OLAP)	Operational Database (OLTP)
1	Involves historical processing of information.	Involves day-to-day processing.
2	OLAP systems are used by knowledge workers such as executives, managers and analysts.	OLTP systems are used by clerks, DBAs, or database professionals.
3	Useful in analyzing the business.	Useful in running the business.
4	It focuses on Information out.	It focuses on Data in.
5	Based on Star Schema, Snowflake, Schema and Fact Constellation Schema.	Based on Entity Relationship Model.
6	Contains historical data.	Contains current data.
7	Provides summarized and consolidated data.	Provides primitive and highly detailed data.
8	Provides summarized and multidimensional view of data.	Provides detailed and flat relational view of data.
9	Number or users is in hundreds.	Number of users is in thousands.
10	Number of records accessed is in millions.	Number of records accessed is in tens.
11	Database size is from 100 GB to 1 TB	Database size is from 100 MB to 1 GB.
12	Highly flexible.	Provides high performance.

What is Fact table

A fact table is a table that contains summarized numerical and historical data (facts) and a multipart index composed of foreign keys from the primary keys of related dimension tables.

Dimensions
In data warehousing, a dimension is a collection of reference information about a measurable event. These events are known as facts and are stored in a fact table.

Dimensions categorize and describe data warehouse facts and measures in ways that support meaningful answers to business questions. They form the very core of dimensional modeling.

What is Dimension tables
Dimension tables are referenced by fact tables using keys. When creating a dimension table in a data warehouse, a system-generated key is used to uniquely identify a row in the dimension. This key is also known as a surrogate key. The surrogate key is used as the primary key in the dimension table. The surrogate key is placed in the fact table and a foreign key is defined between the two tables. When the data is joined, it does so just as any other join within the database.

Characteristics of Data Warehouse:

i. Subject-oriented :

The warehouse organizes data around the essential subjects of the business (customers and products) rather than around applications such as inventory management or order processing.

i. Integrated:

It is consistent in the way that data from several sources is extracted and transformed. For example, coding conventions are standardized: M _ male, F _ female.

ii. Time-variant:

Data are organized by various time-periods (e.g. months).

iii. Non-volatile:

The warehouse’s database is not updated in real time. There is periodic bulk uploading of transactional and other data.

The Advantages of Data Warehousing

Clean data: required info
Indexes: multiple types
Query processing: multiple options, faster
Security: data and access(multiple views)
Easy report creation.
Enhanced access to data and information.

The Disadvantages of Data Warehousing

Preparation may be time consuming.
Compatibility with existing systems.
Security issues.

Data Warehouse Architecture

Types of Dimensions

Dimension

A dimension table typically has two types of columns, primary keys to fact tables and textual\descriptive data.

Eg: Time, Customer

Types of Dimensions

1. Slowly Changing Dimensions

2. Rapidly Changing Dimensions

3. Junk Dimensions

4. Inferred Dimensions

5. Conformed Dimensions

6. Degenerate Dimensions

7. Role Playing Dimensions

8. Shrunken Dimensions

9. Static Dimensions

Slowly Changing Dimensions

Attributes of a dimension that would undergo changes over time. It depends on the business requirement whether particular attribute history of changes should be preserved in the data warehouse. This is called a slowly changing attribute and a dimension containing such an attribute is called a slowly changing dimension.

Rapidly Changing Dimensions

A dimension attribute that changes frequently is a rapidly changing attribute. If you don’t need to track the changes, the rapidly changing attribute is no problem, but if you do need to track the changes, using a standard slowly changing dimension technique can result in a huge inflation of the size of the dimension. One solution is to move the attribute to its own dimension, with a separate foreign key in the fact table. This new dimension is called a rapidly changing dimension.

Junk Dimensions

A junk dimension is a single table with a combination of different and unrelated attributes to avoid having a large number of foreign keys in the fact table. Junk dimensions are often created to manage the foreign keys created by rapidly changing dimensions.

Inferred Dimensions

While loading fact records, a dimension record may not yet be ready. One solution is to generate a surrogate key with null for all the other attributes. This should technically be called an inferred member, but is often called an inferred dimension.

Conformed Dimensions

A dimension that is used in multiple locations is called a conformed dimension. A conformed dimension may be used with multiple fact tables in a single database, or across multiple data marts or data warehouses.

Degenerate Dimensions

A degenerate dimension is when the dimension attribute is stored as part of fact table, and not in a separate dimension table. These are essentially dimension keys for which there are no other attributes. In a data warehouse, these are often used as the result of a drill through query to analyze the source of an aggregated number in a report. You can use these values to trace back to transactions in the OLTP system.

Role Playing Dimensions

A role-playing dimension is one where the same dimension key — along with its associated attributes — can be joined to more than one foreign key in the fact table. For example, a fact table may include foreign keys for both ship date and delivery date. But the same date dimension attributes apply to each foreign key, so you can join the same dimension table to both foreign keys. Here the date dimension is taking multiple roles to map ship date as well as delivery date, and hence the name of role playing dimension.

Shrunken Dimensions

A shrunken dimension is a subset of another dimension. For example, the orders fact table may include a foreign key for product, but the target fact table may include a foreign key only for productcategory, which is in the product table, but much less granular. Creating a smaller dimension table, with productcategory as its primary key, is one way of dealing with this situation of heterogeneous grain. If the product dimension is snowflaked, there is probably already a separate table for productcategory, which can serve as the shrunken dimension.

Static Dimensions

Static dimensions are not extracted from the original data source, but are created within the context of the data warehouse. A static dimension can be loaded manually — for example with status codes — or it can be generated by a procedure, such as a date or time dimension.

Types of Facts

There are three types of facts:

Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table.
Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.

Let us use examples to illustrate each of the three types of facts. The first example assumes that we are a retailer, and we have a fact table with the following columns:

Date

Store

Product

Sales_Amount

The purpose of this table is to record the sales amount for each product in each store on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can sum up this fact along any of the three dimensions present in the fact table -- date, store, and product. For example, the sum of Sales_Amount for all 7 days in a week represents the total sales amount for that week.

Say we are a bank with the following fact table:

Date

Account

Current_Balance

Profit_Margin

The purpose of this table is to record the current balance for each account at the end of each day, as well as the profit margin for each account for each day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to add them up for the account level or the day level.

The Types of Fact Table are

1. Snapshot

2. Cumulative

3. Fact less Fact Table

Snapshot

This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.

Eg: Daily balances fact can be summed up through the customers dimension but not through the time dimension.

Cumulative

This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table.

Eg: Sales fact

Fact less Fact Table

In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called “Fact less Fact tables”.

Eg: A fact table which has only product key and date key is a fact less fact. There are no measures in this table. But still you can get the number products sold over a period of time.

Slowly Changing Dimension Types, Advantages & Disadvantages

There are many approaches how to deal with SCD. The most popular are:

· Type 0 - The passive method

· Type 1 - Overwriting the old value

· Type 2 - Creating a new additional record

· Type 3 - Adding a new column

· Type 4 - Using historical table

· Type 6 - Combine approaches of types 1,2,3 (1+2+3=6)

Type 0 - The passive method. In this method no special action is performed upon dimensional changes. Some dimension data can remain the same as it was first time inserted, others may be overwritten.

Example: Employee birth date.

Type 1: The new record replaces the original record. No trace of the old record exists.

Type 2: A new record is added into the customer dimension table. Thereby, the customer is treated essentially as two people.

Type 3: The original record is modified to reflect the change.

SCD Type 1, SCD Type 2, SCD Type 3, Slowly Changing Dimension Types, Advantages & Disadvantages

In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other words, no history is kept.

In our example, recall we originally have the following table:

Customer Key	Name	State
1001	Williams	New York

After Williams moved from New York to Los Angeles, the new information replaces the new record, and we have the following table:

Customer Key	Name	State
1001	Williams	Los Angeles

Advantages

· This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information.

Disadvantages

· All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case, the company would not be able to know that Williams lived in New York before.

When to use Type 1

Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep track of historical changes.

Example: 2

Type 1 - Overwriting the old value. In this method no history of dimension changes is kept in the database. The old dimension value is simply overwritten be the new one. This type is easy to maintain and is often use for data which changes are caused by processing corrections(e.g. removal special characters, correcting spelling errors).

Before the change:

Customer_ID	Customer_Name	Customer_Type
1	Cust_1	Corporate

After the change:

Customer_ID	Customer_Name	Customer_Type
1	Cust_1	Retail

SCD Type 2, Slowly Changing Dimension Use, Example, Advantage, Disadvantage

In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information. Therefore, both the original and the new record will be present. The new record gets its own primary key.

In our example, recall we originally have the following table:

Customer Key	Name	State
1001	Williams	New York

After Williams moved from New York to Los Angeles, we add the new information as a new row into the table:

Customer Key	Name	State
1001	Williams	New York
1005	Williams	Los Angeles

Advantages

· This allows us to accurately keep all historical information.

Disadvantages

· This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern.

· This necessarily complicates the ETL process.

When to use Type 2

Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes.

Example:2

Type 2 - Creating a new additional record. In this methodology all history of dimension changes is kept in the database. You capture attribute change by adding a new row with a new surrogate key to the dimension table. Both the prior and new rows contain as attributes the natural key(or other durable identifier). Also 'effective date' and 'current indicator' columns are used in this method. There could be only one record with current indicator set to 'Y'. For 'effective date' columns, i.e. start_date and end_date, the end_date for current record usually is set to value 9999-12-31. Introducing changes to the dimensional model in type 2 could be very expensive database operation so it is not recommended to use it in dimensions where a new attribute could be added in the future.

Before the change:

Customer_ID	Customer_Name	Customer_Type	Start_Date	End_Date	Current_Flag
1	Cust_1	Corporate	22-07-2010	31-12-9999	Y

After the change:

Customer_ID	Customer_Name	Customer_Type	Start_Date	End_Date	Current_Flag
1	Cust_1	Corporate	22-07-2010	17-05-2012	N
2	Cust_1	Retail	18-05-2012	31-12-9999	Y

SCD Type 3, Slowly Changing Dimension Use, Example, Advantage, Disadvantage

In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of interest, one indicating the original value, and one indicating the current value. There will also be a column that indicates when the current value becomes active.

In our example, recall we originally have the following table:

Customer Key	Name	State
1001	Williams	New York

To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:

· Customer Key

· Name

· Original State

· Current State

· Effective Date

After Williams moved from New York to Los Angeles, the original information gets updated, and we have the following table (assuming the effective date of change is February 20, 2010):

Customer Key	Name	Original State	Current State	Effective Date
1001	Williams	New York	Los Angeles	20-FEB-2010

Advantages

· This does not increase the size of the table, since new information is updated.

· This allows us to keep some part of history.

Disadvantages

· Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Williams later moves to Texas on December 15, 2003, the Los Angeles information will be lost.

When to use Type 3

Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track historical changes, and when such changes will only occur for a finite number of time.

Example: 2

Type 3 - Adding a new column. In this type usually only the current and previous value of dimension is kept in the database. The new value is loaded into 'current/new' column and the old one into 'old/previous' column. Generally speaking the history is limited to the number of column created for storing historical data. This is the least commonly needed techinque.

Before the change:

Customer_ID	Customer_Name	Current_Type	Previous_Type
1	Cust_1	Corporate	Corporate

After the change:

Customer_ID	Customer_Name	Current_Type	Previous_Type
1	Cust_1	Retail	Corporate

Type 4 - Using historical table. In this method a separate historical table is used to track all dimension's attribute historical changes for each of the dimension. The 'main' dimension table keeps only the current data e.g. customer and customer_history tables.

Current table:

Customer_ID	Customer_Name	Customer_Type
1	Cust_1	Corporate

Historical table:

Customer_ID	Customer_Name	Customer_Type	Start_Date	End_Date
1	Cust_1	Retail	01-01-2010	21-07-2010
1	Cust_1	Oher	22-07-2010	17-05-2012
1	Cust_1	Corporate	18-05-2012	31-12-9999

Type 6 - Combine approaches of types 1,2,3 (1+2+3=6). In this type we have in dimension table such additional columns as:

· current_type - for keeping current value of the attribute. All history records for given item of attribute have the same current value.

· historical type - for keeping historical value of the attribute. All history records for given item of attribute could have different values.

· start_date - for keeping start date of 'effective date' of attribute's history.

· end_date - for keeping end date of 'effective date' of attribute's history.

· current_flag - for keeping information about the most recent record.
In this method to capture attribute change we add a new record as in type 2. The current_type information is overwritten with the new one as in type 1. We store the history in a historical_column as in type 3.

Customer_ID	Customer_Name	Current_Type	Historical_Type	Start_Date	End_Date	Current_Flag
1	Cust_1	Corporate	Retail	01-01-2010	21-07-2010	N
2	Cust_1	Corporate	Other	22-07-2010	17-05-2012	N
3	Cust_1	Corporate	Corporate	18-05-2012	31-12-9999	Y

WHAT IS ROW-LEVEL SECURITY?

Row level Security is about constraining access to data for a database user at a very granular level.

By default the SQL Server relational engine can only limit user access to a certain level of granularity. For example SQL Server can limit access to a table and columns, but does not provide security within tables (i.e. based on the data contained within the table). Said another way, the grain of SQL Server Security is at the object level. What we want is access at a more granular level (at the row level).

Since SQL Server does not have native support for row-level Security we have to build a model to support the needs of the business. There are many different approaches to implementing row-level security, but the one we will cover in this article is through the use of custom views.

Gathering Statistics

Understanding Statistics

Statistics quantify the data distribution and storage characteristics of tables, columns, indexes, and partitions.

You must gather statistics on a regular basis to provide the optimizer with information about schema objects. New statistics should be gathered after a schema object's data or structure are modified in ways that make the previous statistics inaccurate. For example, after loading a significant number of rows into a table, you should collect new statistics on the number of rows. After updating data in a table, you do not need to collect new statistics on the number of rows but you might need new statistics on the average row length.

Statistics can be generated with the ANALYZE statement or with the package DBMS_STATS.

The statistics generated include the following:

Table statistics

Number of rows
Number of blocks
Number of empty blocks
Average row length

Column statistics

Number of distinct values (NDV) in column
Number of nulls in column
Data distribution (histogram)

Index statistics

Number of leaf blocks
Levels
Clustering factor

Generating Statistics

Because the cost-based approach relies on statistics, you should generate statistics for all tables and clusters and all types of indexes accessed by your SQL statements before using the cost-based approach. If the size and data distribution of your tables change frequently, then you should generate these statistics regularly to ensure the statistics accurately represent the data in the tables.

Oracle generates statistics using the following techniques:

Estimation based on random data sampling
Exact computation
User-defined statistics collection methods

Gathering Statistics with the DBMS_STATS Package

Procedure	Description
GATHER_INDEX_STATS	Collects index statistics.
GATHER_TABLE_STATS	Collects table, column, and index statistics.
GATHER_SCHEMA_STATS	Collects statistics for all objects in a schema.
GATHER_DATABASE_STATS	Collects statistics for all objects in a database.

Bulk load and normal load

Bulk loading, improves the performance of a session that inserts a large amount of data to the target database. When bulk loading, the Integration Service bypasses the database log, which speeds performance. Without writing to the database log, however, the target database cannot perform rollback. As a result, you may not be able to perform recovery. Bulk load is used when you are loading large amount of data (more than 100GB). Bulk load makes the data loading faster.

In normal load the database log is not bypassed and therefore the target database can recover from an incomplete session.

[Informatica] Bulk Load vs Normal Load

While loading data to relational targets, Informatica provides an option to either do Bulk load or Normal load. Selection of appropriate option can not only improve your session performance but also can make your rollback strategy foolproof.

Bulk load should be used in case where the data is humongous in size (greater than 100 GB), and the operation is performance intensive. Bulk load can greatly speed up your session performance in such a case.

However, the trade-off with Bulk load is that, it bypasses database logs, and thus recovery in case of failure is not possible. On the other hand, Normal load logs each and every database transaction while loading, thus enabling rollback.

Also, while using Bulk load, one cannot have Indexes defined on target. Bulk load fails in such a case. One must drop the Indexes, do Bulk load and then recreate the Indexes in such a case. Other option id to use Normal load.

HINTS

In various SQL implementations, a hint is an addition to the SQL standard that instructs the database engine on how to execute the query. For example, a hint may tell the engine to use or not to use an index (even if the query optimizer would decide otherwise).

Types of hints:

Join Order Hints

In case you believe a different join order to be useful, you can use one of the join order hints: ORDEREDor LEADING. The latter is more versatile and should thus be preferred.

ORDERED takes no parameters and instructs the optimizer to join the tables in the order as they appear in the FROM clause. Because the ORDERED hint is so basic and you do not want to move around tables in the FROM clause, Oracle has provided us with the LEADING hint. It takes the table names or aliases (if specified) as parameters, separated by spaces.

Join Operation Hints

Join operation hints are also paired:

· USE_HASH / NO_USE_HASH

· USE_MERGE / NO_USE_MERGE

· USE_NL / NO_USE_NL

These hints allow you to instruct the optimizer to use a hash join, a sort-merge join, or nested loops, respectively.

Hash joins support input swapping, which we have discussed when we talked about left-deep and right-deep join trees. This can be accomplished with SWAP_JOIN_INPUTS or prohibited with NO_SWAP_JOIN_INPUTS.

Parallel Execution Hints

Not all SQL statements can be run in parallel. All DML statements, including subqueries, can be run in parallel, which means that multiple blocks can be selected, inserted, deleted, or updated simultaneously. For parallelized DDL statements, multiple blocks are being created/altered and written in parallel. The DDL statements that can be run in parallel are:

· CREATE INDEX

· CREATE TABLE ... AS SELECT

· ALTER INDEX ... REBUILD

· ALTER INDEX ... [ REBUILD | SPLIT ] PARTITION

· ALTER TABLE ... MOVE

· ALTER TABLE ... [ MOVE | SPLIT | COALESCE ] PARTITION

Access Path Hints

Access path hints determine how Oracle accesses the data you require. They can be divided into two groups: access path hints for tables and access path hints for indexes.

Tables

The most prominent hint in this group is the FULL( tab_name ) hint. It instructs the optimizer to access a table by means of a full table scan. If the table you want Oracle to access with a full table scan has an alias in the SQL statement, you have to use the alias rather than the table name (without the schema name) as the parameter to FULL. For named query blocks you have to provide the query block’s name as discussed previously.

In this group are also the CLUSTER and HASH hints, but they apply only to tables in an indexed cluster and hash clusters respectively.

Indexes

The hints in this group all come in pairs:

· INDEX / NO_INDEX

· INDEX_ASC / INDEX_DESC

· INDEX_FFS / NO_INDEX_FFS

· INDEX_SS / NO_INDEX_SS

· INDEX_SS_ASC / INDEX_SS_DESC

· INDEX_COMBINE / INDEX_JOIN

All these hints take at least one parameter: the table name or alias in the SQL statement. A second parameter, the index name(s), is optional but often provided. If more than one index is provided, the indexes are separated by at least one space; the INDEX_COMBINE hint is recommended for this use case though.

SDE and SIL mapping

SDE(Source Dependent extract)

SDE_* - Workflow loads data into the staging table. Staging table will
not have any index defined on them and always the load into the staging
table will be a truncate and load. Since there are no index defined "Bulk"
load mode is being used.

SIL(Source Independent loading)

SIL_* - You can find two SIL_* workflows in the Workflow Manager. During
a full load the workflow (SIL_*_Full) will be invoked from DAC and the
target table is truncate ,indexes are dropped ,data is loaded in 'Bulk' mode
and then the indexes are rebuild. During incremental load, the workflow
(SIL_*) will be invoked which would just insert/update the incremental
data. Since the amount of data during incremental load is minimal compared to
full load this would run in 'Normal' mode. If indexes are present on a table
"BULK" mode does not work.

Search This Blog

DataWarehousing

What is Data Mart

Types of Dimensions

Types of Facts

Slowly Changing Dimension Types, Advantages & Disadvantages

SCD Type 1, SCD Type 2, SCD Type 3, Slowly Changing Dimension Types, Advantages & Disadvantages

SCD Type 2, Slowly Changing Dimension Use, Example, Advantage, Disadvantage

SCD Type 3, Slowly Changing Dimension Use, Example, Advantage, Disadvantage

WHAT IS ROW-LEVEL SECURITY?

Gathering Statistics

Gathering Statistics with the DBMS_STATS Package

[Informatica] Bulk Load vs Normal Load

Join Order Hints

Join Operation Hints

Parallel Execution Hints

Access Path Hints

Tables

Indexes

SDE and SIL mapping

Comments

Post a Comment