Monday, December 29, 2008

Datawarehousing Concepts

1.What is a DataWareHouse?Explain The Characteristics?

Is a Subject –Oriented , Integrated , Non-Volatile and TimeVariant collection of data in support of management’s decision.

S.O : Focus on Particular Topic to analyze the data.

Integ : Data from separate sources must be put into a consistent format thru Data Transformations and Data Cleansing.

Non-Volatile : Data Once entered into the warehouse can’t be changed.

Time-Variant : Creates a time dimension to the Data ware house. Can view patterns and trends over time.

2.Diff. between OLTP and OLAP?

Online Transactional Processing

Online Analytical Processing

Functional : Day to Day Operations

Decision Support

Db Design : Application Oriented

Subject Oriented.

Data : Current Up to date

Historical Data

Detailed, Flat,Relational

Summarized, Isolated

Higly Normalized

Highly De-Normailized

Unit of work : Short, Simple, Transaction

Complex Query


3.Diff between Datawarehouse and DataMart?

A data warehouse is for very large databases (VLDBs)

A data mart is for smaller databases.

A data warehouse is a database designed to support a broad range of decision tasks in a specific organization.

A data mart is a selected part of the data warehouse which supports specific decision support application requirements of a company’s department

Is a Smaller Subset of Data ware House.

Has More Subject-Oriented Areas

Has Less Subject – Oriented Areas

Size is > 100 GB

Size is <>

Running Ad hoc(unplanned) Queries from a huge data warehouse takes long time.

Running ad hoc queries in data marts allow the efficient execution of predicted queries.

It is Enterprise Level One

It is Smaller Scale One.


4. What are Surrogate Keys?

A unique Identifier or Number for each row , which is used as a Substitute for Natural Key such as Customer no in Customer table.

Used as a primary key because normal natural key will be changing or updating.


5.What is Star Schema, Snow Flake Schema,Fact Constellation Schema/(Galaxy Schema)?

A Star Schema is a Central Fact table is Surrounded by De-Normalized Dimensions.

A Snow Flake Schema is a Central Fact table is surrounded by Normalized Dimensions, split Dimension table data into other tables.

A Fact Constellation Schema is the one in which multiple fact tables share the dimension tables. Can also be called collection of stars (Galaxy Schema).


6.Advantages Of Star Schema Over Snow Flake Schema, Drawbacks of Snow Flake Schema?

ADVANTAGES OF STAR SCHEMA

DRAWBACKS OF SNOW FLAKE

Easy to Understand

Time Consuming Joins

Easy to define Hierarchies

Report Generation Slow

Reduces No: of Physical Joins


7.What is Dimensional Modelling?

The way to identify which information belongs to a central fact table and which information belongs to its associated dimension tables.


8.What is a Junk Dimension?

It is a convenient way to separate the random attributes in a dimension into another dimension.

A "junk" dimension is a collection of random transactional codes, flags and/or text attributes that are unrelated to any particular dimension. The junk dimension is simply a structure that provides a convenient place to store the junk attributes. A good example would be a trade fact in a company that brokers equity trades.

The fact would contain several metrics (principal amount, net amount, price per share, commission, margin amount, etc.) and would be related to several dimensions such as account, date, rep, office, exchange, etc. This fact would also contain several codes and flags that were related to the transaction rather than any of the dimensions ... such as origin code (that indicates whether the trade was initiated with a phone call or via the Web), a reinvest flag (that indicates whether or not this trade as was the result of the reinvestment of a dividend payout) and a comment field for storing special instructions from the customer. These three attributes would normally be removed from the fact table and stored in a junk dimension ... perhaps called the trade dimension. In this way, the number of indexes on the fact table would be reduced, and performance (not to mention ease of use) would be enhanced. Hope this helps.

9.How do u load data into Date Dimension?

By using the Maxdate and MinDate.,with a stored procedure.

10.What are Additive Facts,Semi-Additive Facts,Non-Additive Facts?

Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table.
Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.

11. What is Staging Area?

Provides a place and an area with a set of functions to clean, change, combine, convert, deduplicate and prepare source data for storage and use in the data warehouse.

12.What is Factless Fact? What are different types?

A Fact table which do not have any facts.They may consist of nothing but keys.

Types :

*. Event Tracking. : It is a table that records an event. Eg: Student Tracking System that detects each student attendance event each day.

*. Coverage Tables : These are frequently needed when a primary fact table in a dimensional data warehouse is sparse(small).

13.What is ETL & CDC?

ETL-(Extraction,Transaction & Loading)Data Extraction takes data from source systems, necessary transformations takes place and data load takes extracted data and loads it into the data warehouse.(where the transformation takes place)

CDC-(Change Data Capture) Captures the data which is changing and will be used for transformations.(Which data should be picked up)

14.Explain the Datawarehouse LifeCycle?

Extraction - As a first step, heterogeneous data from different online transaction processing systems is extracted. This data becomes the data source for the data warehouse.

Cleansing/transformation - The source data is sent into the populating systems where the data is cleansed, integrated, consolidated, secured and stored in the corporate or central data warehouse.

Distribution - From the central data warehouse, data is distributed to independent data marts specifically designed for the end user.

Analysis - From these data marts, data is sent to the end users who access the data stored in the data mart depending upon their requirement.


15. What is the aim/objective of having a data warehouse? And who needs a data warehouse? Or what is the use of Data Warehousing?

Data warehousing technology comprises a set of new concepts and tools which support the executives, managers and analysts with information material for decision making.

The fundamental reason for building a data warehouse is to improve the quality of information in the organization.

The main goal of data warehouse is to report and present the information in a very user friendly form.

16. What is Confirmed Dimension ,Confirmed Fact , Stoved Pipe?

  • Dimensions which are 100% shared across Star Schema is called Confirmed Dimension.
  • Keeping the consistency of same unit of measurement and same background calculation across schema is called Confirmed Fact.
  • Any Data ware house that are not having Confirmed Fact/Dimension are called Stoved Pipes.
  • 100% shared---Tables Structure should be consistent.

17.What is S.C.D? What are different Types?

Slowly changing dimensions are dimension tables that have slowly increasing dimension data, as well as updates to existing dimensions.

Table 17-2. Slowly Changing Dimensions

Slowly Changing Dimensions Mapping

Target Table

History

Data Handling

Type 1 Dimension

Slowly Changing Dimension

None

Inserts new dimensions. Overwrites existing dimensions with changed dimensions.

Type 2 Dimension/Version Data

Slowly Changing Dimension

Full

Inserts new and changed dimensions. Creates a version number and increments the primary key to track changes.

Type 2 Dimension/Flag Current

Slowly Changing Dimension

Full

Inserts new and changed dimensions. Flags the current version and increments the primary key to track changes.

Type 2 Dimension/Effective Date Range

Slowly Changing Dimension

Full

Inserts new and changed dimensions. Creates an effective date range to track changes.

Type 3 Dimension

Slowly Changing Dimension

Partial

Inserts new dimensions. Updates changed values in existing dimensions. Optionally uses the load date to track changes.

Use the Type 3 Dimension mapping to update a slowly changing dimension table when you want to keep only current and previous versions of column data in the table. Both versions of the specified column or columns are saved in the same row.

18. What is S.P.T and S.G.D?

Table 17-1. Getting Started Mapping Types

Getting Started Mapping Type

Target Table Type

History

Data Handling

Simple Pass Through

Static Fact or Dimension

None

Inserts all source rows. Use the truncate target table option in the session properties, or use a pre-session shell command to drop or truncate the target before each session run.

Slowly Growing Target

Slowly Growing Fact or Dimension

Full

Flags and inserts new rows to the existing target.

19)what is degenerated dimension?

A dimension that has a key only, no additional attributes.

I have a fact table that stores insurance contracts and one important dimension is the year signed. So the fact table does have many columns, like CUSTOMER_ID, CONTRACT_ID, etc and one column YEAR_SIGNED as varchar(4). The CUSTOMER_ID is the foreign key column to the DIM_CUSTOMER with all the customer date, name address, .... CONTRACT_ID relates to the DIM_CONTRACT with all the contract specific information. Any YEAR_SIGNED? Should I really have a DIM_YEAR_SIGNED and it will have one column only. What other attributes should a year have?
Therefore, we do not create an explicit dimension table and call that YEAR_SIGNED column a degenerated dimension.

20)What is snowflaking?

Further normalization and expansion of the dimension tables in a star schemaresult in the implementation of a snowflake design. In other words, a dimensiontable is said to be snowflaked when the low-cardinality attributes inthe dimensionhave been removed to separate normalized tables and these normalized tablesare then joined back into the original dimension table.

21)When do you snowflake?

Snowflaking a dimension table can typically be performed under the following twoconditions:

**The dimension table consists of two or more sets of attributes which defineinformation at different grains.

** The sets of attributes of the same dimension table are being populated bydifferent source systems.

No comments:

Post a Comment