April 26, 2018

Data Warehouse - Mid Sem Excercise

Q.1   Indicate if the following statements are true/false with a brief justification  (1 X 5 marks)
a.       In type 3 changes of slowly changing dimensions, attribute values are overwritten.
b.      Fact table always contains explicit measures.
c.       Data Warehouses handle unstructured data seamlessly.
d.      In Federated DW architecture, information delivery appears centralized.
e.       Data transformation rules are contained in operational metadata.
Ans. a)
Reason: In type 3 changes of slowly changing dimensions there is adding a new column. In this type usually, only the current and previous value of the dimension is kept in the database.
•Attribute values are overwritten in type 1 changes of slowly changing dimensions.

Ans. b)
Reason: A fact table corresponds to a physical observable event, and not to the demands of a particular report. In a star schema, data is stored as either facts or dimensional attributes,.

Ans. c)
Reason: Data warehouses have relied on batch-processing Extract-Transform-Load jobs.

Ans. d)
Reason: A federated data warehouse is used to integrate key business measures and dimensions.
•The iterative manner of federated data warehouse approach helps reduce the implementation time and cost but the delivery reporting is standard and not centralized.

Ans. e)
Reason: Operational metadata contain all of the information about the operational data sources that allow us to trace back to the original source.
•Category of metadata which contains information about all the data transformations taking place in the data staging area is called Transformation Metadata.
Q.2  Write a brief note on the following                                                                             
    1. OLTP vs. OLAP
    2. ETL vs. Data Virtualization
    3. Semi-additive vs. Non-additive measures
    4. Periodic snapshot vs. Accumulating snapshot
    5. Time dimension
Ans.a)   OLTP vs OLAP:

Ans.b)  ETL vs. Data Virtualization: 

Extract, Transform, and Load (ETL) 
ETL tools that are designed to make bulk copies of large data sets, with millions of rows from large structured data sources.
Perform complex, multi-pass data changes and cleaning operations and load data into targeted data stores.
Creating historical records of data, e.g. snapshots at a particular time, to analyze how the dataset changes over time.

Data virtualization 
Acting as a virtual data source to augment an ETL process.
Federating multiple data warehouses.
Extending existing data warehouses with new data sources.
Separating applications with changes in the underlying data sources (e.g., migrating data warehouses.)

Ans.c)  Semi-additive vs. Non-additive measures

Semi-Additive measures
-These facts are the specific classes of measures that can be collected in all the dimensions and their hierarchy except for the time dimensions.
-Facts which can be summarized for certain dimensions in the facts table, but not others.
-Example: Daily balance can be understood through the dimensions of customers, but not through time dimension.

Non-additive measures
−These facts are those specific classes of measures that can not be integrated into all / any dimension and their hierarchy.
−Non-additives are facts which can not be summed up for any dimensions present in the facts table.
- Examples: The factors in which the percentages are, the ratios were calculated.

Ans.d)  Periodic snapshot vs. Accumulating snapshot

Periodic snapshot

These fact tables frequently comprise many facts because any measurement event reliable with the fact table grain is allowable
It captures the state of the measures based on the occurrence of a status events occurring over a standard period, such as a day, a week, or a month.
These are uniformly dense in their foreign keys because even if no action takes place through the period, a row is naturally inserted in the fact table having a zero or null for each fact

Accumulating snapshot
−A accumulation snapshot summarizes the measurement events occurring at the estimated steps between a row start and end of the process in the fact table.
Pipeline or workflow processes, such as claim processing or order fulfillment, in which there is an initial starting point, standard intermediate stage, and defined endpoint, this type of model can be modeled with the fact table.

−Snapshot Facts table is a foreign key for other dimensions, and alternatively, there are unalterable dimensions

Ans.e)  Time dimension

−The Date dimension is one of these dimension tables related to the Fact.
−Although the calendar date dimension is the most important time dimension, we also need a calendar month dimension when the fact table’s time grain is a month. In some environments, we may need to build calendar week, quarter, or year dimensions as well if there are fact tables at each of these grains.
−The most common and useful time dimension is the calendar date dimension with the granularity of a single day.  
The calendar month dimension should be a separate physical table and should be created by physically eliminating selected rows and columns from the calendar day dimension.

Q.3)What is a staging area? What is the purpose of a staging area? Do we need it?(1+2+2 marks)
  • Staging Area: It's a place where you hold temporary tables on the Data warehouse server. These tables are connected to fact tables.
  • Do we need it: We basically need the staging area to hold the data, and perform data cleansing and merging before loading the data into the warehouse. In the deficiency of a staging area, the data load will have to go from the OLTP system to the OLAP system directly, which in fact will rigorously hamper the performance of the OLTP system.
  • Purpose of Staging area
    • Do flat mapping i.e. Putting all the OLTP data in to it without applying any business rules. 
    • Used for cleaning and validating data using first logic.
    • Pushing data into the staging table will take fewer time because there are no business rules of alteration applied to it.
Q.4)A retail store chain operates at many locations and keeps track of individual customer information and their purchases. The company (that operates the retail store chain) is interested in the understanding effectiveness of promotions offered to their customers. Design a star schema to help with their analytical requirements.


Retail Store Star Schema

Q.5) Suppose that a data warehouse consists of the four dimensions, date, spectator, location, and game, and the two measures, count and charge, where the charge is the ticket price that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors,with each category having its own rate
(a) Draw a star schema diagram for the data warehouse.
Star Schema

(b) Starting with the base cuboid [date, spectator, location, game], what specific OLAP operations should one perform in order to find the total charge paid by student spectators at Eden Gardens Stadium in 2016
Ans. b)The specific OLAP operations to be performed are:
Roll-up on date from date id to year.
Roll-up on a game from game id to all.
Roll-up on location from location id to location name.
Roll-up on spectator from spectator id to status.
Dice with status=“ students”, location name=“ Eden Gardens Stadium”, and year = 2016.