Introduction:
I know that the education system in India still follows once
in a 10 year syllabus appraisal system and students still think that computer
industry means C, C++, Java and .Net (in some colleges, in fact they still
teach COBOL.
So ultimately they fall in love with these languages and feel
disheartened if somebody asks them to work on something new. his is especially
true for BI & DWH as other IT work would have some amount of language
programming (e.g. in Mobile/Web development etc.) but BI & DWH is mostly
playing with software tools at junior level.
LinkedIn Reveals 5 Hottest Skills in Demand:
Professional networking website LinkedIn has revealed 5
hottest skills that were in demand in 2014 & 2015. LinkedIn analyzed the
skills and experience data in over 330 million member profiles to gather the
data.
The LinkedIn report shows that tech skills dominated the
'hottest skills' list. "Across the globe, Business intelligence and Data
Warehousing (Data analysis) skills were highly valued. In the US, India, and
France, cloud and distributed computing skills were in particularly high demand".
5 Hottest Skills in India on LinkedIn:
1.
Business intelligence/data analysis
2.
Data engineering and data warehousing
3.
Statistical analysis and data mining
4.
Market research and insights
5.
Social media marketing /Email Marketing
The demand of most businesses in today's
time is, in context of the question asked, data collection, data cleaning,
reporting and gathering insights/ answering business problems/ suggesting methods
and ways to boost business. And so
it comes down to two broad categories Business Intelligence and Data
Warehousing, and Business Consulting.
Job Requirements:
·
Candidate should be a
graduate or post graduate in the field of arts, science or engineering or MBA.
Final Year College students can apply.
·
Should have minimum 50%
marks in school and college.
·
Age should be less than
30 years for Fresher. No age restriction for experienced people.
Essential Skill :
Anybody
new to industry first need to understand what it is and then start working on
coding assignments. I would classify skill and knowledge in two categories, one
technical and others soft skills.
Technical SKILL:
1. Solid concepts of Data warehousing
2. Basic knowledge of Database concepts
3. Good knowledge of SQL
4. OLAP Tools
5. MS Excel
6. Basic knowledge on BI/ETL/Analytics tools.
Soft SKILL:
*Excellent in mathematics (numerical
aptitude)
* Enjoy Learning tools and new technologies
* Positive attitude
* Love for data, calculations and reports
Note:
I would recommend that you take some Software tool training Like
ETL courses (Not mandatory.) Some ETL Courses Like:
1. Informatica 2. Oracle Data
Integrator (ODI) 3. Oracle - Warehouse Builder 4. IBM - Cognos ,etc
Data warehousing (DWH):
Data warehousing is the combination of
Business Intelligence (BI) + Data Analyst (DA) + Master Data Management (MDD) + Mapping Analyst + ETL Developer + data steward +Database Marketing.
This person is going to perform almost all the tasks. They will
work on crude data (unclean) using the many tools available in the market to
clean data, generate reports/ dashboards.
Also, technically this profile is a part of Data Warehousing and
Business Intelligence domain.
Ø Data warehouse (DW or DWH),
also known as an enterprise
data warehouse (EDW),
is a system used for reporting and data analysis. DWs are central repositories of integrated
data from one or more disparate sources.
Ø They store current and historical data and are
used for creating trending reports for senior management reporting such as
annual and quarterly comparisons.
Ø The data stored in the warehouse is uploaded from the
operational systems (such as marketing, sales, etc., shown in the figure to the
left). The data may pass through an operational
data store for additional operations before it is used in
the DW for reporting.
Ø Any Enterprise must necessarily maintain a lot
of data about its operation. This is its "Operational Data".
Note:
Business Analyst is generalist and role varys from company to company based on
their nature of business or even on project level.
What is Data Analyst/Business intelligence/
DWH?
This is the set of techniques and tools for the
transformation of raw data into meaningful and useful information for business
analysis purposes.
BI technologies are capable of handling large amounts of
unstructured data to help identify, develop and otherwise create new strategic
business opportunities.
BI technologies provide historical, current
and predictive views of business operations. Common functions of business
intelligence technologies are reporting, online analytical processing,
analytics, data mining, process mining, complex event processing, business
performance management, benchmarking, text mining, and predictive analytics and
prescriptive analytics.
BI can be used
to support a wide range of business decisions ranging from operational to
strategic. Basic operating decisions include product positioning or pricing.
Strategic business decisions include.
1. What is Data?
Ans:
Data is a collection of facts, such as
numbers, words, measurements, observations or even just descriptions of things.
Data can be qualitative or
quantitative.
- Qualitative data is descriptive information (it describes something)
- Quantitative data is numerical information (numbers).
2. What is Data Cleansing?
Ans:
·
Data cleansing is the process of removing
incomplete, inaccurate or duplicated data from a database.
·
Data cleansing tools can help an organisation save time
and costs spent on fixing data errors manually, and, can lead to a more
efficient customer service.
3. What is raw data?
Ans: Raw data (sometimes called
source data or atomic data) is data that has not been processed for use. A
distinction is sometimes made between data and information to the effect that
information is the end product of data processing.
4. What is Datawarehousing?
A Datawarehouse is the repository of a data and it is used
for Management decision support system. Datawarehouse consists of wide variety
of data that has high level of business conditions at a single point in time.
In single sentence, it is repository of integrated
information which can be available for queries and analysis.
5.What are the benefits of data warehouse?
A data warehouse helps to
integrate data (see Data integration)
and store them historically so that we can analyze different aspects of
business including, performance analysis, trend, prediction etc. over a given
time frame and use the result of our analysis to improve the efficiency of
business processes.
cooked data.
6. What is Business Intelligence?
Business Intelligence is also known as DSS – Decision
support system which refers to the technologies, application and practices for
the collection, integration and analysis of the business related information or
data. Even, it helps to see the data on the information itself.
7. What is Dimension Table?
Dimension table is a table which contain attributes of
measurements stored in fact tables. This table consists of hierarchies,
categories and logic that can be used to traverse in nodes.
8. What is Fact Table?
Fact table contains the measurement of business processes,
and it contains foreign keys for the dimension tables.
Example – If the business process is manufacturing of bricks
Average number of bricks produced by one person/machine –
measure of the business process
9. What are the stages of Datawarehousing?
There are four stages of Datawarehousing:
- Offline Operational Database
- Offline Data Warehouse
- Real Time Datawarehouse
- Integrated Datawarehouse
10. What is Data Mining?
Data Mining is set to be a process of analyzing the data in
different dimensions or perspectives and summarizing into a useful information.
Can be queried and retrieved the data from database in their own format.
11. What is OLTP?
OLTP is abbreviated as On-Line Transaction Processing, and
it is an application that modifies the data whenever it received and has large
number of simultaneous users.
12. What is OLAP?
OLAP is abbreviated as Online Analytical Processing, and it
is set to be a system which collects, manages, processes multi-dimensional data
for analysis and management purposes.
13. What is the difference between OLTP and OLAP?
Following are the differences between OLTP and OLAP:
OLTP
|
OLAP
|
Data is from original data source
|
Data is from various data sources
|
Simple queries by users
|
Complex queries by system
|
Normalized small database
|
De-normalized Large Database
|
Fundamental business tasks
|
Multi-dimensional business tasks
|
14. What is ODS?
ODS is abbreviated as Operational Data Store and it is a
repository of real time operational data rather than long term trend data.
15. What is the difference between View and Materialized
View?
A view is nothing but a virtual table which takes the output
of the query and it can be used in place of tables.
A materialized view is nothing but an indirect access to the
table data by storing the results of a query in a separate schema.
16. What is ETL?
ETL is abbreviated as Extract, Transform and Load. ETL is a
software which is used to reads the data from the specified data source and
extracts a desired subset of data. Next, it transform the data using rules and
lookup tables and convert it to a desired state.
Then, load function is used to load the resulting data to
the target database.
17. What is VLDB?
VLDB is abbreviated as Very Large Database and its size is
set to be more than one terabyte database. These are decision support systems
which is used to server large number of users.
18. What is real-time datawarehousing?
Real-time datawarehousing captures the business data
whenever it occurs. When there is business activity gets completed, that data
will be available in the flow and become available for use instantly.
19. What are Aggregate tables?
Aggregate tables are the tables which contain the existing
warehouse data which has been grouped to certain level of dimensions. It is
easy to retrieve data from the aggregated tables than the original table which
has more number of records.
This table reduces the load in the database server and
increases the performance of the query.
20. What is factless fact tables?
A factless fact tables are the fact table which doesn’t
contain numeric fact column in the fact table.
21. How can we load the time dimension?
Time dimensions are usually loaded through all possible
dates in a year and it can be done through a program. Here, 100 years can be
represented with one row per day.
22. What are Non-additive facts?
Non-Addictive facts are said to be facts that cannot be
summed up for any of the dimensions present in the fact table. If there are
changes in the dimensions, same facts can be useful.
23. What is conformed fact?
Conformed fact is a table which can be used across
multiple data marts in combined with the multiple fact tables.
24. What is Datamart?
A Datamart is a specialized version of Datawarehousing and
it contains a snapshot of operational data that helps the business people to
decide with the analysis of past trends and experiences. A data mart helps to
emphasizes on easy access to relevant information.
25. What is Active Datawarehousing?
An active datawarehouse is a datawarehouse that enables
decision makers within a company or organization to manage customer
relationships effectively and efficiently.
26. What is the difference between Datawarehouse and
OLAP?
Datawarehouse is a place where the whole data is stored for
analyzing, but OLAP is used for analyzing the data, managing aggregations,
information partitioning into minor level information.
27. What is ER Diagram?
ER diagram is abbreviated as Entity-Relationship diagram
which illustrates the interrelationships between the entities in the database.
This diagram shows the structure of each tables and the links between the
tables.
28. What are the key columns in Fact and dimension
tables?
Foreign keys of dimension tables are primary keys of entity
tables. Foreign keys of fact tables are the primary keys of the dimension
tables.
30. What is SCD?
SCD is defined as slowly changing dimensions, and it applies
to the cases where record changes over time.
31. What are the types of SCD?
There are three types of SCD and they are as follows:
SCD 1 – The new record replaces the original record
SCD 2 – A new record is added to the existing customer
dimension table
SCD 3 – A original data is modified to include new data
32. What is called data cleaning?
Name itself implies that it is a self explanatory term.
Cleaning of Orphan records, Data breaching business rules, Inconsistent data
and missing information in a database.
33. What is Metadata?
Metadata is defined as data about the data. The metadata
contains information like number of columns used, fix width and limited width,
ordering of fields and data types of the fields.
34. What are loops in Datawarehousing?
In datawarehousing, loops are existing between the tables.
If there is a loop between the tables, then the query generation will take more
time and it creates ambiguity. It is advised to avoid loop between the tables.
35. What is the definition of Cube in Datawarehousing?
Cubes are logical representation of multidimensional data.
The edge of the cube has the dimension members,and the body of the cube
contains the data values.
36. What is called Dimensional Modelling?
Dimensional Modeling is a concept which can be used by
dataware house designers to build their own datawarehouse. This model can be
stored in two types of tables – Facts and Dimension table.
Fact table has facts and measurements of the business and
dimension table contains the context of measurements.
37. What are the types of Dimensional Modeling?
There are three types of Dimensional Modeling and they are
as follows:
- Conceptual Modeling
- Logical Modeling
- Physical Modeling
38. What is surrogate key?
Surrogate key is nothing but a substitute for the natural
primary key. It is set to be a unique identifier for each row that can be used
for the primary key to a table.
39. What is the difference between ER Modeling and
Dimensional Modeling?
ER modeling will have logical and physical model but
Dimensional modeling will have only Physical model.
ER Modeling is used for normalizing the OLTP database design
whereas Dimensional Modeling is used for de-normalizing the ROLAP and MOLAP
design.
40. What are the steps to build the datawarehouse?
Following are the steps to be followed to build the
datawaerhouse:
- Gathering business requirements
- Identifying the necessary sources
- Identifying the facts
- Defining the dimensions
- Defining the attributes
- Redefine the dimensions and attributes if required
- Organize the Attribute hierarchy
- Define Relationships
- Assign unique Identifiers
41. What are the different types of datawarehosuing?
Following are the different types of Datawarehousing:
- Enterprise Datawarehousing
- Operational Data Store
- Data Mart
42. What needs to be done while starting the database?
Following need to be done to start the database:
- Start an Instance
- Mount the database
- Open the database
43. What needs to be done when the database is shutdown?
Following needs to be done when the database is shutdown:
- Close the database
- Dismount the database
- Shutdown the Instance
44. Can we take backup when the database is opened?
No, We cannot take full backup when the database is opened.
45. What is defined as Partial Backup?
A Partial backup in an operating system is a backup short of
full backup and it can be done while the database is opened or shutdown.
46. What is the goal of Optimizer?
The goal to Optimizer is to find the most efficient way to
execute the SQL statements.
47. What is Execution Plan?
Execution Plan is a plan which is used to the optimizer to
select the combination of the steps.
48. What are the approaches used by Optimizer during
execution plan?
There are two approaches:
- Rule Based
- Cost Based
49. What are the tools available for ETL?
Following are the ETL tools available:
Informatica
Data Stage
Oracle
Warehouse Builder
Ab Initio
Data Junction
Data Stage
Oracle
Warehouse Builder
Ab Initio
Data Junction
50.What is the difference between metadata and data
dictionary?
Metadata is defined as data about the data. But, Data dictionary contain the information about the project information, graphs, abinito commands and server information.
Metadata is defined as data about the data. But, Data dictionary contain the information about the project information, graphs, abinito commands and server information.
What
Is Master Data Management?
Ans:
Ø Master Data Management (MDM) refers to the process of creating and
managing data that an organization must have as a single master copy, called
the master data. Usually, master data can include customers, vendors,
employees, and products, but can differ by different industries and even
different companies within the same industry.
Ø MDM is important because it offers the enterprise a single version
of the truth. Without a clearly defined master data, the enterprise runs the
risk of having multiple copies of data that are inconsistent with one another.
Ø MDM is typically more important in larger organizations. In fact,
the bigger the organization, the more important the discipline of MDM is,
because a bigger organization means that there are more disparate systems
within the company, and the difficulty on providing a single source of truth,
as well as the benefit of having master data, grows with each additional data
source.
Ø A particularly big challenge
to maintaining master data occurs when there is a merger/acquisition. Each of
the organizations will have its own master data, and how to merge the two sets
of data will be challenging. Let's take a look at the customer files: The two
companies will likely have different unique identifiers for each customer.
Addresses and phone numbers may not match. One may have a person's maiden name
and the other the current last name. One may have a nickname (such as
"Bill") and the other may have the full name (such as
"William"). All these contribute to the difficulty in creating and
maintain in a single set of master data.
Ø At the heart of the master data management program is the definition
of the master data. Therefore, it is essential that we identify who is
responsible for defining and enforcing the definition. Due to the importance of
master data, a dedicated person or team should be appointed. At the minimum, a
data steward should be identified. The responsible party can also be a group --
such as a data governance committee or a data governance council.