You design a Business Intelligence (BI) solution by using SQL Server 2008. The solution includes a SQL Server 2008 Analysis Services (SSAS) database. The database contains a data mining structure that uses a SQL Server 2008 table as a data source. A table named OrderDetails contains detailed information on product sales. The OrderDetails table includes a column named Markup.
You build a data mining model by using the Microsoft Decision Trees algorithm. You classify Markup as discretized content. The algorithm produces a large number of branches for Markup and results in low confidence ratings on predictable columns. You need to verify whether the Markup values include inaccurate data. What should you do?
A.
Modify the content type of Markup as Continuous.
B.
Create a data mining dimension in the SSAS database from OrderDetails.
C.
Create a data profile by using SQL Server 2008 Integration Services (SSIS).
D.
Create a cube in SSAS. Use OrderDetails as a measure group. Recreate the data mining structure and mining model from the cube data.
Explanation:
Tip: "inaccurate data" = "profile SSIS"Discretized The column has continuous values that are grouped into buckets. Each bucket is considered to have a specific order and to contain discrete values. Possible values for discretization method are automatic, equal areas, or clusters. Automatic means that SSAS determines which method to use. Equal areas results in the input data being divided into partitions of equal size. This method works best with data with regularly distributed values. Clusters means that SSAS samples the data to produce a result that accounts for clumps of data values. Because of this sampling, Clusters can be used only with numeric input columns. You can use the date, double, long, or text data type with the Discretized content type.
Microsoft Decision Trees Algorithm
Microsoft Decision Trees is probably the most commonly used algorithm, in part because of its flexibilitydecision trees work with both discrete and continuous attributesand also
because of the richness of its included viewers. Its quite easy to understand the output via these viewers. This algorithm is used to both view and to predict. It is also used (usually in
conjunction with the Microsoft Clustering algorithm) to find deviant values. The Microsoft Decision Trees algorithm processes input data by splitting it into recursive (related) subsets.
In the default viewer, the output is shown as a recursive tree structure.
If you are using discrete data, the algorithm identifies the particular inputs that are most closely correlated with particular predictable values, producing a result that shows which columns
are most strongly predictive of a selected attribute. If you are using continuous data, the algorithm uses standard linear regression to determine where the splits in the decision tree occur.
Clicking a node displays detailed information in the Mining Legend window. You can configure the view using the various drop-down lists at the top of the viewer, such as Tree,
Default Expansion, and so on. Finally, if youve enabled drillthrough on your model, you can display the drillthrough informationeither columns from the model or (new to SQL Server 2008) columns from the mining structure, whether or not they are included in this model.
Data Profiling
The control flow Data Profiling task relates to business problems that are particularly prominent in BI projects: how to deal with huge quantities of data and what to do when this data originates from disparate sources. Understanding source data quality in BI projectswhen scoping, early in prototyping, and during package developmentis critical when estimating the work involved in building the ETL processes to populate the OLAP cubes and data mining structures. Its common to underestimate the amount of work involved in cleaning the source data before it is loaded into the SSAS destination structures.The Data Profiling task helps you to understand the scope of the source-data cleanup involved in your projects. Specifically, this cleanup involves deciding which methods to use to clean up your data. Methods can include the use of advanced package transformations (such as fuzzy logic) or more staging areas (relational tables) so that fewer in-memory transformations are necessary during the transformation processes. Other considerations include total number of tasks in a single package, or overall package size.
(Smart Business Intelligence Solutions with Microsoft SQL Server 2008, Copyright 2009 by Kevin Goff and Lynn Langit)