Architecture and Star Schema
Architecture and Star Schema
by William D. Back, Nicholas Goodman, and Julian Hyde, authors of Mondrian in Action
Mondrian is expecting to perform its analytic magic on top of a star schema. In this article, based on chapter 3 of Mondrian in Action, the authors cover the general architecture of an analytic solution and then explore star schemas, the "best practice" database modeling technique for analytic systems. Click here for 40% savings on Mondrian in Action and other related titles.
Mondrian is an open source business analytics tool that enables organizations of any size to give business users access to the data for interactive analysis without needing support from IT or database administrators. In this article, you'll learn why Mondrian requires a particular database architecture and how it aids in analysis. By the end of the article, you should understand what a star schema is and how it supports the goals of online analytics processing.
Characteristics of analytic systems
As the database warehouse (DW) or business intelligence (BI) architect or developer of a Mondrian solution, you're charged with making sure the analytics presented to your users exhibit the following three characteristics:
- Fast—Users expect speed-of-thought results, and the fact that your database is scanning millions of transactions to present these results is irrelevant to them. If you haven't presented them the results in a period of time where they can continue their flow to ask their next question, refine their results, and continue exploring, you've lost them.
- Consistent and accurate—Nothing drives users more insane than running two reports in a system and getting results that don't match each other. For instance, if your online analytical processing (OLAP) system shows them a report that reports on Sales by Quarter and Year and the two figures don't match (refer to table 1), your users will lose confidence in how accurate the analytics presented by your system are. If your totals (quarter 1, quarter 2, quarter 3, and quarter 4) don't make up the total (the year at $350), your users will lose confidence.
- Information focused—Users don't care that, in source system XYZ, the SKU for the product is referred to as the UNIQUE_RESOURCE_LOCATOR field and that, in system ABC, the SKU is alternatively referred to as PLU_BASE_UNIT. Users care about information such as Sales, Customers, and not data such as column names LINE_TRANS_AMT and PLU_BASE_UNIT.
Fast rule of thumb
If the reporting results take long enough that the user considers a walk to refill their coffee, your solution is not fast enough.
Table 1 Inconsistent and inaccurate
The analytics you present in your OLAP system must focus on those information subjects that are analytically significant to your users. Users want to see the customer named Bob and state California instead of the transactional ids (100, 22CA1, and so on) for such values. Computers think in IDs and codes, and humans think in names and labels. Additionally, transactional systems (OLTP) change over time.
Consider how many Fortune 500 companies are, at the very moment you're reading this line, in the process of a huge implementation of an ERP, CRM, or HR application? The OLAP system needs to handle source-of-data changes with grace. Your company may someday move from Oracle Applications to Salesforce.com.
If you've worked with technology for any period of time, you're likely thinking that fast, consistent, and information focused are not implemented easily. These objectives are simply not low-hanging fruit that the installation of a single piece of software can achieve. We will now look at the architectures that can help us build a system that achieves these characteristics.
Fortunately for all of us, including you, building analytic systems to meet those needs is not new at all. Many thousands of professionals, numerous authors, industry experts, and vendors have been refining for years the tools, techniques, and best practices for implementing analytic systems that are fast, consistent, and information focused.
This collective wisdom boils down to two, almost universally accepted tenets of building analytic systems:
- Copy data to systems dedicated for analytics—Data, unlike physical assets, can be duplicated with relative ease. Your transaction systems are intricate systems that require advanced security, high availability, and are designed to be fully available to run your company. They are, generally speaking, ill-suited to performing aggregations across many areas, such as multi-table joins, and typically include analytic reporting as a "bolt-on" piece of the primary application. Doing analytics on these systems is a common source for poor system availability and high system loads. I'm guessing that, if you're reading this, there's a decent chance that you've received a nasty email from a DBA or System Administrator for running a query that is slowing down the whole system.
- Transform, clean, and enrich data for analytics—While these transactional systems tend to be flexible, they speak a foreign language of codes, effective dates, keys, composite keys, and joins. Data is what transactional systems are built to manage, not what matters to the analytic users. The industry knows that, in order for the data to be useful, it must undergo a transformation from data it enters, such as CSV files or raw tables in a database, into information subjects, such as Sales or Customers, if it's to be effective for analysis. Often, the data necessary to do an analysis isn't even present in the original data stream, and integrating, matching and enriching this data in the analytic system is necessary to present certain analytics. The act of moving data from the source system to the analytics system is called extract, transform, and load (ETL).
Whether it's called a data warehouse that serves as the long-term storage of company data, or a single analysis area commonly called a datamart, or an interdependent set of datamarts, the industry has determined that analytics should be done on separate compute resources and include data that has been cleaned, transformed, and enriched from multiple source systems. Figure 1 shows a typical analytic environment overview with a set of systems separated from the source systems. Data is then copied to the analytic environment via ETL, and users access their reports/data in the analytic environment rather than directly against the source systems.
Figure 1 Analytic architecture overview. Data is copied (and enriched) from the source systems to a dedicated analytic environment, which is where users (via Mondrian) access analytic data.
The common charter to create an analytic system that is fast, consistent, and information focused has led the industry to a widely accepted best practice of dimensional modeling, based on the physical star schema methods. We will briefly explain the basics of the star schema (also known as dimensional modeling), how it meets these goals, and compare it with third normal form modeling (3NF).
The star schema is an industry best practice modeling technique optimized for massive, dynamic aggregations. While 3NF modeling is the industry best practice for modeling transactional systems (OLTP), star schemas are the best practice for analytics (OLAP). The concepts and specifics are outlined comprehensively in the authoritative book on the topic by Ralph Kimball et al (1998), The Data Warehouse Lifecycle Toolkit.
For all intents and purposes, Mondrian expects your data to be in a relational database, in the star schema format (or one of its closely related permutations). The star schema as a set of relational database tables is what Mondrian uses as the basis for performing aggregations and analytics.
We're about to cover the general structure of star schema. It's worth noting that the specifics of each individual model for a particular business or need are driven by the analysis needs of that particular company, department, or user.
The desired analytics and model for Mondrian cubes drives the design of both the Mondrian schema and the star schema that supports it. To understand how the analytic needs of users drive the actual implemented model, we will consider the following example.
Adventure Works managers want to understand how much revenue they are receiving by customer type. They are looking to understand, first, their SALES by CUSTOMER STATE. Of course, they will eventually wish to look at additional attributes, such as SALES over TIME, but, for our first foray into star schemas, this basic SALES by CUSTOMER STATE will suffice.
Mondrian and star schemas
Mondrian expects your data to be in a relational database, in a star schema, which is an industry best practice modeling technique for OLAP systems.
A star schema consists of a fact table surrounded by multiple dimension tables. The shape of a fact surrounded by dimensions is how the star schema gets its name, based on its resemblance to celestial stars.
Figure 2 Star schema. Notice the fact that contains the WHAT you are trying to measure (sales and, more specifically, the column that has the data to be aggregated, sales_amount). The dimensions are the BY attributes that you are trying to segment and allocate the data to (customer and, more specifically, the state the customer is from customer_state).
Fact tables contain the stuff you are trying to aggregate, total, and measure. The numbers that are added to create the total sales number are contained in the fact table and are referred to as the measures in the cubes. It's the WHAT you are trying to measure and analyze. In our example and figure 2, sales is the WHAT we are trying to measure.
Dimensions tables contain the qualifying attributes that you want to split out those numbers by. In our example and figure 2, the users wish to split out the total sales (in the fact) by customer state so that you can see the total sales for each state individually, along with the total sales for all states. Customer state is the BY that you are trying to use for comparison and filtering.
WHAT you are trying to measure (revenue, web impressions, customer calls, and so on) is in the fact table. The things you are trying to split it out by (product, geography, and so on) are in the dimension tables.
When looking at the physical database model, a star schema consists of dimension tables that contain rows, independent of the transactions that have the attributes. For instance, a product dimension would contain a row per product and contain information on product categories, vendors, departments, and so on.
Typically, this foreign key is also non-nullable so that you can aggregate the table at any combination of dimensions and always get the same sum total.
Remember, consistency is one of our goals and this ability to aggregate at any combination of dimensions helps to keep the sum totals avoiding the results in table 1.
Dimension tables are highly normalized and now contain many attributes. Your source system may have included information about departments in a separate table, but, in the star schema, the department name is now part of the employee dimension.
A single fact table that contains a row for the individual transactions (order line items, individual clicks) matching the grain of the table. (See Kimball for more info on "grain.") The fact table contains a set of surrogate integer keys that easily join to the dimension tables for the attributes associated. Additionally, it will most often have one or more columns that contain the number (sales_amount) associated with that single transaction, which is to be aggregated .
The other thing to note is that using this technique means that fact tables typically contain at least ten times more but, more commonly, at least a thousand more records than the dimension tables. Fact tables contain millions to billions of rows, and dimension tables typically contain thousands to just a few million. This has important performance benefits and is a key enabler in how this modeling technique can deliver speedy results even when millions of transactions/facts are involved.
Star schema: dimensions and facts
- Are the WHAT you are trying to measure.
- Usually numeric and are aggregated (SUM, COUNT, or AVG).
- Contains millions (or more!) of "skinny" records, typically only integers and numbers.
- Uses many non-nullableE foreign keys to dimension records.
- Are the BY you use to allocate or split your numbers.
- Contain thousands (sometimes more) of "fat" records, typically with many varchar and descriptive attributes.
- Are highly denormalized; often contain typically separate items (customer and state names) together in a single table.
Comparing star schemas with 3NF
There's a high likelihood that you've designed, built, maintained, or optimized a database schema for an application. We'll examine the technique, as it's likely familiar to you, and then examine why we'll depart from it for our star schema.
As a brief refresher, third normal form (3NF) is a modeling technique in which redundancy is reduced and foreign keys are introduced so that additional attributes (such as the name of the state) are located in a different location and must be accessed in another table.
The 3NF model has typically and unequivocally been considered the "correct" database modeling technique with little discussion or questioning. This is true for transactional systems like your ERP or CRM, given that:
- Lots of concurrent users are reading/modifying—Keeping similar data together and factoring out and normalizing data that is repetitive (such as department names, locations, and so on) allow lots of users to operate on smaller sections of the dataset independently and without conflicts (or locks).
- Subprograms and people are accessing small slices of data—Typically, users of an HR system are not going to update the last name of every employee in the company. They are more likely accessing a single employee and updating the last name of a single record.
- Joins are inexpensive with a relatively small amount of data—Why duplicate the full department information (name and so on) with every employee when a simple join or lookup to another table in the database for a single department number is simple and fast?
The 3NF technique is not, however, a good model for doing large aggregations touching entire sets of data by a few users. Joining a single record to others (a small amount of data) tends to be efficient. Joining many tables to include all of the attributes used for qualification (department, and so on) requires much more work by the database to be able to do that same join for the entire database (large numbers of database rows). You have likely written a few SQL statements for your reports that are a page or two themselves and their database EXPLAIN is a small chapter of a book; these queries tend to perform poorly as the data set grows in size. We have written our share of these expansive, poorly performing queries before embarking on our OLAP adventures.
You should expect that, if you are accustomed to 3NF modeling, the first star schemas you design will not feel right. It will leave you with a strange, lingering feeling that you've just built a terrible data model. Over time, as the star schemas' fit to the use case becomes increasingly apparent, the modeling technique won't feel quite so strange.
Star schema benefits
The star schema has the following benefits:
- At most, one pass through the table—No need to look over millions of records time and again; the database will simply make one pass aggregating the dataset. The single remaining join path is centered on the largest table; database planners typically produce efficient executions when cardinality differences between tables are large. It makes the planner's job straightforward by easily identifying the tables that will be expansive (and drive the single pass approach) and which tables are smaller lookup tables.
- No sum total issues caused by missing JOIN keys—Consider the difficulty in balancing sum totals if some products are not assigned to categories. If you join via a key that isn't present and the join condition in SQL isn't satisfied, you typically lose records before doing the aggregations. In this situation, it's possible to do an aggregation without a GROUP BY statement and get one figure, and get different totals if you join to a table. With a star schema, you can mix and match and do aggregations at the intersection of any attributes and always come up with the same exact sum total of revenue. You can probably think back to a SQL report you've written that joins to an extra table that has an additional reporting field of interest and, all of a sudden, the sum totals are missing due to the missing join keys. Star schemas are better suited to avoid this pitfall; you must include a dimension record that serves as the star schema equivalent of NULL so that fact records that don't have the attribute (such as category) always join to every dimension.
- Variety of physical database optimizations—Most importantly, the star schema has eliminated the multi-table joins that are extremely inefficient and costly to perform on large sets of data. A single, easy-to-optimize physical structure (one large table, single key joins to surrounding smaller tables) is something that nearly every database can perform effectively. Further, expecting this particular modeling technique and seeing the physical tables organized as a star schema, some databases have features that provide even greater efficiencies and query speed improvements. Bitmap indexes, parallel query and partitioning, and sharded fact tables, are just a few of the techniques. In fact, there's an entire class of column storage databases that are purposely built to handle such schemas/workloads and provide blazing fast performance on top of star schemas.
- Easy for OLAP systems like Mondrian, but also now information focused—While the primary consumer for a star schema is an OLAP engine like Mondrian, the database and tables themselves represent an information-focused, easy-to-understand view of the data for reporting. A star schema has reduced the complexity and knowledge necessary to write plain old SQL reports against the data. Analysts that typically needed to remember complicated join rules (have to include effective date in this SQL join; otherwise, you get too many records and double-count your sales) have a simplified, easy-to-understand, information-focused model to report against.
We have looked at the basic structure and benefits of star schema design, and compared it to third normal form (3NF) techniques.
Here are some other Manning titles you might be interested in:
MongoDB in ActionKyle Banker
RabbitMQ in ActionAlvaro Videla and Jason J.W. Williams
Hadoop in ActionChuck Lam