Bigtable Stores in Practice
by Dan McCreary and Ann Kelly, Authors of Making Sense of NoSQL
Bigtable systems are important NoSQL data architecture patterns because they can quickly scale to manage large data volumes. They are also known to be closely tied with many MapReduce systems. In this article from Making Sense of NoSQL, the authors discuss how Bigtable systems store data using row and column keys and how they are used in several business applications.
Bigtable systems are a type of database that uses row and column identifiers as general purposes keys for data lookup. They are sometimes referred to as a data store rather than a database since Bigtables lack features you may expect to find in a traditional database. For example, they lack typed columns, secondary indexes, triggers, or query languages. Systems introduced by Google as well as HBase and Hypertable are good examples of Bigtable systems.
Our first example of using rows and columns as a key is the spreadsheet. While most of us don’t think of spreadsheets as a NoSQL technology, they serve as an ideal way to visualize how keys can be built up from more than one value. Figure 1 shows a spreadsheet with a single cell at row 3 and column 2 (the B column) that contains text.
Figure 1 Using a row and a column to address a cell. The cell has an address of 3B and can be thought of as the lookup key in a sparse matrix system.
Using a row and a column to address a cell. The cell has an address of 3B and can be thought of as the lookup key in a sparse matrix system.
Figure 2 Spreadsheets use a row/column pair as a key to look up the value of a cell. This is similar to using a key-value system where the key has two parts. Like a key-value store, the value in a cell may take on many types such as strings, numbers, or formulas.
This is roughly the same concept in Bigtable systems. Each item of data can only be found by knowing information about the row and column identifiers. And, like a spreadsheet, you can insert data into any cell at any time. As opposed to a RDBMS, you don’t have to insert all of the column data for each row.
Real Bigtable systems
Now that you are comfortable with slightly more complex keys, we will add two additional fields to the spreadsheet example. In figure 3, you can see that we have added a column family and timestamp to the key, which we will use to discuss Bigtable implementation.
Figure 3 The key structure in Bigtable stores are similar to a spreadsheet but have two additional attributes. In addition to the column name, a column family is used to group similar column names together. The addition of a timestamp in the key also allows each cell in a Bigtable store to store multiple versions of a value over time.
The key in the above figure is typical of a class of NoSQL systems called Column Stores or Bigtable implementations. Unlike the typical spreadsheet, which might have 100 rows and 100 columns, Bigtable systems are designed to be...well...very big. How big? Systems with billions of rows and hundreds or thousands of columns are not unheard of. For example, a Geographic Information System (GIS) like Google Earth might have a row id for the longitude of a portion of a map and use the column name for the latitude of the map. If you have one map for each square mile on Earth, you could have 15,000 distinct row ids and 15,000 distinct column ids.
What’s unusual about these large implementations is that, if you viewed them in a spreadsheet, you would see that very few cells contain data. This sparse matrix implementation is a grid of values where only a small percent of cells contain values. Unfortunately, relational databases are not efficient at storing sparse data; however, column stores are designed exactly for this purpose.
With a traditional relational database, you can use a simple SQL query to find all the columns in any table; however, when querying sparse matrix systems, you must look for every element in the database to get a full listing of all column names. One problem that may occur with many columns is that running reports that list columns and related columns can be tricky unless you use a column family (a high-level category of data also known as an upper-level ontology). For example, you may have groups of columns that describe a website, a person, a geographical location, and products for sale. In order to view these columns together, you would group them in the same column family to make retrieval easier.
Since not all column stores use a column family as part of their key, if they do, you will need to take this into account when storing an item key since the column family is part of the key, and retrieval of data cannot occur without it. Inasmuch as the API is very simple, NoSQL products can scale to manage large volumes of data, adding new rows and columns without needing to modify a data definition language.
Bigtable systems can deal with the issues of partition tolerance as part of their core architecture. Although you can start your development on a single laptop, in production, they are designed to store the data in three distinct nodes in different geographic regions (geographically distinct data centers) to ensure high availability. Bigtable systems have automatic failover built in to detect failing nodes and algorithms and identify corrupt data. They leverage advanced hashing and indexing tools such as Bloom filters to perform probabilistic analysis on large data sets. The larger the data set, the better these tools perform. Finally, Bigtable implementations are designed to work with MapReduce jobs, serving as either input or output, so be sure to consider these factors before you select a Bigtable implementation.
Benefits of Bigtable systems
The Bigtable approach of using a row id and column name as a lookup key is a flexible way to store data and gives you benefits of higher scalability, availability and saves you time and hassles when adding new data to your Bigtable store. As you read through these benefits, think about the data your organization collects to see if a Bigtable store would help you gain a competitive advantage in your market.
The world “Big” in the title of the original Google paper tells us that Bigtable systems are designed to scale beyond a single processor. At the core, Bigtable systems are noted for their scalable nature, which means that, as you add more data to your Bigtable system, your investment will be in the new nodes added to the computing cluster. With careful design, you can achieve a linear relationship between the way data grows and the number of processors you require.
The principal reason for this relationship is the simple way that row ids and column names are used to identify a cell. By keeping the interface simple, the back-end system can distribute queries over a large number of processing nodes without performing any join operations. With careful design of row ids and columns, you give the system enough hints to tell it where to get related data and avoid unnecessary network traffic crucial to system performance.
By building a system that scales on distributed networks, we gain the ability to replicate data on multiple nodes in a network. Knowing that the architecture of Bigtable systems uses efficient communication, the cost of replication is lower and, because you do not need to join data, backup copies of any portion of a Bigtable matrix can be stored in remote computers. This means that, if the server that holds part of the sparse matrix crashes, other computers are standing by to provide the data service for those cells.
Easy to add new data
Like the key-value and graph stores, a key feature of the column store is that you don’t need to fully design your data model before you begin inserting data. There are, however, a couple of constraints that you should know before you begin. Your groupings of column families should be known in advance. But, the names of a row id and a column name can be created at any time.
For all the good things that you can do with Bigtable systems, be warned that they are really designed to work on distributed clusters of computers and may not be appropriate for small datasets. You need at least five processors to justify a Bigtable cluster since many systems are designed to store data on three different nodes for replication. Bigtable systems don’t have general purpose SQL queries built in; they store the data into a file system and use external tools to generate reports.
In the next three segments, we’ll look at how Bigtable implementations have been efficiently used by companies like Google to manage analytics, maps, and user preferences.
Case Study: Storing analytical information in Bigtable
In Google’s Bigtable paper, they described how Bigtable is used to store website usage information in Google Analytics. The Google Analytics service allows you track who is visiting your website. Every time a user clicks on a web page, the hit is stored in a single row-column entry that has the URL and a timestamp as the row id. The row ids are constructed so that all page hits for a specific user session are together.
As you can guess, viewing a detailed log of all the individual hits on your website would be a very long process. Google Analytics makes it simple by summarizing the data into regular intervals (such as once an day) and creating reports that allow you to see the total number of visits and most popular pages that were requested on any given day.
Google Analytics is a good example of a large database that scales in a linear fashion as the number of users increases. As each transaction occurs, new hit data is immediately added to the tables, even if a report is running. The data in Google Analytics, like other logging-type applications, is generally written once and never updated. This means that once the data is extracted and summarized, the original data is compressed and put into an intermediate store until archived.
Once the data from event logs is summarized, tools like pivot tables can use the aggregated data. The events can be web hits, sales transactions, or any type of event monitoring systems. The last step will be to use an external tool to generate the summary reports.
In the case of using HBase as a Bigtable store, you will need to store the results in the Hadoop distributed file system (HDFS) and use a reporting tool such as Hadoop Hive to generate the summary reports. Hadoop Hive has a query language that looks similar to SQL in many ways but it also requires you to write a MapReduce function to move data into and out of HBase.
Case Study: Google maps stores geographic information in Bigtable
Another example of using Bigtable to store large amounts of information is in the area of Geographic Information Systems (GIS). GIS systems like Google Maps store geographic points on Earth, the moon or other planets by identifying each location using its longitude and latitude coordinates. The system allows users to travel around the globe and zoom into and out of places using a 3D-like graphical interface. When viewing the satellite maps, you can then choose to display the map layers or points of interest within a specific region of a map. For example, if you post vacation photos from your trip to the Grand Canyon on the web, you can identify each photo’s location. Later when your neighbor, who heard about your awesome vacation, is searching for images of the Grand Canyon, they would see your photo as well as other photos with the same general location.
GIS systems store items once and then provide multiple access paths (queries) to let you view the data. They are designed to cluster similar row ids together and result in rapid retrieval of all images/points that are near each other on the map.
Case Study: Using Bigtable to store user preferences
Many websites let the user store information about their user preferences. This account specific information can store privacy settings, contact information and if and how they want to be notified about key events on your website. A typical user preference page for a social networking site is typically under 100 fields, many of which may be simple true/false values or a code selection. So, a 1 K file is a reasonable first approximation for each user as long as it does not include a photo.
There are a few factors about user preferences that make them unique. There are very few transactional requirements. Generally, the user associated with the account is the only one who makes changes and the changes occur infrequently. As a result, you don’t really need to worry about ACID transactions and your focus should be on making sure that when the user presses the “Save” or “Update” button that the transaction is not blocked.
The other factors to consider is how many user preferences you need to store and how reliable the systems need to be. When a user logs in you usually need to access the preferences to customize the user experience. So this read-mostly event needs to be fast and scalable even if you have many concurrent users.
You also may need some reporting tools, but Bigtable does not support standard SQL queries. What if your marketing department wants to get reports of how many people are using their “real name” vs. an alias from the user preferences? Using a key-value store may not be the best way to store preferences, although many sites take this route.
Bigtable systems may provide the ideal match for storing user preferences when combined with an external reporting system. These reporting systems can be set up to provide very high availability through redundancy and yet still allow reporting to be done on the user preference data. In addition, as the number of users grows, the size of the database can expand by adding new nodes to your system without changing the architecture. If you have very large data sets, Big Data stores may provide an ideal way to create reliable and yet scalable data services.
One of the challenges with NoSQL systems is there are many different architectural patterns from which to choose.
Here are some other Manning titles you might be interested in:
Big DataNathan Marz
Neo4j in ActionJonas Partner and Aleksa Vukotic
Redis in ActionJosiah Carlson