Pierre Zemb's Blog

Hbase Data Model

Table of contents

๐Ÿ”—HBase?

hbase image

Apache HBaseโ„ข is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isnโ€™t an RDBMS which supports SQL as its primary access language. Technically speaking, HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.

-- Hbase architecture overview

๐Ÿ”—Hbase data model

The data model is simple: it's like a multi-dimensional map:

Not as easy as you thought? Here's an example! Let's say that we're trying to save the whole internet. To do this, we need to store the content of each pages, and versioned it. We can use the page address as the row key and store the contents in a column called "Contents". Nowadays, website contents can be anything, from a HTML file to a binary such as a PDF. To handle that, we can create as many qualifiers as we want, such as "content:html" or "content:video".

{
  "fr.pierrezemb.www": {          // Row key
    "contents": {                 // Column family
      "content:html": {           // Column qualifier
        "2017-01-01":             // A timestamp
          "<html>...",            // The actual value
        "2016-01-01":             // Another timestamp
          "<html>..."             // Another cell
      },
      "content:pdf": {            // Another Column qualifier
        "2015-01-01": 
          "<pdf>..."  // my website may only contained a pdf in 2015
      }
    }
  }
}

๐Ÿ”—Key design

Hbase is most efficient at queries when we're getting a single row key, or during row range, ie. getting a block of contiguous data because keys are sorted lexicographically by row key. For example, my website fr.pierrezemb.www and org.pierrezemb.www would not be "near".

As such, the key design is really important:

As stated by the official documentation:

Hotspotting occurs when a large amount of client traffic is directed at one node, or only a few nodes, of a cluster. This traffic may represent reads, writes, or other operations. The traffic overwhelms the single machine responsible for hosting that region, causing performance degradation and potentially leading to region unavailability.

As you may have guessed, this is why we are using the reverse address name in my example, because www is too generic, we would have hotspot the poor region holding data for www.

If you are curious about Hbase schema, you should have a look on Designing Your BigTable Schema, as BigTable is kind of the proprietary version of Hbase.

๐Ÿ”—Be warned

I have been working with Hbase for the past three years, including operation and on-call duty. It is a really nice data store, but it diverges from classical RDBMS. Here's some warnings extracted from the well-written documentation:

HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc. However, HBase has many features which supports both linear and modular scaling.

-- NoSQL?

If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.

-- When Should I Use HBase?


Thank you for reading my post! Feel free to react to this article, I am also available on Twitter if needed.

Tags: #hbase #modeling