How we leverage our document datastore (DynamoDB)

DynamoDB Document Data Stores

tl;dr Document datastores, like DynamoDB, offer unparalleled performance and bi-directional scalability, which enables us to exploit the cloud pricing model while providing (largely) transient storage within our serverless architecture.

In previous posts, I mention how we leverage document datastores, and specifically AWS DynamoDB. I'm a huge fan of key/value document datastores, and specifically DynamoDB because it offers "peaks of more than 20 million requests per second" and "single-digit millisecond response times at any scale". That said, like all technologies, it needs to be positioned correctly within the overall architecture.

Within our recommendation engine application, we lean heavily on document datastores as the back-end data semi-persistence layer to the front-end application, and we leverage them (mostly) transiently, which I will discuss in greater detail throughout this post.

Let's begin by talking about what a document datastore is not: it is not our persistent data layer, and it is not the foundation for our analytics offerings, except as it pertains to real-time analytics of extremely recent activity. Document datastores are great, if you know the access pattern. To illustrate, let's take a simple JSON (key/value) object that represents a fictitious customer and their purchases:

{"customerKey" : 123,
   "customerName" : {"firstName": "Ann",
     "lastName" : "Example"},
   "customerAddress" : {"addressLine1": "123 Main Street",
     "addressLine2" : "Apartment B",
     "city" : "Wherever",
     "stateProvince": "OR",
     "country" : "USA",
     "latitude": "45.5155",
     "longitude": "122.6793"},
   "purchases" : [{"purchaseDate" : "20180927",
     "productId" : 123,
     "quantity" : 2,
     "purchasePrice" : 100},
     {"purchaseDate" : "20180927",
     "productId" : 456,
     "quantity": 1,
   "purchasePrice" : 45},
     {"purchaseDate" : "20190102",
     "productId" : 123,
     "quantity": 3,
     "purchasePrice" : 140},
     {"purchaseDate" : "20190103",
     "productId" : 862,
     "quantity": 1,
     "purchasePrice" : 15}]
}

The biggest piece of advice I could give regarding document datastores is that you have to understand the access pattern. Imagine we have a document like this for each customer, which adheres (generally) to the above document structure. If our system knows which customer has signed in, it is incredibly fast to retrieve information for that customer. You simply pass it the customerKey, and all of the information within that document is available for use.

Alternatively, for an analytics system, we may want to answer questions like how many purchases were made on 20190927? A document datastore would be horrible for this, based on the structure of this document. In order to answer that question, we'd have to pull the information for all customers (each with its own document), then filter out the purchase key/value pairs, then filter out data that had purchases on that specific date, then summarize that information to get to our answer. A far superior approach would be to obtain that information from a relational database, which is built for aggregations, summarizations, and data exploration. A fair counter-argument might be to have a different document that contains purchases keyed by date, so instead of accessing the document by the customerKey, we could instead extract only the document pertaining to that specific date. However, the summarization of information from the key/value pairs is still sub-optimal, when compared to a traditional relational database.

Additionally, the document size of this secondary document (with date at the top of the hierarchy) will be a challenging endeavor, to say the least, when compared against technologies built to do that type of work. Customers with lots of purchases could become too large for DynamoDB, for instance, which has an maximum item size of 400 kb. A quick look at my Amazon purchases this month might put that limit to its test...

With regard to our architecture, we leverage these datastores transiently, too, which is to say that we lean heavily on TTL (Time to Live), so that data in any table is only present for a set number of hours or days, for the most part. There are exceptions, for small data that requires persistence and single-digit latency, like customers. When a customer signs in, we have to retrieve certain information about the customer (like name), so we persist just that information (not historic purchases, like the previous example), so we can retrieve information we need in the application, but lean on relational data for analytics from a separate source.

I said previously that we do retrieve certain real-time analytic information from our document datastores, and I want to expand on that with some more details. There are aspects of this that seem to mirror the Lambda architecture. This is not to be confused with AWS Lambda functions, which we use for nearly all of our computations. Rather, the Lambda architecture is a way to combine real-time and batch processing to produce a full picture of state. To be clear, I'm not a proponent of the Lambda architecture, for a number of reasons that are beyond the scope of this document, but can be found in this article that questions the Lambda architecture. Rather, I'm simply stating that we consume certain data from these transient datastores where the data is so fresh as to not be available yet in our relational databases. We don't maintain the same code in multiple places (a consistent complaint of the Lambda architecture), but we do produce real-time analytics and for extremely fresh data that isn't available elsewhere, we can obtain it from DynamoDB. We could get away from this extremely fresh data and provide our consumers with "near-real-time" analytics, where the data is only fifteen minutes old, but that isn't real-time, and while I would argue that producing data visualizations and making only data older than fifteen minutes suffices for probably 99% of all use-cases, we wanted to be able to do the other 1%, too, and this approach allows us to do just that.

If we look at what this looks like from a data perspective, we can see that data flows into our document datastore in real time, and certain information is transiently stored there by leveraging a TTL of a few hours, before flowing to our relational database, which is in S3. You may note that I am referring to a relational database and not a RDBMS, and my April, 2019 post speaks in greater detail to that aspect. For the sake of this conversation, we can consider them to be the synonymous.

documentDataStore flow

In this image, I'm depicting a static site that is being hosted on S3. The user interacts with the website (in this case, but the same thing applies for a mobile app), which triggers RESTful API calls (AWS Lambda functions) that write and read from the CUSTOMER document datastore table. It also calls an AWS Lambda function that writes data to a transient document datastore table called SEARCH_TERMS. As that data is written to SEARCH_TERMS, a separate Lambda function fires that writes it to a different S3 bucket, in its raw (key/value, JSON) form. Then, periodically (say, every 15 minutes), a CloudWatch Event fires that initiates yet another Lambda function (our mini-batch ETL processing) that consumes all JSON since the last execution, cleanses it, models it relationally, and writes it to a new key on S3 for use in analytics and data visualizations.

Now, we have built out a Function-as-a-Service (Faas) compute architecture in which our ETL runs in what we like to call "right time" increments, and if we need the data for visualization or analytics purposes in (true) real-time, we can consume it from a combination of our relational source and the document datastore (the Lambda architecture-esque aspect), while also minimizing DynamoDB storage costs by placing a TTL of, say, 24 hours, on every document in the SEARCH_TERMS table, and persisting the CUSTOMER data (which is small, and with a very well understood access pattern), which again, contains no analytic data; it only contains data needed by the application itself.

The other major benefit that document datastores offer is in their flexibility. So long as all documents within a table contain the key (and sort key, if applicable), the remainder of the document can differ from one to another. For instance, if the previous JSON were keyed on customerKey, with no sort key, we could add another document like this:

{"customerKey" : 123,
  "eyeColor" : "brown"}

This is completely legal, although it makes the use of the data a bit more challenging, since attributes are inconsistent. That said, the ability to have dynamic data structures is arguably the biggest strength of JSON, and taking advantage of that strength is highly advised and comes into play frequently. Whereas relational databases are far more rigid (for good reasons), document datastores are far more dynamic (also for good reasons), and it further underscores my point about positioning technologies appropriately, and playing to the strengths of each one.

In this post, I quickly covered the strengths of document datastores (seemingly infinite bi-directional scalability, unparalleled performance, dynamic data structures), optimizing for costs (TTLs and transient storage), how a relational database can be leveraged in conjunction with a document datastore to produce truly real-time analytics and data access, as well as one way in which a document datastore should be positioned in an overall FaaS architecture.