Distributing Data in a Distributed DBMS

Technology Horizontally Scalable Data

I'm the first to admit, I am not a proponent of RDBMS/DBMS systems in this day and age. And we don't have a RDBMS in our architecture at my current organization. To be clear: there is a relational database, just not a RDBMS. I'm not going to rant here about why (or at least not much), in part because I've already done that before.

What I do want to do is talk about the trick to doing distributed DBMS the right way.

There are still a ton of different organizations that are using DBMS systems out there, specifically for operational backend systems (which I usually find to be legacy systems, or at least, legacy thinking and not native cloud architectures), and more often for the backend of (big) data environments. It's really the latter that I want to focus on here.

All right, here's the biggest reasons I don't like DBMS in 2020: first, the reasons for DBMS, dating back 4+ decades, was largely because the cost of disk was expensive and limited. In 1980, it cost $233,000 USD per gigabyte. By 2016, the price had dropped to less than 3 cents.

Disk Cost

So of course you wanted to limit the replication of data on disk. But that restriction is gone. And while Moore's Law specifically talks about the number of transistors doubling every two years, because chips are becoming so specialized and powerful, it is now compute that now carries the major cost.

Secondly, I disagree with DBMS because cloud native architectures need to decouple compute and disk so that they can optimize for each, and compute should always be elastic (bi-directionally scalable, scaling to zero). DBMS applications don't allow that. You buy (or lease, in the case of the cloud) a cluster of nodes with the distributed DBMS application installed on them (classic master/worker architecture). You continue to take up more and more disk, which leads to bigger and bigger clusters, and your compute usage is obviously not always 100% (if it is, you needed a bigger cluster a long time ago), so you are inevitably paying for unused computational resources, and since that's the most expensive part, time to rethink the choice.

But hey, let's say we've got arguments that have led to us to go with a distributed DBMS like AWS Redshift, Teradata, or Snowflake. Perhaps we have an experienced staff that is really used to DBMS, and who would struggle to transition to something like Presto on EMR, with the data decoupled and stored on S3 (disclaimer: this has been my preferred big data architecture for many years). The point here isn't to debate alternatives. Rather, I'm just pretending we're using (e.g.) Redshift for whatever reasons, and I want to talk about a little trick to making these extremely powerful DBMS systems deliver on their promises.

In a nutshell, it's completely about data distribution, and the trick to figuring out how to distribute your data.

All right, now we have to do a bit of a walk down memory lane, so we've got context. I'll imagine that part of the reason you've got Redshift is because you want the cloud based data warehouse, and perhaps you've got some old people (like me) on staff, who have been doing data warehousing for many years, and who talk about dimensional modeling and star schemas and all the stuff Ralph waxed on about in The Data Warehouse Toolkit. So, you start to build your dimensional model as if it were the year 2000.

Conan

And what does that entail? Well, we're creating surrogate keys on our dimensions and facts, because we love surrogate keys (I do, too, but I contend that in a cloud architecture, they should be UUIDs, not integers, but I'm resisting the urge to go sideways on that rant right now), since they give us a nice way to join, assure uniqueness, and provide us a level of abstraction from the business keys. But here's the problem: when we (default) put that model into Redshift, it distributes by that key.

Seems innocent enough, sure, and it almost guarantees an even distribution (I'll return to this later). But as soon as you try and scale, it's going to destroy performance. Here's a picture to illustrate. We've got three simple tables here: dimensions CUSTOMER_D and PRODUCT_D and fact TRANSACTION_F.

Dimensional Model

Now, let's add some data to those tables:

tables with data

Now, let's pretend, for the sake of this conversation, that our Redshift cluster is four nodes. And we're distributing the data by the primary key of each table, which is to say CUSTOMER_D.customer_id, PRODUCT_D.product_id, and TRANSACTION_F.transaction_id. By "distributing the data", I mean on which node within the cluster the data is being stored. And yeah, I know Redshift is columnar, but that isn't going to change what I'm illustrating here, so chill. We'll pass the key into a modulus function and the remainder will determine which our data will be stored on (surrogate key : node on which it's stored):

for surrogate in range(1,11):
    str(surrogate) + ' : ' + str(surrogate % 4)

'1 : 1'
'2 : 2'
'3 : 3'
'4 : 0'
'5 : 1'
'6 : 2'
'7 : 3'
'8 : 0'
'9 : 1'
'10 : 2'

As depicted below:

data distribution

Now, we're going to do something simple, like get counts by product purchased, since each of the individual nodes in the cluster are unaware of what is going on on the other nodes, the data has to be redistributed all over the place, to make sure we get all the hats together, and all the shirts together, etc.

shuffling

And that's the big performance killer: shuffling data. Now, in certain cases, you can just replicate an entire dimension to each node and avoid the majority of those shuffles, but we're really talking about big data environments here. Not 5 and 10 row tables like I'm illustrating. I'm talking about terabytes and petabytes of data, millions of customers, maybe thousands of products, and billions of transactions, and instead of a little four node cluster, it's fifty or a hundred nodes, or more.

So, what's the secret? Simply, change your distribution keys. Instead of distributing on the primary key, distribute on the key that you normally want to look at data by. Usually, that's customer. You usually want to see count of customers that bought a thing, or count of customers that did a thing between this date and that date.

That's step one, and that's a no-brainer, but in our case here, it actually doesn't do the trick we need it to. Yes, it gets all of our customer keys co-located with their transactions (which is a really helpful), but we're still shuffling around those pesky products. I'll come back to this in a second. In the meantime, we've already made a significant gains by reducing our shuffling substantially, but there is a risk/price to pay here, which is the long tail.

When we distribute by the surrogate key (default), we are *almost* guaranteed an even distribution. Since the surrogate keys are incremental, the modulus will keep number of rows on each node about equal, which is good, because when we're running a query, nothing gets returned until the last node is complete, so you want to make sure that you don't end up with one node doing 10x the work and taking 10x as long as the other nodes. So this is something to be aware of, and if you suddenly start to see certain customers, say, purchasing massively more than others, their associated transactions (in my example here) will skew the amount of data co-located on that node. That said, even if you have certain customers that exhibit this type of behavior, the law of averages is going to say that those customers will be randomly distributed across all the nodes, too, so this is less of a concern practically than it is academically. But you should still be aware.

Returning to those pesky product shuffles...while it might not make sense to do this here, a great trick is to force the customer (in this case) onto whatever dimensions you have, even if that means you're creating more data. Remembering that disk is cheap and shuffling is the biggest performance killer, you could, for instance, create a PRODUCT_DAY_D table, which captures all products purchased at a day level and includes the customer_id on each row. Totally bastardized dimension, and it is going to feel like you need a shower the first time you do it, I know. But this brings us back to an earlier point, which is that the rules for data modeling have changed in the last four decades, and as long as we understand why the rules exist, we can break them when it makes sense.

To be fair, it's a little difficult to wrap our heads around appending the customer_id to the PRODUCT_D table (or creating PRODUCT_DAY_D, as previously mentioned) because this example is kind of shit, but that's partially because our fact table is likely incomplete and would, in real life have more attributes (surrogates) on it go give us better insights. And if it didn't, well, then we could just amount to PRODUCT_DAY_D and forgo the TRANSACTION_F table altogether.

I used to work at a media company that had thousands of titles of things for people to watch and millions and millions of customers watching those titles every minute of the day. We ended up appending the customer id to practically everything, then distributing by that key (even if the model itself would make Ralphie cringe) because it was the right model to avoid shuffling, and because we could restate dimensions easily in a horizontally scalable manner on a regular basis (e.g. PRODUCT_DAY_D). And while seemingly everyone outside of our organization was complaining about how their distributed DBMS was never living up to the promises, even with significantly less data than the ~500 terabytes we were processing regularly, we were as happy as pigs in shit with our performance, because of the application of better distribution, and the denormalization of attributes for the purposes of distribution.

Right up until we realized we didn't even need a DBMS to accomplish what we needed. Then we applied the same approach, but without the DBMS and saved 75% or more of the costs of the DBMS because, really, it's 2020 and you don't need one anymore.