After decades in the Data business (note: getting old sucks), over the last few years I've been working in a slightly different role: the technology behind products and a more over-arching scope that still includes Data, but shifts my focus so that it is one of many areas of responsibility. That's been really interesting, and it's given me an opportunity to grow (and fail!) in a number of new areas. But since Data has been such a huge part of life, and based on that old saying about how everyone has an asshole and an opinion (which makes me wonder if this cat has an opinion), I still have lots to say about the subject.
Recently, I took on a task around building out an eighteen month strategy for our organization. As I tend to do as a starting point, I tried to build a foundation (or build upon a foundation) based on lessons learned that others had taken the time to write down. In this case, that was in the form of a handful of books, including Good Strategy/Bad Stategy, How I Built This, Lead from the Future, The Future is Faster than You Think, Experimentation Works, and Competing in the Age of AI (full disclosure, I've not completed "The Future is Faster than You Think"; I got distracted with Numbers Don't Lie, which is amazing).
Part of that strategy was to put more focus on Experimentation (specifically A/B/n Testing), and to make Data a first-class citizen in our organization. As is often the case with startups, our focus has (rightfully) been on getting our product to market. Until we do that, we don't have a lot of data to do stuff with. We have made a small investment into that area, and we have a little Data Science and Engineering (DSE) team, but to date they've been focused largely on building out certain infrastructure, putting together a data model and some processes to ingest data, do some basic ETL and get some visualizations in place. And most importantly, to start to sell the vision and make sure the rest of the organization understands the value DSE is going to bring to the table as we launch (April!) and grow.
I was asked to give some thoughts to a subset of our Advisors around the topic of Data, and, as this site can attest, that started as a handful of sentences and ended as a harangue of randomness. Frankly, I left work at around 2 PM, came home and started typing, and wrapped it up at midnight, taking breaks only to refresh my pint glass, use the facilities, and scarf down a slice of pizza. I am a freakin' health nut. It's no wonder I get back sweat from walking up the stairs.
Anyway, since I did spend the better part of ten hours capturing my thoughts, and since I haven't posted anything in awhile, I figured I'd post it here, so this post is my thoughts on a number of different things related to Data. I've changed certain things to keep our still stealth startup anonymous (did I mention we're finally launching in April?), but the rest of the nonsense is still nonsense.
One more thing: the first three paragraphs are shamelessly stolen from my old company's website. It was marketing drivel, but I've always liked it (credit to my CEO at that organization; he was the wordsmith). The rest was from the afternoon and evening of February 15th, 2021.
ContextThe advent of cloud computing unleashed the promise and potential of the Big Data revolution. The availability of on-demand, elastic computing and cheap, abundant storage has elevated the practice of data integration and analytics to a science. Data science and the big data that fuels it are the most pervasive and impactful technical trends of the early 21st century. No longer is the collection of operational data and its formulation into impactful business insights constrained by costly and measly data center storage limits or governed by small windows of overnight batch processing. The ability to manage Big Data has enabled massive innovation, giving rise to data-driven companies and products hitherto unimaginable. Yet what was until recently considered bleeding edge is now merely table stakes for any serious contender in the business of data. Data-driven organizations demand a dependable data infrastructure.
Problem StatementEngineering teams building systems to support this data imperative face a bewildering array of difficult choices if they are to successfully exploit ubiquitous cloud resources. Lacking in-house expertise or a vision for data, the playground of cloud offerings and open-source tools oftentimes results in risky and fruitless PoCs; or unwieldy, unreliable Frankenstein solutions that fail to deliver the expected cost-savings and flexibility. The techniques required to confidently manage data assets and elastically provisioned resources that both scale-out and scale-in are non-trivial and not obvious; adopting them requires mastery of the underlying distributed architecture and technologies. Naive implementations having tightly bound storage and compute, squander the cost-savings opportunity of the pay-as-you-go model. Moreover, much of the established playbook for data management no longer applies, but a professional appreciation for its legacy is necessary to recognize when to break the rules and when to establish new ones.
SolutionCheap, abundant storage changes what is possible; a well-designed data management approach unlocks simplification in data processing strategies. On-demand compute resources are well-suited to dynamically manage highly variable and heterogenous processing workloads, both batch and streaming. Balancing elasticity and cost reactively, providing fault-tolerance and recoverability, while guaranteeing responsiveness and data delivery SLAs requires that we separate the management of storage, compute, and dynamic provisioning. A successful data solution requires that all three elements be harmonized: storage, compute, elasticity. Not surprisingly, such a framework necessitates a managed approach in deploying the plethora of self-serve tools used to query, visualize, and analyze data, all the while masking this added complexity from the end-user.
The previous Solution paragraph is admittedly high level. The remainder of this paper is organized as a deep dive into the various components and motivations for the Anonymous Data Environment (ADE). See what I did there? Changing out my current organization's name with "Anonymous". So clever. Or, perhaps not at all clever.
The Anonymous Data Environment (ADE)Generally, data environments can be split into a series of interconnected functional components, with roles responsible indicated in square brackets:
- Data Integration [Data Engineer]
- Batch, mini-batch, micro-batch, real-time and “right-time” data processing
- Data Visualization and Reporting [Data Visualization Specialist]
- Data Management [Data Engineer]
- Data validation
- Data profiling
- Data modeling
- Data Analytics [Data Analyst]
- ELT (usually, SQL is the language of choice)
- Business and Source System Analysis
- Data Science and Machine Learning [Data Scientist]
- Experimentation [Product + Data Scientist]
- A/B testing
Each of these functional areas is discussed in greater detail below. The roles that support these areas (with the exception of Product) are included within the Data Science and Engineering (DSE) team.
SpannersThis paper would be remiss if it failed to mention the role of Spanners. I credit a former boss of mine with introducing this as a specific role in DSE, as presented at a TDWI conference in August, 2010. In short, it is a deliberate role that provides “full analytics stack” capabilities across the DSE functions, including Source System and Business Analysts, Data Visualization, Data Engineering, and Data Management.
Absent from the spanner responsibilities are those of the Data Scientist, which often requires specialized mathematical (statistical) knowledge, and specific training. However, inroads have been made, and are discussed herein, specifically with regard to Auto Machine Learning (AutoML). This is not to say a Spanner can’t include Data Science capabilities and qualifications; rather just that it is less likely.
By embracing the role of Spanner, significant savings are recognized with regard to staffing, to say nothing of the benefits of higher job satisfaction from autonomy, impacts and mastery.
ArchitectureThe architecture shown in the figure below encompasses a number of functional data requirements and capabilities, including (at a high level):
- Data Management
- Data Visualization
- Experimentation (A/B/n Testing)
- Batch, Mini-Batch and “Right Time” Processing
- Real-Time Processing
- Data Validation and Data Profiling
- 3rd Party Integrations
- Integrating with the Parent Company's EDW
- Predictive Analytics, Machine Learning and Automated Machine Learning (AutoML)
The above diagram is very involved, as it is designed around a number of motivations and learnings over the last two and a half decades, with special focus on impacts made by cloud computing over the last decade. The subsections below describe in some detail those motivations.
Motivation: Self-ServeOrganizations often struggle with Data environments because of permission limitations and lack of accessibility. This ultimately results in a self-fulfilling prophecy of complexity and cost: as permission limitations end up becoming layer upon layer of complexity, it requires more specialized staff members to manage both those permissions, as well as more in-depth knowledge required in order to understand where data exists, how it relates to one another, and how to access it. By making “self-serve” a primary motivation of the design of the architecture, and taking specific steps in order to accomplish that, and at the risk of some up-front pains, we create a more simplified end-to-end solution with business value and the customer in mind.
As an example, data associated with analytics don’t require personally identifiable information (PII). As such, a decision has been made to obfuscate or otherwise remove any sensitive information. This allows all users access to all data, and essentially eliminates practically all of the complexities associated with permissions.
The other aspect of self-serve is centralizing on as few tools, but no fewer, to assure that we’re using the right application for the job. Additionally, it requires that the dedicated staff provides some training in order to assure that consumers of the data environment are aware of not only the applications (tools), when to use them, and assuring that the underlying base data model is understandable, including any nuances associated with things like partitioning, etc.
Motivation: Data as a Service (DaaS)Data as a Service (DaaS) is an emerging topic, and one not without some challenges. Specifically, every organization’s data is unique, and in my experience, trying to make those things too static and reusable often leads to the resulting model becoming almost academic.
That aside, there are significant upsides to striving towards building DaaS, and the following list should be considered to be incomplete:
- Access to on-demand (cloud-based) data related infrastructure (including tools and applications)
- The data is accessible
More than anything else, we should continue to work towards those goals as a north star, with a product mindset in place, and a longer term opportunity for Data to become an enticing product offering and extension to the Anonymous architecture.
Motivation: Exploiting the Pricing Model of the Cloud
Exploiting the pricing model of the cloud takes many forms, but generally, it requires the knowledge and savviness to be able to understand when to leverage services or maintain your own, decoupling compute and storage so that you can pay for each independently, and taking advantage of the pay-for-what-you-use pricing model. This approach is in direct competition with the traditional approach of buying servers, as well as “renting” servers in the cloud and paying for them 24/7 (said differently, taking a lift-and-shift approach instead of building cloud-native designs).
Decoupled Compute and StorageIt is vital to recognize how advances in technology have impacted architectural decisions. Specifically, the massive decrease in the cost of disk, and the ramifications that has had: from "$500,000 [USD] per gigabyte in 1981 to less than $0.03 per gigabyte today." The cost of disk in the 1980s largely drove technology decisions behind RDBMS architectures. That change, coinciding with the explosion of data volumes beginning with the advent of the internet in the mid-1990s to the “Big Data” revolution of the last decade, means a change to how we store and process data.
Today, disk costs hover around zero, while chip technology continues to improve, and significant costs continue to exist, but for compute, not disk.
The cloud, and advancing technologies, provide us the capability to decouple compute and storage, and manage and pay for them independently. This becomes even more pronounced when it comes to Data, since traditional analytics departments were built around Data Warehouses, which are (RDBMS) tightly coupled compute and storage. Imagine a (e.g.) Oracle Data Warehouse that houses a terabyte of data. Once the disk on that server is exhausted, you are forced to buy a bigger server, even though your compute costs are never consistently at or near 100% (and if they are, you’ve got bigger problems!). This in turn means that you’re paying for a bigger server (with more disk and compute) even though your needs are simply for more disk (the thing that’s practically free).
Recognizing that disk costs are no longer restrictive, and that certain data-heavy technologies were reliant on that, and with the introduction of the cloud, thereby providing new and novel ways to build systems, we can decouple the disk necessary to house the data (likely a graph that generally follows a 45 degree angle of always-increasing-amounts) from the compute (which is naturally spiky), independently.
This does not mean that the value of relational databases should be thrown out; rather, I make the argument simply that RDBMS (usually) should be. I wrote an article back in April of 2019 on the same topic.
We can see this illustrated in the usage of S3 (disk) decoupled from EMR (compute), which uses a Hive Metastore to manage the metadata (describes the data on disk for usage on EMR, the AWS Hadoop cluster):
Finally, because the data is centralized, as is the metadata, we have “infinite” horizontal scalability: we can provision any number of EMR clusters, comprised of any number of EC2 instances, and they can all access the same data, as shown below:
Ephemeral Distributed ComputeThis speaks to the approach of paying for what you use, which I mentioned previously is in direct competition with the traditional approach of buying servers, as well as the cloud based approach of “renting” servers (compute) and persisting them, in which case you are paying for them 24/7, even though compute needs are, at best, volatile.
The approach taken in the ADE is twofold: (1) to provision clusters (horizontally scalable and elastic) on-demand for specific processing while (2) leveraging spot instances, which provides savings of up to 90% savings over on-demand pricing. There are some nuances and challenges to taking this approach, specifically around writing idempotent ETL, and dealing with the frustrations of losing a cluster occasionally and needing to provision a new one, but mastering this capability can have both massive financial impacts, amongst other things.
To better understand the decision to utilize clusters (as compared to, for instance, a serverless data deployment), it essentially comes down to a combination of data volumes and the stateless nature of Lambda. I’ve written a post about this topic here. To be sure, serverless and FaaS architectures are in play within ADE, alongside EMR.
The sections below speak to staffing plus functionality. This is the meat of the componentry that ultimately comprise the ADE platform, as managed and maintained by the DSE team.
Role and Responsibilities of Data EngineeringA Data Engineer wears as many hats as any Engineering position I can imagine. On any given day, they’re expected to understand DevOps, Engineering (always across at least two languages), data modeling, data management, and enough about Data Science and Data Visualization to be able to support those requirements. One key component and focus of this role needs to be to minimize the amount of time those other roles spend either validating the data, or wrangling it. This section describes a number of areas that fall under the Data Engineering umbrella.
Managing the ephemeral nature of the ADE is involved: being able to provision and terminate clusters on-demand, with varying sets of applications per cluster, differing sizes of clusters, adjustable EC2 instance types, varying jobs to be run on each, differing schedules, automated backups, logging, restarts, etc., is far-from-trivial, and DSE has built this functionality in-house for ADE, leveraging Lambda functionality, S3 and DynamoDB.
Data IntegrationAccording to this Forbes article, "Data scientists spend 60% of their time on cleaning and organizing data. Collecting data sets comes second at 19% of their time, meaning data scientists spend around 80% of their time on preparing and managing data for analysis."
Further, "76% of data scientists view data preparation as the least enjoyable part of their work". Comparatively, a Data Engineer will (should!) relish data preparation, where they can leverage their engineering skills to populate data models.
I argue for ETL over ELT for a couple of major reasons:
- ELT leverages declarative languages (SQL) which don’t allow for code reusability, which should be central to any repeatable data integration tasks
- SQL is sufficient (even preferred) for various data exploration tasks, but does not allow the Engineer to control the execution plan of the script, including performance improvements, or leveraging process parallelism
Thus, for repeatable data integration tasks, I prefer the use of ETL over ELT, where ETL leverages languages that can be adjusted to improve performance, celebrate code reuse, and allows for process parallelism. It Extracts the data from the (logical) database, then Transforms it in a controllable manner that can process in parallel and leverage reusable libraries, and then Loads” it back to the (logical) database.
Avoiding OrchestrationShortly, I discuss the importance of idempotent ETL, with a subtle nuance to the traditional definition:
[Idempotency] means that when you run a process multiple times with the same parameters (even on different days), the outcome is exactly the same. You do not end up with multiple copies of the same data in your environment or other undesirable side effects.
That nuance is that the outcome doesn’t have to be exactly the same, but the outcome must always be complete and accurate.
By enforcing this in our ETL, we can adjust the cadence of execution at will. A weekly execution will produce complete results, as will an hourly execution of the same script. It further decouples execution cadence from orchestration tools, which are a bane to my existence, and I have written a post that further discusses the risks and often unrecognized challenges of implementing orchestration applications, and specifically leveraging scheduling applications within a decoupled storage and compute, horizontally scalable architecture, like the ADE.
Real-Time vs Right-TimeWithin Data environments, in my experience, rarely is there a need for true real-time processing. Almost inevitably, data processed seconds or minutes later will suffice. In fact, when Apache Spark initially released their streaming capabilities, it was actually micro-batch, with the ability to process data as early as milliseconds after receiving it, but it wasn’t truly real-time.
A successful data platform can process data at different cadences, but the important piece is to identify actual needs and distinguish between them, and to agree on the semantics, as the smaller the necessary lag, the more complex and involved the solution often becomes. Again, if you’re building idempotent ETL, cadence shouldn’t matter, though it is admittedly significantly easier to go from daily to hourly executions than it is to go from hourly to sub-second, or real time.
Further, real time processing often requires persistent compute, which means you’re giving away the benefits of ephemeral compute. However, real-time and microbatch capabilities need to exist in many data environments, and ADE is designed in such a way as to support those requirements.
For more frequent processing, there are a couple of options, somewhat dependent on volume (more accurately, on the rate of data arrival). For large data volumes, Kinesis streams data to a persisted EMR cluster. While this is significantly more costly than spot instances, some savings order to take advantage of reserved pricing, which can still save upwards of 70% compared to on-demand pricing.
Amazon Kinesis Data Streams (KDS) is a massively scalable and durable real-time data streaming service. KDS can continuously capture gigabytes of data per second from hundreds of thousands of sources such as website clickstreams, database event streams, financial transactions, social media feeds, IT logs, and location-tracking events. The data collected is available in milliseconds to enable real-time analytics use cases such as real-time dashboards, real-time anomaly detection, dynamic pricing, and more.
The Kinesis to reserved priced persistent EMR cluster approach is shown in the diagram below:
Alternatively, if real-time processing is needed, but the rate of data arriving is less than what KDS supports, a serverless architecture that better reflects what is happening on the Anonymous platform is a viable option, leveraging Lambda functionality, as shown in the diagram below:
A Function as a Service (FaaS) architecture, which is reflected in the image above, is beyond the scope of this document, but I’ve posted about it elsewhere.
Similarly, and I think this really lends itself well to the idea of “right-time” processing where the cadence is (normally) interday, but where the rate of data arrival adheres to the practical and actual limitations of (AWS) Lambda functions. In these scenarios, it is often a good practice to capture (buffer, if you will) the data in a queue temporarily, before further processing (by other Lambda functions, or on stateful compute, like EC2). This approach is illustrated in the diagram below:
This section presented ADE in terms of processing massive amounts of data on ephemeral, horizontally scalable and elastic server clusters, as well as streaming significant volumes on persisted reserved, horizontally scalable and elastic server clusters, in addition to real-time processing leveraging a FaaS approach, as well as a queue based “right-time” approach, which could be minutes or hours. The ADE environment supports processing of minimal amounts, out to multi-petabyte workloads, and any combination thereof.
A Note about the Lambda Architecture
Nathan Marz is widely regarded as the creator of the Lambda Architecture (not to be confused with AWS Lambda functions, which are discussed throughout this paper). His approach promised to address the limitations of the CAP theorem, by implementing a combination of batch, streaming and serving layers. In short, and admittedly not doing it any justice here, data would be served up in real time and near real time, and would later be aggregated, summarized or otherwise processed in batch and replace the real time data. Arguably, certain conceptual components of the Lambda Architecture are present in the ADE, but (in my opinion) the problems with that architectural choice outweigh the actual benefits. This discussion goes beyond this paper, but O’Reilly has published a good article on the same topic.
Idempotent ETLAs discussed previously, idempotent ETL gives us the opportunity to adjust cadence at will, which is a compelling argument in and of itself, and with the nuance of “always as complete as possible”, this also provides correction-without-removal, which is equally as valuable. Imagine a scenario where a bad code push doesn’t result in having to “back out” incorrect results. Theoretically, leveraging idempotent ETL, a rerun will override anything that may have been done in error previously.
Finally, idempotency is a trademark best-practice of basically anything built on distributed systems, so it becomes applicable for other “deliver-at-least-once” guarantees (common in distributed systems), which in turn make further extending real-time and right-time data processing requirements more pattern based and common to the DSE team.
Data Privacy: Ingestion and ObfuscationData privacy is the most important technology problem facing the world today. Companies are buying data legally and illegally, and individuals and organizations alike are leveraging data in extremely inwardly (intrusive advertising) and outwardly nefarious ways, from identity theft to social bullying to impacting elections to inciting riots.
Beyond the idealistic, there is significant security risk (hackers) and a simplicity benefit to the business.
Traditionally, permissions in data is an oft-overlooked risk. When sensitive data is allowed into a Data environment, there are role restrictions as to who can see what. This inevitably begins innocently enough: persons in role X and above have access to a superset of data, whereas everyone else has access to a subset. But, inevitably, those differences are not binary, and before you know it, there are a dozen different sets of permissions that are not static, which are in turned to a set of people who are dynamic (people come and go in jobs) in roles that are, too (people change roles in their jobs).
By obfuscating data at the initial point it enters the ADE, you can open up the entirety of the data model to everyone. This goes to usability, and it goes directly to self-serve. Within ADE, this is accomplished on raw (app/platform) data via S3 events triggering Lambda functions that publish to Firehose (in turn writing to S3), as illustrated in the diagram below:
Kinesis Firehose is leveraged here to assemble the output data into more efficiently consumable block sizes, as well as leveraging more modern compression formats, like Apache Parquet and/or ORC. The resulting data is ready for ETL, confident that all sensitive data, which is anyway not applicable for analytics or data science, and can be made blindly accessible for everyone in the organization.
Third Party Integrations (and the Backend-for-FrontEnd)There are, at present, three disparate third party applications that we collect data on for analytics purposes:
At a (unfairly) high level, MixPanel captures application level interactions, including things like user sessions and button clicks. The Venn Diagram with Google Analytics undoubtedly has not-insignificant overlap (perhaps not surprisingly, they are often in disagreement with one another), along with more general information. Both sell themselves as Analytics Suites, much in the same way I can sell myself as a chef: I’m a better alternative than starving, but not by much.
The data from these two applications are brought into the ADE via Stitch, a simple and inexpensive 3rd party application that has connectors to each of them. Stitch writes data to S3, where it becomes available for processing and integration to the base data model on ephemeral, horizontally scalable, elastic EMR clusters running spot instances. These pipelines are shown below:
Salesforce provides information associated with cases and other customer related data, since it is the CRM system of the company. As Salesforce data needs to be integrated back into the platform (for instance, a customer speaks to an agent and updates their surname), a Change Data Capture (CDC) process has been set up for data originating in Salesforce: via a RESTful API call, data is written back to application backend, which in turn goes into a communication hub (or, if it is data not needed in the backend, directly to the communication hub.
In turn, this data is written to a platform S3 bucket, replicated to the S3 environment in the ADE account, all in real-time. That data is then obfuscated, modeled, etc., and integrated into the base data model. This flow is shown below:
A note about Salesforce: Since writing this a couple weeks ago, we've moved to leveraging AWS AppFlow to integrate Salesforce into AWS.
Amazon DynamoDB is a key-value and document database that delivers single-digit millisecond performance at any scale. It's a fully managed, multi-region, multi-active, durable database with built-in security, backup and restore, and in-memory caching for internet-scale applications. DynamoDB can handle more than 10 trillion requests per day and can support peaks of more than 20 million requests per second
Further, the BFF was implemented using another design approach, called Single-Table Design. These were all decisions that took into account the desire for (platform) reusability. That, coupled with the GraphQL (AppSync) and an event driven architecture, and it essentially means all the "real" application data and everything happening on the back-end (the main data flow) comes through this pipeline.
The pipeline is intentionally simple, and naturally real-time. Any information (in the form of a before/after JSON object) from the BFF table is automatically written to the communication hub, which in turn writes the data in real-time to S3, which triggers a real-time event to perform some processing, which writes it back to S3 in real time, at which point it is replicated to another bucket, which triggers a real-time event that triggers a Lambda function that obfuscates it and makes it available to downstream ETL and integration into the base data model. As a side note, the S3 automatically long-term archives that data (for regulatory and legal reason) in Glacier. This flow is shown below:
Role and Responsibilities of Data ScienceI use the term "Data Scientist" vaguely. More recently, the term "Artificial Intelligence" (AI) has again become en vogue, and I’ll jump on that bus enough to say "Data Scientists build Artificial Intelligence". I’ll include predictive analytics, as well as machine learning and deep learning (e.g. neural networks).
Almost inevitably, a Data Scientist has some background (normally formal) in Statistics, which comes into play beyond AI, for things like
Within the ADE, Data Scientists can leverage traditional EMR clusters (spot instances, ephemeral, leveraging Compute Management), or they can leverage AWS SageMaker, and access data from the ADE data model (decoupled disk, on S3, leveraging the Hive metastore/Glue catalog for relational database capabilities), as shown below:
Certainly, EMR will be significantly less expensive than SageMaker, but depending on the experience of the Data Scientist, EMR can be layered in more and more (for model development, data exploration, and model training on a regular cadence, to name a few), but the trade-off is complexity and having to manage some functionalities that SageMaker provides out-of-the-box.
But Data Scientists are expensive, and certain aspects of their day-to-day jobs have become increasingly componentized, or at least we’re seeing the Pareto Principle prove again true. Kaggle contests are most often won using random forests. Text indexing often leverage Lucene Indexes. And so on. Enter the world of AutoML.
AutoMLIn short, certain libraries have somewhat recently emerged that provide predictive analytics, including identifying the attributes (and combinations thereof), then deciding among the following algorithms:
- Random Forests
- Extremely Randomized trees
- k-nearest neighbors
- LightGBM boosted trees
- CatBoost boosted trees
- A specific type of (tabular) deep neural networks
And, using a quote from their materials:
[Y]ou can train state-of-the-art machine learning models for image classification, object detection, text classification, and tabular data prediction with little to no prior experience in machine learning.
This becomes an incredibly sought after capability and selling point to ADE. For at least some of the needs of the organization, they can buy a technology solution and free up their Data Scientists to do more meaningful research and work, such as Markov Chains and deep neural networks cannot be serviced through this approach (yet).
This capability is shown in the diagram below, again leveraging SageMaker.
The pipeline starts with an S3 bucket, which is where you upload the training data that AutoGluon uses to build your models, the testing data you want to make predictions on, and a pre-made package containing a script that sets up AutoGluon. After you upload the data to Amazon S3, a Lambda function kicks off an Amazon SageMaker model training job that runs the pre-made AutoGluon script on the training data. When the training job is finished, AutoGluon’s best-performing model makes predictions on the testing data, and these predictions are saved back to the same S3 bucket.
Experimentation and A/B TestingMature technology companies often run hundreds (or more) of A/B tests daily. Decisions are made based on what the data tells us, not what our intuition does. Scores of books like this one, have been written on this subject; all of which inevitably doing a better job than I could. Instead, I’ll speak here to the short/medium/long term goals of ADE, with regard to an experimentation framework.
Initially, we will leverage Google Firebase for A/B Testing, which was chosen because it has the most enticing price tag (though you get what you pay for). In short, it will provide some pay-for-what-you-need basic testing capabilities for the front end. Unfortunately, it has limitations that a more fully featured (and fully priced) alternative can provide.
Importantly, the data from Firebase must be captured and persisted within the ADE, despite it being a third party application. This can in turn be leveraged for future tests, within ETL, potentially integrated within the data model, reported on, etc.
Mid-term, ADE would prefer to upgrade to a more fully featured commercial Experimentation Framework application, such as Optimizely or Google Optimize. While these applications are significant upgrades to Firebase, not the least of which being “Full Stack” A/B testing capabilities, they start around $150,000 USD, and even then, will fall short versus an enterprise A/B and Experimentation framework, customized to the needs of the organization (which would be the long term goal).
An argument might be made to forego the mid-term if the long-term build cost (including time) is appealing when compared against the additional flexibility and functionality that it might provide. Time will tell.
Integrating Model Outputs to the Application (applying the model)One nuance to integrating the model can be applying the model. Essentially, making it accessible to the application that will perform the algorithm.
The two main candidates for Machine Learning and Predictive Analytics within the platform would be the mobile application and the money flow. At the risk of over-simplification, we can think of machine learning as determining which attributes or combinations of attributes (or actions) can determine something else. For instance, a Data Scientist may determine that height can be a factor in predicting weight.
The resulting algorithm might be (again, a gross oversimplification):
2x + 3 = y
This is ultimately what a Data Scientist friend of mine calls the output of the model training: "the coefficients" (his term). Now, when we call the coefficient, and perhaps pass x in as the height the user enters, it predicts the person's weight, y.
Ultimately, then, it’s a matter of getting the resulting coefficients in an accessible place for (a) the application, (b) the money flow, or (c) something else, when those opportunities arise.
The diagram below shows how the coefficients can be made accessible either via S3 or via the BFF table.
There is significantly more involved in properly deploying models, including but not limited to rolling back, CI/CD and versioning. But at the simplest level, applying the coefficients is the outcome of all the work that went into building and training it. Having a solution here when it has proven to be problematic for organizations in the past is worth noting, at the very least.
Role and Responsibilities of Data ManagementWikipedia defines Data Management as "compris[ing] all disciplines related to managing data as a valuable resource." A slightly more detailed definition might be:
Modeling, profiling, validating, storing, and protecting data to ensure the accessibility, reliability, and actionable timeliness of the data for its users.
Data Management, a set of skills required for Spanners and Data Engineers alike, is the topic of this section.
Data ModelingIn the Problem Statement of this paper, it stated:
[M]uch of the established playbook for data management no longer applies, but a professional appreciation for its legacy is necessary to recognize when to break the rules and when to establish new ones.
And nothing could be more applicable in the world of Data than modeling. Also in this paper, the price of disk was discussed. As a reminder, it’s gone from $500,000 USD in 1980 to less than $0.03 in 2018. This has had an impact on RDBMS technologies, and it, amongst other things, has had a profound impact on relational data modeling.
For many years in the 1990s and early 2000s, Ralph Kimball was widely regarded as the father of data warehousing, or at least, the data models that warehouses leveraged. The Data Warehouse Toolkit could be found in the cubicle of seemingly every Data Warehousing and ETL practitioner. Entire careers were dedicated to the mastering of star and snowflake schemas.
Make no mistake, those lessons are largely still applicable today (a professional appreciation for its legacy), but the disk costs, immutable nature and characteristics of cloud storage, and understanding of horizontally scalable compute are all necessary to understand that new solutions are needed (recognize when to break the rules and when to establish new ones).
Because disk is now no longer a commodity (the value has decreased substantially), because we focus on idempotent ETL (which equates to many copies of the data, since each execution is complete), and because processing is distributed, the traditional star/snowflake is replaced with a somewhat more bastardized version of the same.
Specifically, dimensions and facts are still leveraged (textual vs transactional), but:
- Surrogates are UUIDs, which can be applied in parallel
- Denormalization is preferred to normalization, to an extent
- Reporting tables are the norm, built off the base model
- Dimensions are complete replacements
- Facts are append
- _DAY_D fact/dimension hybrids are used, when they make sense
- Data is batched for idempotency reasons. Batch identification is the topic of a 2015 post
Each of the above bullets above is beyond the scope of this document, but each should be considered part of the maturity of the ADE and the DSE individuals responsible for Data Management.
Data Profiling and ValidationIntroducing functionality like data profiling and validation is significantly more complex to have to integrate after-the-fact. Every organization will choose to prioritize functionality, and seeing value from their Data investment, ahead of building out fancy frameworks to validate the data. Unfortunately, this is a mistake, as quality is not an afterthought, and a failure to appreciate the importance of doing so can cost organizations millions of dollars, in human power and in the fallout of mistakes of leveraging invalid data to make predictions or decisions.
That said, there aren’t enough hours in the day or quality resources for a reasonable budget to do everything at once (ah, the complexities of even trying!), and the framework discussed herein is something that I’ve PoC-ed with two clients, but they were just steel threads. More importantly for Anonymous, the architecture is flexible enough that integrating it at a future point will be much easier than a traditional architecture in which validation is an afterthought.
A Note about Profiling vs ValidationPreviously in this paper, Data Scientists spend 80% of their time on preparing and managing data for analysis, and part of that preparation is in determining profiles (characteristics) of the data. I also used an example of using height to determine weight. That simplified model becomes moot if, say, 75% of the customers don’t enter their height.
Determining and exposing these characteristics is profiling. It’s capturing all sorts of information (null counts, row counts, valid values, min/max/mean/mode, etc.). They’re the type of things Data Scientists, Data Visualization Engineers and Data Analysts all do. Over and over, instead of automating it, doing it once, and exposing the results.
Data validation, on the other hand, might identify that 75% customer heights are null and if that is a bug flag it as a validation failure (log and/or create an alert). But perhaps it isn’t. Perhaps it is an optional field and about 25% of customers opt in to providing that information, in which case, validations should not flag it as an issue.
Additionally, data validations do type checks (e.g. is birthdate numeric?), row counts, etc. Perhaps a custom validation might be: "is the number of rows loaded yesterday to table x within 2 standard deviations of a rolling last 10 Sundays?". The value of the combination of profiling and validating is difficult to debate. The best simple implementation I’ve come up with thus far is shown in the diagram below:
Here, ElasticSearch is introduced. ElasticSearch leverages the Lucene technology made famous by Doug Cutting (of Hadoop fame). In a nutshell, it’s natural language processing, so you can “ask” questions about the data via textual searches. So, in this example, perhaps validations and profiles are captured for the CUSTOMER_D (dimension) table. You could “search” for CUSTOMER_D and retrieve statistics (profiles and validations) associated with (tagged with) that table name, and have the information served up via Open Source software Kibana.
A Note about the ELK ArchitectureElasticSearch and Kibana make up the “E” and “K” of the ELK architecture (the other being Logstash). Here, ElasticSearch is being used differently in two ways:
- It isn’t collecting log information (further, Logstash is not any part of the Anonymous architecture). Instead, it is capturing profiles and validations specifically for surfacing via ElasticSearch/Kibana
- Because of the nature of usage here, ElasticSearch could be used ephemerally, which would be a significant exploitation of the cloud pricing model, but:
- This increases complexities. For one, Compute Management would have to be extended (luckily, it was designed with this idea in mind)
- Data persisted to disk (S3 in the diagram above) would have to be reprocessed regularly, and thus, some data deletion would need to happen on the regular. Luckily, older data is less important (e.g. very rarely does anyone ask for profiles of data that is months old, except in summarization)
Finally, alerts can be done in various ways, and adding a subscription to SNS that checks for errors could pass them on to Slack, or to OpsGenie, for that matter (though the former would be more likely than the latter).
Role and Responsibilities of Data VisualizationData Visualization, and the act of identifying visually the signal in the noise is a vital skill set that isn’t appreciated nearly enough (the number of pie charts that blind me regularly in a startup is sufficient evidence of that).
In the ADE, data visualization is done via AWS QuickSight, as a stop gap. As AWS products go, this is the one that likely made Bezos lose his hair. It’s an abomination compared to many of their other offerings. However, the alternatives are not without cost, and not without needing to stand up additional (likely persisted) infrastructure, within VPCs, but accessible internally, etc.
And with the DSE team being extremely small, the decision is to leverage an AWS service until a change is justified.
QuickSight connects to the Hive Metastore/Glue Catalog and retrieves its data from the base data model and/or reporting tables, as discussed previously. This connectivity is shown in the diagram below:
As the organization matures, the DSE team grows, and the organization begins to use the ADE, it is highly likely that the application itself will be upgraded from QuickSight to Tableau, Looker or other.
Role and Responsibilities of Data AnalysisI make a slight distinction with Data Analysis, in part because it is only occasionally a role (those individuals are, in my experience, usually from a math background, and spend time looking at the data in its raw form (as opposed to corporate visualizations; they may create a data visualization, but it’s more transient in nature because it is specific to their current use case).
Almost inevitably, data analysis is performed leveraging SQL, so from a language standpoint, there is massive overlap with Data Visualization, Data Engineering, and Data Science (hello, Spanner).
When a number of other initiatives are complete, or well in hand, and the organization is using the ADE regularly, we will likely upgrade to Presto on EMR, but because of priorities and team sizes, the decision at this point is to provide that interface to ADE for analysis via SQL using AWS Athena (which is Presto on EMR, as a service, and leveraging a data size + cost of individual query pricing model). Again, the decision to leverage a service offering exploits the pricing model of the cloud. It won’t once the number of queries being executed again at-scale data volumes, but the architecture is flexible enough to easily switch (we can already run Presto on ephemeral, horizontally scalable and elastic EMR clusters).
Like QuickSight, Athena leverages the Hive Metastore/Glue Catalog and data on S3 via the base data model, reporting tables, or transient/special purpose tables the user might create for task-at-hand usage. This is shown in the diagram below.
Integrating with the Parent Company Enterprise Data Warehouse (EDW)Integrating data into Parent Company is extremely valuable, as it can be used internally there in conjunction with their existing data, as part of their Enterprise Data Warehouse (EDW).
Luckily, the Parent Company, like Anonymous, leverages AWS, and specifically, Redshift, S3, EC2, and QuickSight. This makes integration relatively straightforward, via S3 replication across accounts. Where that data from, and how it is prepared (or explained) is the challenge here, not the technology or connecting platforms.
Generally speaking, sharing data that leverages a modern compression format that is also self-describing (Parquet, ORC), and focusing on explanation so that the team at Parent Company can be sure they know what to expect and when, and the process to follow to fix it when that fails is paramount.
ConclusionThere was more to my presentation on Data, but for shit's sake, this is already coming up on ten thousand words and let's be honest here, even I'm not going to take the time to read all of this. Suffice to say, this stuff is really interesting to me, but there are a hundred ways to skin these cats. Snowflake is a big winner in the market right now, including a stock price currently above $250 (they must be doing something right!).
And there are a number of other competitors in the market. I have my reasons, and there are plenty of good arguments could be made for other alternatives.
I'll admit my approaches do introduce some additional complexities, but the end result is an incredibly flexible, scalable, and extendable Data architecture that exploits the pricing model of the cloud. At the end of the day, you can find/train/teach smart people to understand this shit, and doing so makes them the best Data Engineers/Spanners/Visualization Experts/Analysts/Data Scientists out there, which benefits them just as much as it benefits you. The technology is a means to an end, and it's the people that build the solutions. So, if I can check the boxes on good architecture, make Data accessible and self-serve, and help people be successful doing it this way, I think I'll stick to my guns for awhile longer.
 In mature organizations that consider experimentation part of their corporate DNA, anyone across the company can recommend (and in many circumstances, implement) experiments (usually A/B tests). This specifically talks to ownership (Product: this role determines what should go into the offering) and specific required skills (Data Scientist: there is, e.g., a statistical component to teasing out p-values in multivariate experiments)
 With one exception: access to the raw and unadulterated data (pre-obfuscation) is limited to a small team of Data Engineers/Spanners.
 A topic I’ll return to in the section on Reusability.
 “Accessible” here means that it is understood, actionable from a timing perspective, and trustworthy
 Leveraging the AWS service, it is referred to as the “Glue Catalog”.
 I consider mini-batch to be anything executing intraday, down to about every 15 minutes, give or take
 In certain scenarios, I will recommend leveraging Kafka over Kinesis. Those decisions are almost inevitably about data volumes. At Box, for instance, we were processing nearly 3 petabytes of data daily. For that, Kafka made sense. Kinesis is the normal choice because streaming volumes are not often that significant, and the integrations of Kinesis in the overall AWS ecosystem are superior to Kafka, though AWS does now offer Kafka as a Service.
 Often, you’ll have to adjust a high watermark (or similar) to assure previously processed data is picked up in a fix run, but this should be relatively frictionless
 With extremely rare exception
 With total apologies to all my Data Scientist friends. You all are way smarter than I am. You always remind everyone.
 An extension, perhaps, of the definition found here.