By Ian Gorton
Senior Member of the Technical Staff
Software Solutions Division
In earlier posts on big data, I have written about how long-held design approaches for software systems simply don’t work as we build larger, scalable big data systems. Examples of design factors that must be addressed for success at scale include the need to handle the ever-present failures that occur at scale, assure the necessary levels of availability and responsiveness, and devise optimizations that drive down costs. Of course, the required application functionality and engineering constraints, such as schedule and budgets, directly impact the manner in which these factors manifest themselves in any specific big data system. In this post, the latest in my ongoing series on big data, I step back from specifics and describe four general principles that hold for any scalable, big data system. These principles can help architects continually validate major design decisions across development iterations, and hence provide a guide through the complex collection of design trade-offs all big data systems require.
First Principle: System Costs Must Grow More Slowly Than System Capacity
Imagine you’re asked to build a system that will initially manage and analyze 1 petabyte (PB) of data, and the budget to build and operate this system in the first year is $2 million. In addition, data and processing are expected to double in size every year, and performance and availability requirements are expected to remain stable. This growth pattern means that in four years, your system will be managing 16PB of data, and in six years, 64PB. For many development teams, this growth rate would be a daunting requirement.
Now imagine that your design can sustain this growth, but the associated costs will also double each year, which means that in four years the project will require a budget for development and operations of $32 million. That’s probably not an estimate many clients will like. What is reasonable will of course depend on how the system functionality will evolve, but linear cost growth estimates are more likely to be acceptable. This principle that costs must grow much slower than capacity is depicted in Figure 1 below.
Figure 1. Costs grow slowly as capacity grows quickly
Adhering to this principle requires the system architecture to directly address and minimize the costs associated with rapidly growing capacity. For example, choosing a database that can be expanded with minimal manual intervention (for example, one that distributes data using consistent hashing) will make increasing data capacity a fixed, low cost activity. If a request load is cyclic, experiencing peaks and troughs, you can create an elastic solution that handles peaks by provisioning new cloud-based resources as required, and tearing these down to save costs once the peaks are over. If historical data will be accessed infrequently, it may be possible to store it on slower, low-cost storage media and build summarized views of the data for rapid access. Or, if the access patterns are amenable, the most frequently accessed historical data could be cached in online data stores.
Scalable software architectures therefore need to constantly seek out and implement efficiencies to ensure costs grow as slowly as feasible, while data and processing capacity grow as rapidly as requirements demand. These efficiencies pervade all layers of big data architectures, and by testing design options against this principle, scalable solutions are much more likely to emerge.
Second Principle: The More Complex a Solution, the Less Likely it Will Scale
Most of us are taught at an early age that if a deal sounds too good to be true, it probably is. Common sense tells us investments that are guaranteed to grow at 100 percent a year are almost certainly bogus or illegal, so we ignore them. Unfortunately, in building scalable software systems, we commonly see common sense put on the back burner when competing design alternatives and products are evaluated as candidates for major components of big data systems.
Let’s take a simple example: Strong consistency in databases is the bedrock of transactional systems and relational databases. Implementing strong consistency, however, is expensive, especially in distributed databases. To build highly scalable and available systems, the NOSQL database movement has consequently weakened the consistency models we can expect from databases. This trend has occurred for a good reason: weak consistency models are inherently more efficient to implement because the underlying mechanisms required are simpler.
In response, relational databases and the NewSQL technologies are now turning to new implementation models that provide strong consistency. NewSQL solutions aim to achieve the same scalable performance of NoSQL systems for online transaction processing workloads, while simultaneously supporting the atomicity, consistency, isolation, and durability (ACID) properties found in traditional SQL database systems. This approach sounds attractive, and some of the new open source technologies that exploit main memory and single-threading show immense promise. But fundamentally, achieving strong consistency requires more complexity, and as the scale of the problem grows, it is almost certainly not going to scale as well as weak consistency models.
Of course, weak consistency models will give your application greater scalability, but there are trade-offs. You probably have to de-normalize your data model and, hence, manage any duplication this introduces. Application code has to handle the inevitable conflicts that arise with weak consistency models. As always, there’s no free lunch. But, if your data and workload are amenable to a weak consistency model (and many are, even ones we think of as needing strong consistency), it will be your path to scalability.
Another example of the second principle is the scalability of many simple concurrent read requests versus concurrent queries that invoke complex, statistical machine learning and data mining approaches to analyze and process tens of MBs of data. We often see requirements that demand the latter, along with 2-second response times. Common sense tells us that is unlikely to happen and certainly won’t scale (this is when you consider materialized views and caches), but often in these circumstances, common sense simply doesn’t seem to prevail. This principle is depicted below in Figure 2.
There is one more key point to remember. Even though one design mechanism may be fundamentally more scalable than another, the implementation of the mechanism and how you use it in applications, determines the precise scalability you can expect in your system. Poorly implemented scalable mechanisms will not scale well, and from our experience these are not uncommon. The same applies to inappropriate usage of a scalable mechanism in an application design, such as trying to use a batch solution like Hadoop for real-time querying.
Adhering to the second principle requires thinking about the fundamental distributed systems and database approaches that underpin design decisions. Even simple rules of thumb can be enormously beneficial when considering how a design may scale. Ignore these details at your peril, as many have recently found out.
Third Principle: Avoid Managing Conversational State Outside the Data Tier
State management is a much debated and oft misunderstood issue. Many frameworks, for example the Java Enterprise Edition (JEE), support managing state in the application logic tier by providing explicit abstractions and simple application programming interfaces (APIs) that load the required state from the database into service instance variables, typically for user session state management. Once in memory, all subsequent requests for that session can hit the same service instance, and efficiently access and manipulate the data that’s needed. From a programming perspective, stateful services are convenient and easy.
Unfortunately, from a scalability perspective, stateful solutions are a bad idea for many reasons. First, they consume server resources for the duration of a session, which may span many minutes. Session lengths are often unpredictable, so having many (long-lived) instances on some servers and few on others may create a load imbalance that the system must somehow manage. When sessions do not terminate cleanly (e.g., a user does not log out), an instance remains in memory and consumes resources unnecessarily before some inactive timeout occurs and the resources are reclaimed. Finally, if a server becomes inaccessible, you need some logic, somewhere, to handle the exception and recreate the state on another server.
As we build systems that must manage many millions of concurrent sessions, stateful services simply do not scale. Stateless services, where any service instance can serve any request in a timely fashion, are the scalable solution. In a stateless architecture, the session state is securely passed as parameters from the client with each request, and the data is accessed from the persistent data tier and/or from caches, where it is ephemeral and only causes a performance hit if not present. In server failure cases, requests can be simply routed to another service instance, and new service nodes can be started at any time to add capacity. Passing state with each request does consume more network resources, but the amount of resources is typically small as the state that must be communicated is conversational. A conversational state is only needed while a session (conversation) is in progress to control a sequence of interactions from a single client, and, hence, is limited in scope and size.
Stateless solutions scale trivially by adding resources, simplifying application logic and system operations. The inevitable design trade-off is that stateless solutions place more load on your data tier, making the scalability of an application’s databases a crucial design factor. As Martin Kleppmann writes in his blog post, Six Things I Wish We Had Known About Scaling, “the hard part is scaling the stateful parts of your system: your databases.” For this reason, I am working with other SEI researchers to develop LEAP(4BD), to help organizations fully understand the implications of their data tier choices for big data systems.
Fourth Principle: You Can’t Manage What You Don’t Monitor
Big data systems rapidly attain deployment scales that change many accepted wisdoms of software engineering. Put simply, two of these challenges to accepted wisdom are
- The more software and hardware components big data systems have, the higher the likelihood that failures occur.
- It is impractical to fully test new code because tests will become obsolete as soon as the size of the data on which the code operates grows. As Kleppmann explains, “realistic load testing is hard.”
The only feasible solution to these two problems is to weave powerful monitoring and analysis capabilities into your applications and deployment infrastructure. By carefully monitoring how systems behave as code and databases and deployments scale, it becomes possible to more easily respond to failures and even proactively take actions as pressure points build up.
This last topic is complex, and worthy of treatment by itself, which I’ll do in my next installment in this blog series. In summary though, deep and flexible monitoring and analysis capabilities are fundamental to the success of big data systems, and your architectures must be designed to take into account the costs of these capabilities, which can be considerable. As we’ll see, this requirement for production-time data capture and analysis has many implications for successful big data system deployments.
Final Thoughts and Looking Ahead
The four principles described above hold for any big data system, so adhering to them will always be a good thing. In contrast, unconsciously violating these principles is likely to lead a system into downstream distress, slowing capability delivery, massively inflating costs and potentially leading to project failure. Of course, simple expediency may mean you have to violate a principle to meet a near-term deliverable. Violations are not a bad thing, as long you recognize the technical debt that has been incurred and plan accordingly to pay this debt back before the interest incurred becomes a major problem.
The next post in this ongoing series on big data will examine the challenges of scale and how observability is fundamental to the success of big data systems.
To listen to the podcast, An Approach to Managing the Software Engineering Challenges of Big Data, please visit
For more information about the the Lightweight Evaluation and Architecture Prototyping (for Big Data), known as LEAP(4BD), please visit http://blog.sei.cmu.edu/post.cfm/challenges-big-data-294.