The Importance of Software Architecture in Big Data Systems

Big Data Add comments

By Ian Gorton
Senior Member of the Technical Staff
Software Solutions Division

Ian Gorton Many types of software systems, including big data applications, lend them themselves to highly incremental and iterative development approaches. In essence, system requirements are addressed in small batches, enabling the delivery of functional releases of the system at the end of every increment, typically once a month. The advantages of this approach are many and varied. Perhaps foremost is the fact that it constantly forces the validation of requirements and designs before too much progress is made in inappropriate directions.  Ambiguity and change in requirements, as well as uncertainty in design approaches, can be rapidly explored through working software systems, not simply models and documents. Necessary modifications can be carried out efficiently and cost-effectively through refactoring before code becomes too ‘baked’ and complex to easily change. This posting, the second in a series addressing the software engineering challenges of big data, explores how the nature of building highly scalable, long-lived big data applications influences iterative and incremental design approaches.

Iterative, incremental development approaches are embodied in agile development methods, such as XP and Scrum. While the details of each approach differ, the notion of evolutionary design is at the core of each. Agile software architects eschew large, planned design efforts (also known as Big Design Up Front), in lieu of just enough design to meet deliverable goals for an iteration. Design modifications and improvements occur as each iteration progresses, providing on-going course corrections to the architecture and ensuring that only features that support the current targeted functionality are developed.

Martin Fowler provides an excellent description of the pros and cons of this approach. He emphasizes the importance of test-driven development and continuous integration as key practices that make evolutionary design feasible. In a similar vein, at the SEI we are developing an architecture-focused approach that can lead to more informed system design decisions that balance short-term needs with long-term quality.
Evolutionary, emergent design encourages lean solutions and avoids over-engineered features and software architectures. This design approach limits time spent on tasks such as updating lengthy design documentation. The aim is to deliver, in as streamlined a manner as possible, a system that meets its requirements.

There is, of course, an underlying assumption that must hold for evolutionary design to be effective: change is cheap. Changes that are fast to make can easily be accommodated within short development cycles. Not all changes are cheap, however. Cyber-physical systems, where hardware-software interfaces are dominant, offer prominent examples of systems in which hardware modifications or unanticipated deployment environments can lead to changes with long development cycles. Other types of change can be expensive in purely software systems, as well.

For example, poorly documented, tightly coupled, legacy code can rarely be successfully replaced in a single iteration. Incorporating new, third-party components or subsystems can involve lengthy evaluation, prototyping, and development cycles, especially when negotiations with vendors are involved. Likewise, architectural changes—for example, moving from a master-slave to a peer-to-peer deployment architecture to improve scalability—regularly require a fundamental and widespread re-design and refactoring that must be spread judiciously over several development iterations.

Evolutionary Design and Big Data Applications

As we described in a previous blog post, our research focuses on addressing the challenges of building highly scalable big data systems. In these systems, the requirements for extreme scalability, performance, and availability introduce complexities that require new design approaches from the software engineering community. Big data solutions must adopt highly distributed architectures with data collections that are (1) partitioned over many nodes in clusters to enhance scalability and (2) replicated to increase availability in the face of hardware and network failures. NoSQL distributed database architectures provide many of the capabilities that make scalability feasible at acceptable costs. They also introduce inherent complexities that force applications to perform the following tasks:

  • explicitly handle data consistency
  • tolerate a wide range of hardware and software faults
  • track component monitoring and performance measurement so that operators have visibility into the behavior of the deployment

Due to the size of their deployment footprint, big data applications are often deployed on virtualized, cloud platforms. Clouds platforms are many and varied in their nature, but generally offer a set of services for application configuration, deployment, security, data management, monitoring, and billing for use of processor, disk, and network resources. A number of cloud platforms from service providers, such as Amazon Web Services and Heroku, as well as open-source systems, such as OpenStack and Eucalyptus, are available for deploying big data applications in the cloud.

In this context of big data applications deployed on cloud platforms, it’s interesting to examine the notion of evolutionary system design in an iterative and incremental development project. Recall that evolutionary design is effective as long as change is cheap. Hence, are there elements of big data applications where change is unlikely to be a straightforward task, and that might, in turn, require major rework and perhaps even fundamental architecture changes for an application?

We posit that there are two main areas in big data applications where change is likely so expensive and complex that it warrants a judicious  upfront architecture design effort. These two areas revolve around changes to data management and cloud deployment technologies:

  • Data Management Technologies. For many years, relational database technologies dominated data management systems. With a standard data model and query language, competitive relational database technologies share many traits, which makes moving to another platform or introducing another database into an application relatively straightforward. In the last five years, NoSQL databases have emerged as foundational building blocks for big data applications. This diverse collection of NoSQL technologies eschews standardized data models and query languages. Each technology employs radically different distributed data management mechanisms to build highly scalable, available systems. With different data models, proprietary application programming interfaces (APIs) and totally different runtime characteristics, any transition from one NoSQL database to another will likely have fundamental and widespread impacts on any code base.
  • Cloud Deployments. Cloud platforms come in many shapes and sizes. Public cloud services provide hosting infrastructures for virtualized applications and offer sophisticated software and hardware platforms that support pay-as-you-use cost models. Private cloud platforms enable organizations to create clouds behind their corporate firewalls. Again, private clouds offer sophisticated mechanisms for hosting virtualized applications on clusters managed by the development organization. Like NoSQL databases, little commonality exists between various public and private cloud offerings, making a migration across platforms a daunting proposition with pervasive implications for application architectures. In fact, a whole genre of dedicated cloud migration technologies, including Yuruware and Racemi, is emerging to address this problem. Where opportunities for new tools such as these exist, the problem they are addressing is likely not something that can be readily accommodated in an evolutionary design approach.

LEAP(4BD)

Our Lightweight Evaluation and Architecture Prototyping for Big Data (LEAP4BD) method reduces the risks of needing to migrate to a new database management system by ensuring a thorough evaluation of the solution space is carried out in the minimum of time and with minimum effort. LEAP(4BD) provides a systematic approach for a project to select a NoSQL database that can satisfy its requirements. This approach is amenable to iterative and incremental design approaches, because it can be phased across one or more increments to suit the project’s development tempo.

A key feature of LEAP(4BD) is its NoSQL database feature evaluation criteria. This ready-made set of criteria significantly speeds up a NoSQL database evaluation and acquisition effort. To this end, we have categorized the major characteristics of data management technologies based upon the following areas:

  • Query Language—characterizes the API and specific data manipulation features supported by a NoSQL database
  • Data Model—categorizes core data organization principles provided by a NoSQL database
  • Data Distribution—analyzes the software architecture and mechanisms that are used by a NoSQL database to distribute data
  • Data Replication—determines how a NoSQL database facilitates reliable, high performance data replication
  • Consistency—categorizes the consistency model(s) that a NoSQL database offers
  • Scalability—captures the core architecture and mechanisms that support scaling a big data application in terms of both data and request load increases
  • Performance—assesses mechanisms used to provide high-performance data access
  • Availability—determines mechanisms that a NoSQL database uses to provide high availability in the face of hardware and software failures
  • Modifiability—questions whether an application data model be easily evolved and how that evolution impacts clients
  • Administration and Management—categorizes and describe the tools provided by a NoSQL database to support system administration, monitoring and management

Within each of these categories, we have detailed evaluation criteria that can be used to differentiate big data technologies. For example, here’s an extract from the Data Model evaluation criteria:

  1. Data Model style
    a.    Relational
    b.    Key-Value
    c.    Document
    d.    Column
    e.    Graph
    f.    XML
    g.    Object
  2. Data item identification   
  3. a.    Key-value for each field
    b.    Objects in same store can have variable formats types stored
    c.    Opaque data items  that needs application interpretation
    d.    Fixed or variable schema
    e.    Embedded hierarchical data items supported (e.g. sub documents)
  4. Data Item key
    a.    Automatically allocated
    b.    Composite keys supported
    c.    Secondary indexes supported
    d.    Querying on non-key metadata supported
  5. Query Styles
    a.    Query by key
    b.    Query by partial key
    c.    Query by non-key values

           i.    Indexed
           ii.    Non-indexed
    d.    Text search in data items
           i.    Indexed
           ii.    Non-indexed

In LEAP(4BD), we first work with the project team to identify features pertinent to the system under development. These features help identify a specific set of technologies that will best support the system. From there, we weight individual features according to system requirements and evaluate each candidate technology against these features.

LEAP(4BD) is supported by a knowledge base that stores the results of our evaluations and comparisons of different NoSQL databases. We have pre-populated the LEAP(4BD) knowledge base with evaluations of specific technologies (e.g., MongoDB, Cassandra, and Riak) with which we have extensive experience. Each evaluation of a new technology adds to this knowledge base, making evaluations more streamlined as the knowledge base grows. Overall, this approach provides a systematic, quantitative, and highly transparent approach that quickly provides a ranking of the various candidate technologies according to project requirements.

As we have demonstrated thus far in this series, there are many facets to LEAP(4BD). The next post in this series on Big Data will explain the prototyping phase. In the meantime, we’re keen to hear from developers and architects who are evaluating big data technologies, so please feel free to share your thoughts in the comments section below.

Additional Resources

To listen to the podcast, An Approach to Managing the Software Engineering Challenges of Big Data, please visit
http://url.sei.cmu.edu/iq

Share this

Share on Facebook Send to your Twitter page  Save to del.ico.us  Save to LinkedIn  Digg this  Stumble this page.  Add to Technorati favorites  Save this page on your Google Home Page 

7 responses to “The Importance of Software Architecture in Big Data Systems”

  1. jens bylehn Says:
    Having lived with a big data application many years and seen the application evolve I can agree with many observations made here.
    Agile development is key to success, it is very difficult to envision how the data distribution and noise will effect the end result.

    But I have also found that the one key problem is to deal with data syntax and semantic shifts in the source. The semantics of fields change, new colums are added and redundant fields are conflicting etc. Data preparation is the biggest challenge.

    I have enjoyed the capability of relational db to not only to normalize input data to 3N, but most importantly update an exend the reference model used for the normalization.
    Thus I kan keep old data and variations from different sources searchable with the same conformance to the semantic meaning of key values." London" is not the same london in all sources...

    I find it difficult to find any such capabilities in no-sql or map reduce solutions where the variations seems to have to be delt with in the search query.

    This late or early approach is difficult to refactor and tightly intertwined with the application. Or is it a misconception that time degenerates the queries.
  2. Ian Gorton Says:
    Jens - these are great observations. Schemaless NoSQL technologies certainly make data model evolution straightforward, but you are of course right that these can impact query code, putting greater complexity in the applications to trade off for the much simpler data models supported.

    In essence, there's a trade-off here to gain scalability. For example, a graph DB like Neo4j gives you a much richer data model and powerful query capabilities, but you can't currently scale it out horizontally like a typical key-value or column store. This is why mapping the needs of the application, especially in terms of scale, consistency and availability, to the capabilities of these technologies should be performed judiciously. There's no right answer, but with care you can minimize downstream 'surprises'.
  3. Gerry Ehlers Says:
    Hi Ian - I really like your work and ideas behind LEAP4DB for evaluating, currently, NoSQL technologies. I think that this prototyping model can be extended to incorporate NewSQL DB solutions that are cropping up now, as well. Dove-tailing off your response to Jens, NewSQL technologies aim to solve some of the issues found in the NoSQL world. While NoSQL technologies and their partners are still rapidly evolving in an effort to solve such problems, the options that attempt to do so seem to be wrapper-technologies (i.e., non-native) and in some cases are proprietary. This creates an entirely new set of problems, in some respects, in this rapidly evolving DB world.

    That said, the LEAP4DB model could also incorporate examples of which types of data are potentially most suitable for a given DB solution\technology and\or model (i.e., banking transactions, blogs, sensor data, customer relations data, node\device events, application logs, etc.). I would think that an important factor in the decision making process would be to determine, for any given DB solution or model, the level of effort an organization will assume if, for example, the requirements for one or more of their applications require more complex queries than they originally thought. Being able to make the most out of your data or "big data" in a timely fashion is essential - quicker is better.

    Since, as of right now, one size fits none and that a polyglot DB environment is seemingly inevitable, it will be interesting to see which solutions evolve in a manner that minimizes the need for a polyglot approach. It's a race between the old, current and new solutions to make this happen. Some are in better positions than others to make a "one size fits some or most", but it will likely all unfold over the next decade I imagine...
  4. venkata sree rama murty maddula Says:
    Nice topic..but if u float as a online course for LEAP4DB its great help for all..
    some where i read there is no standards for BIGDATA what do you say ? What about the documentation for this Architectures...do u use Archimate or any other tools are available...
  5. Bas Geerdink Says:
    Ian, nice article. Please review my paper on this topic at http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6750165 and http://gdink.com/data/sites/1/article_geerdink.pdf
  6. Ian Gorton Says:
    Gary - agreed - these are interesting times in the big data world. My hunch is this market is going to take quite a few years to settle and converge. Scale is driving so many innovations, and as you say, it's not clear right now which approaches will be around for the long term.
  7. Ian Gorton Says:
    Thx Bas - I had a quick glance and it looks interesting. I shall read ....

Add Comment


Leave this field empty: