A relational database management system (RDBMS), a program that allows you to manage your relational database, can help ensure data consistency and compliance with a formal database schema. For enterprises that embrace data science, however, the restrictions of a RDBMS can actually hinder — rather than help — your team build computationally expensive and data-heavy projects like machine learning models. 

With a RDBMS, new data or modifications to existing data are not accepted unless they satisfy the constraints represented in the schema, which are usually related to data types, referential integrity, etc. The way in which a RDBMS coordinates its transactions guarantees that the entire database is consistent at all times and adheres to the well-known ACID properties of database transactions: atomicity, consistency, isolation and durability. Consistency is, of course, a desirable property — you wouldn’t want erroneous data to enter the system.

However, sometimes this focus on consistency becomes a burden. It induces (sometimes unnecessary) overhead and hampers scalability and flexibility. A RDBMS is at its best when performing intensive read/write operations on small- or medium-sized data sets, or when executing larger-batch processes with only a limited number of simultaneous transactions. As the data volumes or the number of parallel transactions increase, capacity can be increased by vertical scaling (also called "scaling up"), i.e., by extending storage capacity and/or the CPU power of the database server. Obviously, however, there are hardware-induced limitations to vertical scaling.

Therefore, further capacity increases need to be realized by horizontal scaling (also known as "scaling out"), in which multiple DBMS servers are arranged in a cluster. The respective nodes in the cluster can balance workloads among one another — and scaling is achieved by adding nodes to the cluster, rather than extending the capacity of individual nodes. A clustered architecture is an essential prerequisite to cope with the enormous demands of recent data-driven evolutions at enterprise companies, such as big data, data science and machine learning, cloud computing, and all kinds of responsive web applications. It provides the necessary performance, which cannot be realized by a single server, while also guaranteeing availability by replicating data over multiple nodes and allowing other nodes to take over their neighbor’s workload if that node fails.

However, a RDBMS is not good at extensive horizontal scaling. The way these systems manage transactions — and their need to keep data consistent at all times — induces significant coordination overhead as the number of nodes increases. In addition, the systems' rich querying functionality may be overkill in settings where applications merely need high capacity to "put" and "get" data items and where there is no demand for complex data interrelationships or selection criteria. Also, in settings where big data is a focus, the emphasis is often on semi-structured data or on data with a very volatile structure (for instance, sensor data, images, audio data, and so on), where the rigid database schema of a RDBM is a source of inflexibility.

None of this means that relational databases will become obsolete anytime soon. However, the "one size fits all" era — in which these systems were used in nearly every data and processing context — seems to have come to an end. A RDBMS is still the way to go when storing medium-sized or smaller volumes of highly structured data and when your organization strongly emphasizes consistency and extensive querying facilities. Where massive volumes, flexible data structures, scalability, and availability are more important, other systems may be called for. This need has resulted in the emergence of NoSQL databases.

The Emergence of the NoSQL Movement 

The term “NoSQL” has become a loaded one in the past decade; it's now associated with many meanings and systems. The modern NoSQL movement describes databases that store and manipulate data in formats other than tabular relations, i.e., non-relational databases. It would have been more appropriate to call the movement NoREL, especially since some of these non-relational databases actually provide query language facilities close to SQL. Because of this, NoSQL now largely stands for "not only SQL" or "not relational" instead of "not SQL." 

What makes NoSQL databases different from other legacy, non-relational systems that have existed since the 1970s? The table below provides a detailed comparison of typical NoSQL databases and relational systems. Note that different categories of NoSQL databases exist and even members of a single category can be very diverse. No single NoSQL system will exhibit every property listed here.

 

Relational Databases

NoSQL Databases

Data paradigm

Relational tables

Key value (tuple)-based
Document-based
Column-based
Graph-based
XML, object-based
Other: time series, probabilistic, etc.

Distribution

Single-node and distributed

Mainly distributed

Scalability

Vertical scaling; harder to scale horizontally

Easy to scale horizontally; easy data replication

Openness

Closed and open source

Mainly open source

Schema role

Schema-driven

Mainly schema-free or flexible schema

Query language

SQL as query language

No or simple querying facilities, or special-purpose languages

Transaction mechanism

ACID: atomicity, consistency, isolation, durability

BASE: basically available, soft state, eventually consistent

Feature set

Many features (triggers, views, stored procedures, etc.)

Simple API

Data volume

Capable of handling normal-sized data sets

Capable of handling huge amounts of data and/or very high frequencies of read/write requests

We should note, however, that the explosion of popularity of NoSQL data storage layers needs to be put in perspective due to the limitations of these layers. Most NoSQL implementations have yet to prove their true worth in the field (most are very young and in development). Many implementations sacrifice other ACID concerns in favor of being consistent, and the lack of relational support makes expressing some queries or aggregations particularly difficult; map-reduce interfaces are offered as a possible — but harder to learn and use — alternative.

Combined with the fact that a RDBMS provides strong support for transactionality, durability, and manageability, quite a few early adopters of NoSQL were confronted with some sour lessons: For instance, Digg struggled with the NoSQL Cassandra database after switching from MySQL and Twitter faced similar issues that kept it using a MySQL cluster for longer than expected. Another example is the fiasco at HealthCare.gov, in which the IT team relied on a NoSQL database that was badly suited to the job of running a massive government healthcare website. 

It would be an over-simplification to reduce your choice of a RDBMS or a NoSQL database to a choice between consistency and integrity or scalability and flexibility, respectively. The NoSQL systems market is far too diverse for that. Still, this tradeoff will often come into play when you decide to take the NoSQL route. We see many NoSQL vendors focusing on robustness and durability, while at the same time observing RDBMS vendors implement features that let you build schema-free, scalable data stores inside a traditional RDBMS. These systems are often capable of storing nested, semi-structured documents, which remains the major selling point of most NoSQL databases, especially those in the document storage category.  Some vendors have already adopted “NewSQL” as a term to describe modern relational database management systems that aim to blend the scalable performance and flexibility of NoSQL systems with the robustness guarantees of a traditional DBMS.

Expect the industry to continue moving towards adoption of “blended systems,” except in use cases that require specialized, niche database management systems. In such settings, the NoSQL movement has rightly taught users that the one-size-fits-all mentality of relational systems is no longer applicable and should be replaced by an approach that focuses on finding the right tool for the job. 

About the Authors

Wilfried Lemahieu is a professor at KU Leuven, Faculty of Economics and Business, where he also holds the position of Dean. His teaching, for which he was awarded a best teacher recognition, includes database management, enterprise information management, and management Informatics. His research focuses on big data storage and integration, data quality, business process management and service-oriented architectures. In this context, he collaborates extensively with a variety of industry partners, both local and international. His research is published in renowned international journals and he is a frequent lecturer for both academic and industry audiences. See feb.kuleuven.be/wilfried.lemahieu for further details.

Bart Baesens is a professor of Big Data and Analytics at KU Leuven (Belgium) and a lecturer at the University of Southampton (United Kingdom). He has done extensive research on big data, analytics, and credit risk modeling. He wrote more than 200 scientific papers some of which have been published in well-known international journals and presented at top conferences. He received various best paper and best speaker awards. Bart is the author of eight books that have sold 20,000 copies worldwide, some of which have been translated in Chinese, Russian and Korean. His research is summarized at www.dataminingapps.com.

Seppe vanden Broucke works as an assistant professor at the Faculty of Economics and Business, KU Leuven, Belgium. His research interests include business data mining and analytics, machine learning, process management and process mining. His work has been published in well-known international journals and presented at top conferences. He is also author of the book Beginning Java Programming (Wiley, 2015) of which more than 4,000 copies were sold and which was also translated in Russian. Seppe's teaching includes Advanced Analytics, Big Data and Information Management courses. He also frequently teaches for industry and business audiences. See seppe.net for further details.

Screen Shot 2018-05-31 at 12.52.09 PMInterested in reading more by these authors? You can pre-order their new book, "Principles of Database Management: The Practical Guide to Storing, Managing and Analyzing Big and Small Data," or sign up to be notified of its release on www.pdbmbook.com.

 

 

Wilfried Lemahieu, Bart Baesens, and Seppe vanden Broucke
Wilfried Lemahieu, Bart Baesens, and Seppe vanden Broucke

Related Content