U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

The PMC website is updating on October 15, 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Adv Radiat Oncol
  • v.5(6); Nov-Dec 2020

Artificial Intelligence Research: The Utility and Design of a Relational Database System

Although many researchers talk about a “patient database,” they typically are not referring to a database at all, but instead to a spreadsheet of curated facts about a cohort of patients. This article describes relational database systems and how they differ from spreadsheets. At their core, spreadsheets are only capable of describing one-to-one (1:1) relationships. However, this article demonstrates that clinical medical data encapsulate numerous one-to-many relationships. Consequently, spreadsheets are very inefficient relative to relational database systems, which gracefully manage such data. Databases provide other advantages, in that the data fields are “typed” (that is, they contain specific kinds of data). This prevents users from entering spurious data during data import. Because each record contains a “key,” it becomes impossible to add duplicate information (ie, add the same patient twice). Databases store data in very efficient ways, minimizing space and memory requirements on the host system. Likewise, databases can be queried or manipulated using a highly complex language called SQL. Consequently, it becomes trivial to cull large amounts of data from a vast number of data fields on very precise subsets of patients. Databases can be quite large (terabytes or more in size), yet still are highly efficient to query. Consequently, with the explosion of data available in electronic health records and other data sources, databases become increasingly important to contain or order these data. Ultimately, this will enable the clinical researcher to perform artificial intelligence analyses across vast amounts of clinical data in a way heretofore impossible. This article provides initial guidance in terms of creating a relational database system.

Introduction

This issue of Advances in Radiation Oncology presents a series of articles around applications of artificial intelligence (AI) in our field. One of the potential benefits of AI is that it can pore through large amounts of data to discover patterns not evident to clinicians. However, this vast volume of data cannot be accommodated within a single spreadsheet (which is how most clinical researchers work when conducting standard multivariable analyses). In fact, many researchers erroneously describe spreadsheets as databases. This article will demonstrate both what a relational database system is and how it is superior to a spreadsheet. It will also discuss considerations when implementing a relational database system (RDBS) for your own research purposes, using an actual lung cancer radiation therapy database as an example. I have also provided some excellent Wikipedia references that contain abundant additional information, beyond what can be encapsulated in a single article. (These, in turn, reference computer science literature for the very intrepid reader, but such references might extend beyond the level of understanding of all but the most technically inclined.)

One might question why a database system is necessary for AI research. This article will demonstrate that a database enables creation of a multidimensional structure to cleanly and accurately contain these data. To perform AI analysis requires efficient storage of hundreds or thousands of data points on a single patient or even on a single course of radiation therapy. There is a famous illustration of the “data science hierarchy of needs” ( Fig 1 ). To perform an AI analysis, one must first create an RDBS to serve as the storage mechanism. This creation of a system to store structured data entails a major part of the bottom row of the pyramid. To create a database, then, will set the reader down the path toward conducting AI research at their own institution.

An external file that holds a picture, illustration, etc.
Object name is gr1.jpg

The data science hierarchy of needs. Used with permission of Monica Rogati ( aipyramid.com ). For details, see text. Abbreviation : AI = artificial intelligence.

Origin of Relational Databases

The concept of a RDBS was first described in a seminal article in 1970. 1 The theoretic construct was that all data could be defined or represented as a series of relations with or to other data. The article was quantitative in that it used relational algebra and tuple relational calculus to prove its points. 2 IBM used this theoretic framework to design what became the initial SQL (pronounced “see-quell” or “ess-cue-ell”) language, which they used to manipulate and retrieve data from IBM’s original RDBS. 3 Since that time, the American National Standards Institute and the International Standards Organization have deemed SQL to be the standard language in relational database communication. 2 Today, there are a wide variety of commercial and open-source relational database systems available for use. These vary in their features and relative strengths or weaknesses, but, fundamentally, they all operate using the principles defined in the Codd article. 1 The SQL language is well defined and is used to write code to query (or update) the data within an RDBS.

Fundamental Disadvantage of Spreadsheets

Spreadsheets are designed to incorporate and analyze one-to-one (1:1) relationships ( Fig 2 a). Each patient has a single birth date and a single death date. However, medical records are rife with “one-to-many” relationships ( Fig 2 b). A patient might receive multiple different courses of radiation therapy treatment, as in the example provided, or might have multiple chemotherapy administrations. To accommodate these data, a spreadsheet quickly balloons in size ( Fig 2 c). Not only is this inefficient (duplication of data), but it also makes maintenance of the spreadsheet extremely cumbersome and prone to error. For instance, in this example, when patient “12345” passes away, the “DeathDate” needs to be updated in 5 rows of the spreadsheet (because she had 2 courses of radiation therapy and 4 cycles of chemotherapy). It is not difficult to imagine that a researcher could neglect to update the “DeathDate” in each place, introducing errors. To further expound upon the issue, imagine a patient who takes numerous medications or has variable numbers of comorbid illnesses; the rows required to encapsulate 1 patient explode. To use a data science term, the dimensionality of the data balloons. But, to reiterate the point, spreadsheets are only designed to encapsulate 1:1 relationships (2-dimensional data). But patient data are multidimensional.

An external file that holds a picture, illustration, etc.
Object name is gr2.jpg

(a) Spreadsheets are useful where there is a one-to-one correspondence of data. For instance, each unique medical record number (MRN) represents a single patient, with a single birth/death date and a single first and last name. (b) Spreadsheets “break down” when describing 1-to-many correspondences. In this example, 2 patients have a total of 5 courses of radiation therapy treatment between them. (c) To accommodate all the data in our simple example, a spreadsheet needs to store redundant data (colored in red). The data storage requirements quickly balloon. Furthermore, as additional traits and factors are added to the spreadsheet, it becomes impossible to follow, as one patient will require untold numbers of rows to capture all relevant data concepts. Stated another way, the data are multidimensional. Maintenance and updating of fields become error-prone (see text). Abbreviations : DOB = date of birth; Lname = last name; Fname = first name; LUL = upper lobe; MRN = medical record number; RLL, right lower lobe; RUL = right upper lobe; SBRT = stereotactic body radiotherapy.

Fundamental Advantage of Relational Databases

RDBS gracefully manage one-to-many relationships. They can do so because a database is created of numerous different tables, which are explicitly linked ( Fig 3 ). Every table must also contain a key, which is a unique, required identifier for each row of data. Relationships between the tables are defined when creating the database tables or fields. In the “demographics” table, medical record number, “MRN,” is the key. For the “TreatmentCourse” and “Chemotherapy” tables, the keys are “TreatmentCourse” and “ChemotherapyID,” respectively. Note that “TreatmentCourse 1” in the “TreatmentCourse” table pertains to breast radiation therapy treatment. This, in turn, is linked to 4 cycles of chemotherapy in the “Chemotherapy” table, each of which is uniquely identified in that table, in turn.

An external file that holds a picture, illustration, etc.
Object name is gr3.jpg

In a relational database, data are stored in multiple tables, which are joined via defined variables. In this fictitious example, note that the patient only has one “DeathDate” to update. Furthermore, each course of treatment (“TreatmentCourse”) can have multiple chemotherapy cycles associated with it. Note, too, that medical record number (MRN) only appears in 2 of the 3 tables (it is not needed in the “Chemotherapy” table). If the researcher wishes to retrieve the MRN, it can be obtained via a SQL query, linking back to one of the tables that contains it. Abbreviations : DOB = date of birth.

When comparing Fig 3 (a database) to Fig 2 c (a spreadsheet), note that Fig 3 contains the same information as Fig 2 c without the addition of repetitious information (colored red in Fig 2 c). Alternately, in a spreadsheet, the researcher could manually aggregate and summarize the chemotherapy delivered into a single cell in a single row of the spreadsheet (ie, “flatten” the data, to use a data science term), but then some data would be lost. Using the chemotherapy administrations as an example, if one were to “flatten” the data down to a single spreadsheet cell stating “4 cycles of Adriamycin/Cytoxan,” one loses the dates of administration. If one summarizes the data as “4 cycles of Adriamycin/Cytoxan: <date1>, <date2>, <date3>, <date4>,” the dates and the chemotherapy occupy the same cell and the data are retained but are no longer discrete; one loses the ability to filter the spreadsheet by chemotherapy kind or by date.

Conversely, a SQL database cleanly encapsulates these multidimensional data ( Fig 3 ). Each table is 2-dimensional in structure. But because it can contain multiple rows of data on 1 patient (chemotherapy administrations, to follow the same example), a multidimensional structure is created, as 4 chemotherapy cycles link to one of 2 courses of radiation therapy (“TreatmentCourse” table) for 1 patient (“Demographics” table). Now, imagine a clinical database with millions of rows of data spread across hundreds of tables, as in the real-life example described below. Clearly, a spreadsheet would not be adequate.

Additional Advantages of Relational Databases

  • 1. Each row of data in a table has a unique identifier (a key). Consequently, one cannot accidentally add a row of data into a database table twice.

An external file that holds a picture, illustration, etc.
Object name is gr4.jpg

Note that each field in this database table is specifically designed. It has a “type” (kind) and a “size” (length). When importing data from numerous external sources, these definitions can prevent erroneous imports (for details, see text). Note that the field “MRN” is the key for this table. All the data in this table refer back to “MRN” via a one-to-one relationship. MRN can be used as the key because no 2 patients have the same MRN.  Abbreviations : DOB = date of birth; MRN = medical record number.

  • 3. Not only must the data types correspond, but the data lengths must be observed. If the database design states that a field is a decimal with 3 places to the right of the decimal place, then a fourth decimal place would be truncated at import. Alternately, the database could declare an error, which might also imply that the field contains erroneous data.
  • 4. A key from one table can be linked “backward” to a key from another table (termed a “foreign key”). As an example ( Fig 3 ), the database can be designed such that the MRN from “TreatmentCourse” must refer to an MRN already contained within the “Demographics” table. If one tried to import data into “TreatmentCourse” and it used an MRN not listed in “Demographics,” the import would fail. Such a situation might imply, for instance, that the MRN was incorrect in the external data source (or in the database). Or perhaps it relates to a patient who received prostate radiation therapy (but you have a breast cancer database).
  • 5. Foreign key relationships also work in the opposite direction: If one realizes that a patient is represented in the database who should not be, one can delete the patient from the “Demographics” table and the database will delete all data about that patient from all the other data tables automatically.
  • 6. RDBS are specifically optimized to manage vast amounts of data. Large spreadsheets (containing thousands of rows and hundreds of columns) are extremely slow and memory intensive. However, one can query across or manipulate many gigabytes of data in fractions of a second in many RDBS, as the data stores are highly optimized and efficient from both a computational and memory utilization perspective.
  • 7. RDBS are much more secure than spreadsheets. An institution’s IT team might allow one to access some tables within an institution’s data warehouse, but not others. One’s access could be restricted to defined subsets of patients. One might have “read” access to these data, but not “write” access (or “write” access to only some subset of fields). Databases might likewise be set up such that only users from specific IP addresses or computers may access them. The login systems set up by IT departments for these purposes typically use state of the art encryption algorithms, 2-factor authentication, and the like. In contrast, an Excel spreadsheet can be “locked” such that only some fields are editable. But it is not possible to restrict data access by user. Furthermore, this restriction is to “write” access only, not to “read” access. It is true that one can “hide” columns in a spreadsheet and then lock it, to prevent a given user from viewing them, but the spreadsheet maintainer must do this manually before distributing the spreadsheet (time-consuming and prone to error).

Benefits of SQL

As described earlier, SQL is a defined, standardized language for composing queries within an RDBS, or to manipulate and update these data. Some database systems provide “extensions” to the SQL standard, to provide some additional and specific functionality (details available in the vendors’ literature). It is beyond the scope of this article to teach SQL coding. However, many excellent online resources are available for the interested reader. Functionally, SQL allows one to search for any number of variables or combinations of variables across any number of tables, simultaneously. This can be extremely powerful and useful, both for retrieving and for manipulating and updating data. Queries can be saved for reuse or modification later. As stated above, these queries typically produce output in fractions of a second, even across vast pools of data.

Our institution has a database of patients who have received radiation therapy to the lung, whether for primary cancer or metastatic disease to the lung. 4 The database and some of its details of implementation are described below, but first, some “real-life” examples of what such an RDBS system can do (not possible when using a spreadsheet):

  • • Real-life example 1: Find patients who might be candidates for a certain lung cancer clinical trial. For this particular study, they must have previously received lung SBRT, have nonmetastatic disease, no evidence of recurrence, be alive (obviously), be at least 2 years out from the end of the prior SBRT treatment, and must have been seen in follow-up within the past 1.5 years. By constructing an appropriate SQL query, 135 patients were found (out of more than 4600 in the database) to pass along to the PI for closer inspection.
  • • Real-life example 2: It takes only a few minutes to set up very complex queries. If one has a basic facility with SQL, one can design a query such as: “Find all patients with stage II or III lung cancer treated with concurrent chemoradiotherapy who developed neutropenia during treatment, who are female, 70 years of age or older, and who take any antihypertensive medication (defined in a certain list).” Ultimately, such queries are only limited by one’s imagination (and the richness or completeness of the data coming from the source systems).

It is true that one can “filter” data in Excel to rapidly find subsets. However, this filtering is limited to “true or false” matching. In this example, it would be impossible to discover the patients who developed neutropenia while undergoing radiation therapy unless one had a “neutropenia (yes or no)” column. But one cannot perform the arithmetic “where date of neutropenia > RadiationStartDate and < RadiationEndDate” to filter the data without writing code in Visual Basic, which is likely beyond the ability of most.

Disadvantage of SQL

With SQL, it is possible to create highly complex queries; it is a rich and powerful language. However, these can be quite complicated and obtuse to a nontechnical person. Some systems do provide graphical tools to help build SQL queries, but, even so, there are some users for whom all but the simplest SQL queries will be beyond their technical skills.

Database Implementation

Databases may, and often do, contain many thousands of tables and millions of rows of data. (In other words, they can contain data far in excess of the requirements of any one radiation oncologist or even any one radiation oncology department). In fact, some systems allow even single tables to contain terabytes or even petabytes of data. 5 Consequently, there are numerous systems available to accommodate any researcher’s needs. Some of the very best are open source (free). Software is available across a wide variety of operating systems. Wikipedia provides an outstanding overview of the topic. 5

To implement a database system, it is first necessary to have a discussion with the IT Department at your institution. There is no single solution for creating a data repository that holds true for researchers across all institutions. The solution can vary, depending on the resources at your institution and the level of access the IT Department has into the underlying patient data source systems (often defined in the institutional contracts signed with the individual vendors). Some large centers have elaborate data warehouse systems. Smaller centers might provide access to data from individual source systems but might not have compiled them into a single data warehouse. Some IT departments might have adequate resources to provide output data from their data warehouse to individual researchers, when needed. Others might not. Some might provide a dedicated research server on which the researcher can construct a database. Other researchers might need to rely instead upon existing servers within their department. I do not recommend that one set up a database system on a free-standing laptop or desktop machine, as there are Health Insurance Portability and Accountability Act concerns (the computer could be stolen). Data should be backed up across a secured network electronically.

Creation of a Lung Cancer Radiation Therapy Database

I began my own database several years ago. My need grew out of a sense of frustration regarding lack of access to clinical data. At the time, at my institution, it was a difficult (and somewhat mysterious) process to procure data from the data warehouse. However, data from Mosaiq (Elekta AB, Stockholm, Sweden), which is our department’s record or verify system, were available. These data formed the nucleus of the original database. Basic demographic information and radiation therapy prescriptions, dates of treatment, dose delivered, tumor stage, and the like, were exported, using custom software. Research IT provided a Linux server, on which I implemented the database. I chose to use MariaDB (MariaDB Foundation, DE), as it is a powerful, well-regarded, commercially supported, free, open-source database system whose SQL functions are congruent with those of Oracle (which is a database system I had used previously). Because my institution could not support an implementation of an Oracle, MariaDB was an excellent alternate option. MariaDB does include a Windows GUI database administration tool for administering its databases (creating tables, writing SQL code, importing or exporting data, and the like). I had previously used a similar commercial database administration product called Navicat (Premium Cybertech Ltd, Hong Kong), which provides similar functionality, so I elected to purchase that. Similarly, I imported all data I had captured in spreadsheets for previous research projects. More recently, I have gained access to data from our data warehouse and so have created numerous additional tables to store the information. At present, the database contains more than 3 million rows of data on approximately 4800 patients, spread across more than 170 tables.

Incorporation of Data from Other Institutional Data Systems

To import data from outside source systems requires a multistep process, referred to as “ETL” (“Extraction, Transformation, and Loading”) in the data science literature.” 6 The issues go far beyond the physical importation of data into the database; importing spreadsheets of data are a trivial task. There are numerous issues in ETL, which are critical to consider when designing a database and importing data into it. Furthermore, many of these issues are not inherently obvious. In fact, a large proportion of the time required to create a database and fill it with clinically useable data derives from the ETL involved. The oft-reproduced “Data Science Hierarchy of Needs” illustrates this fact ( Fig 1 ). Most of the discussion in this article addressed aspects of the bottom-most layer of the pyramid. ETL comprises the majority of the next 2 layers of the pyramid and is the topic of another article.

Research data are not available at this time.

Sources of support: none.

Disclosures: Dr Dilling reports personal fees and nonfinancial support from NCCN, personal fees from Varian, personal fees and nonfinancial support from Harborside Press, nonfinancial support from Astra Zeneca, all outside the submitted work.

A relational model of data for large shared data banks

research paper on relational database

New Citation Alert added!

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options.

  • Klöcking M Lehnert K Wyborn L (2025) Geochemical databases Treatise on Geochemistry 10.1016/B978-0-323-99762-1.00123-6 (97-135) Online publication date: 2025 https://doi.org/10.1016/B978-0-323-99762-1.00123-6
  • BELYAEV O KHOMCHENKOVA I SINITSYNA J DYACHKOV V BYZOVA A BADEEV A ALEKSEEV D MAKAROV Y (2024) V. I. ABAEV'S HISTORICAL-ETYMOLOGICAL DICTIONARY: ISSUES IN THE DEVELOPMENT OF A DIGITAL BILINGUAL EDITION Lomonosov Journal of Philology 10.55959/MSU0130-0075-9-2024-47-02-4 (75-86) Online publication date: 16-Jun-2024 https://doi.org/10.55959/MSU0130-0075-9-2024-47-02-4
  • Lee A Powell C Gorham J Morey A Scott J Hanisch R (2024) Development of the NIST X-ray Photoelectron Spectroscopy (XPS) Database, Version 5 Data Science Journal 10.5334/dsj-2024-045 23 (45) Online publication date: 18-Sep-2024 https://doi.org/10.5334/dsj-2024-045
  • Show More Cited By

Recommendations

Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation). A prompting service which supplies such information is not a satisfactory solution. Activities of users at ...

Information

Published in.

cover image Communications of the ACM

IBM Scientific Center, Houston, TX

Association for Computing Machinery

New York, NY, United States

Publication History

Permissions, check for updates, author tags.

  • composition
  • consistency
  • data integrity
  • data organization
  • data structure
  • derivability
  • hierarchies of data
  • networks of data
  • predicate calculus
  • retrieval language

Contributors

Other metrics, bibliometrics, article metrics.

  • 5,743 Total Citations View Citations
  • 52,660 Total Downloads
  • Downloads (Last 12 months) 9,313
  • Downloads (Last 6 weeks) 1,192
  • Dhatterwal J Kaswan K Saxena S Panwar A (2024) Big Data for Health Data Analytics and Decision Support Computational Convergence and Interoperability in Electronic Health Records (EHR) 10.4018/979-8-3693-3989-3.ch006 (93-116) Online publication date: 9-Aug-2024 https://doi.org/10.4018/979-8-3693-3989-3.ch006
  • Tracz P Plechawska-Wójcik M (2024) Comparative analysis of the performance of selected database management system Journal of Computer Sciences Institute 10.35784/jcsi.5927 31 (89-96) Online publication date: 30-Jun-2024 https://doi.org/10.35784/jcsi.5927
  • Bergami G Fox O Morgan G (2024) Matching and Rewriting Rules in Object-Oriented Databases Mathematics 10.3390/math12172677 12 :17 (2677) Online publication date: 28-Aug-2024 https://doi.org/10.3390/math12172677
  • Grigg I (2024) Triple Entry Accounting Journal of Risk and Financial Management 10.3390/jrfm17020076 17 :2 (76) Online publication date: 14-Feb-2024 https://doi.org/10.3390/jrfm17020076
  • Videsott P Robecchi M Schaber J (2024) (Re)cartographier la Galloromania médiévale : enjeux et perspectives quarante ans après l’ Atlas de Dees Zeitschrift für romanische Philologie 10.1515/zrp-2023-0041 139 :4 (1003-1047) Online publication date: 17-Jan-2024 https://doi.org/10.1515/zrp-2023-0041
  • Fejza A Genevès P Layaïda N (2024) Efficient Enumeration of Recursive Plans in Transformation-Based Query Optimizers Proceedings of the VLDB Endowment 10.14778/3681954.3681986 17 :11 (3095-3108) Online publication date: 1-Jul-2024 https://dl.acm.org/doi/10.14778/3681954.3681986
  • Kakaraparthy A Patel J (2024) SplitDF: Splitting Dataframes for Memory-Efficient Data Analysis Proceedings of the VLDB Endowment 10.14778/3665844.3665849 17 :9 (2175-2184) Online publication date: 1-May-2024 https://dl.acm.org/doi/10.14778/3665844.3665849

View options

View or Download as a PDF file.

View online with eReader .

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

  • Open access
  • Published: 14 August 2015

Choosing the right NoSQL database for the job: a quality attribute evaluation

  • João Ricardo Lourenço 1 ,
  • Bruno Cabral 1 ,
  • Paulo Carreiro 2 ,
  • Marco Vieira 1 &
  • Jorge Bernardino 1 , 3  

Journal of Big Data volume  2 , Article number:  18 ( 2015 ) Cite this article

57k Accesses

65 Citations

14 Altmetric

Metrics details

For over forty years, relational databases have been the leading model for data storage, retrieval and management. However, due to increasing needs for scalability and performance, alternative systems have emerged, namely NoSQL technology. The rising interest in NoSQL technology, as well as the growth in the number of use case scenarios, over the last few years resulted in an increasing number of evaluations and comparisons among competing NoSQL technologies. While most research work mostly focuses on performance evaluation using standard benchmarks, it is important to notice that the architecture of real world systems is not only driven by performance requirements, but has to comprehensively include many other quality attribute requirements. Software quality attributes form the basis from which software engineers and architects develop software and make design decisions. Yet, there has been no quality attribute focused survey or classification of NoSQL databases where databases are compared with regards to their suitability for quality attributes common on the design of enterprise systems. To fill this gap, and aid software engineers and architects, in this article, we survey and create a concise and up-to-date comparison of NoSQL engines, identifying their most beneficial use case scenarios from the software engineer point of view and the quality attributes that each of them is most suited to.

Introduction

Relational databases have been the stronghold of modern computing applications for decades. ACID properties (Atomicity, Consistency, Isolation, Durability) made relational databases the solution for almost all data management systems. However, the need to handle data in web-scale systems [ 1 – 3 ], in particular Big Data systems [ 4 ], have led to the creation of numerous NoSQL databases.

The term NoSQL was first coined in 1988 to name a relational database that did not have a SQL (Structured Query Language) interface [ 5 ]. It was then brought back in 2009 for naming an event which highlighted new non-relational databases, such as BigTable [ 3 ] and Dynamo [ 6 ], and has since been used without an “official” definition. Generally speaking, a NoSQL database is one that uses a different approach to data storage and access when compared with relational database management systems [ 7 , 8 ]. NoSQL databases lose the support for ACID transactions as a trade-off for increased availability and scalability [ 1 , 7 ]. Brewer created the term BASE for these systems - they are Basically Available, have a Soft state (during which they are not yet consistent), and are Eventually consistent, as opposed to ACID systems [ 9 ]. This BASE model forfeits the essential ACID properties of consistency and isolation in order to favor “availability, graceful degradation, and performance” [ 9 ]. While originally the term stood for “No SQL”, it has recently been restated as “Not Only SQL” [ 1 , 7 , 10 ] to highlight that these systems rarely fully drop the relational model. Thus, in spite of being a recurrent theme in literature, NoSQL is a very broad term, encompassing very distinct database systems.

There are hundreds of readily available NoSQL databases, and each have different use case scenarios [ 11 ]. They are usually divided in four categories [ 2 , 7 , 12 ], according to their data model and storage: Key-Value Stores, Document Stores, Column Stores and Graph databases. This classification is due to the fact that each kind of database offers different solutions for specific contexts. The “one size fits all” approach of relational databases no longer applies.

There has been extensive research in the comparison of relational and non-relational databases in terms of their performance for different applications. However, when developing enterprise systems, performance is only one of many quality attributes to be considered. Unfortunately, there has not yet been a comprehensive assessment of NoSQL technology in what concerns software quality attributes. The goal of this article is to fill this gap, by clearly identifying which NoSQL databases better promote the several quality attributes, thus becoming a reference for software engineers and architects.

This article is a revised and extended version of our WorldCIST 2015 paper [ 13 ]. It improves and complements the former in the following aspects:

Three more quality attributes (Consistency, Robustness and Maintainability) were evaluated.

A new section describing the evaluated NoSQL databases was introduced.

The state of the art was extended to provide more up to date and thorough information.

All of the previously evaluated quality attributes were reevaluated in light of new studies and new developments in the NoSQL ecosystem.

New conclusions and insights derived from the quality attribute based analysis of several NoSQL databases.

Henceforth, the main contributions of this article can be summarized as follows:

The development of a quality-attribute oriented evaluation of NoSQL databases (Table 2 ). Software architects may use this information to assess which NoSQL database best fits their quality attribute requirements.

A survey of the literature on the evaluation of NoSQL databases from a historic perspective.

The identification of several future research directions towards the full coverage of software quality attributes in the evaluation of NoSQL databases.

The remainder of this article is structured as follows. In Section ‘ Background and literature review ’, we perform a review of the literature and evaluation surrounding NoSQL systems. In Section ‘ Research design and methodology ’, we introduce the methodology used to select the quality attributes and NoSQL databases that we evaluated, as well as the methodology used in that evaluation. In Section ‘ Evaluated NoSQL databases ’, we present and describe the evaluated NoSQL databases. In Section ‘ Software quality attributes ’, we analyze the different quality attributes and identify the best NoSQL solutions for each of these quality attributes according to the literature. In Section ‘ Results and discussion ’, a summary table and analysis of the results of this evaluation is provided. Finally, Section ‘ Conclusions ’ presents future work and draws the conclusions.

Background and literature review

The word NoSQL was re-introduced in 2009 during an event about distributed databases [ 5 ]. The event intended to discuss the new technologies being presented by Google (Google BigTable [ 3 ]) and Amazon (Dynamo [ 14 ]) to handle high amounts of data. Interest in the research of NoSQL technologies bloomed since then, and lead to the publication of works, such as those by Stonebraker and Cattell [ 12 , 15 , 16 ]. Sonebraker began his research by describing different types of NoSQL technology and differences among those when compared to relational technology. The author argues that the main reasons to move to NoSQL databases are performance and flexibility. Performance is mainly focused on sharing and management of distributed data (i. e. dealing with “Big Data”), while flexibility relates to the semi-structured or unstructured data that may arise on the web.

By 2011, the NoSQL ecosystem was thriving, with several databases being the center of multiple studies [ 17 – 20 ]. These included Cassandra, Amazon SimpleDB, SciDB, CouchDB, MongoDB, Riak, Redis, and many others. Researchers categorized existing databases, and identified what kinds of NoSQL databases existed according to different architectures and goals. Ultimately, the majority agreed on four categories of databases [ 11 ]: Document Store, Column Store, Key-value Store and Graph-oriented databases.

Hecht and Jablonski [ 11 ] described the main characteristics offered by different NoSQL solutions, such as availability and horizontal scailability. Konstantinou et al. [ 19 ] performed a study based on the elasticity of non-relational solutions and compared HBase, Cassandra and Riak during execution of read and update operations. The authors concluded that HBase provided high elasticity and fast reads while Cassandra was capable of delivering fast inserts (writes). On the other hand, according to the authors, Riak did not show good scaling and high performance increase, regardless of the type of access. Many studies focused on evaluating performance [ 4 , 11 , 21 ].

Performance evaluations were made easier by the popularization of the Yahoo! Cloud Serving Benchmark (YCSB), proposed and implemented by Cooper et al. [ 21 ]. This benchmark, still widely used today, allows testing the read/write, latency and elasticity capabilities of any database, in particular NoSQL databases. The first studies using YCSB evaluated Cassandra, HBase, PNUTS and MySQL to conclude that each database offers its own set of trade-offs. The authors warn that each database performs at its best in different circumstances and, thus, a careful choice of the one to use must be made according to the nature of each project.

Since 2012, NoSQL databases have been most often evaluated and compared to RDBMSs (Relational Database Management Systems). Performance evaluation carried by [ 22 ] compared Cassandra, MongoDB and PostgreSQL, concluding that MongoDB is capable of providing high throughput, but mainly when it is used as a single server instance. On the other hand, the best choice for supporting a large distributed sensor system was considered Cassandra due to its horizontal scalability. Floratou et al. [ 4 ] used YCSB and TPC-H to compare the performance of MongoDB and MS SQL Server, as well as Hive. The authors state that NoSQL technology has room for improvement and should be further updated. Ashram and Anderson [ 7 ] studied the data model of Twitter and found that using non-relational technology creates additional difficulties on the programmers’ side. Parker et al. [ 23 ] also chose MongoDB and compared its performance with MS SQL Server using only one server instance. According to their results, when performing inserts, updates and selects, MongoDB is faster, but MS SQL Server outperforms MongoDB when running complex queries instead of simpler key-value access. In [ 24 ], Kashyap et al. compare the performance, scalability and availability of HBase, MongoDB, Cassandra and CouchDB by using YCSB. Their results show that Cassandra and HBase shared similar behaviour, but the former scaled better, and that MongoDB performed better than HBase by factors in the hundreds for their particular workload. The authors are prudent, and note that NoSQL is constantly evolving and that evaluations can quickly become obsolete. Rabl et al. [ 25 ] studied Cassandra, Voldemort, HBase, Redis, VoltDB and MySQL Cluster with regards to throughput, latency and scalability. Cassandra’s throughput is consistently better than that of the other databases, but it exhibits high latency. Voldemort, HBase and Cassandra all show linear scalability, and Voldemort has the most stable, lowest latency. Of the tested NoSQL databases, VoltDB has the worst results and HBase also lagged behind the other databases in terms of throughput.

Already in 2013, with the research focus on performance, Thumbtack Technologies produced two white papers comparing Aerospike, Cassandra, Couchbase and MongoDB [ 26 , 27 ]. In [ 26 ], the authors compare the durability and performance trade-offs of several state of the art NoSQL systems. Their results firstly show that Couchbase and Aerospike have good in-memory performance, and that MongoDB and Cassandra lagged behind in bulk loading capabilities. Regarding durability, Aerospike beats the competition in large balanced and read-heavy datasets. For in-memory datasets, Couchbase performed similarly to Aerospike as well. In their second paper [ 27 ], the authors study failover characteristics. Their results allow for many conclusions, but overall tend to indicate that Aerospike, Cassandra and Couchbase give strong availability guarantees.

In [ 28 ], MongoDB and Cassandra are compared in terms of their features and their capabilities by using YCSB. MongoDB is shown to be impacted by high workloads, whereas Cassandra seemed to experience performance boosts with increasing amounts of data. Additionally, Cassandra was found to be superior for update operations. In [ 29 ], the authors studied the applicability of NoSQL to RDF (Resource Description Framework data) processing, and make several key observations: 1) distributed NoSQL systems can be competitive with RDF stores with regards to query times; 2) NoSQL systems scale more gracefully than RDF stores when loading data in parallel; 3) complex SPARQL (SPARQL Protocol and RDF Query Language) queries, particularly with joins, perform poorly on NoSQL systems; classical query optimization techniques work well on NoSQL RDF systems; 5) MapReduce-like operations introduce higher latency. As their final conclusion, the authors state that NoSQL represents a compelling alternative to native RDF stores for simple workloads. Several other studies were performed in the same year regarding the applicability of NoSQL to diverse scenarios, such as [ 30 – 32 ].

More recently, as of 2014, experiments have stopped being so focused on performance, and having additional focus on applicability. NoSQL has seen validation and widespread usage, and so, in [ 10 ], a survey of some of the most popular NoSQL solutions is described. The authors state some of the advantages and main uses according to the NoSQL database type. In another evaluation, [ 33 ] performed their tests using real medical scenarios using MongoDB and CouchDB. They concluded that MongoDB and CouchDB have similar performance and drawbacks and note that, while applicable to medical imaging archiving, NoSQL still has to improve. In [ 34 ], the Yahoo! Cloud Serving Benchmark is used with a middleware layer that allows translating SQL queries into NoSQL commands. They tested Cassandra and MongoDB with and without the middleware layer, noting that it was possible to build middleware to ease the move from relational data stores to NoSQL databases. In [ 35 ], a write-intensive enterprise application is used as the basis for comparing Cassandra, MongoDB and Couchbase with MS SQL server. The results show that Cassandra outperforms the other NoSQL databases in a four-node setup, and that a MS SQL Server running on a single node vastly outperforms all NoSQL contenders for this particular setup and scenario.

The latest trends in NoSQL research, although still related to applicability and performance, have also concerned the validity of the benchmarking processes and tools used throughout the years. The authors of [ 36 ] propose an improvement of YCSB, called YCSB++, to deal with several shortcomings of the benchmark. In [ 37 ], the author proposes a method to validate previously proposed benchmarks of NoSQL databases, claiming that rigorous algorithms should be used for benchmarking methodology before any practical use. Chen et al., in [ 38 ], perform a survey of benchmarking tools, such as YCSB and TPC-C and list shortcomings and difficulties in implementing MapReduce and Big Data related benchmark systems, proposing methods for overcoming these difficulties. Similar work had already been done by [ 39 ], where benchmarks are reviewed and suggestions are given on building better benchmarks.

As we have seen, to the best of our knowledge, there are no studies focused on quality attributes and how each NoSQL system fits each of these attributes. Our work attempts to fill in this gap, by reviewing the literature, in Section ‘ Software quality attributes ’, with regards to the different quality attributes, finally presenting our findings in a summary table in Section ‘ Results and discussion ’.

It is important to notice that the analysis of NoSQL systems is inherently bound to the CAP theorem [ 40 ]. The CAP theorem, proposed by Brewer, states that no distributed system can simultaneously guarantee Consistency, Availability and Partition-Tolerance. In the context of the CAP theorem [ 40 , 41 ], consistency is often viewed as the premise that all nodes see the same data at the same time [ 42 ]. Indeed, of Brewer’s CAP theorem, most databases choose to be “AP”, meaning they provide Availability and Partition-Tolerance. Since Partition-Tolerance is a property that often cannot be traded off, Availability and Consistency are juggled, with most databases sacrificing more consistency than availability [ 43 ]. In Fig. 1 , an illustration of CAP is shown.

CAP theorem with databases that “choose” CA, CP and AP

Some authors (Brewer being one of them) have come to criticize the way the CAP theorem is interpreted and some have claimed that much has been written in literature under false assumptions [ 41 , 44 – 46 ]. The idea of CA (systems which ensure Consistency and Availability) is now most often looked at as a trade-off on a micro-scale [ 41 ], where individual operations can have their consistency guarantees explicitly defined. This means that some operations can be tied to full consistency (in the ACID semantics sense), or to one of a vast range of possible consistency options. Modern NoSQL systems allow for this consistency tuning and should therefore not be looked at under such a simplistic view which narrows the whole system to “CA”, “CP” or “AP”.

Research design and methodology

This work was developed to answer the following research question: “Is there currently enough knowledge on quality attributes in NoSQL systems to aid a software engineer’s decision process”? In our literature survey, we did not find any similar work attempting to provide a quality attribute guided evaluation of NoSQL databases. Thus, we devised a methodology to develop this work and answer our original research question.

We began by identifying several desirable quality attributes to evaluate in NoSQL databases. There are hundreds of quality attributes, yet some are nearly ubiquitous to every software project [ 47 ], and others are intimately tied to the topic of database systems, storage models and web applications (where the database backend often requires certain quality attributes) [ 48 ]. This lead us to identify the following quality attributes to evaluate: Availability, Consistency, Durability, Maintainability, Read and Write performance, Recovery Time, Reliability, Robustness, Scalability and Stabilization Time. These attributes are interdependent and have impact on most software projects. Most of these attributes have also been the target (even if indirectly) of some studies [ 18 , 27 , 27 , 49 , 50 ], rendering them ideal picks for this work.

Once these quality attributes had been identified, we identified which NoSQL systems were more popular and used, so as to narrow our research to a fixed set of NoSQL databases. This search lead us to selecting Aerospike, Cassandra, Couchbase, CouchDB, HBase, MongoDB and Voldemort as the systems to evaluate. These are often found in literature [ 6 , 10 , 11 , 26 , 51 , 52 ] and other sources [ 53 ] as the most popular and used systems, as well as the most versatile or appropriate to certain scenarios. For instance, while Couchbase and CouchDB share source-code and several similar original design goals, they have evolved into different systems, both with very high success and different characteristics. In much the same way, MongoDB and Cassandra, which are probably the most used NoSQL databases in the market, have fundamentally different approaches to storage model. Thus, our selection of databases attempted to find not only the most popular and mature databases in general, but also those that find high applicability in specific areas.

We surveyed the literature to evaluate the selected quality attributes on the aforementioned databases. This survey took into account already available evaluations regarding certain quality attributes, such as performance [ 51 , 54 ], consistency [ 43 ] or durability [ 26 ]. Each of the surveyed papers was taken into account according to the versions of the database tested (e.g. papers with outdated versions were given less relevance), generality of results and overall relevance to this evaluation. The summary table presented in Section ‘ Results and discussion ’ is the result of this careful evaluation of the NoSQL literature, technical knowledge found on the NoSQL ecosystem and expert opinions and positions. We also took into account the overall architectures of each NoSQL system (e.g. systems built with durability limitations are intrinsically limited in terms of this quality attribute). The result of this methodology is the aforementioned summary table, which we hope will aid software engineers and architects in their decision process when selecting a given NoSQL database according to a certain quality attribute.

In the following sections, we present the databases that we evaluated from the literature, as well as that evaluation.

Evaluated NoSQL databases

There are several popular NoSQL databases which have gained recognition and are usually considered before other NoSQL alternatives. We studied several of these databases (Aerospike, Cassandra, Couchbase, CouchDB, HBase, MongoDB and Voldemort) by performing a literature review and introduce the first quality attribute based evaluation of NoSQL databases. In this section, these selected databases are presented, with a summary table at the end (Table 1 ) detailing their characteristics.

Aerospike (formerly known as Citrusleaf [ 10 ] and recently open-sourced) is a NoSQL shared-nothing key-value database which offers mainly AP (Availability and Partition-Tolerance) characteristics. Additionally, the developers claim that it provides high consistency [ 55 ] by trading off availability and consistency at low granularity in specific subsystems, restricting communication latencies, minimizing cluster size, maximizing consistency and availability during failover situations and automatic conflict resolution. Consistency is guaranteed by using synchronous writes to replicas, guaranteeing immediate consistency. This immediate consistency can be relaxed if the software architects view that as a necessity. Durability is ensured by guaranteeing the use of flash/SSD on every node and performing direct reads from flash, as well as replication on several different layers.

Failover can be handled in two different ways [ 55 ]: focusing on High consistency on AP mode, or on Availability in CP (Consistency and Partition-Tolerance) mode. The former uses techniques to “virtually eliminate network based partitioning”, including fast heartbeats and consistent Paxos based cluster formation. These techniques favor Consistency over Availability to ensure that the system does not enter a state of network partition. If, however, partitioning occurs, Aerospike offers two conflict handling policies: one relies on the database’s auto-merging capabilities, and the other offloads the conflict to the application layer so that application developers can resolve the conflicts by themselves and re-write the right data back to the database. The second way that Aerospike manages failover is to provide Availability while in CP mode. In this mode, availability needs to be sacrificed by, for instance, forcing the minority quorum(s) to halt, thus avoiding data inconsistency if a network split occurs.

Aerospike is, henceforth, an in-memory database with disk persistence, automatic data partitioning and synchronous replication, offering cross data center replication and configurability in the failover handling mechanism, preferring full consistency or high consistency [ 10 , 52 , 55 ].

Cassandra is an open-source shared-nothing NoSQL column-store database developed and used in Facebook [ 10 , 52 , 56 ]. It is based on the ideas behind Google BigTable [ 3 ] and Amazon Dynamo [ 14 ].

Cassandra is similar to BigTable in what concerns the data model. The minimal unit of storage is a column, with rows consisting of columns or super columns (nested columns). Columns themselves consist of the name, value and timestamp, all of which are provided by the client. Since it is column-based, rows need not have the same number of columns [ 10 ].

Cassandra supports a SQL-like language called CQL, together with other protocols [ 10 ]. Indexes and secondary indexes are supported, and atomicity is guaranteed at the level of one table row. Persistence is ensured by logging. Consistency is highly tunable according to the desired operation – the application developer can specify the desired level of consistency, trading off latency and consistency. Conflicts are resolved based on timestamps (the newest record is kept). The database operates in master-master mode [ 52 ], where no node is different from another, and combines disk-persistence with in-memory caching of results, resulting in high write throughput operations [ 52 , 56 ]. The master-master architecture makes it easy for horizontal scalability to happen [ 56 ]. There are several different partitioning techniques and replication can be automatically managed by the database [ 56 ].

Apache CouchDB is another open-source project, written in Erlang, and following a document-oriented approach [ 10 ]. Documents are written in JSON and are meant to be accessed with CouchDB’s specific implementation of MapReduce views written in Javascript.

This database uses a B-tree index [ 10 ], updated during data modifications. These modifications have ACID properties on the document level and the use of MVCC (Multi-Version Concurrency Control) enables readers to never block [ 10 ]. CouchDB’s document manipulation uses optimistic locks by updating an append-only B-tree for data storage, meaning that data must be periodically compressed. This compression, in spite of maintaining availability, may hinder performance [ 10 ].

Regarding fault-tolerant replication mechanisms [ 57 ], CouchDB supports both master-slave and master-master replication that can be used between different instances of CouchDB or on a single instance. Scaling in CouchDB is achieved by replicating data, a process which is performed asynchronously. It does not natively support sharding/partitioning [ 10 ]. Consistency is guaranteed in the form of strengthened eventual consistency [ 10 ], and conflict resolution is performed by selecting the most up to date version (the application layer can later try to merge conflicting changes, if possible, back into the document). CouchDB’s programming interface is REST-based [ 10 , 57 ]. Ideally, CouchDB should be able to fit the whole dataset into the RAM of the cluster, as it is primarily a RAM-based database.

Couchbase is a combination of Membase (a key-value system with memcached compatibility) and CouchDB. It can be used in key-value fashion, but is considered a document store working with JSON documents (similarly to CouchDB) [ 10 ].

Documents, in Couchbase, have an intrinsic unique id and are stored in what are called data buckets. Like CouchDB, queries are built using MapReduce views in Javascript. The optimistic locking associated with an append-only B-tree is also implemented like in CouchDB. The default consistency level is eventual consistency (due to MapReduce views being constructed asynchronously). There is also the option of specifying that data should be indexed immediately [ 10 ].

A major difference when comparing Couchbase with CouchDB regards sharding [ 10 ]. Whereas CouchDB does not natively support sharding (there are projects, such as CouchDB Lounge [ 10 ] which enable this), Couchbase comes with transparent sharding off-the-shelf, with application transparency. Replication is also a major point of difference between the two databases, as couchbase supports intercluster and intracluster replication. The latter is performed within a cluster, guaranteeing immediate consistency. The former kind of replication ensures eventual consistency and is made asynchronously between geographically distributed clusters (conflict resolution is performed in the same way CouchDB does it). This database is mostly intended to run in-memory, so as to hold the whole dataset in RAM [ 10 , 29 ].

HBase is an open-source database written in Java and developed by the Apache Software Foundation. It is intended to be the open-source implementation of the Google BigTable principles, and relies on the Apache Hadoop Framework and the Apache ZooKeeper projects. It is, therefore, a column-store database [ 10 ].

HBase’s architecture is highly inspired by Google’s BigTable [ 3 , 10 ], and, thus, their capabilities are similar. There are, however, certain differences. The Hadoop Distributed File System (HDFS) is used for distributed storage, although other backends can be used (e.g. Hadoop MapReduce), in place of the Google File System. HBase also supports several master servers to improve system reliability, but does not support the concept of locality. Similarly to Google BigTable, it does not support full ACID semantics, although several properties are guaranteed [ 58 ]. Atomicity is guaranteed within a row and consistency ensures that no rows result of interleaving operations (i.e. the row must have effectively existed at some point in time). Still on the topic of consistency, rows are guaranteed to only move forward in time, never backward, and scans do not exhibit snapshot isolation, but, rather, the “read commited” isolation level. Durability is established in the sense that all data which is read has already been made durable (i.e. persisted to disk), and that all operations returning success have ensured this durability property. This can be configured, so that data is only periodically flushed to disk [ 58 ]. HBase does not support secondary indexes, meaning that data can only be queried by the primary key or by a table scan. It is worth noting that data, in HBase is also absent of data types (everything is a byte array) [ 52 ]. Regarding the programming interface, HBase can be interfaced using a Java API, a REST interface, and the Avro and Thrift protocols [ 10 ].

MongoDB is an open-source document-oriented database written in C++ and developed by the 10gen company. It uses JSON (data is stored and transferred in a binary, more compact form named BSON), allowing for a schemaless data model where the only requirement is that an id is always present [ 10 , 56 ].

MongoDB’s horizontal scalability is mainly provided through the use of automatic sharding [ 56 ]. Replication is also supported using locks and the asynchronous master-slave model, meaning that writes are only processed by the master node and reads can be made from both the master node and from one of the slave nodes. Writes are propagated to the slave nodes by reading from the master’s oplog (operation log) [ 56 ]. Database clients can choose the kind of consistency models they wish, by defining whether reads from secondary nodes are allowed and from how many nodes the confirmation must be obtained.

Document manipulation is a strong focus of MongoDB, as the database provides different frameworks (e.g. MapReduce and Aggregation Framework) and ways of interacting with documents [ 10 ]. These can be queried, sorted, projected, iterated with cursors, aggregated, among other operations. The changes to a document are guaranteed to be atomic. Indexing can be used on one or several fields (implemented using B-trees), with the possibility of using two-dimensional spatial indexes for geometry-based data [ 10 ]. There are many different programming interfaces supported by MongoDB, with most popular programming languages having native bindings. A REST interface is also supported [ 10 ].

Project Voldemort is an open-source key-value store implemented in Java which presents itself as an open-source implementation of the Amazon Dynamo database [ 10 , 14 , 59 , 60 ]. It supports scalar values, lists and records with named fields associated with a single key. Arbitrary fields can be used if they are serializable [ 10 ].

Operations on the data are simple and limited: there are put , get and delete commands [ 10 , 60 ]. In this sense, Voldemort can be considered (as the developers themselves put it), “basically just a big, distributed, persistent, fault-tolerant hash table” [ 59 ]. For data modification, the MVCC mechanism is used [ 10 ].

Replication is supported using the consistent hashing method (proposed in [ 61 ]) [ 10 , 60 ]. Sharding is implemented transparently with support for adding and removing nodes in real-time (although this feature was not always easily available [ 62 ]). Data is meant to stay in RAM as much as possible, with persistent data storage using several mechanisms, such as Berkley DB [ 60 ]. Voldemort uses a Java API [ 52 ].

Table 1 summarizes the characteristics of the studied NoSQL databases, similar to the work seen in [ 1 , 11 , 17 , 49 , 63 ], but providing a broader and more up to date view of these characteristics. Its information is derived from the previous sections and additional relevant sources ([ 12 , 64 – 71 ]). Each NoSQL database is described according to key characteristics: category (e.g. Key-Value database), positioning in the context of the CAP theorem, consistency guarantees and configurability, durability guarantees and configurability, querying possibilities and mechanisms (i.e. how are queries made and how complex can queries be?), concurrency control mechanisms, partitioning schemes and the existence of native partitioning. It should be noted that, as we have previously discussed, modern NoSQL databases often allow for fine-tuning of consistency and availability properties on a per-query basis, making the CAP-based classification (“AP”, “CP”, etc) overly simplistic [ 41 , 44 – 46 ].

Software quality attributes

In the previous section we identified and described several NoSQL databases. In this section, we survey the literature on NoSQL databases to find how each of these satisfy the software quality attributes that we selected. Each subsection explores the NoSQL literature on a given quality attribute, drawing conclusions regarding all of the evaluated NoSQL databases. This information is then summarized in the following section (Section ‘ Results and discussion ’), where a table is provided to aid software architects and engineers in their decision process.

Availability

Availability concerns what percentage of time a system is operating correctly [ 1 ]. NoSQL technology is inherently bound to provide availability more easily than relational systems. In fact, given the existence of Brewer’s CAP theorem [ 40 ], and the presence of failures in real-world systems (whether they are related to the network or to an application crash), NoSQL databases oppose most relational databases by favoring availability instead of consistency. Thus, one can assert that the higher the availability of a NoSQL system, the less likely it is that it provides high consistency guarantees. Several NoSQL databases provide ways to tune the trade-off between consistency and availability, including Dynamo [ 14 ], Cassandra, CouchDB and MongoDB [ 9 ].

Apache CouchDB uses a shared-nothing clustering approach, allowing all replica nodes to continue working even if they are disconnected, thus being a good candidate for systems where high availability is needed [ 9 ]. It is worth noting, however, that this database periodically requires a compaction step which may hinder system performance, but which does not affect the availability of its nodes under normal operation [ 3 ].

In 2013, [ 27 ] tested several NoSQL Databases (Aerospike, Cassandra, Couchbase and MongoDB) concerning their failover characteristics. Their results showed that Aerospike had the lowest downtime, followed by Cassandra, with MongoDB having the least favorable downtime. One should note that the results shown in the paper are limited to RAM-only datasets and hence might not be the best source for real-world scenarios. MongoDB’s results are also not surprising, as even though it allows for fine-tuning (to adjust the consistency-availability trade-offs), several tests have shown that it is not the best choice for a highly available system, in particular due to overhead when nodes are rejoining the system (see, for instance, [ 1 , 9 ] and our section on reliability). Lastly, [ 5 ] tested several NoSQL databases on the Cloud and noted that Riak could not provide high-availability under very high loads.

Thus, there is no obvious candidate for a highly available system, but there are several competing solutions, in particular when coupled with systems such as Memcached [ 2 ]. The specific architecture (number of replicas, consistency options, etc.) employed will play a major role, as pointed by several authors [ 27 , 72 ]. Furthermore, the popular MongoDB and Riak databases seem less likely to be good picks for this use case scenario.

Consistency

Consistency is related to transactions and, although not universally defined, can be seen as the guarantee that transactions started in the future see the effects of transactions committed in the past, coupled with the insurance of database constraints [ 73 – 75 ]. It is useful to recall that, in the context of the CAP theorem [ 40 , 41 ], consistency is often seen as the premise that all nodes see the same data at the same time [ 42 ] (i.e., the CAP version of consistency is merely a subset of the ACID version of the same property [ 41 ]). We have previously seen that consistency and availability are highly related properties of NoSQL systems.

Cassandra has several different consistency guarantees [ 76 ]. The database allows for tunable consistency at both the read and write level, even with near-ACID semantics if consistency level “ALL” is picked. MongoDB, in spite of being generally regarded as a CP system, offers similar consistency options [ 76 , 77 ]. Couchbase offers strong consistency guarantees for document access, whereas query access is eventually consistent [ 67 ]. HBase provides eventual consistency without fine-tuning being possible [ 58 ] (there is only the choice of opting for strong or eventual consistency), and CouchDB, being an AP system, fully relies on eventual consistency [ 78 ]. The Voldemort project puts more stress on application logic to deal with inconsistencies in data, by using read repair [ 60 ].

Regarding concrete experiments, not much has been done to study consistency as a property in itself. Recent work by Bermbach et al. [ 76 ] has seen the proposal of a general architecture for consistency benchmarking. The authors test their proposal on Cassandra and MongoDB, concluding that MongoDB performed better, but also noting that they are merely proposing an architecture and that their tests might have been impacted negatively due to their testing environment. The authors of [ 54 ] study Cassandra and Couchbase in a real world microblogging scenario, concluding that Couchbase provided consistent results faster (i.e. the same value took less time to reach all node replicas). In [ 79 ], the authors study Amazon S3’s consistency behavior and conclude that it frequently violates the monotonic read consistency property (other related work is presented by Bermbach et al. [ 76 ]). It seems that a general framework for testing consistency might provide with more in depth answers to the effectiveness of consistency trade-offs and techniques provided by each NoSQL database.

In summary, as the NoSQL ecosystem matures, there is a tendency towards micromanagement of consistency and availability [ 41 ], with some solutions opting to provide consistency (withholding availability), others providing availability (withholding consistency) and another set, such as Cassandra and MongoDB, allowing for fine-tuning based on a query basis.

Durability refers to the requirement that data be valid and committed to disk after a successful transaction [ 1 ]. As we have previously covered, NoSQL databases act on the premise that consistency doesn’t need to be fully enforced in the real world, preferring to sacrifice it in adjustable ways for achieving higher availability and partition tolerance. This impacts durability, as if a system suffers from consistency problems, its durability will also be at risk, leading to potential data loss [ 26 ].

In [ 26 ], the authors test Aerospike, Couchbase, Cassandra and MongoDB in a series of tests regarding durability and performance trade-offs. Their results featured Aerospike as the fastest performing database by a factor of 5-10 when the databases were set to synchronous replication. However, most scenarios do not rely on synchronous replication, but rather asynchronous (meaning that changes aren’t instantly propagated among nodes). In that sense, the same authors, which in [ 27 ] studied the same databases in the context of failover characteristics, show that MongoDB loses less data upon node failure when asynchronous replication is used. Cassandra comes as forerunner to MongoDB by about a factor of 100, and Aerospike and Couchbase both lose very large amounts of data. In [ 1 ], MongoDB is found to have issues with data loss when compared to CouchDB, in particular during recovery after a crash. In the same paper, the authors highlight that CouchDB’s immutable append only B+ Tree ensures that files are always in a valid state. CouchDB’s durability is also noticed and justified by the authors of [ 2 ]. It should be noted that document-based systems such as MongoDB usually use a single-versioning system, which is designed specifically to target durability [ 49 ]. HBase’s reliance on Hadoop means that it is inherently durable in the way requests are processed, as several authors have noted [ 80 – 82 ]. Voldemort’s continuing operation as the backend to Linkedin’s service is backed by strong durability [ 83 ], although there is a lack of studies focusing specifically on Voldemort’s durability.

In conclusion, as with other properties, the durability of NoSQL systems can be fine-tuned according to specific needs. However, databases based on immutability, such as CouchDB, are good picks for a system with good durability due to their inherent properties [ 1 ]. Furthermore, single-version databases, such as MongoDB, should also be the focus of those interested in durability advantages.

Maintainability

Maintainability is a quality attribute that regards the ease with which a product can be maintained, i.e., upgraded, repaired, debugged and met with new requirements [ 84 ]. From an intuitive point of view, systems with many components (e.g. several nodes) should add complexity and difficult maintainability, and this is a view that several authors agree with [ 7 , 85 ]. On the other hand, as some have hypothesized, the benefits of thoughtful modularity and task division make the case for a more maintainable system [ 86 ]. Assessing maintainability is a difficult problem which has seen vast amounts of research throughout the years, but it has seldom been focused explicitly on the database itself (in particular due to the widespread usage of the relational model with similar database interfaces).

In spite of the perceived difficulty in assessing the maintainability of NoSQL systems, there has been some research on the subject. Dzhakishev [ 50 ] studied the usability and maintainability of several NoSQL solutions in a real enterprise scenario. The author notes how MongoDB and Neo4j have “great shell applications”, easing maintainability, and that Neo4j even has a web interface (other NoSQL databases have such software, e.g. Couchbase Server). The authors of [ 87 ] study social network system implementation processes and rely on their own application-specific code to ensure maintainability of their application. They claim that versioning the schema using subversion is good for their goals. Throughout their work, maintainability seems to be moved more into the application layer and less into the database layer, possibly suggesting that NoSQL maintainability must be achieved with help of the developer. In [ 29 ], another real world experiment, the authors note the added maintainability difficulties in using HBase, Couchbase, Cassandra and CouchDB to replace their RDF data system. Similarly, Han [ 88 ] also faced maintainability problems with MongoDB when comparing it with the maintainability of relational alternatives. Although no particular study in literature has focused on the maintainability of Voldemort, from the point of view of ease of use, this database seems harder to configure (in particular in terms of node topology changes) than others [ 62 ].

It seems that most NoSQL systems offer limited maintainability when compared with traditional RDBMSs, but the literature has little to say with regards as to which is the more maintainable. Some authors [ 50 , 87 ] point in the direction of the ease of use of web interfaces and the readiness of tools. In that sense, Couchbase and Neo4j are prominent examples of easy to use and setup databases. On the other hand, MongoDB and HBase are known to be hard to install [ 89 , 90 ] or to confuse first users, thus limiting ease of use. Further research can and should be developed in this area so as to be able to truly compare the maintainability of NoSQL solutions.

Performance

When it comes to the performance and execution of different types of operations, NoSQL databases are divided mostly into two categories: read and write optimized [ 21 , 91 ]. That means that, in part, regardless of the system type and records, the database has an optimization that is granted by its mechanisms used for storage, organization and retrieval of data. For example, Cassandra is known for being very highly optimized during execution of writes (inserts) and is not able to show the same performance during reads [ 21 , 91 ]. The database achieves this by using a cache for write operations (updates are immediately written to a logfile, then cached in memory and only later written to disk, making the insertion process itself faster). In general, Column Store and Key-Value databases use more memory to store their data, some of those being completely in-memory (and, hence, completely unsuited for other attributes such as durability).

Document Stores, on the other hand, are considered as being more read optimized. This behavior resembles that of relational databases, where data loading and organization is slower, with the advantage of better preparing the system for future reads. Examples of this are MongoDB and Couchbase. If one compares most Column Store databases, such as Cassandra, to the document-based NoSQL landscape, with regards to read-performance, then the latter wins. This has been seen in numerous works such as [ 51 , 54 ] and [ 26 ]. We should also consider that databases such as MongoDB and Couchbase are considered more enterprise solutions with a set of mechanisms and functionality besides traditional key-value retrieval, which is mostly used not only by Key-Value stores but also by Column Store databases. This impacts performance significantly, as the need for additional functionality is usually associated with high performance costs.

Much work has been done with regards to performance testing of databases. Since NoSQL is constantly changing, past evaluations quickly become obsolete, but recent evaluations have been performed, some of which we now enumerate. In [ 51 ], a performance overview is given for Cassandra, HBase, MongoDB, OrientDB and Redis. The conclusions are that Redis is particularly well suited for all kinds of workloads (although this result should be taken lightly, since the database has many trade-offs with other quality attributes), that Cassandra performs very well for write/update scenarios, that overall OrientDB performs poorly when tested in this scenario and that HBase deals poorly with update queries. In [ 33 ], MongoDB and CouchDB are tested in a medical archiving scenario, with MongoDB showing better performance. In [ 92 ], MongoDB is shown to perform poorly for CRUD (create, read, update and delete) bulk operations when compared with PostgreSQL. Regarding write-heavy scenarios, a real-world enterprise scenario is presented in [ 35 ], where Cassandra, Couchbase and MongoDB are compared with MS SQL Server. In a four-node environment, Cassandra outperforms the NoSQL competition greatly (which is expected, since it is a write-optimized database), but is outperformed by a single-node MS SQL Server instance. Less recent, but also relevant, is the work presented in [ 26 , 27 ], where Cassandra, Couchbase, MongoDB and Aerospike are tested. Aerospike is shown to have the better performance, with Cassandra coming in as second in terms of read-throughput, and Couchbase in terms of write-throughput. Rabl et al. [ 25 ] compared Voldemort, Redis, HBase, Cassandra, MySQL Cluster and VoltDB with regards to throughput, scalability and disk usage, and noted that while Cassandra’s throughput dominated most tests, Voldemort exhibits a good balance between read and write performance, competing with the other databases. The authors also note that VoltDB had the worst performance and HBase’s throughput is low (although it scales better than most other databases).

In conclusion, performance highly depends on the database architecture. Column Store databases, such as Cassandra, are usually more oriented towards writing operations, whereas document based databases are more read-oriented. This last group of databases is also generally more feature-rich, bearing more resemblance to the traditional relational model, thus tending to have a bigger performance penalty. Experiments have been validating this theory and we can conclude that, in contrast with some of the other quality attributes studied in this article, performance is definitely not lacking in terms of research and evaluation.

Reliability

Reliability concerns the system’s probability of operating without failures for a given period of time [ 49 ]. The higher the reliability, the less likely it is that the system fails. Recently, Domaschka et al., in [ 49 ], have proposed a taxonomy for describing distributed databases with regards to their reliability and availability. Since reliability is significantly harder to define than availability (as it depends on the context of the application requirements), the authors suggest that software architects consider the following two questions: “(1) How are concurrent writes to the same item resolved?; (2) What is the consistency experienced by clients?”. With these in mind, and by using their taxonomy, we can see that systems which use single-version techniques, such as Redis, Couchbase, MongoDB and Neo4j, all perform online write conflict resolution detection, being good picks for a reliable system in the sense that they answer question (1) with reliable options. Regarding question (2), MongoDB, CouchDB, Neo4J, Cassandra and HBase all provide strong consistency guarantees. Thus, in order to achieve strong consistency guarantees and good concurrent write conflict resolution, as proposed by the authors, one should look at systems which have both these characteristics - MongoDB and Neo4j.

In conclusion, in spite of reliability being an important quality attribute, we have found that there is little focus in current literature about this topic, and, therefore, are limited in our answers to this research question.

Robustness is concerned with the ability of the database to cope with errors during execution [ 93 ]. Relational technology is known for its robustness, but many questions still arise when such a topic is discussed in the context of NoSQL [ 4 ]. If, from one point of view, one might consider that NoSQL databases are more robust due to their replication (i.e. crashes are “faded out” by appropriate replication and consensus algorithms [ 94 ]), from another, lack of code maturity and extensive testing might make NoSQL less robust in general [ 4 , 12 ]. Few [ 12 , 95 ] has been written on this subject, although there have been some real world studies where the impact of NoSQL on a system’s robustness was considered (even if only indirectly). In [ 88 ], Han experiments with MongoDB as a possible replacement for a traditional RDBMS in an air quality monitoring scenario. With regards to robustness, the author notes that as cluster scale and workloads increase, robustness becomes a more pressing issue (i.e. problems become more evident). Ranjan, in [ 95 ], studies Big Data platforms and notes that lack of robustness is a question in Big Data scheduling platforms and, in particular, in the NoSQL (Hadoop) case. In 2011, the authors of [ 12 ] postulated that robustness would be an issue for NoSQL, as the technology was new and needed testing. Neo4j is seen by some as a robust graph-based database [ 96 , 97 ]. Lior et al. [ 98 ] reviewed security issues in NoSQL databases and found that Cassandra and MongoDB were subject to Denial of Service attacks, which can be seen as a system with a lack of robustness. Similarly, Manoj [ 77 ] presents a comparative table of features for Cassandra, MongoDB and HBase, where HBase is identified as having an intrinsic single point of failure that needs to be overcome by explicitly using failover clustering. Lastly, in [ 87 ], the authors claim Cassandra is robust due to Facebook’s contribution to its development and the fact that it is used as one of the backends of the social network.

Overall, not much can be concluded for each individual database in terms of robustness. A benchmark for robustness is currently lacking in the NoSQL ecosystem, and software engineers looking for the most robust database would benefit from research into this area. The most up to date information and research indicates that more popular and used databases are more robust, although in general these systems are seen as less robust than their relational counterparts when tested in practice.

Scalability

Scalability concerns a system’s ability to deal with increasing workloads [ 1 ]. In the context of databases, it may be defined as the change in performance when new nodes are added, or hardware is improved [ 99 ]. NoSQL databases have been developed specifically to target scenarios where scalability is very important. These systems rely on horizontal and “elastic” scalability, by adding more nodes to a system instead of upgrading hardware [ 3 , 4 , 9 ]. The term “elastic” refers to elasticity, which is a characterization of the way a cluster reacts to the addition or removal of nodes [ 99 ].

In [ 18 ], the authors compared Cassandra and HBase, improving upon previous work. They concluded that both databases scale linearly with different read and write performances. They also provided a more in-depth analysis at Cassandra’s scalability, noticing how performing horizontal scalability with this platform leads to less performance hassles than performing vertical scalability.

In [ 99 ], the authors measure the elasticity and scalability of Cassandra, HBase and MongoDB. They showed surprise by identifying “superlinear speedups for clusters of size 24” when using Cassandra, stating that “it is almost as if Cassandra uses better algorithms for large cluster sizes”. For clusters of sizes 6 and 12, their results show HBase the fastest competitor with stable performance. Regarding elasticity, they found that HBase gives the best results, stabilizing the database significantly faster than Cassandra and MongoDB.

Rabl et al. [ 25 ] studied the scalability (and other attributes) of Cassandra, Voldemort, HBase, VoltDB, Redis and MySQL cluster. They noted the linear scalability of Cassandra, HBase and Voldemort, noting, however, that Cassandra’s latency was “peculiarly high” and that Voldemort’s was stable. HBase, while the worst of these databases in terms of throughput, scaled better than the rest. Regarding the different scalability capabilities of the databases themselves, Cassandra, HBase and Riak all support the addition of machines during live operation Key-value databases, such as Aerospike and Voldemort, are also easier to scale, as the data model allows for better distribution of data across several nodes [ 12 ]. In particular, Voldemort was designed to be highly scalable, being the major backend behind LinkedIn.

Further studies regarding scalability are needed in literature. It is clear that NoSQL databases are scalable, but the question of which scale the most, or with the best performance, is still left unanswered. Nevertheless, we can conclude that popular choices for highly scalable systems are Cassandra and HBase. One must also take notice that scalability will be influenced by the particular choice of configuration parameters.

Stabilization Time and Recovery Time

Besides availability, there are other failover characteristics which determine the behavior of a system and might impact system stability. In the study made in [ 27 ], which we have already covered, the authors measure the time it takes for several NoSQL systems to recover from a node failure - the recovery time -, as well as the time it takes for the system to stabilize when that node rejoins the cluster -the stabilization time. They find that MongoDB has the best recovery time, followed by Aerospike (when in synchronous change propagation mode), with Couchbase having values an order of magnitude slower and Cassandra two orders of magnitude slower that MongoDB. Regarding the time to stabilize on node up, all systems perform well (< 1ms) with the exception of MongoDB and Aerospike. The former takes a long 31 seconds to recover to stabilize on node reentry, and Aerospike, in synchronous mode takes 3 seconds. These results tend to indicate that MongoDB and Aerospike are good picks if one is looking for good recovery times, but that these choices should be taken with care, such that when a node reenters the system, it does not affect its stability.

Overall, the topic of failover is highly dependent of configuration and desired properties and should be studied more thoroughly (we note this as part of our future work). The current literature is limited and does not allow for very general and broad conclusions.

Results and discussion

We used the criteria described in each of the previous sections to quantify the databases. Regarding availability, the downtime was used as a primary measure, together with relevant studies [ 5 , 27 ]. Consistency was graded according to two essential criteria: 1) how much the database can provide ACID-semantics consistency and 2) how much can consistency be fine-tuned. Durability was measured according to the use of single or multi version concurrency control schemes, the way that data are persisted to disk (e.g. if data is always asynchronously persisted, this hinders durability), and studies that specifically targeted durability [ 26 ]. Regarding maintainability, the criteria were the currently available literature studies of real world experiments, the ease of setup and use, as well as the accessibility of tools to interact with the database. For read and write performance, we considered recent studies [ 27 ] and the fine-tuning of each database, as noted in the previous sections. Reliability is graded according to the taxonomy presented in [ 49 ] and by looking at synchronous propagation modes (databases which do not support them tend to be less reliable, as Domaschka et. al note). Database robustness was assessed with the real world experiments carried by researchers, as well as the available documentation on possible tendency of databases to have problems dealing with crashes or attacks (e.g. being subject to DoS attacks). With respect to scalability, we looked at each database’s elasticity, its increase in performance due to horizontal scaling, and the ease of on-line scalability (i.e. is the live addition of nodes supported?). For recovery time and stabilization time, highly related to availability, we based our classification on the results shown in [ 27 ] (implying that our grading of these attributes is mostly limited to their particular study and should be taken with apprpriate care). We looked at the databases described in Section ‘ Evaluated NoSQL databases ’.

By analyzing Table 2 , we can see that Aerospike suffers from data loss issues, affecting its durability, and it also has issues with stabilization time (in particular in synchronous mode). Cassandra is a multi-purpose database (in particular due to its configurable consistency properties) which mostly lacks read performance (since it is tuned for write-heavy workloads). CouchDB provides similar technology to MongoDB, but is better suited for situations where availability is needed. Couchbase provides good availability capabilities (coupled with good recovery and stabilization times), making it a good candidate for situations where failover is bound to happen. HBase has similar capabilities to Cassandra, but is unable to cope with high loads, limiting its availability in these scenarios, and is also the worst database in terms of robustness (this is mostly due to research seen in [ 77 , 95 ]). MongoDB is the database that mostly resembles the classical relational use case scenario - it is better suited for reliable, durable, consistent and read-oriented use cases. It is somewhat lacking in terms of availability, scalability and write-performance, and it is very hindered by its stabilization time (which is also one of the reasons for its low availability). Furthermore, it is not as efficient during write operations. Lastly, we lack some information on Voldemort, but find it to be a poor pick in terms of maintainability. It is, however, a good pick for durable, write-heavy scenarios, and provides a good balance between read and write performance (in line with [ 25 ]). We should highlight that there are more quality attributes that should be focused on, which we intend to do in future work, and, thus, that this table does not intend to show that “one database is better than another”, but, rather, that some database is better for a particular use case scenario where these attributes are needed.

Many software quality attributes are highly interdependent. For example, availability and consistency, in the context of the CAP theorem, are often polarized. Similarly, availability, stabilization time and recovery time are highly related, since low stabilization and recovery time are bound to hinder availability. With this in mind, there are several interesting findings in the summary table we have presented.

Availability, stabilization time and recovery time, as mentioned, are highly related software quality attributes. In this sense, it is interesting to note that polarizing results are found for different databases. Software engineers looking at the availability software quality attribute should note that although Aerospike, Cassandra, Couchbase, CouchDB and Voldemort all provide high availability, some of these databases are not ideal picks for situations where a fast recovery time is needed. Indeed, of these databases, only Aerospike and MongoDB have a “Great” rating, with Cassandra having the worst possible grading. On the other hand, Aerospike and MongoDB have poor stabilization times, but Cassandra has a “Good” rating for this quality attribute. Couchbase, another highly available system, although not having any “Great” rating in these two quality attributes, has a “Good” rating. Thus, for systems which desire high availability with a balance of stabilization and recovery time, Couchbase is an ideal pick.

It is interesting to note that Aerospike and Cassandra achieve high availability and high consistency ratings. A naive application of the CAP theorem to distributed systems and NoSQL systems would tend to indicate that both of these quality attributes would have to ultimately be traded off. Nevertheless, as other authors have pointed out [ 41 , 44 – 46 ], this is not the case, and our table reflects it. Systems such as Cassandra allow for these properties to be traded off on a query basis. This, ultimately combined with the other characteristics of each database, result in high ratings in both these quality attributes.

If inspecting only the availability and scalability quality attributes, it becomes clear that they are highly correlated. Nearly all systems with high availability also have high scalability. The only exception to this observation being CouchDB (this database does not support native partitioning, hindering its scalability). In cases where availability is limited, results are somewhat polarized: HBase achieves high scalability, whereas MongoDB is also hindered in terms of its scalability (this can easily be traced to MongoDB’s locking mechanism; indeed, as we have mentioned, this database is the most similar one to the typical relational use case scenario).

There are other highly correlated quality attributes which can be surprising. For instance, there is a high correlation between scalability and write performance. When one of these quality attributes is tending towards being “Great”, the other is too; similarly, when one tends to be “Bad”, the other does too. This result provides insight into how scalability is achieved in many NoSQL systems: write optimizations (particularly found in column-store databases) help achieve scalability, and systems with poor write performance tend to be fairly limited in its scalability. Contrasting with this positive correlation, it seems that read performance is slightly negatively correlated with scalability: databases with high scalability tend to have higher write performance than read performance. This could be due to the fact that many of these databases rely on partitioning as an efficient way to scale, and the fact that partitioning improves write performance (through parallel writes) much more than it does read performance (nevertheless, this would be interesting to study as future work). Consistency and recovery time are also quality attributes that share a high degree of correlation. This result is also intuitive, since systems that react quickly to the loss of a node will tend to have fewer conflicts and, thus, fewer consistency problems. Still on the topic of consistency, it shares some similarity with robustness, in our table. Indeed, robust systems tend to also be consistent ones (notable exceptions are HBase and MongoDB). This relationship, however, is probably due to the nature of each database and not to any particular reason (i.e. there is no intrinsic relationship between consistency and robustness). Finally, reliability and write performance are often in polarized positions (e.g., Aerospike has “Good” write performance but low reliability).

There are some quality attributes for which no “Great” rating has been attributed. Indeed, in terms of durability, maintainability, robustness and stabilization time, no NoSQL system was found to achieve optimal results. This indicates directions of future work for NoSQL databases. Some of these ratings can be explained due to the infancy of these systems – robustness and maintainability are properties that evolve with time, as systems mature, bugs are found and new functionality is added. These two quality attributes have the worst overall ratings and reveal weaknesses of NoSQL systems.

Another point of interest that becomes clear with the summary table is that while some quality attributes do not have a “Great” rating, there are others for which no “Bad” rating is given: availability, consistency, maintainability, reliability, scalability, read and write performance. Some of these quality attributes, such as consistency and scalability, are actually found to have generally high ratings. This implies that these quality attributes are among the key attributes offered by NoSQL databases. It is no surprise, then, that availability, consistency and scalability, some of the three major reasons for the initial development of NoSQL databases [ 63 ], are among these attributes.

Although performance is often considered an isolated quality attribute, read and write performance can be different. This difference is reflected in our table, and it is interesting to analyse the performance quality attribute as a whole with the data presented in the table. Most NoSQL databases polarize on performance characteristics, either having high ratings on read or write performance (Cassandra, Couchbase, HBase and MongoDB), but there are some exceptions. Aerospike provides a balance between write and read performance without reaching the “Great” rating in either of these quality attributes. On the other hand, Voldemort provides high write performance (“Great”) and good read performance (“Good”), while Couchbase offers good write performance (“Good”), but high read performance (“Great”). This implies that software engineers can look to Voldemort, Couchbase and Aerospike as “balanced” systems in terms of performance, with Voldemort and Couchbase tending slightly towards more specific write or read scenarios, respectively.

The only quality attributes where there are neither “Great” or “Bad” ratings are durability and maintainability. Indeed, it would make little sense for a NoSQL system to have bad durability, since this is a key attribute of most database solutions. On the other hand, the trade-offs associated with NoSQL often mean that durability must be sacrificed, resulting in no system achieving the best durability yet (this is a clear are for future work in NoSQL systems).

Conclusions

In this article we described the main characteristics and types of NoSQL technology while approaching different aspects that highly contribute to the use of those systems. We also presented the state of the art of non-relational technology by describing some of the most relevant studies and performance tests and their conclusions, after surveying a vast number of publications since NoSQL’s birth. This state of the art also intended to give a time-based perspective to the evolution of NoSQL research, highlighting four clearly distinct periods: 1) Database type characterization (where NoSQL was in its infancy and researchers tried to categorize databases into different sets); 2) Performance evaluations, with the advent of YCSB and a surge in NoSQL popularity; 3) Real-world scenarios and criticism to some interpretations of the CAP theorem; and 4) An even bigger focus on applicability and a reinvigorated focus on the validation of benchmarking software.

We concluded that although there have been a variety of studies and evaluations of NoSQL technology, there is still not enough information to verify how suited each non-relational database is in a specific scenario or system. Moreover, each working system differs from another and all the necessary functionality and mechanisms highly affect the database choice. Sometimes there is no possibility of clearly stating the best database solution. Furthermore, we tried to find the best databases on a quality attribute perspective, an approach still not found in current literature – this is our main contribution. In the future, we expect that NoSQL databases will be more used in real enterprise systems, allowing for more information and user experience available to conclude the most appropriate use of NoSQL according to each quality attribute and further improve this initial approach.

As we have seen, NoSQL is still an in-development field, with many questions and a shortage of definite answers. Its technology is ever-increasing and ever-changing, rendering even recent benchmarks and performance evaluations obsolete. There is also a lack of studies which focus on use-case oriented scenarios or software engineering quality attributes (we believe ours is the first work on this subject). All of these reasons make it difficult to find the best pick for each of the quality attributes we chose in this work, as well as others. The summary table we presented makes it clear that there is a current need for a broad study of quality attributes in order to better understand the NoSQL ecosystem, and it would be interesting to conduct research in this domain. When more studies and with more consistent results have been performed, a more thorough survey of the literature can be done, and with clearer, more concise results.

Software architects and engineer can look to the summary table presented in this article if looking for help understanding the wide array of offerings in the NoSQL world from a quality attribute based overview. This table also brings to light some hidden or unexpected relationships between quality attributes in the NoSQL world. For instance, scalability is highly related to the write performance, but not necessarily the read performance. Additionally, broad sets of quality attributes that are highly related (e.g. availability, stabilization time and recovery time) can be individually studied, so that the appropriate trade-offs can be selected for a candidate system architecture.

Our literature review allows us to establish future directions on research regarding a quality-attribute based approach to NoSQL databases. It is our belief that the development of a framework for assessing most of these quality attributes would greatly benefit the lifes of software engineers and architects alike. In particular, research is currently lacking in terms of Reliability, Robustness, Durability and Maintainability, with most work in literature focusing on raw performance. Future work in this area, with the development of such a framework for quality attribute evaluation, would undoubtedly benefit the NoSQL research in the long term.

Orend K (2010) Analysis and Classification of NoSQL Databases and Evaluation of their Ability to Replace an Object-relational Persistence Layer. Dissertation, Technische Universität München.

Leavitt N (2010) Will nosql databases live up to their promise?Computer 43(2): 12–14.

Article   Google Scholar  

Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2008) Bigtable: A distributed storage system for structured data. ACM Trans Comput Syst (TOCS) 26(2): 4.

Floratou A, Teletia N, DeWitt DJ, Patel JM, Zhang D (2012) Can the elephants handle the nosql onslaught?Proc VLDB Endowment 5(12): 1712–1723.

Lith A, Mattson J (2013) Investigating storage solutions for large data: A comparison of well performing and scalable data storage solutions for real time extraction and batch insertion of data. Dissertation, Chalmers University of Technology.

Sadalage PJ, Fowler M (2012) NoSQL Distilled: a Brief Guide to the Emerging World of Polyglot Persistence. Pearson Education, Upper Saddle River, NJ.

Google Scholar  

Schram A, Anderson KM (2012) Mysql to nosql: data modeling challenges in supporting scalability In: Proceedings of the 3rd Annual Conference on Systems, Programming, and Applications: Software for Humanity, 191–202.. ACM, Tucson, Arizona, USA.

NoSQL. http://nosql-database.org/ . Accessed June, 2015.

Strauch C (2011) NoSQL Databases. Lecture: Selected Topics on Software-Technology Ultra-Large Scale Sites, Stuttgart Media University.

Kuznetsov S, Poskonin A (2014) Nosql data management systems. Program Comput Softw 40(6): 323–332.

Hecht R, Jablonski S (2011) Nosql evaluation In: International Conference on Cloud and Service Computing, 336–41.. IEEE, Hong Kong, China.

Cattell R (2011) Scalable sql and nosql data stores. ACM SIGMOD Record 39(4): 12–27.

Lourenço JR, Abramova V, Vieira M, Cabral B, Bernardino J (2015) Nosql databases: A software engineering perspective In: New Contributions in Information Systems and Technologies, 741–750.. Springer, São Miguel, Azores, Portugal.

DeCandia G, Hastorun D, Jampani M, Kakulapati G, Lakshman A, Pilchin A, Sivasubramanian S, Vosshall P, Vogels W (2007) Dynamo: amazon’s highly available key-value store In: ACM SIGOPS Operating Systems Review, 205–220.. ACM, Stevenson, Washington, USA.

Stonebraker M (2010) Sql databases v. nosql databases. Commun ACM. 53(4): 10–11.

Stonebraker M (2011) Stonebraker on nosql and enterprises. Commun ACM. 54(8): 10–11.

Tudorica BG, Bucur C (2011) A comparison between several nosql databases with comments and notes In: Roedunet International Conference (RoEduNet), 2011 10th, 1–5.. IEEE, Iasi, Romania.

Chapter   Google Scholar  

Dory T, Mejías B, Van Roy P, Tran NL (2011) Comparative elasticity and scalability measurements of cloud databases In: Proc of the 2nd ACM Symposium on Cloud Computing (SoCC).. IEEE, Iasi, Romania.

Konstantinou I, Angelou E, Boumpouka C, Tsoumakos D, Koziris N (2011) On the elasticity of nosql databases over cloud management platforms In: Proceedings of the 20th ACM international conference on Information and knowledge management, 24–28, Glasgow.

Han J, Haihong E, Le G, Du J (2011) Survey on nosql database In: Pervasive Computing and Applications (ICPCA), 2011 6th International Conference On, 363–366.. IEEE, Port Elizabeth, South Africa.

Cooper BF, Silberstein A, Tam E, Ramakrishnan R, Sears R (2010) Benchmarking cloud serving systems with ycsb In: Proceedings of the 1st ACM Symposium on Cloud Computing, 143–154.. ACM, Indianapolis, Indiana, USA.

van der Veen JS, van der Waaij B, Meijer RJ (2012) Sensor data storage performance: Sql or nosql, physical or virtual In: Cloud Computing (CLOUD), 2012 IEEE 5th International Conference On, 431–438.. IEEE, Honolulu, HI, USA.

Parker Z, Poe S, Vrbsky SV (2013) Comparing nosql mongodb to an sql db In: Proceedings of the 51st ACM Southeast Conference, 5.. ACM, Savannah, Georgia, USA.

Kashyap S, Zamwar S, Bhavsar T, Singh S (2013) Benchmarking and analysis of nosql technologies. Int J Emerg Technol Adv Eng 3: 422–426.

Rabl T, Gómez-Villamor S, Sadoghi M, Muntés-Mulero V, Jacobsen HA, Mankovskii S (2012) Solving big data challenges for enterprise application performance management. Proc VLDB Endowment 5(12): 1724–1735.

Nelubin D, Engber B (2013) Ultra-High Performance NoSQL Benchmarking: Analyzing Durability and Performance Tradeoffs. Thumbtack Technology, Inc., White Paper.

Nelubin D, Engber B (2013) Nosql failover characteristics: Aerospike, cassandra, couchbase, mongodb.

Abramova V, Bernardino J (2013) Nosql databases: Mongodb vs cassandra In: Proceedings of the International C* Conference on Computer Science and Software Engineering, 14–22.. ACM, New York, USA.

Cudré-Mauroux P, Enchev I, Fundatureanu S, Groth P, Haque A, Harth A, Keppmann FL, Miranker D, Sequeda JF, Wylot M (2013) Nosql databases for rdf: an empirical evaluation In: The Semantic Web–ISWC 2013, 310–325.. Springer, Berlin.

Yang CT, Liu JC, Hsu WH, Lu HW, Chu WC-C (2013) Implementation of data transform method into nosql database for healthcare data In: Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2013 International Conference On, 198–205.. IEEE, Taipei Taiwan.

Blanke T, Bryant M, Hedges M (2013) Back to our data-experiments with nosql technologies in the humanities In: Big Data, 2013 IEEE International Conference On, 17–20.. IEEE, Silicon Valley, CA, USA.

Fan C, Bai C, Zou J, Zhang X, Rao L (2013) A dynamic password authentication system based on nosql and rdbms combination In: LISS 2013, 811–819.. Springer, Berlin.

Silva LAB, Beroud L, Costa C, Oliveira JL (2014) Medical imaging archiving: A comparison between several nosql solutions In: Biomedical and Health Informatics (BHI), 2014 IEEE-EMBS International Conference On, 65–68.. IEEE, Valencia, Spain.

Rith J, Lehmayr PS, Meyer-Wegener K (2014) Speaking in tongues: Sql access to nosql systems In: Proceedings of the 29th Annual ACM Symposium on Applied Computing, 855–857.. ACM, Gyeongju, Korea.

Lourenço JR, Abramova V, Cabral B, Bernardino J, Carreiro P, Vieira M (2015) Nosql in practice: a write-heavy enterprise application In: IEEE BigData Congress 2015. New York, June 27-July 2, 2015.

Wingerath W, Friedrich S, Gessert F, Ritter N (2015) Who Watches the Watchmen? On the Lack of Validation in NoSQL Benchmarking. In: Seidl T, Ritter N, Schöning H, Sattler K-U, Härder T, Friedrich S, Wingerath W (eds)Datenbanksysteme für Business, Technologie und Web (BTW 2015), Hamburg.

George TBA proposed validation method for a benchmarking methodology. Int J Sustainable Econ Manag (IJSEM) 3(4): 1–10.

Chen Y, Raab F, Katz R (2014) From tpc-c to big data benchmarks: A functional workload model In: Specifying Big Data Benchmarks, 28–43.. Springer, Berlin.

Qin X, Zhou X (2013) A survey on benchmarks for big data and some more considerations In: Intelligent Data Engineering and Automated Learning–IDEAL 2013, 619–627.. Springer, Berlin.

Brewer EA (2000) Towards robust distributed systems In: PODC.. IEEE, Portland, Oregon, USA.

Brewer E (2012) Cap twelve years later: How the “rules” have changed. Computer 45(2): 23–29.

Gilbert S, Lynch N (2002) Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News 33(2): 51–59.

Wada H, Fekete A, Zhao L, Lee K, Liu A (2011) Data consistency properties and the trade-offs in commercial cloud storage: the consumers’ perspective In: CIDR, 134–143.. ACM, Asilomar, California, USA.

Stonebraker M (2010) In search of database consistency. Commun ACM 53(10): 8–9.

Abadi D (2010) Problems with CAP, and Yahoo’s Little Known NoSQL System, DBMS Musings, blog, (2010); on-line resource. http://dbmsmusings.blogspot.pt/2010/04/problems-with-cap-and-yahoos-little.html . Accessed June 2015.

Hale C (2010) You can’t sacrifice partition tolerance. http://codahale.com/you-cant-sacrifice-partition-tolerance/ . Accessed July 2015.

Clements P, Kazman R, Klein M (2003) Evaluating Software Architectures. Tsinghua University Press, Beijing.

Offutt J (2002) Quality attributes of web software applications. IEEE Softw2: 25–32.

Domaschka J, Hauser CB, Erb B (2014) Reliability and availability properties of distributed database systems In: Enterprise Distributed Object Computing Conference (EDOC), 2014 IEEE 18th International, 226–233.. IEEE, Ulm, Germany.

Dzhakishev D (2014) Nosql databases in the enterprise. An experience with tomra s receipt validation system.

Abramova V, Bernardino J, Furtado P (2014) Which nosql database? a performance overview. Open J Databases (OJDB) 1(2): 17–24.

Gudivada VN, Rao D, Raghavan VV (2014) Nosql systems for big data management In: Services (SERVICES), 2014 IEEE World Congress On, 190–197.. IEEE, Anchorage, AK, USA.

DB-Engines Ranking: Knowledge Base of Relational and NoSQL Database Management Systems. http://db-engines.com/en/ranking . Accessed July, 2015.

Fonseca A, Vu A, Grman P (2013) Evaluation of NoSQL databases for large-scale decentralized microblogging, Universitat Politècnica de Catalunya.

Aerospike (2014) ACID Support in Aerospike. Aerospike, Mountain View, California.

Haughian G (2014) Benchmarking replication in nosql data stores. Dissertation, Imperial College London.

Nocuń Ł, Nieć M, Pikuła P, Mamla A, Turek W (2013) Car-finding system with couchdb-based sensor management platform. Comput Sci 14(3): 403–422.

Apache Hbase ACID Semantics. http://hbase.apache.org/acid-semantics.html . Accessed July, 2015.

Voldemort Project Github. https://github.com/voldemort/voldemort . Accessed July, 2015.

Voldemort: Design – Voldemort. www.project-voldemort.com/voldemort/design.html . Accessed July, 2015.

Karger D, Sherman A, Berkheimer A, Bogstad B, Dhanidina R, Iwamoto K, Kim B, Matkins L, Yerushalmi Y (1999) Web caching with consistent hashing. Comput Netw 31(11): 1203–1213.

Voldemort Rebalancing (as Seen in the Wayback Time Machine Archive in 2012). http://web.archive.org/web/20100923080327/http://github.com/voldemort/voldemort/wiki/Voldemort-Rebalancing . Accessed July, 2015.

Pokorny J (2013) Nosql databases: a step to database scalability in web environment. Int J Web Inf Syst 9(1): 69–82.

Aerospike Clustering. https://www.aerospike.com/docs/architecture/clustering.html . Accessed July, 2015.

MongoDB Concurrency FAQ. http://docs.mongodb.org/manual/faq/concurrency/ . Accessed July, 2015.

Couchbase Blog: Optimistic or Pessimistic Locking, Which One Should You Pick? http://blog.couchbase.com/optimistic-or-pessimistic-locking-which-one-should-you-pick . Accessed July, 2015.

10 Things Developers Should Know About Couchbase. http://blog.couchbase.com/10-things-developers-should-know-about-couchbase . Accessed July, 2015.

Cassandra Concurrency Control. http://teddyma.gitbooks.io/learncassandra/content/concurrent/concurrency_control.html . Accessed July, 2015.

Apache HBase Reference Guide. https://hbase.apache.org/book.html . Accessed July, 2015.

Apache HBase Durability Javadoc. https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Durability.html . Accessed July, 2015.

Sumbaly R, Kreps J, Gao L, Feinberg A, Soman C, Shah S (2012) Serving large-scale batch computed data with project voldemort In: Proceedings of the 10th USENIX Conference on File and Storage Technologies, 18–18.. USENIX Association, San Jose, CA, USA.

Beyer F, Koschel A, Schulz C, Schäfer M, Astrova I, Grivas SG, Schaaf M, Reich A (2011) Testing the suitability of cassandra for cloud computing environments In: CLOUD COMPUTING 2011, The Second International Conference on Cloud Computing, GRIDs, and Virtualization, 86–91.. IARIA XPS, Venice/Mestre, Italy.

Ports DR, Clements AT, Zhang I, Madden S, Liskov B (2010) Transactional consistency and automatic management in an application data cache In: OSDI, 1–15.. USENIX Association, Vancouver, BC, Canada.

Eswaran KP, Gray JN, Lorie RA, Traiger IL (1976) The notions of consistency and predicate locks in a database system. Commun ACM 19(11): 624–633.

Article   MATH   MathSciNet   Google Scholar  

Haerder T, Reuter A (1983) Principles of transaction-oriented database recovery. ACM Comput Surv (CSUR) 15(4): 287–317.

Article   MathSciNet   Google Scholar  

Bermbach D, Zhao L, Sakr S (2014) Towards comprehensive measurement of consistency guarantees for cloud-hosted data storage services In: Performance Characterization and Benchmarking, 32–47.. Springer, Berlin.

Manoj V (2014) Comparative study of nosql document, column store databases and evaluation of cassandra. Int J Database Manag Syst (IJDMS) 6: 11–26.

CouchDB Consistency. http://guide.couchdb.org/draft/consistency.html . Accessed July, 2015.

Bermbach D, Tai S (2011) Eventual consistency: How soon is eventual? an evaluation of amazon s3’s consistency behavior In: Proceedings of the 6th Workshop on Middleware for Service Oriented Computing, 1.. ACM, Lisbon, Portugal.

Konishetty VK, Kumar KA, Voruganti K, Rao G (2012) Implementation and evaluation of scalable data structure over hbase In: Proceedings of the International Conference on Advances in Computing, Communications and Informatics, 1010–1018.. USENIX Association, Chennai, India.

Harter T, Borthakur D, Dong S, Aiyer AS, Tang L, Arpaci-Dusseau AC, Arpaci-Dusseau RH (2014) Analysis of hdfs under hbase: a facebook messages case study In: FAST, 12.. USENIX Association, Santa Clara, CA, USA.

Konishetty VK, Kumar KA, Voruganti K, Rao GVP (2012) Implementation and evaluation of scalable data structure over hbase In: Proceedings of the International Conference on Advances in Computing, Communications and Informatics. ICACCI ’12, 1010–1018.. ACM, New York, NY, USA, doi:10.1145/2345396.2345559. http://doi.acm.org/10.1145/2345396.2345559 .

Chandra DG, Prakash R, Lamdharia S (2012) A study on cloud database In: Computational Intelligence and Communication Networks (CICN), 2012 Fourth International Conference On, 513–519.. IEEE, Mathura, India.

Riaz M, Mendes E, Tempero E (2011) Towards predicting maintainability for relational database-driven software applications: Extended evidence from software practitioners. Int J Softw Eng Appl 5(2): 107–121.

Roijackers J, Fletcher G (2012) Bridging sql and nosql. Master’s thesis, Eindhoven University of Technology.

Fujimoto R, McLean T, Perumalla K, Tacic I (2000) Design of high performance rti software In: Distributed Simulation and Real-Time Applications, 2000.(DS-RT 2000). Proceedings. Fourth IEEE International Workshop On, 89–96.. IEEE, San Francisco, CA, USA.

Škrabálek J, Kunc P, Nguyen F, Pitner T (2013) Towards effective social network system implementation In: New Trends in Databases and Information Systems, 327–336.. Springer, Berlin.

Han M (2015) The application of nosql database in air quality monitoring In: 2015 International Conference on Intelligent Systems Research and Mechatronics Engineering.. Atlantis Press, Zhengzhou, China.

Chodorow K (2013) MongoDB: the Definitive Guide. “O’Reilly Media, Inc.”, 103a Morris Street, Sebastopol, CA 95472, USA.

George L (2011) HBase: the Definitive Guide. “O’Reilly Media, Inc”, 103a Morris Street, Sebastopol, CA 95472, USA.

Gajendran SK (1998) A Survey on NoSQL Databases. Department of Computer Science, Donetsk.

Hammes D, Medero H, Mitchell H (2014) Comparison of NoSQL and SQL Databases in the Cloud. Proceedings of the Southern Association for Information Systems (SAIS), Macon, GA, 21-22 March, 2014.

Eager DL, Sevcik KC (1983) Achieving robustness in distributed database systems. ACM Trans Database Syst (TODS) 8(3): 354–381.

Feng H (2012) Benchmarking the suitability of key-value stores for distributed scientific data. Dissertation, The University of Edinburgh.

Ranjan R (2014) Modeling and simulation in performance optimization of big data processing frameworks. Cloud Comput IEEE 1(4): 14–19.

Huang H, Dong Z (2013) Research on architecture and query performance based on distributed graph database neo4j In: Consumer Electronics, Communications and Networks (CECNet), 2013 3rd International Conference On, 533–536.. IEEE, Xianning, China.

Schreiber A, Ney M, Wendel H (2012) The provenance store proost for the open provenance model In: Provenance and Annotation of Data and Processes, 240–242.. IEEE, Changsha, China.

Okman L, Gal-Oz N, Gonen Y, Gudes E, Abramov J (2011) Security issues in nosql databases In: Trust, Security and Privacy in Computing and Communications (TrustCom), 2011 IEEE 10th International Conference On, 541–547, IEEE, Changsha, China.

Kuhlenkamp J, Klems M, Röss O (2014) Benchmarking scalability and elasticity of distributed database systems. Proc VLDB Endowment 7(13): 1219–1230.

Download references

Acknowledgement

This research would not have been made possible without support and funding of the FEED - Free Energy Data and iCIS - Intelligent Computing in the Internet Services (CENTRO-07 - ST24 - FEDER - 002003) projects, to which we are extremely grateful.

Author information

Authors and affiliations.

CISUC, Department of Informatics Engineering, University of Coimbra, Pólo II – Pinhal de Marrocos, Coimbra, 3030-290, Portugal

João Ricardo Lourenço, Bruno Cabral, Marco Vieira & Jorge Bernardino

Critical Software, Parque Industrial de Taveiro, lote 49, Coimbra, 3045-504, Portugal

Paulo Carreiro

ISEC – Superior Institute of Engineering of Coimbra, Polytechnic Institute of Coimbra, Coimbra, 3030-190, Portugal

Jorge Bernardino

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to João Ricardo Lourenço .

Additional information

Competing interests.

The authors declare that they have no competing interests.

Authors’ contributions

JRL surveyed most of the literature. BC, JB and MV helped identifying and evaluating the quality attributes, as well as finding the appropriate NoSQL databases to study, guiding the research and iteratively reviewing and revising the work. PC provided an initial case study from which this work originally sprouted, and helped identifying the evaluated quality attributes. All authors read and approved the final manuscript.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( https://creativecommons.org/licenses/by/4.0 ), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article.

Lourenço, J.R., Cabral, B., Carreiro, P. et al. Choosing the right NoSQL database for the job: a quality attribute evaluation. Journal of Big Data 2 , 18 (2015). https://doi.org/10.1186/s40537-015-0025-0

Download citation

Received : 02 June 2015

Accepted : 27 July 2015

Published : 14 August 2015

DOI : https://doi.org/10.1186/s40537-015-0025-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • NoSQL databases
  • Document store
  • Software engineering
  • Quality attributes
  • Software architecture

research paper on relational database

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Database Management Systems—An Efficient, Effective, and Augmented Approach for Organizations

  • Conference paper
  • First Online: 06 December 2021
  • Cite this conference paper

research paper on relational database

  • Anushka Sharma 7 ,
  • Aman Karamchandani 7 ,
  • Devam Dave 7 ,
  • Arush Patel 7 &
  • Nishant Doshi 7  

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 248))

908 Accesses

1 Citations

Big and small firms, organizations, hospitals, schools, and other commercial offices are generating moderate to huge amounts of data regularly and need to constantly update and manage these data. These data are not only used at that instance, but generally, the retrospective analysis of data helps tremendously to improve the business strategies and the marketing trends. With time, these data may grow and become unmanageable if handled conventionally, like the file system. These factors resulted in the introduction of the terms database and database management system. Hierarchical, network, relational, and object-oriented approaches of DBMS are discussed in this paper. A highlight of the new-generation database approach called NoSQL is also included in this paper along with an insight into augmented data management. A model based on the database design for the Study in India Program is discussed. It is followed by a graphical user interface developed in Java for the same which ensures the ease of access to the database.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

research paper on relational database

Database Management Systems (DBMS)

research paper on relational database

Data Management – A Look Back and a Look Ahead

research paper on relational database

An Innovative Approach to Manage Heterogeneous Information Using Relational Database Systems

Silberschatz, S., Korth, H.F., Sudarshan, S.: Database system concepts

Google Scholar  

Object database page. https://en.wikipedia.org/wiki/Object_database

Ketabchi, M.A., Mathur, S., Risch, T., Chen, J.: Comparative analysis of RDBMS and OODBMS—a case study

Database Trends and Application. http://www.dbta.com/Editorial/News-Flashes/Key-Database-Trends-Now-and-for-the-Future-131888.aspx , last accessed 2019/5/21

ChristofStrauch, Prof. Walter Kriah: NoSQL Database

Dbzone webpage. https://www.dbzone.org (for figure 1,2)

NoSQL and hybrid databases. https://www.stratoscale.com/blog/dbaas/hybrid-databases-combining-relational-nosql/

Sethi, B., Mishra, S., Patnaik, P.K.: A study of NoSQL database. Int. J. Eng. Res. Technol. (IJERT) (2014)

Padhy, R.P., Patra, M.R., Satapathy, S.C.: RDBMS to NoSQL: reviewing some next-generation non-relational database's. (IJAEST) Int. J. Adv. Eng. Sci. Technol. (2011)

https://www.gartner.com/en/conferences/apac/data-analytics-india/gartner-insights/rn-top-10-data-analytics-trends/augmented-data-management

AnalyticsIndiaMag webpage. https://analyticsindiamag.com/how-startups-can-leverage-augmented-data-management-to-drive-business/,last accessed 2019/10/14

PDPU official site. https://www.pdpu.ac.in/exposure-program.html

Comparing Database Management Systems. https://www.altexsoft.com/blog/business/comparing-database-management-systems-mysql-postgresql-mssql-server-mongodb-elasticsearch-and-others . Last Accessed 20 June 2019

Download references

Acknowledgements

We would like to extend our gratitude to Prof. Nigam Dave, Head of Office of International Relations, PDPU, and Dr. Ritu Sharma, Associate Professor, PDPU, for providing insight into SIP requirements. We are immensely grateful to them for guiding us through our project and providing us with information as and when required.

Author information

Authors and affiliations.

Department of Computer Science and Engineering, School of Technology, Pandit Deendayal Energy University, Gandhinagar, India

Anushka Sharma, Aman Karamchandani, Devam Dave, Arush Patel & Nishant Doshi

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Nishant Doshi .

Editor information

Editors and affiliations.

University of the Ryukyus, Okinawa, Japan

Tomonobu Senjyu

Sinhgad Technical Education society, SKNCOE, Pune, India

Parikshit N. Mahalle

Computer Science, Faculty of CS and IT, Universiti Putra Malaysia, Seri Kembangan, Malaysia

Thinagaran Perumal

Global Knowledge Research Foundation, Ahmedabad, India

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper.

Sharma, A., Karamchandani, A., Dave, D., Patel, A., Doshi, N. (2022). Database Management Systems—An Efficient, Effective, and Augmented Approach for Organizations. In: Senjyu, T., Mahalle, P.N., Perumal, T., Joshi, A. (eds) ICT with Intelligent Applications. Smart Innovation, Systems and Technologies, vol 248. Springer, Singapore. https://doi.org/10.1007/978-981-16-4177-0_47

Download citation

DOI : https://doi.org/10.1007/978-981-16-4177-0_47

Published : 06 December 2021

Publisher Name : Springer, Singapore

Print ISBN : 978-981-16-4176-3

Online ISBN : 978-981-16-4177-0

eBook Packages : Intelligent Technologies and Robotics Intelligent Technologies and Robotics (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Two-Bit History

Computing through the ages

research paper on relational database

Important Papers: Codd and the Relational Model

29 Dec 2017

It’s hard to believe today, but the relational database was once the cool new kid on the block. In 2017, the relational model competes with all sorts of cutting-edge NoSQL technologies that make relational database systems seem old-fashioned and boring. Yet, 50 years ago, none of the dominant database systems were relational. Nobody had thought to structure their data that way. When the relational model did come along, it was a radical new idea that revolutionized the database world and spawned a multi-billion dollar industry.

The relational model was introduced in 1970. Edgar F. Codd, a researcher at IBM, published a paper called “A Relational Model of Data for Large Shared Data Banks.” The paper was a rewrite of a paper he had circulated internally at IBM a year earlier. The paper is unassuming; Codd does not announce in his abstract that he has discovered a brilliant new approach to storing data. He only claims to have employed a novel tool (the mathematical notion of a “relation”) to address some of the inadequacies of the prevailing database models.

In 1970, there were two schools of thought about how to structure a database: the hierarchical model and the network model. The hierarchical model was used by IBM’s Information Management System (IMS), the dominant database system at the time. The network model had been specified by a standards committee called CODASYL (which also—random tidbit—specified COBOL) and implemented by several other database system vendors. The two models were not really that different; both could be called “navigational” models. They persisted tree or graph data structures to disk using pointers to preserve the links between the data. Retrieving a record stored toward the bottom of the tree would involve first navigating through all of its ancestor records. These databases were fast (IMS is still used by many financial institutions partly for this reason, see this excellent blog post ) but inflexible. Woe unto those database administrators who suddenly found themselves needing to query records from the bottom of the tree without having an obvious place to start at the top.

Codd saw this inflexibility as a symptom of a larger problem. Programs using a hierarchical or network database had to know about how the stored data was structured. Programs had to know this because they were responsible for navigating down this structure to find the information they needed. This was so true that when Charles Bachman, a major pioneer of the network model, received a Turing Award for his work in 1973, he gave a speech titled “ The Programmer as Navigator .” Of course, if programs were saddled with this responsibility, then they would immediately break if the structure of the database ever changed. In the introduction to his 1970 paper, Codd motivates the search for a better model by arguing that we need “data independence,” which he defines as “the independence of application programs and terminal activities from growth in data types and changes in data representation.” The relational model, he argues, “appears to be superior in several respects to the graph or network model presently in vogue,” partly because, among other benefits, the relational model “provides a means of describing data with its natural structure only.” By this he meant that programs could safely ignore any artificial structures (like trees) imposed upon the data for storage and retrieval purposes only.

To further illustrate the problem with the navigational models, Codd devotes the first section of his paper to an example data set involving machine parts and assembly projects. This dataset, he says, could be represented in existing systems in at least five different ways. Any program \(P\) that is developed assuming one of five structures will fail when run against at least three of the other structures. The program \(P\) could instead try to figure out ahead of time which of the structures it might be dealing with, but it would be difficult to do so in this specific case and practically impossible in the general case. So, as long as the program needs to know about how the data is structured, we cannot switch to an alternative structure without breaking the program. This is a real bummer because (and this is from the abstract) “changes in data representation will often be needed as a result of changes in query, update, and report traffic and natural growth in the types of stored information.”

Codd then introduces his relational model. This model would be refined and expanded in subsequent papers: In 1971, Codd wrote about ALPHA, a SQL-like query language he created; in another 1971 paper, he introduced the first three normal forms we know and love today; and in 1972, he further developed relational algebra and relational calculus, the mathematically rigorous underpinnings of the relational model. But Codd’s 1970 paper contains the kernel of the relational idea:

The term relation is used here in its accepted mathematical sense. Given sets \(S_1, S_i, ..., S_n\) (not necessarily distinct), \(R\) is a relation on these \(n\) sets if it is a set of \(n\)-tuples each of which has its first element from \(S_1\), its second element from \(S_2\), and so on. We shall refer to \(S_j\) as the \(j\)th domain of \(R\). As defined above, \(R\) is said to have degree \(n\). Relations of degree 1 are often called unary , degree 2 binary , degree 3 ternary , and degree \(n\) n-ary .

Today, we call a relation a table , and a domain an attribute or a column . The word “table” actually appears nowhere in the paper, though Codd’s visual representations of relations (which he calls “arrays”) do resemble tables. Codd defines several more terms, some of which we continue to use and others we have replaced. He explains primary and foreign keys, as well as what he calls the “active domain,” which is the set of all distinct values that actually appear in a given domain or column. He then spends some time distinguishing between a “simple” and a “nonsimple” domain. A simple domain contains “atomic” or “nondecomposable” values, like integers. A nonsimple domain has relations as elements. The example Codd gives here is that of an employee with a salary history. The salary history is not one salary but a collection of salaries each associated with a date. So a salary history cannot be represented by a single number or string.

It’s not obvious how one could store a nonsimple domain in a multi-dimensional array, AKA a table. The temptation might be to denote the nonsimple relationship using some kind of pointer, but then we would be repeating the mistakes of the navigational models. Instead. Codd introduces normalization, which at least in the 1970 paper involves nothing more than turning nonsimple domains into simple ones. This is done by expanding the child relation so that it includes the primary key of the parent. Each tuple of the child relation references its parent using simple domains, eliminating the need for a nonsimple domain in the parent. Normalization means no pointers, sidestepping all the problems they cause in the navigational models.

At this point, anyone reading Codd’s paper would have several questions, such as “Okay, how would I actually query such a system?” Codd mentions the possibility of creating a universal sublanguage for querying relational databases from other programs, but declines to define such a language in this particular paper. He does explain, in mathematical terms, many of the fundamental operations such a language would have to support, like joins, “projection” ( SELECT in SQL), and “restriction” ( WHERE ). The amazing thing about Codd’s 1970 paper is that, really, all the ideas are there—we’ve been writing SELECT statements and joins for almost half a century now.

Codd wraps up the paper by discussing ways in which a normalized relational database, on top of its other benefits, can reduce redundancy and improve consistency in data storage. Altogether, the paper is only 11 pages long and not that difficult of a read. I encourage you to look through it yourself. It would be another ten years before Codd’s ideas were properly implemented in a functioning system, but, when they finally were, those systems were so obviously better than previous systems that they took the world by storm.

If you enjoyed this post, more like it come out every four weeks! Follow @TwoBitHistory on Twitter or subscribe to the RSS feed to make sure you know when a new post is out.

research paper on relational database

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

  •  We're Hiring!
  •  Help Center

Relational Database

  • Most Cited Papers
  • Most Downloaded Papers
  • Newest Papers
  • Last »
  • Clouds Follow Following
  • Free Software Follow Following
  • Distributed Systems Follow Following
  • Cloud Follow Following
  • Database Systems Follow Following
  • Database Management Systems Follow Following
  • Nomophobia Follow Following
  • Risk management and control of ERP projects Follow Following
  • Cyber Bullying Follow Following
  • Humanities Computing (Digital Humanities) Follow Following

Enter the email address you signed up with and we'll email you a reset link.

  • Academia.edu Journals
  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024

IMAGES

  1. (PDF) Summarizing Relational Databases

    research paper on relational database

  2. Relational Database Research Paper

    research paper on relational database

  3. Relational database data model

    research paper on relational database

  4. A Paper Database Part 1: Understanding Relational Databases

    research paper on relational database

  5. (PDF) Software Engineering Design Patterns for Relational Databases

    research paper on relational database

  6. Comparison of Relational Database and Object Oriented Database

    research paper on relational database

VIDEO

  1. AL ICT Model Paper Question 05 : DataBase Question

  2. BCA 4 SEMESTER kuk University questions paper relational database management system 2023

  3. Relational Agents Group

  4. 1. DATABASE MANAGEMENT SYSTEM RDBMS PYQ IN TAMIL #pgtrbcs #ugcnetcs #trbcs

  5. IGCSE ICT Paper 2 || Databases || Chapter 18 || Online Class || Part 2

  6. IGCSE ICT Paper 2 || Databases || Chapter 18 || Online Class || Part 3

COMMENTS

  1. Relational data paradigms: What do we learn by taking the materiality

    Thus, despite the relational database's continued dominance in many contexts, modern databases' specific material forms can vary dramatically. For instance, though all relational databases organize data into sets of interlinked tables, the specific file types and querying languages vary depending on the software platform being used.

  2. Database management system performance comparisons: A systematic

    the database is used by one or several software applications via a DBMS. Collectively, the database, the DBMS, and the software application are referred to as a database system [31, p.7][17, p.65]. The separation of the database and the DBMS, especially in the realm of relational databases, is typically impossible

  3. Artificial Intelligence Research: The Utility and Design of a

    Origin of Relational Databases. The concept of a RDBS was first described in a seminal article in 1970. 1 The theoretic construct was that all data could be defined or represented as a series of relations with or to other data. The article was quantitative in that it used relational algebra and tuple relational calculus to prove its points. 2 IBM used this theoretic framework to design what ...

  4. (PDF) Design and Analysis of a Relational Database for Behavioral

    Paper — Design and Analysis of a Relational Database for Behavioral Experiments Data Processing Fig. 5. Comparison of time needed to iterate over all records (in seconds) for 10 and 80 mi l-

  5. PDF Architecture of a Database System

    on relational database systems throughout this paper. At heart, a typical RDBMS has five main components, as illustrated in Figure 1.1. As an introduction to each of these components and the way they fit together, we step through the life of a query in a database system. This also serves as an overview of the remaining sections of the paper.

  6. PDF Relational Deep Learning: Graph Representation Learning on Relational

    that make much better use of the rich predictive signal in relational data. This paper lays the ground for future work by making the following main sections: • Blueprint. Relational Deep Learning, an end-to-end learnable approach that ultilizes the predictive signals available in relational data, and supports temporal predictions ...

  7. A relational model of data for large shared data banks

    In Section 1, inadequacies of these models are discussed. A model based on n -ary relations, a normal form for data base relations, and the concept of a universal data sublanguage are introduced. In Section 2, certain operations on relations (other than logical inference) are discussed and applied to the problems of redundancy and consistency ...

  8. Choosing the right NoSQL database for the job: a quality attribute

    For over forty years, relational databases have been the leading model for data storage, retrieval and management. However, due to increasing needs for scalability and performance, alternative systems have emerged, namely NoSQL technology. The rising interest in NoSQL technology, as well as the growth in the number of use case scenarios, over the last few years resulted in an increasing number ...

  9. A comparative study of relational database and key-value database for

    The business organization expects that NoSQL database has better performance than a relational database. In This paper, we aim to compare the performance of Redis, which is a key-value database, one kind of NoSQL database, and MariaDB, which is a popular relational database. We designed a set of experiments with a large amount of data and ...

  10. (PDF) Advancements in Database Management Systems: A Comprehensive

    This research paper provides a comprehensive review of advancements in database management systems (DBMS) over the years, encompassing both relational and non-relational databases. By analyzing ...

  11. PDF Spark SQL: Relational Data Processing in Spark

    to query external data stored in the Hive catalog, and was thus not useful for relational queries on data inside a Spark program (e.g., on the errors RDD created manually above). Second, the only way to call Shark from Spark programs was to put together a SQL string, which is inconvenient and error-prone to work with in a modular program.

  12. Reasoning on Relational Database and Its Respective ...

    The use of relational databases has been a longstanding practice for data storage and information retrieval. ... Our research contributes valuable insights to the comparative analysis of reasoning algorithms between relational databases and knowledge graphs. ... Given that the primary emphasis of this paper is comparing the performance of the ...

  13. Data Migration from Relational to NoSQL Database: Review and ...

    This paper is organized into 5 sections. The first one presents the literature review on fundamental approaches, concepts of RDBMS and NoSQL databases. The second section outlines the synthesis of the various contributions presented by researchers and the four one demonstrate a comparative analysis between them.

  14. (PDF) A Comparison of NoSQL and Relational Database ...

    This paper addresses the concepts of NoSQL, the movement, motivations and needs behind it, and reviews the types of NoSQL databases and the issues concerning to these databases mainly areas of ...

  15. The Basics of Relational Databases Using MySQL

    Going beyond a simple database table, a relational database fits more complicated systems by relating information from two or more database tables. This paper will use MySQL to develop a basic appreciation of relational databases including user administration, database design, and SQL syntax. It will lead the reader in downloading and ...

  16. Database Management Systems—An Efficient, Effective, and ...

    The aim of this paper is to enlighten the readers on why DBMS is the way going forward, as well as giving a brief explanation to the future trends in addition to the future technologies related to database management systems. ... Relational database management system has proved to be the most easily designed database. The open-source software ...

  17. Important Papers: Codd and the Relational Model

    The relational model was introduced in 1970. Edgar F. Codd, a researcher at IBM, published a paper called "A Relational Model of Data for Large Shared Data Banks.". The paper was a rewrite of a paper he had circulated internally at IBM a year earlier. The paper is unassuming; Codd does not announce in his abstract that he has discovered a ...

  18. PDF Relational Cloud: A Database-as-a-Service for the Cloud

    This paper introduces a new transactional "database-as-a-service" (DBaaS) called Relational Cloud. A DBaaS promises to move much of the operational burden of provisioning, configuration, scal- ing, performance tuning, backup, privacy, and access control from the database users to the service operator, offering lower overall costs to users.

  19. The Role Concept for Relational Database Management Systems

    [Show full abstract] more software development teams use OOPLs 'on top of' (O)RDBMSs, i. e., access (object-)relational databases from applications devel- oped in OOPLs, this paper reports on our ...

  20. PDF A Relational Model of Data for Large Shared Data Banks

    A Relational Model of Data for Large Shared Data Banks E. F. CODD IBM Research Laboratory, San Jose, California Future users of large data banks must be protected from having to know how the data is organized in the machine (the ... access to large banks of formatted data. Except for a paper by Childs [l], the principal application of relations ...

  21. Relational Database Research Papers

    In this paper, the performance evaluation of MySQL and MongoDB is performed where MySQL is an example of relational database and MongoDB is an example of non relational databases. A relational database is a data structure that allows you to connect information from different 'tables', or different types of data buckets. Non-relational database ...

  22. The Engagement and Disengagement of Heterogeneous Stakeholders: A

    Central to research within the relational view is how interactions among multiple stakeholders are affected by their interests, hierarchies, and relationships (Castelló et al., 2016; Dawkins, 2015), as well as value congruence and strategic complementarity (Bundy et al., 2018).For example, in cities, stakeholders with different interests—what we term heterogeneous stakeholders—such as ...

  23. (PDF) A Comparative Study of Relational and Non-Relational Database

    The purpose of this paper is to present a comparative study between relational and non-relational database models in a web-based application, by executing various operations on both relational and ...

  24. Debate: A relational agenda for changing public administration research

    Why a relational agenda. Relational theory and practice in public administration have increased in critical mass. We note the growth of relational practice across fields (Lamph et al., Citation 2023), a widespread narrative of relational public service reform (Wilson et al., Citation 2024), and a scholarly relational turn (Bartels & Turnbull, Citation 2020).

  25. (PDF) Relational Database Management Systems

    A relational database management system (RDBMS) is a. program that allows you to create, update, and administer a. relational database. Generally, RDBMS use the SQL language to access the ...