Databases and NO SQL

NoSQL?

A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability. NoSQL databases are often highly optimized key–value stores intended primarily for simple retrieval and appending operations, whereas an RDBMS is intended as a general purpose data store. There will thus be some operations where NoSQL is faster and some where an RDBMS is faster. NoSQL databases are finding significant and growing industry use in big data and real-time web applications.NoSQL systems are also referred to as “Not only SQL” to emphasize that they may in fact allow SQL-like query languages to be used.There is no standard definition of what NoSQL means. The term began with a workshop organized in 2009, but there is much argument about what databases can truly be called NoSQL.
But while there is no formal definition, there are some common characteristics of NoSQL databases they don’t use the relational data model, and thus don’t use the SQL language they tend to be designed to run on a cluster, they tend to be Open Source they don’t have a fixed schema, allowing you to store any data in any record examples include
We should also remember Google’s Bigtable and Amazon’s SimpleDB. While these are tied to their host’s cloud service, they certainly fit the general operating characteristics

Carlo Strozzi used the term NoSQL in 1998 to name his lightweight, open-source relational database that did not expose the standard SQL interface.[2] Strozzi suggests that, as the current NoSQL movement “departs from the relational model altogether; it should therefore have been called more appropriately ‘NoREL’.[3]

Eric Evans (then a Rackspace employee) reintroduced the term NoSQL in early 2009 when Johan Oskarsson of Last.fm wanted to organize an event to discuss open-source distributed databases.[4] The name attempted to label the emergence of a growing number of non-relational, distributed data stores that often did not attempt to provide atomicity, consistency, isolation and durability guarantees that are key attributes of classic relational database systems.[5]

Taxonomy

There have been various approaches to classify NoSQL databases, each with different categories and subcategories. Because of the variety of approaches and overlaps it is difficult to get and maintain an overview of non-relational databases. Nevertheless, the basic classification that most would agree on is based on data model. A few of these and their prototypes are:

  • Column: HBase, Accumulo, Cassandra
  • Document: MarkLogic, MongoDB, Couchbase
  • Key-value: Dynamo, Riak, Redis, MemcacheDB, Project Voldemort
  • Graph: Neo4J, OrientDB, Allegro, Virtuoso

Reduce Development Drag
A lot of effort in application development is tied up in working with relational databases. Although Object/ Relational Mapping frameworks have eased the load, the database is still a significant source of developer hours.
Often we can reduce this effort by choosing an alternative database that’s more suited to the problem domain.
We often come across projects who are using relational databases because they are the default, not because they are the best choice for the job. Often they are paying a cost, in developer time and execution performance, for features they do not use. so this means we can Embrace Large Scale
The large scale clusters that we can support with NoSQL databases allow us to store larger datasets (people are talking about petabytes these days) to process large amounts of analytic data.
Alternative data models also allow us to carry out many tasks more efficiently, allowing us to tackle problems that we would have balked at when using only relational databases
McLaren Streaming of telemetric data into MongoDB for later analysis. Orders of magnitude faster than relational (SQL Server). [more…]
Guardian New functionality uses Mongo rather than relational DB. They found Mongo’s document data model significantly easier to interact with for their kind of application. [more…]
Danish Health Care
Centralized record of drug prescriptions. Currently held in MySQL databases, but concerned about scale for both response time and availability. Migrated data to DNC Riak. [more…]
Searching 300 Million voters information for 1 person with addresses, emails, phones is tough with a relational data store.
MongoDB was used to store the documents about the person.[more…]

Polyglot Persistence?

Polyglot Persistence means using multiple data storage technologies, chosen based upon the way data is being used by individual applications. Why store binary images in relational database, when there are better storage systems?
Polyglot persistence will occur over the enterprise as different applications use different data storage technologies. It will also occur within a single application as different parts of an application’s data store have different access characteristics

There will still be large amounts of it managed in relational stores, but increasingly we’ll be first asking how we want to manipulate the data and only then figuring out what technology is the best bet for it.

This polyglot affect will be apparent even within a single application. A complex enterprise application uses different kinds of data, and already usually integrates information from different sources. Increasingly we’ll see such applications manage their own data using different technologies depending on how the data is used. This trend will be complementary to the trend of breaking up application code into separate components integrating through web services. A component boundary is a good way to wrap a particular storage technology chosen for the way its data in manipulated.

This will come at a cost in complexity. Each data storage mechanism introduces a new interface to be learned. Furthermore data storage is usually a performance bottleneck, so you have to understand a lot about how the technology works to get decent speed. Using the right persistence technology will make this easier, but the challenge won’t go away.

Many of these NoSQL option involve running on large clusters. This introduces not just a different data model, but a whole range of new questions about consistency and availability. The transactional single point of truth will no longer hold sway (although its role as such has often been illusory).

So polyglot persistence will come at a cost – but it will come because the benefits are worth it. When relational databases are used inappropriately, they exert a significant drag on application development.  An  Application was essentially composing and serving web pages. They only looked up page elements by ID, they had no need for transactions, and no need to share their database. A problem like this is much better suited to a key-value store than the corporate relational hammer they had to use. A good public example of using the right NoSQL choice for the job is The Guardian – who have felt a definite productivity gain from using MongoDB over their previous relational option.

Another benefit comes in running over a cluster. Scaling to lots of traffic gets harder and harder to do with vertical scaling – a fact we’ve known for a long time. Many NoSQL databases are designed to operate over clusters and can tackle larger volumes of traffic and data than is realistic with single server. As enterprises look to use data more, this kind of scaling will become increasingly important. The Danish medication system described at gotoAarhus2011 was a good example of this.

All of this leads to a big change, but it won’t be rapid one – companies are naturally conservative when it comes to their data storage.

Why NoSQL?
  • Relational databases have been a successful technology for twenty years, providing persistence, concurrency control, and an integration mechanism.
  • Application developers have been frustrated with the impedance mismatch between the relational model and the in-memory data structures.
  • There is a movement away from using databases as integration points towards encapsulating databases within applications and integrating through services.
  • The vital factor for a change in data storage was the need to support large volumes of data by running on clusters. Relational databases are not designed to run efficiently on clusters.
  • NoSQL is an accidental neologism. There is no prescriptive definition—all you can make is an observation of common characteristics.
  • The common characteristics of NoSQL databases are
    • Not using the relational model
    • Running well on clusters
    • Open-source
    • Built for the 21st century web estates
    • Schemaless
  • The most important result of the rise of NoSQL is Polyglot Persistence.

Aggregate Data Models

  • An aggregate is a collection of data that we interact with as a unit. Aggregates form the boundaries for ACID operations with the database.
  • Key-value, document, and column-family databases can all be seen as forms of aggregate-oriented database.
  • Aggregates make it easier for the database to manage data storage over clusters.
  • Aggregate-oriented databases work best when most data interaction is done with the same aggregate; aggregate-ignorant databases are better when interactions use data organized in many different formations.

More Details on Data Models

  • Aggregate-oriented databases make inter-aggregate relationships more difficult to handle than intra-aggregate relationships.
  • Graph databases organize data into node and edge graphs; they work best for data that has complex relationship structures.
  • Schemaless databases allow you to freely add fields to records, but there is usually an implicit schema expected by users of the data.
  • Aggregate-oriented databases often compute materialized views to provide data organized differently from their primary aggregates. This is often done with map-reduce computations.

Distribution Models

  • There are two styles of distributing data:
    • Sharing distributes different data across multiple servers, so each server acts as the single source for a subset of data.
    • Replication copies data across multiple servers, so each bit of data can be found in multiple places.

    A system may use either or both techniques.

  • Replication comes in two forms:
    • Master-slave replication makes one node the authoritative copy that handles writes while slaves synchronize with the master and may handle reads.
    • Peer-to-peer replication allows writes to any node; the nodes coordinate to synchronize their copies of the data.

    Master-slave replication reduces the chance of update conflicts but peer-to-peer replication avoids loading all writes onto a single point of failure.

Consistency

  • Write-write conflicts occur when two clients try to write the same data at the same time. Read-write conflicts occur when one client reads inconsistent data in the middle of another client’s write.
  • Pessimistic approaches lock data records to prevent conflicts. Optimistic approaches detect conflicts and fix them.
  • Distributed systems see read-write conflicts due to some nodes having received updates while other nodes have not. Eventual consistency means that at some point the system will become consistent once all the writes have propagated to all the nodes.
  • Clients usually want read-your-writes consistency, which means a client can write and then immediately read the new value. This can be difficult if the read and the write happen on different nodes.
  • To get good consistency, you need to involve many nodes in data operations, but this increases latency. So you often have to trade off consistency versus latency.
  • The CAP theorem states that if you get a network partition, you have to trade off availability of data versus consistency.
  • Durability can also be traded off against latency, particularly if you want to survive failures with replicated data.
  • You do not need to contact all replicants to preserve strong consistency with replication; you just need a large enough quorum.

Version Stamps

  • Version stamps help you detect concurrency conflicts. When you read data, then update it, you can check the version stamp to ensure nobody updated the data between your read and write.
  • Version stamps can be implemented using counters, GUIDs, content hashes, timestamps, or a combination of these.
  • With distributed systems, a vector of version stamps allows you to detect when different nodes have conflicting updates.

Map-Reduce

  • Map-reduce is a pattern to allow computations to be parallelized over a cluster.
  • The map task reads data from an aggregate and boils it down to relevant key-value pairs. Maps only read a single record at a time and can thus be parallelized and run on the node that stores the record.
  • Reduce tasks take many values for a single key output from map tasks and summarize them into a single output. Each reducer operates on the result of a single key, so it can be parallelized by key.
  • Reducers that have the same form for input and output can be combined into pipelines. This improves parallelism and reduces the amount of data to be transferred.
  • Map-reduce operations can be composed into pipelines where the output of one reduce is the input to another operation’s map.
  • If the result of a map-reduce computation is widely used, it can be stored as a materialized view.
  • Materialized views can be updated through incremental map-reduce operations that only compute changes to the view instead of recomputing everything from scratch.

Schema Migrations

  • Databases with strong schemas, such as relational databases, can be migrated by saving each schema change, plus its data migration, in a version-controlled sequence.
  • Schemaless databases still need careful migration due to the implicit schema in any code that accesses the data.
  • Schemaless databases can use the same migration techniques as databases with strong schemas.
  • Schemaless databases can also read data in a way that’s tolerant to changes in the data’s implicit schema and use incremental migration to update data.

Polyglot Persistence

  • Polyglot persistence is about using different data storage technologies to handle varying data storage needs.
  • Polyglot persistence can apply across an enterprise or within a single application.
  • Encapsulating data access into services reduces the impact of data storage choices on other parts of a system.
  • Adding more data storage technologies increases complexity in programming and operations, so the advantages of a good data storage fit need to be weighed against this complexity

Beyond NoSQL

  • NoSQL is just one set of data storage technologies. As they increase comfort with polyglot persistence, we should consider other data storage technologies whether or not they bear the NoSQL label.

Choosing Your Database

  • The two main reasons to use NoSQL technology are:
    • To improve programmer productivity by using a database that better matches an application’s needs.
    • To improve data access performance via some combination of handling larger data volumes, reducing latency, and improving throughput.
  • It’s essential to test your expectations about programmer productivity and/or performance before committing to using a NoSQL technology.
  • Service encapsulation supports changing data storage technologies as needs and technology evolve. Separating parts of applications into services also allows you to introduce NoSQL into an existing application.
  • Most applications, particularly nonstrategic ones, should stick with relational technology—at least until the NoSQL ecosystem becomes more mature.

NoSQL databases can be run on-premises, but are also often run on IaaS or PaaS platforms like Amazon Web Services, RackSpace or Heroku. There are three common deployment models for NoSQL on the cloud:

  • Virtual machine image – cloud platforms allow users to rent virtual machine instances for a limited time. It is possible to run a NoSQL database on these virtual machines. Users can upload their own machine image with a database installed on it, use ready-made machine images that already include an optimized installation of a database, or install the NoSQL database on a running machine instance.
  • Database as a service – some cloud platforms offer options for using familiar NoSQL database products as a service, such as MongoDB, Redis and Cassandra, without physically launching a virtual machine instance for the database. The database is provided as a managed service, meaning that application owners do not have to install and maintain the database on their own, and pay according to usage. Some database as a service providers provide additional features, such as clustering or high availability, that are not available in the on-premise version of the database (see the table below for several examples).
  • Native cloud NoSQL databases – some providers offer a NoSQL database service which is available only on the cloud. A well-known example is Amazon’s SimpleDB, a simple NoSQL key-value store. SimpleDB cannot be installed on a local machine and cannot be used on any cloud platform except Amazon’s.

The following table provides notable examples of NoSQL databases available on the cloud in each of these deployment models:

Deployment Model Database Technology Provider Cloud-Specific Features Pricing Model
Virtual machine image MongoDB MongoDB – machine images for Amazon EC2[14] and Windows Azure[15] None
  • Database and machine image – open source
  • Amazon/Azure instances – pay per use
Virtual Machine Image Redis
  • Redis – standard open source installation
  • Script for installation on Amazon EC2 [16]
  • Recommended installation on Windows Azure [17]
None
  • Database and machine image – open source
  • Amazon/Azure instances – pay per use
Virtual machine image Cassandra Apache Cassandra – machine image for Amazon EC2[18] None
  • Database and machine image – open source
  • Amazon instances – pay per use
Database as a Service MongoDB Mongolab[19] – available on Amazon, Google, Joyent, Rackspace and Windows Azure
  • Managed service
  • High availability
  • Automatic failover
  • Pre-configured clustering
  • Free up to 500MB (on disk)[20]
  • Paid plans based on architecture and storage size
Database as a Service Redis/Memcached Amazon Web Services – ElastiCache[21]
  • Managed service
  • Automatic healing of failed nodes
  • Resilient system to prevent overloaded DBs
  • Performance monitoring
  • Free for 750 hours on micro instance[22]
  • Pay per use for machine utilization, no separate charge for data usage[23]
Database as a Service Redis RedisToGo[24] – available on Amazon EC2, RackSpace, Heroku, AppHarbor, Orchestra
  • Managed service
  • Daily backups
  • API enabling creation, deletion, or download of Redis instances
  • Free up to 5MB (memory)
  • Paid plans based on memory usage
Database as a Service Redis Redis Cloud (Garantia Data)[25] – available on Amazon EC2, Windows Azure, Heroku, Cloud Foundry, OpenShift, AppFog, AppHarbor
  • Managed service
  • Automatic scaling, unlimited Redis nodes
  • High availability
  • Built-in clustering
  • Free up to 25MB (memory)[26]
  • Pay per use
Database as a Service Cassandra InstaClustr[27] – available on Amazon EC2, RackSpace, Windows Azure, Joyent, Google Compute Engine
  • Managed service
  • Performance tuning
  • Monitoring
  • Automated backups
  • DataStax OpsCenter for cluster admin
Paid plans based on disk storage, memory usage and CPU cores[28]
Native cloud NoSQL database Amazon SimpleDB Amazon Web Services
  • Managed service
  • High availability
  • Unlimited scale
  • Data durability
  • Free for 750 hours on micro instance[29]
  • Pay per use – separate charge for machine utilization and data usage[29]
Native cloud NoSQL database Google App Engine Datastore[30] Google
  • No planned downtime
  • Atomic transactions
  • High availability of reads and writes
  • Free with quota system limiting instance hours, storage and throughput[31]
  • Pay per use based on instance hours, storage, throughput and other parameters
Native cloud NoSQL database SalesForce Database.com[32] SalesForce
  • Unlimited scale
  • Access to SalesForce meta data
  • Social API
  • Support for mobile clients
  • Multi-tenancy
  • Free up to 100K records and 50K transactions[33]
  • Pay per use based on users, number of records and transactions
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s