A report on Hadoop

A report on Hadoop

Takeaway: Hadoop has been helping analyze data for years now, but there are probably more than a few things you don’t know about it.

7 Things to Know About Hadoop

Source: Pressureua/Dreamstime.com

What is Hadoop? It’s a yellow toy elephant. Not what you were expecting? How about this: Doug Cutting – co-creator of this open-source software project – borrowed the name from his son who happened to call his toy elephant Hadoop. In a nutshell, Hadoop is a software framework developed by the Apache Software Foundation that’s used to develop data-intensive, distributed computing. And it’s a key component in another buzzword readers can never seem to get enough of: big data. Here are seven things you should know about this unique, freely licensed software.

How did Hadoop get its start?

Twelve years ago, Google built a platform to manipulate the massive amounts of data it was collecting. Like the company often does, Google made its design available to the public in the form of two papers: Google File System and MapReduce.

At the same time, Doug Cutting and Mike Cafarella were working on Nutch, a new search engine. The two were also struggling with how to handle large amounts of data. Then the two researchers got wind of Google’s papers. That fortunate intersection changed everything by introducing Cutting and Cafarella to a better file system and a way to keep track of the data, eventually leading to the creation of Hadoop.

What is so important about Hadoop?

Today, collecting data is easier than ever. Having all this data presents many opportunities, but there are challenges as well:

  • Massive amounts of data require new methods of processing.
  • The data being captured is in an unstructured format.

To overcome the challenges of manipulating immense quantities of unstructured data, Cutting and Cafarella came up with a two-part solution. To solve the data-quantity problem, Hadoop employs a distributed environment – a network of commodity servers – creating a parallel processing cluster, which brings more processing power to bear on the assigned task.

Next, they had to tackle unstructured data or data in formats that standard relational database systems were unable to handle. Cutting and Cafarella designed Hadoop to work with any type of data: structured, unstructured, images, audio files, even text.  Cloudera (Hadoop integrator) white paper explains why this is important:

    “By making all your data usable, not just what’s in your databases, Hadoop lets you uncover hidden relationships and reveals answers that have always been just out of reach. You can start making more decisions based on hard data, instead of hunches, and look at complete data sets, not just samples and summaries.”

What is Schema on read?

As was mentioned earlier, one of the advantages of Hadoop is its ability to handle unstructured data. In a sense, that is “kicking the can down the road.” Eventually the data needs some kind of structure in order to analyze it.

That is where schema on read comes into play. Schema at read is the melding of what format the data is in, where to find the data (remember the data is scattered among several servers), and what’s to be done to the data – not a simple task. It’s been said that manipulating data in a Hadoop system requires the skills of a business analyst, a statistician and a Java programmer. Unfortunately, there aren’t many people with those qualifications.

What is Hive?

If Hadoop was going to succeed, working with the data had to be simplified. So, the open-source crowd got to work and created Hive:

    “Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.”

Hive enables the best of both worlds: database personnel familiar with SQL commands can manipulate the data, and developers familiar with the schema on read process are still able to create customized queries.

Apace Hive is a data warehouse system that is often used with an open-source analytics platform called Hadoop. Hadoop has become a popular way to aggregate and refine data for businesses. Hadoop users may use tools like Apache Spark or MapReduce to compile data in precise ways before storing it in a file handling system called HDFS. From there, the data can go into Apache Hive for central storage.

Techopedia explains Apache Hive

Apache Hive and other data warehouse designs are the central repositories for data and play important roles in a company’s IT setup. They need to have specific goals for data retrieval, security and more.

Apache Hive has a language called HiveQL, which shares some features with the commonly popular SQL language for data retrieval. It also supports metadata storage in an associated database.

Apache Spark is an open-source program used for data analytics. It’s part of a greater set of tools, including Apache Hadoop and other open-source resources for today’s analytics community.

Experts describe this relatively new open-source software as a data analytics cluster computing tool. It can be used with the Hadoop Distributed File System (HDFS), which is a particular Hadoop component that facilitates complicated file handling.

Some IT pros describe the use of Apache Spark as a potential substitute for the Apache Hadoop MapReduce component. MapReduce is also a clustering tool that helps developers process large sets of data. Those who understand the design of Apache Spark point out that it can be many times faster than MapReduce, in some situations.

Those reporting on the modern use of Apache Spark show that companies are using it in various ways. One common use is for aggregating data and structuring it in more refined ways. Apache Spark can also be helpful with analytics machine-learning work or data classification.

Typically, organizations face the challenge of refining data in an efficient and somewhat automated way, where Apache Spark may be used for these kinds of tasks. Some also imply that using Spark can help provide access to those who are less knowledgeable about programming and want to get involved in analytics handling.

Apache Spark includes APIs for Python and related software languages.

Apache HBase is a specific kind of database tool written in Java and used with elements of the Apache software foundation’s Hadoop suite of big data analysis tools. Apache HBase is an open source product, like other elements of Apache Hadoop. It represents one of several database tools for the input and output of large data sets that are crunched by Hadoop and its various utilities and resources.

Apache HBase is a distributed non-relational database, which means that it doesn’t store information in the same way as a traditional relatable database setup. Developers and engineers run data from Apache HBase to and from Hadoop tools like MapReduce for data analysis. The Apache community promotes Apache HBase as a way to get direct access to big data sets. Experts point out that HBase is based on something called Google BigTable, a distributed storage system.

Some of the popular features of Apache HBase include some kinds of backup and failover support, as well as APIs for popular programming languages. Its compatibility with the greater Hadoop system makes it a candidate for many kinds of big data management problems in enterprise

What kind of data does Hadoop analyze?

Web analytics is the first thing that comes to mind, analyzing Web logs and Web traffic in order to optimize websites. Facebook, for example, is definitely into Web analytics, using Hadoop to sort through the terabytes of data the company accumulates.

Companies use Hadoop clusters to perform risk analysis, fraud detection and customer-base segmentation. Utility companies use Hadoop to analyze sensor data from their electrical grid, allowing them to optimize the production of electricity. An major companies such as Target, 3M and Medtronics use Hadoop to optimize product distribution, business risk assessments and customer-base segmentation.

Universities are invested in Hadoop too. Brad Rubin, an associate professor at the University of St. Thomas Graduate Programs in Software, mentioned that his Hadoop expertise is helping sort through the copious amounts of data compiled by research groups at the university.

Can you give a real-world example of Hadoop?

One of the better-known examples is the TimesMachine. The New York Times has a collection of full-page newspaper TIFF images, associated metadata, and article text from 1851 through 1922 amounting to terabytes of data. NYT’s Derek Gottfrid, using anEC2/S3/Hadoop system and specialized code,:

    “Ingested 405,000 very large TIFF images, 3.3 million articles in SGML and 405,000 xml files mapping articles to rectangular regions in the TIFFs. This data was converted to a more web-friendly 810,000 PNG images (thumbnails and full images) and 405,000 JavaScript files.”

Using servers in the Amazon Web Services cloud, Gottfrid mentioned they were able to process all the data required for the TimesMachine in less than 36 hours.

Is Hadoop already obsolete or just morphing?

Hadoop has been around for over a decade now. That has many saying it’s obsolete. One expert, Dr. David Rico, has said that “IT products are short-lived. In dog years, Google’s products are about 70, while Hadoop is 56.”

There may be some truth to what Rico says. It appears that Hadoop is going through a major overhaul. To learn more about it, Rubin invited researchers to a Twin Cities Hadoop User Group meeting, and the topic of discussion was Introduction to YARN:

      “Apache Hadoop 2 includes a new MapReduce engine, which has a number of advantages over the previous implementation, including better scalability and resource utilization. The new implementation is built on a general resource management system for running distributed applications called


      Hadoop gets a lot of buzz in database and content management circles, but there are still many questions around it and how it can best be used.
Apache Slider is a new code base for the Hadoop data analytics tool set or ‘suite’ licensed by the Apache software foundation. This project should be released in the second half of 2014 and will help users to apply Hadoop and the YARN resource management tool to various goals and objectives.

Techopedia explains Apache Slider

Experts explain that Apache Slider will help to extend the reach of what Hadoop and YARN can do by allowing certain kinds of databases to run unmodified in the YARN resource management environment.
YARN is an existing Hadoop resource that focuses on resource management and complements other tools like MapReduce and the Hadoop HDFS file handling system. Apache Slider will make more different types of programs compatible with YARN and extend the ‘case uses’ that are possible.
Instead of modifying existing applications, say experts, Apache Slider will allow for a much broader and diversified application of database and data analytics platforms to Hadoop’s core software resources. Using Apache slider may also improve the efficiency of memory and processing resources for an entire project.
Another way to explain the use of Apache Slider and its development is that it can help YARN to eventually become the central software or “operating system” for a corporate data warehouse or other data center. For instance, tools like Apache HBase and Hive are often used in enterprise environments. Making these more compatible with Hadoop YARN can have some real impact on business process efficiency.
DWH and Hadoop
Big data analytics, advanced analytics (i.e., data mining, statistical analysis, complex SQL, and natural language processing), and discovery analytics benefit from Hadoop. HDFS and other Hadoop tools promise to extend and improve some areas within data warehouse architectures:
several DW teams that have consolidated and migrated their staging area(s) onto HDFS to take advantage of its low cost, linear scalability, facility with file-based data, and ability to manage unstructured data. Users who prefer to hand-code most of their ETL solutions will most likely feel at home in code-intense environments such as Apache MapReduce, Pig, and Hive.
They may even be able to refactor existing code to run there. For users who prefer to build their ETL solutions atop a vendor tool, the community of vendors for ETL and other data management tools is rolling out new interfaces and functions for the entire Hadoop product family.
Data archiving. When organizations embrace forms of advanced analytics that require detailed source data, they amass large volumes and retain most of the data over time, which taxes areas of the DW architecture where source data is stored. Storing terabytes of source data in the core EDW’s RDBMS can be prohibitively expensive, which is why many organizations have moved such data to less expensive satellite systems within their extended DW environments.
Similar to migrating staging areas to HDFS, some organizations are migrating their stores of source data and other archives to HDFS. This lowers the cost of archives and analytics while providing greater capacity.
Multi-structured data. : Relatively few organizations are currently getting BI value from semi- and unstructured data, despite years of wishing for it. HDFS can be a special place within your DW environment for managing and processing semi-structured and unstructured data. Hadoop users are finding this approach more successful than stretching an RDBMS-based DW platform to handle data types it was not designed for.
One of Hadoop’s strongest complements to a DW is its handling of semi- and unstructured data, but don’t go thinking that Hadoop is only for unstructured data: HDFS handles the full range of data, including structured forms. In fact, Hadoop can manage and process just about any data you can store in a file and copy into HDFS.
Processing flexibility. Given its ability to manage diverse multi-structured data, as just described, Hadoop’s NoSQL approach is a natural framework for manipulating nontraditional data types. Note that these data types are often free of schema or metadata, which makes them challenging for most vendor brands of SQL-based RDBMSs, although a few have functions for deducing, creating, and applying schema as needed. Hadoop supports a variety of programming languages (Java, R, C), thus providing more capabilities than SQL alone can offer. Again, a few RDBMSs support these same languages as a complement to SQL.
In addition, Hadoop enables the growing practice of “late binding.” With ETL for data warehousing, data is processed, standardized, aggregated, and remodeled before entering the data warehouse environment; this imposes an a priori structure on the data, which is appropriate for known reports, but limits the scope of analytic repurposing later. Data entering HDFS is typically processed lightly or not at all to avoid limiting its future applications. Instead, Hadoop data is processed and restructured at run time, so it can flexibly enable the open-ended data exploration and discovery analytics that many users are looking for today.
Hadoop and RDBMSs are complementary and should be used together
Hadoop’s help for data warehouse environments is limited to a few areas. Luckily, most of
Hadoop’s strengths are in areas where most warehouses and BI technology stacks are weak, such as unstructured data, very large data sets, non-SQL algorithmic analytics, and the flood of files that is drowning many DW environments. Conversely, Hadoop’s limitations are mostly met by mature functionality available today from a wide range of RDBMS types (OLTP databases, columnar databases, DW appliances, etc.), plus administrative tools. In that context, Hadoop and the average RDBMS-based data warehouse are complementary (despite some overlap), which results in a fortuitous synergy when the two are integrated.
The trick, of course, is making HDFS and an RDBMS work together optimally. To that end, one of the critical success factors for assimilating Hadoop into evolving data warehouse architectures is the improvement of interfaces and interoperability between HDFS and RDBMSs. Luckily, this is well under way due to efforts from software vendors and the open source community. Technical users are starting to leverage HDFS/RDBMS integration.
For example, an emerging best practice among DW professionals with Hadoop experience is to manage diverse big data in HDFS, but process it and move the results (via ETL or other data integration media) to RDBMSs (elsewhere in the DW architecture), which are more conducive to SQL-based analytics. Hence, HDFS serves as a massive data staging area and archive.
A similar best practice is to use an RDBMS as a front end to HDFS data; this way, data is moved via distributed queries (whether ad hoc or standardized), not via ETL jobs. HDFS serves as a large, diverse operational data store, whereas the RDBMS serves as a user-friendly semantic layer that makes HDFS data look relational.
Actian Corporation has accumulated a fairly comprehensive portfolio of platforms and tools for managing analytics, big data, and all other enterprise data, encompassing the full range of structured, semi-structured, and unstructured data and content types. The new Actian Analytics Platform includes connectivity to more than 200 sources, a visual framework that simplifies ETL and data science, high-performance analytic engines, and libraries of analytic functions.
The Actian Analytics Platform centers on Matrix (a massively parallel columnar RDBMS formerly called ParAccel) and Vector (a single-node RDBMS optimized for BI). Actian DataFlow accelerates ETL natively on Hadoop. Actian Analytics includes more than 500 analytic functions ready to run in-database or on Hadoop. Actian DataConnect connects and enriches data from over 200 sources on-premises or in the cloud. The Actian platform is integrated by a modular framework that enables users to quickly connect to all data assets for open-ended analytics with linear scalability.
Strategic partnerships include Hortonworks (for HDFS and YARN), Attivio (for big content), and a number of contributors to the Actian Analytics library.
Cloudera is a leading provider of Apache Hadoop–based software, services, and training, enabling Cloudera data-driven organizations to derive business value from all their data while simultaneously reducing  the costs of data management. CDH (Cloudera’s distribution including Apache Hadoop) is a  comprehensive, tested, and stable distribution of Hadoop that is widely deployed in commercial and  non-commercial environments. Organizations can subscribe to Cloudera Enterprise—comprising  CDH, Cloudera Support, and the Cloudera Manager—to simplify and reduce the cost of Hadoop configuration, rollout, upgrades, and administration. Cloudera also provides Cloudera Enterprise  Real-Time Query (RTQ), powered by Cloudera Impala, the first low-latency SQL query engine that  runs directly over data in HDFS and HBase. Cloudera Search increases data ROI by offering non- technical resources a common and everyday method for accessing and querying large, disparate big  data stores of mixed format and structure managed in Hadoop. As a major contributor to the Apache  open source community, with customers in every industry, and a massive partner program,  Cloudera’s big data expertise is profound.
Datawatch Corporation provides a visual data discovery and analytics solution that optimizes any data—regardless of its variety, volume, or velocity—to reveal valuable insights for improving business  decisions. Datawatch has a unique ability to integrate structured, unstructured, and semi-structured  sources—such as reports, PDF files, print spools, and EDI streams—with real-time data streams  from CEP engines, tick feeds, or machinery and sensors into visually rich analytic applications,  which enable users to dynamically discover key factors about any operational aspect of their business.
Datawatch steps users through data access, exploration, discovery, analysis, and delivery, all in a  unified and easy-to-use tool called Visual Data Discovery, which integrates with existing BI and big  data platforms. IT’s involvement is minimal in that IT sets up data connectivity; most users can  create their own reports and analyses, then publish them for colleagues to share in a self-service  fashion. The solution is suitable for a single analyst, a department, or an enterprise. Regardless of user  type, whether business or technical or both, all benefit from the high ease of use, productivity, and  speed to insight that Datawatch’s real-time data visualization delivers.
Dell Software For years, Dell Software has been acquiring and building software tools (plus partnering with leading vendors for more tools) with the goal of assembling a comprehensive portfolio of IT administration tools for securing and managing networks, applications, systems, endpoints, devices, and data. Within that portfolio, Dell Software now offers a range of tools specifically for data management, with a focus on big data and analytics. For example, Toad Data Point provides interfaces and administrative functions for most traditional databases and packaged applications, plus new big data platforms such as Hadoop, MongoDB, Cassandra, SimpleDB, and Azure. Spotlight is a DBA tool for monitoring DBMS health and benchmarking. Shareplex supports Oracle-to-Oracle replication today, and will soon support Hadoop. Kitenga Big Data Analytics enables rapid
transformation of diverse unstructured data into actionable insights. Boomi MDM launched in 2013. The new Toad BI Suite pulls these tools together to span the entire information life cycle of big data and analytics. After all, the goal of Dell Software is: one vendor, one tool chain, all data.
HP HP Vertica provides solutions to big data challenges. The HP Vertica Analytics Platform was purpose-built for advanced analytics against big data. It consists of a massively parallel database with columnar support, plus an extensible analytics framework optimized for the real-time analysis of data. It is known for high performance with very complex analytic queries against multi-terabyte data sets.
Vertica offers advantages over SQL-on-Hadoop analytics, shortening some queries from days to minutes. Although SQL is the primary query language, Vertica also supports Java, R, and C.
Furthermore, the HP Vertica Flex Zone feature enables users to define and apply schema during query and analysis, thereby avoiding the need to prepocess data or deploy Hadoop or NoSQL platforms for schema-free data.
HP Vertica is part of HP’s new HAVEn platform, which integrates multiple products and services into a comprehensive big data platform that provides end-to-end information management for a wide range of structured and unstructured data domains. To simplify and accelerate the deployment of an analytic solution, HP offers the HP ConvergedSystem 300 for Vertica—a pre-built and pre-tested turn-key appliance.
MapR provides a complete distribution for Apache Hadoop, which is deployed at thousands of organizations globally for production, data-driven applications. MapR focuses on extending and advancing Hadoop, MapReduce, and NoSQL products and technologies to make them more feature rich, user friendly, dependable, and conducive to production IT environments. For example, MapR is spearheading the development of Apache Drill, which will bring ANSI SQL capabilities to Hadoop in the form of low-latency, interactive query capabilities for both structured and schema-free, nested data. As other examples, MapR is the first Hadoop distribution to integrate enterprise-grade search;
MapR enables flexible security via support for Kerberos and native authentication; and MapR provides a plug-and-play architecture for integrating real-time stream computational engines such as Storm with Hadoop. For greater high availability, MapR provides snapshots for point-in-time data rollback and a No NameNode architecture that avoids single points of failure within the system and ensures there are no bottlenecks to cluster scalability. In addition, it’s fast; MapR set the Terasort, MinuteSort, and YCSB world records.

Mongo DB vs Couchbase

NoSQL showdown: MongoDB vs. Couchbase

Which NoSQL database has richer querying, indexing, and ease of use?

Document databases may be the most popular NoSQL database variant of them all. Their great flexibility — schemas can be grown or changed with remarkable ease — makes them suitable for a wide range of applications, and their object nature fits in well with current programming practices. In turn, Couchbase Server and MongoDB have become two of the more popular representatives of open source document databases, though Couchbase Server is a recent arrival among the ranks of document databases.

In this context, the word “document” does not mean a word processing file or a PDF. Rather, a document is a data structure defined as a collection of named fields. JSON (JavaScript Object Notation) is currently the most widely used notation for defining documents within document-oriented databases. JSON’s advantage as an object notation is that, once you comprehend its syntax — and JSON is remarkably easy to grasp — then you have all you need to define what amounts to the schema of a document database. That’s because, in a document database, each document carries its own schema — unlike an RDBMS, in which every row in a given table must have the same columns.

More about NoSQL

There’s more than one way to break the RDBMS mold, so start with an overview of the CAP theorem and the NoSQL variants seeking to resolve it. Then find out what makes MongoDB a favorite among developers, and read an in-depth review of Oracle’s distributed, key-value datastore.

The latest versions of Couchbase Server and MongoDB are both newly arrived. In December 2012, Couchbase released Couchbase Server 2.0, a version that makes Couchbase Server a full-fledged document database. Prior to that release, users could store JSON data into Couchbase, but the database wrote JSON data as a blob. Couchbase was, effectively, a key/value database.

10gen released MongoDB 2.4 just this week. MongoDB has been a document database from the get-go. This latest release incorporates numerous performance and usability enhancements.

Both databases are designed to run on commodity hardware, as well as for horizontal scaling via sharding (in Couchbase, the rough equivalent to a shard is called a partition). Both databases employ JSON as the document definition notation, though in MongoDB, the notation is BSON (Binary JSON), a binary-encoded superset of JSON that defines useful data types not found in JSON. While both databases employ JavaScript as the primary data manipulation language, both provide APIs for all the most popular programming languages to allow applications direct access to database operations.

Key differences
Of course there are differences. First, MongoDB’s handling of documents is better developed. This becomes most obvious in the mongo shell, which serves the dual purpose of providing a management and development window into a MongoDB database. Database, collections, and documents are first-class entities in the shell. Collections are actually properties on database objects.

This is not to say that Couchbase is hobbled. You can easily manage your Couchbase cluster — adding, deleting, and fetching documents — from the Couchbase Management GUI, for which MongoDB has no counterpart. Indeed, if you prefer management via GUI consoles, score one for Couchbase Server. If, however, you favor life at the command line, you will be tipped in MongoDB’s direction.

The cloud-based MongoDB Monitoring Service (MMS), which gathers statistics, is not meant to be a full-blown database management interface. But MongoDB’s environment provides a near seamless connection between the data objects abstracted in the mongo shell and the database entities they model. This is particularly apparent when you discover that MongoDB allows you to create indexes on a specific document field using a single function call, whereas indexes in Couchbase must be created by more complex mapreduce operations.

In addition, while Couchbase documents are described via JSON, MongoDB documents are described in BSON; the latter notation includes a richer number of useful data types, such as 32-bit and 64-bit integer types, date types, and byte arrays. Both support geospatial data and queries, but this support in Couchbase is currently in an experimental phase and likely won’t stay there long. New in version 2.4, MongoDB’s full text search capability is also integrated with the database. A similar capability is available in Couchbase Server, but requires a plug-in for the elasticsearch tool.

Both Couchbase Server and MongoDB provide data safety via replication, both within a cluster (where live documents are protected from loss by the invisible creation of replica documents) and outside of a cluster (through cross data-center replication). Also, both provide access parallelism through sharding. However, where both Couchbase and MongoDB support hash sharding, MongoDB supports range sharding and “tag” sharding. This is a two-edged sword. On the one hand, it puts a great deal of flexibility at a database administrator’s fingertips. On the other hand, its misuse can result in an imbalanced cluster.

Mapreduce is a key tool used in both Couchbase and MongoDB, but for different purposes. In MongoDB, mapreduce serves as the means of general data processing, information aggregating, and analytics. In Couchbase, it is the means of creating indexes for the purpose of querying data in the database. (We suspect that this, like the poorer document handling, is an effect of Couchbase’s only having recently morphed into a document database.) As a result, it’s easier to create indexes and perform ad hoc queries in MongoDB.

Couchbase’s full incorporation of Memcached has no counterpart in MongoDB, and Memcached is a powerful adjunct as general object caching system for high-throughput, data-intensive Internet and intranet applications. If your application needs a Memcache server with your database, then look no further than Couchbase.

In general, the two systems are neck-and-neck in terms of features provided, though the ways those features are implemented may differ. Further, the advantages that one might hold over the other will certainly come and go as development proceeds. Both provide database drivers and client frameworks in all the popular programming languages, both are open source, both are easily installed, and both enjoy plenty of online documentation and active community support. As is typical for such well-matched systems, the best advice anyone could give for determining one over the other will be that you install them both and try them out.

Couchbase Server vs. MongoDB at a glance
Document handling Couchbase’s document database characteristics were added to its existing key/value storage architecture with the version 2.0 release. MongoDB was designed as a document database from the get-go. MongoDB’s handling of documents is better developed.
Indexing With Couchbase, you use mapreduce to create additional indexes in the database; these accelerate queries. With MongoDB, you can specify that individual document fields be indexed. Mapreduce in MongoDB is used for general processing.
Memcached Couchbase includes a Memcached component that can operate independently (if you wish) from the document storage components. MongoDB has no counterpart.
Sharding Couchbase supports only hashed sharding. MongoDB supports hashed sharding and range sharding.
Geospatial MongoDB has geospatial capabilities. Couchbase does too, but they were added in version 2.0 and are considered experimental.

Couchbase Server

Couchbase promotes Couchbase Server as a solution for real-time access, not data warehousing. Nor is Couchbase Server suitable for batch-oriented analytic processing — it is designed to be an operational data store.

Though Couchbase Server is based on Apache CouchDB, it is more than CouchDB with incremental modifications. For starters, Couchbase is an amalgam of CouchDB and Memcached, the distributed, in-memory, key/value storage system. In fact, Couchbase can be used as a direct replacement for Memcached. The system provides a separate port that unmodified, legacy Memcached clients can use, as well as “smart SDK” and proxy tools that improve its performance as a Memcached server.

For example, you can use a “thick client” deployment model, which will place the continuously updated knowledge of Memcached node topology on the client tier. This speeds response, as any request for a particular Memcached object will be sent from the client directly to the caching node for that object. This thick-client approach also plays an important role in the Couchbase system’s resilience to node crashes (described later).

Couchbase includes its own object-level caching system based on Memcached, though with enhancements. For example, Couchbase tracks working sets (the documents most frequently accessed on a given node) in its object cache using NRU (not recently used) algorithms. All I/O operations act on this in-memory cache. Updates to documents in the cache are eventually persisted to disk. In addition, for updates, locking is employed at the document level — not at the node, database, or partition level (which would hobble throughput with numerous I/O waits), nor at the field level (which would snarl the system with memory and CPU cycles required to track the locks).

Couchbase accelerates access by using “append only” persistence. This is used not only with the data, but with indexes as well. Updated information is never overwritten; instead, it is appended to the end of whatever data structure is being modified. Further, deleted space is reclaimed by compaction, an operation that can be scheduled to take place during times of low activity. Append-only storage speeds updates and allows read operations to occur while writes are taking place.

Couchbase scaling and replication
To facilitate horizontal scaling, Couchbase uses hash sharding, which ensures that data is distributed uniformly across all nodes. The system defines 1,024 partitions (a fixed number), and once a document’s key is hashed into a specific partition, that’s where the document lives. In Couchbase Server, the key used for sharding is the document ID, a unique identifier automatically generated and attached to each document. Each partition is assigned to a specific node in the cluster. If nodes are added or removed, the system rebalances itself by migrating partitions from one node to another.

There is no single point of failure in a Couchbase system. All partition servers in a Couchbase cluster are equal, with each responsible for only that portion of the data assigned to it. Each server in a cluster runs two primary processes: a data manager and a cluster manager. The data manager handles the actual data in the partition, while the cluster manager deals primarily with intranode operations.

System resilience is enhanced by document replication. The cluster manager process coordinates the communication of replication data with remote nodes, and the data manager process shepherds whatever replica data the cluster has assigned to the local node. Naturally, replica partitions are distributed throughout the cluster so that the replica copy of a partition is never on the same physical server as the active partition.

Like the documents themselves, replicas exist on a bucket basis — a bucket being the primary unit of containment in Couchbase. Documents are placed into buckets, and documents in one bucket are isolated from documents in other buckets from the perspective of indexing and querying operations. When you create a new bucket, you are asked to specify the number of replicas (up to three) to create for that bucket. If a server crashes, the system will detect the crash, locate the replicas of the documents that lived on the crashed system, and promote those replicas to active status. The system maintains a cluster map, which defines the topology of the cluster, and this is updated in response to the crash.

Note that this scheme relies on thick clients — embodied in the API libraries that applications use to communicate with Couchbase — that are in constant communication with server nodes. These thick clients will fetch the updated cluster map, then reroute requests in response to the changed topology. In addition, the thick clients participate in load-balancing requests to the database. The work done to provide load balancing is actually distributed among the smart clients.

Changes in topology are coordinated by an orchestrator, which is a server node elected to be the single arbiter of cluster configuration changes. All topology changes are sent to all nodes in the cluster; even if the orchestrator node goes down, a new node can be elected to that position and system operation can continue uninterrupted.

Couchbase supports cross-data-center replication (XDCR), which provides live replication of database contents of one Couchbase cluster to a geographically remote cluster. Note that XDCR operates simultaneously with intracluster replication (the copying of live documents to their inactive replica counterparts on other cluster members), and all systems in an XDCR arrangement invisibly synchronize with one another. However, Couchbase does not provide automatic fail-over for XDCR arrangements, relying instead on techniques such as using a load-balancing mechanism to reroute traffic at the network layer, in which case the XDCR group will have been set up in a master-master configuration.

Couchbase indexing and queries
Queries on Couchbase Server are performed via “views,” Couchbase terminology for indexes. Put another way, when you create an index, you’re provided with a view that serves as your mechanism for querying Couchbase data. Views are new to Couchbase 2.0, as is the incremental mapreduce engine that powers the actual creation of views. Note that queries really didn’t exist prior to Couchbase Server 2.0. Until this latest release, the database was a key/value storage system that simply did not understand the concept of a multifield document.

To define a view, you build a specific kind of document called a design document. The design document holds the JavaScript code that implements the mapreduce operations that create the view’s index. Design documents are bound to specific buckets, which means that queries cannot execute across multiple buckets. Couchbase’s “eventual consistency” plays a role in views as well. If you add a new document to a bucket or update an existing document, the change may not be immediately visible.

The map function in a design document’s mapreduce specification filters and extracts information from the documents against which it executes. The result is a set of key/value pairs that comprise the query-accelerating index. The reduce function is optional. It is typically used to aggregate or sum the data manipulated by the map operation. Code in the reduce function can be used to implement operations that correspond roughly to SQL’s ORDER BY, SORT, and aggregation features.

Couchbase Server supplies built-in reduce functions: _count, _stats, and _sum. These built-in functions are optimized beyond what would be possible if written from scratch. For example, the _count function (which counts the number of rows returned by the map function) doesn’t have to recount all the documents when called. If an item is added to or removed from the associated index, the count is incremented or decremented appropriately, so the _count function need merely retrieve the maintained value.

Query parameters offer further filtering of an index. For example, you can use query parameters to define a query that returns a single entry or a specified range of entries from within an index. In addition, in Couchbase 2.0, document metadata is available. The usefulness of this becomes apparent when building mapreduce functions, as the map function can employ metadata to filter documents based on parameters such as expiration date and revision number.

Couchbase indexes are updated incrementally. That is, when an index is updated, it’s not reconstructed wholesale. Updates only involve those documents that have been changed or added or removed since the last time the index was updated. You can configure an index to be updated when specific circumstances occur. For example, you might configure an index to be updated whenever a query is issued against it. That, however, might be computationally expensive, so an alternative is to configure the index to be updated only after a specified number of documents within the view have been modified. Still another alternative is to have the view updated based on a time interval.

Whatever configuration you choose, it’s important to realize that a design document can hold multiple views and the configuration applies to all views in the document. If you update one index, all indexes in the document will be updated.

Finally, Couchbase distinguishes between development and production views, and the two are kept in separate namespaces. Development views can be modified; production views cannot. The reason for the distinction arises from the fact that, if you modify a view, all views in that design document will be invalidated, and all indexes defined by mapreduce functions in the design document will be rebuilt. Therefore, development views enable you to test your view code in a kind of sandbox before deploying it into a production view.

NoSQL deep dive: MongoDB vs. Couchbase

You can manage Couchbase Server via the Web-based management console. The view of active servers, shown above, is open to a single member of the cluster. Memory cache and disk storage usage information is readily available.

Managing Couchbase
For gathering statistics and managing a Couchbase Server cluster, the Couchbase Web Console — available via any modern browser on port 8091 of a cluster member — is the place to go. It provides a multitab view into cluster mechanics. The tabs include:

  • Cluster overview, which has general RAM and disk usage information (aggregated for the whole cluster). Also, operations per second bucket usage. The information is presented in a smoothly scrolling line graph.
  • Server nodes, which provides information similar to the above, but for individual members of the cluster. You can also see CPU usage and swap space usage. On this tab, you can add a new node to the cluster: Click the Add Server button and you’re prompted for IP address and credentials.
  • Data buckets, which shows all the buckets on the cluster. You can see which nodes participate in the storage of a given bucket, how many items are in each bucket, RAM and disk usage attributed to a bucket, and so on.

The Couchbase Web Console provides much more information than can be covered here. An in-depth presentation of its capabilities can be found in Couchbase Server’s online documentation.

For administrators who would rather perform their management duties on the metal, Couchbase provides a healthy set of command-line tools. General management functions are found in the couchbase-cli tool, which lists all the servers in a cluster, retrieves information for a single node, initiates rebalancing, manages buckets, and more. The cbstats command-line tool displays cluster statistics, and it can be made to fetch the statistics for a single node (the variety of statistical information retrieved is too diverse to list here). The cbepctl command lets you modify a cluster’s RAM and disk management characteristics. For example, you can control the “checkpoint” settings, which govern how frequently replication occurs.

Other command-line tools include data backup and restore, a tool to retrieve data from a node that has (for whatever reason) stopped running, and even a tool for generating an I/O workload on a cluster member to test its performance.

Couchbase Server is available in both Enterprise and Community editions. The Enterprise edition undergoes more thorough testing than the Community edition, and it receives the latest bug fixes. Also, hot fixes are available, as is 24/7 support (with the purchase of an annual subscription). Nevertheless, the Enterprise edition is free for testing and development on any number of nodes or for production use on up to two nodes. The Community edition, as you might guess, is free for any number of production nodes.

Pros and cons: Couchbase Server 2.0
  • Provides legacy Memcached capabilities
  • Supports spatial data and views
  • Now a true document database
  • Indexing mechanisms not well developed
  • JSON support is relatively immature
  • Does not support range sharding
Review: Visual Studio 2012 shines on Windows 8


MongoDB is about three years old, first released in late 2009. The goal behind MongoDB was to create a NoSQL database that offered high performance and did not cast out the good aspects of working with RDBMSes. For instance, the way that queries are designed and optimized in MongoDB is similar to how that would be done in an RDBMS. MongoDB’s designers also wanted to make the database easier for application developers to work with — for example, by allowing developers to change the data model quickly. MongoDB, whose name is short for “humongous,” stores documents in BSON (Binary JSON), an extension of JSON that allows for the use of data types such as integers, dates, and byte arrays.

Two primary processes are at work in a MongoDB system, mongod and mongos. The mongod process is the real workhorse. In a sharded MongoDB cluster, mongod can be found playing one of two roles: config server or shard server. The config server tracks the cluster’s metadata. (In a sharded MongoDB cluster, there must be at least three config servers for redundancy’s sake.) Each config server knows which server in the cluster is responsible for a given document or, more precisely, where a given contiguous range of shard keys (called a chunk) belongs in the cluster.

Other mongod processes in the cluster run as shard servers, and these handle the actual reading and writing of the data. For fail-over purposes, two instances of a mongod process on a given cluster member run as shard servers. One process is primary, and the other is secondary. All write requests go to the primary, while read requests can go to either primary or secondary.

Secondaries are updated asynchronously from the primary so that they can take over in the event of a primary’s crash. This, however, means that some read requests (sent to secondaries) may not be consistent with write requests (sent to primaries). This is an instance of MongoDB’s “eventual consistency.” Over time, all secondaries will become consistent with write operations on the primary. Note that you can guarantee consistent read/write behavior by configuring a MongoDB system such that all I/O — reads and writes — go to the primary instances. In such an arrangement, secondaries act as standby servers, coming online only when the primary fails.

The mongos process, which runs at a conceptually higher level than the mongod processes, is best thought of as a kind of routing service. Database requests from clients arrive first at a mongos process, which determines which shard(s) in a sharded cluster can service each request. The mongos process dispatches I/O requests to the appropriate mongod processes, gathers the results, and returns them to the client. Note that in a nonsharded cluster, clients talk directly to a mongod process.

MongoDB scaling and replication
MongoDB doesn’t have an explicit memory caching layer. Instead, all MongoDB operations are performed through memory-mapped files. Consequently, MongoDB hands off the chore of juggling memory caching versus persistence-to-disk to the operating system. You can tweak various flush-to-disk settings for optimal performance, however. For example, MongoDB maintains a journal of database operations (for recovery purposes) that is flushed to the disk every 100ms. Not only is this interval configurable, but you can configure the system so that write operations return only after the journal has been written to disk.

Documents are placed in named containers called collections, which are roughly equivalent to Couchbase’s buckets. A collection serves as a means of partitioning related documents into separate groups. The effects of many multidocument operations in a MondoDB database are restricted to the collection in which those operations are performed. MongoDB supports sharding at the collection level, which means — should requirements dictate — you could construct a database with unsharded and sharded collections. Of course, only a sharded collection is protected against a single point of failure.

A document’s membership in a particular shard is determined by a shard key, which is derived from one or more fields in each document. The exact fields can be specified by the database administrator. In addition, MongoDB provides autosharding, which means that, once you’ve configured sharding, MongoDB will automatically manage the storage of documents in the appropriate physical location. This includes rebalancing shards as the number of documents grows or the number of mongod instances changes.

As of the 2.4 release, MongoDB supports both hash-based sharding and range-based sharding. As you might guess, hash-based sharding hashes the shard key, which creates a relatively even distribution of documents across the cluster. With range-based sharding (the sole sharding type prior to 2.4), a given member of a MongoDB sharded cluster will store all the documents within a given subrange of the shard key’s overall domain. More precisely, MongoDB defines a logical container, called a chunk, which is a subset of documents whose shard keys fall within a specific range. The mongos process then dictates which mongod process will manage a given chunk.

Typically, you permit the load balancer to determine which cluster member manages a given shard range. However, with version 2.4, you can associate tags with shard ranges (a tag being nothing more than an identifying string). Once that’s done, you can specify which member of a cluster will manage any shard ranges associated with a tag. In a sense, this lets you override some of the load balancer’s decision making and steer identifiable subsets of the database to specific servers. For example, you could put the data most frequently accessed from California on the cluster member in California, the data most frequently accessed from Texas on the cluster member in Texas, and so on.

MongoDB’s locking is on the database level, whereas it was global prior to version 2.2. The system implements shared-read, exclusive-write locking (many concurrent readers, but only one writer) with priority given to waiting writers over waiting readers. MongoDB avoids contentions via yield operations within locks. Predictive coding was added to the 2.2 release; if a process requests a document that is not in memory, it yields its lock so that other processes — whose documents are in memory — can be serviced. Long-running operations will also periodically yield locks.

You’ll find no clear notion of transactions in MongoDB. Certainly, you cannot perform pure ACID transactions on a MongoDB installation. Database changes can be made durable if you enable journaling, in which case write operations are blocked until the journal entry is persisted to disk (as described earlier). And MongoDB defines the $atomic isolation operator, which imposes what amounts to an exclusive-write lock on the document involved. However, $atomic is applied at the document level only. You cannot guard multiple updates across documents or collections.

MongoDB indexing and queries
MongoDB makes it easy to create secondary indexes for all document fields. A primary index always exists on the document ID. As with Couchbase Server, this is automatically generated for each document. However, with MongoDB, you can specify a separate field as being the document’s unique identifier. For example, a database of bank accounts might use the bank’s generated account number as the document ID field. Indexes exist at the collection level, and they can be compound — that is, created on multiple fields. MongoDB can also handle multikey indexes. If you index a field that includes an array, MongoDB will index each value in that array. Finally, MongoDB supports geospatial indexes.

MongoDB’s querying capabilities are well developed. If you’re coming to MongoDB from the RDBMS world, the online documentation shows how SQL queries might be mapped to MongoDB operations. For example, in most cases, the equivalent of SQL’s SELECT can be performed by a find() function. The find() function takes two arguments: a query document and a projection document. The query document specifies filter operations on specific document fields that are fetched. You could use it to request that only documents with a quantity field whose contents are greater than, say, 100 be returned. Therefore, the query document corresponds to the WHERE clause in an SQL statement. The projection document identifies which fields are to be returned in the results, which allows you to request that, say, only the name and address fields of matching documents be returned from the query. The sort() function, which can be executed on the results of find(), corresponds to SQL’s ORDER BY statement.

You can locate documents with the command db.<collection>.find(), possibly the simplest query you can perform. The find() command will return the first 20 members of the result, but it also provides a cursor, which allows you to iterate through all the documents in the collection. If you’d like to navigate the results more directly, you can reference the elements of the cursor as though it were an array.

More complex queries are possible thanks to MongoDB’s set of prefix operators, which can describe comparisons as well as boolean connections. MongoDB also provides the $regex operator in case you want to apply regular expressions to document fields in the result set. These prefix operators can be used in the update() command to construct the MongoDB equivalent of SQL’s UPDATE ... WHERE statement.

In the 2.2 release, MongoDB added the aggregation framework, which allows for calculating aggregated values without having to resort to mapreduce (which can be overkill if all you want to do is calculate a field’s total or average). The aggregation framework provides functionality similar to SQL’s SUM and AVG functions. It can also calculate computed fields and mimic the GROUP BY operator. Note that the aggregation framework is declarative — it does not employ JavaScript. You define a chain of operations, much in the same way you might perform Unix/Linux shell programming, and these operations are performed on the target documents in stream fashion.

One of the more significant new features in MongoDB’s 2.4 release is the arrival of text search. In the past, developers accomplished this by integrating Apache Lucene with MongoDB, which piled on considerable complexity. Adding Lucene in a clustered system with replication and fault tolerance is not an easy thing to do. MongoDB users now get text search for free. The new text search feature is not meant to match Lucene, but to provide basic capabilities such as more efficient Boolean queries (“dog and cat but not bird”), stemming (search for “reading” and you’ll also get “read”), and the automatic culling of stop words (for example, “and”, “the”, “of”) from the index.

You can define a text index on multiple string fields, but there can be only a single text index per collection, and indexes do not store word proximity information (that is, how close words are to one another, which can affect how matches are weighted). In addition, the text index is fully consistent: when you update data, the index is also updated.

Ease-of-use features have been added to version 2.4 as well. For example, you can now define a “capped array” as a data element, which works sort of like an ordered circular buffer. If, for example, you’re keeping track of the top 10 entries in a blog, using a capped array will allow you to add new entries, and (based on the specified ordering) previous entries will be removed to cap the array at 10 or whatever number you specify.

MongoDB 2.4 also has improved geospatial capabilities. For example, you can now perform polygon operations, which would allow you to determine if two regions overlap. The spherical model used in 2.4 is improved too; it now takes into account the fact that the earth is not perfectly spherical, so distance calculations are more accurate.

MongoDB mapreduce
In Couchbase Server, the mapreduce operation’s primary job is to provide a structured query and information aggregation capability on the documents in the database. In MongoDB, mapreduce can be used not only for querying and aggregating results, but as a general-purpose data processing tool. Just as a mapreduce operation executes within a given bucket in Couchbase Server, mapreduce executes within a given collection in a MongoDB database. As in Couchbase Server, mapreduce functions in MongoDB are written in JavaScript.

You can filter the documents passed into the map function via comparison operators, or you can limit the number of documents to a specific number. This allows you to create what amounts to an incremental mapreduce operation. Initially, you run mapreduce over the entire collection. For subsequent executions, you add a query function that includes only newly added documents. From there, set the output of mapreduce to be a separate collection, and configure your code so that the new results are merged into the existing results.

Further speed/size trade-offs are possible by choosing whether the intermediate results (the output of the map function, sent to the reduce function) are converted to BSON objects or remain JavaScript objects. If you choose BSON, the BSON objects are placed in temporary, on-disk storage, so the number of items you can process is limited only by available disk space. However, if you choose JavaScript objects, then the system can handle only about 500,000 distinct keys emitted by the map function. But as there is no writing to disk, the processing is faster.

You have to be careful with long-running mapreduce operations, because their execution involves lengthy locks. As mentioned earlier, the system has built-in facilities to mitigate this. For example, the read lock on the input collection is yielded every 100 documents. The MongoDB documentation describes the various locks that must be considered — as well as mechanisms to relieve the possible problems.

Managing MongoDB
Management access with the MongoDB database goes through the interactive mongo shell. Very much a command-line interface that lets you enter arbitrary JavaScript, it is nonetheless surprisingly facile. The MongoDB related commands are uncomplicated, but at the price of being dangerous if you’re careless. For example, to select a database, you enter use <databasename>. But that command doesn’t check for the presence of the specific database; if you mistype it and proceed to enter documents into that database, you might not know what’s going on until you’ve put a whole lot of documents into the wrong place. The same goes for collections within databases.

Other useful command-line utilities are mongostat, which returns information concerning the number of operations — inserts, updates, deletes, and so on — within a specific time period. The mongotop utility likewise returns statistical information on a MongoDB instance, this time focusing on a specific collection. You can see the amount of time spent reading or writing in the collection, for instance.

In addition, 10gen provides the free cloud-based MongoDB Monitoring Service (MMS) which provides a monitoring dashboard for MongoDB installations. Built on the SaaS model, MMS requires you to run a small agent on your MongoDB cluster that communicates with the management system.

NoSQL deep dive: MongoDB vs. Couchbase

10gen’s MongoDB Monitoring Service shows statistics — in this case, for a replica set — but management of the database is done from the command line.

In addition to the new text search and geospatial capabilities discussed above, MongoDB 2.4 comes with performance and security improvements. The performance enhancements include the working set analyzer. The idea is that you want to configure your system so that the working set — that subset of a databas accessed most frequently — fits entirely in memory. But it was not easy to figure out your working set or how much memory you need. The working set analyzer, which operates like a helper function, provides diagnostic output to aid you in discovering the characteristics of your working set and tuning your system accordingly. In addition, the JavaScript engine has been replaced by Google’s open source V8 engine. In the past, the JavaScript engine was single-threaded. V8 permits concurrent mapreduce jobs, as well as general speed improvements.

Finally, the Enterprise edition welcomes Kerberos-based authentication. In all editions, MongoDB now supports role-based privileges, which gives you finer-grained control over users’ access and operations on databases and collections.

10gen’s release of MongoDB 2.4 is accompanied by new subscription levels: Community, Basic, Standard, and Enterprise. The Community subscription level is free, but it’s also free of any support. The other subscription levels provide varying support response times and hours of availability. In addition, the Enterprise subscription level comes with the Enterprise version of MongoDB, which has more security features and SNMP support. It has also undergone more rigorous testing.

Pros and cons: MongoDB 2.4
  • New release incorporates text search
  • New release adds improved JavaScript engine
  • Free MongoDB training courseware available from 10gen
  • Text index doesn’t store proximity information
  • No GUI-based management console
  • Kerberos authentication available in Enterprise edition only

Mongo or Couch?
As usual, which product is the best choice depends heavily on the target application. Both are highly regarded NoSQL databases with outstanding pedigrees. On the one hand, MongoDB has spent much more of its lifetime as a document database, and its support for document-level querying and indexing is richer than that in Couchbase. On the other hand, Couchbase can serve equally well as a document database, a Memcached replacement, or both.

Happily, exploring either Couchbase or MongoDB is remarkably simple. A single-node system for either database server is easily installed. And if you want to experiment with a sharded system (and have enough memory and processor horsepower), you can easily set up a gang of virtual machines on a single system, and lash them together via a virtual network switch. The documentation for both systems is voluminous and well maintained. 10Gen even provides free online MongoDB classes for developers, complete with video lectures, quizzes, and homework.