A report on Hadoop

A report on Hadoop

Takeaway: Hadoop has been helping analyze data for years now, but there are probably more than a few things you don’t know about it.

7 Things to Know About Hadoop

Source: Pressureua/Dreamstime.com

What is Hadoop? It’s a yellow toy elephant. Not what you were expecting? How about this: Doug Cutting – co-creator of this open-source software project – borrowed the name from his son who happened to call his toy elephant Hadoop. In a nutshell, Hadoop is a software framework developed by the Apache Software Foundation that’s used to develop data-intensive, distributed computing. And it’s a key component in another buzzword readers can never seem to get enough of: big data. Here are seven things you should know about this unique, freely licensed software.

How did Hadoop get its start?

Twelve years ago, Google built a platform to manipulate the massive amounts of data it was collecting. Like the company often does, Google made its design available to the public in the form of two papers: Google File System and MapReduce.

At the same time, Doug Cutting and Mike Cafarella were working on Nutch, a new search engine. The two were also struggling with how to handle large amounts of data. Then the two researchers got wind of Google’s papers. That fortunate intersection changed everything by introducing Cutting and Cafarella to a better file system and a way to keep track of the data, eventually leading to the creation of Hadoop.

What is so important about Hadoop?

Today, collecting data is easier than ever. Having all this data presents many opportunities, but there are challenges as well:

  • Massive amounts of data require new methods of processing.
  • The data being captured is in an unstructured format.

To overcome the challenges of manipulating immense quantities of unstructured data, Cutting and Cafarella came up with a two-part solution. To solve the data-quantity problem, Hadoop employs a distributed environment – a network of commodity servers – creating a parallel processing cluster, which brings more processing power to bear on the assigned task.

Next, they had to tackle unstructured data or data in formats that standard relational database systems were unable to handle. Cutting and Cafarella designed Hadoop to work with any type of data: structured, unstructured, images, audio files, even text.  Cloudera (Hadoop integrator) white paper explains why this is important:

    “By making all your data usable, not just what’s in your databases, Hadoop lets you uncover hidden relationships and reveals answers that have always been just out of reach. You can start making more decisions based on hard data, instead of hunches, and look at complete data sets, not just samples and summaries.”

What is Schema on read?

As was mentioned earlier, one of the advantages of Hadoop is its ability to handle unstructured data. In a sense, that is “kicking the can down the road.” Eventually the data needs some kind of structure in order to analyze it.

That is where schema on read comes into play. Schema at read is the melding of what format the data is in, where to find the data (remember the data is scattered among several servers), and what’s to be done to the data – not a simple task. It’s been said that manipulating data in a Hadoop system requires the skills of a business analyst, a statistician and a Java programmer. Unfortunately, there aren’t many people with those qualifications.

What is Hive?

If Hadoop was going to succeed, working with the data had to be simplified. So, the open-source crowd got to work and created Hive:

    “Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.”

Hive enables the best of both worlds: database personnel familiar with SQL commands can manipulate the data, and developers familiar with the schema on read process are still able to create customized queries.

Apace Hive is a data warehouse system that is often used with an open-source analytics platform called Hadoop. Hadoop has become a popular way to aggregate and refine data for businesses. Hadoop users may use tools like Apache Spark or MapReduce to compile data in precise ways before storing it in a file handling system called HDFS. From there, the data can go into Apache Hive for central storage.

Techopedia explains Apache Hive

Apache Hive and other data warehouse designs are the central repositories for data and play important roles in a company’s IT setup. They need to have specific goals for data retrieval, security and more.

Apache Hive has a language called HiveQL, which shares some features with the commonly popular SQL language for data retrieval. It also supports metadata storage in an associated database.

Apache Spark is an open-source program used for data analytics. It’s part of a greater set of tools, including Apache Hadoop and other open-source resources for today’s analytics community.

Experts describe this relatively new open-source software as a data analytics cluster computing tool. It can be used with the Hadoop Distributed File System (HDFS), which is a particular Hadoop component that facilitates complicated file handling.

Some IT pros describe the use of Apache Spark as a potential substitute for the Apache Hadoop MapReduce component. MapReduce is also a clustering tool that helps developers process large sets of data. Those who understand the design of Apache Spark point out that it can be many times faster than MapReduce, in some situations.

Those reporting on the modern use of Apache Spark show that companies are using it in various ways. One common use is for aggregating data and structuring it in more refined ways. Apache Spark can also be helpful with analytics machine-learning work or data classification.

Typically, organizations face the challenge of refining data in an efficient and somewhat automated way, where Apache Spark may be used for these kinds of tasks. Some also imply that using Spark can help provide access to those who are less knowledgeable about programming and want to get involved in analytics handling.

Apache Spark includes APIs for Python and related software languages.

Apache HBase is a specific kind of database tool written in Java and used with elements of the Apache software foundation’s Hadoop suite of big data analysis tools. Apache HBase is an open source product, like other elements of Apache Hadoop. It represents one of several database tools for the input and output of large data sets that are crunched by Hadoop and its various utilities and resources.

Apache HBase is a distributed non-relational database, which means that it doesn’t store information in the same way as a traditional relatable database setup. Developers and engineers run data from Apache HBase to and from Hadoop tools like MapReduce for data analysis. The Apache community promotes Apache HBase as a way to get direct access to big data sets. Experts point out that HBase is based on something called Google BigTable, a distributed storage system.

Some of the popular features of Apache HBase include some kinds of backup and failover support, as well as APIs for popular programming languages. Its compatibility with the greater Hadoop system makes it a candidate for many kinds of big data management problems in enterprise

What kind of data does Hadoop analyze?

Web analytics is the first thing that comes to mind, analyzing Web logs and Web traffic in order to optimize websites. Facebook, for example, is definitely into Web analytics, using Hadoop to sort through the terabytes of data the company accumulates.

Companies use Hadoop clusters to perform risk analysis, fraud detection and customer-base segmentation. Utility companies use Hadoop to analyze sensor data from their electrical grid, allowing them to optimize the production of electricity. An major companies such as Target, 3M and Medtronics use Hadoop to optimize product distribution, business risk assessments and customer-base segmentation.

Universities are invested in Hadoop too. Brad Rubin, an associate professor at the University of St. Thomas Graduate Programs in Software, mentioned that his Hadoop expertise is helping sort through the copious amounts of data compiled by research groups at the university.

Can you give a real-world example of Hadoop?

One of the better-known examples is the TimesMachine. The New York Times has a collection of full-page newspaper TIFF images, associated metadata, and article text from 1851 through 1922 amounting to terabytes of data. NYT’s Derek Gottfrid, using anEC2/S3/Hadoop system and specialized code,:

    “Ingested 405,000 very large TIFF images, 3.3 million articles in SGML and 405,000 xml files mapping articles to rectangular regions in the TIFFs. This data was converted to a more web-friendly 810,000 PNG images (thumbnails and full images) and 405,000 JavaScript files.”

Using servers in the Amazon Web Services cloud, Gottfrid mentioned they were able to process all the data required for the TimesMachine in less than 36 hours.

Is Hadoop already obsolete or just morphing?

Hadoop has been around for over a decade now. That has many saying it’s obsolete. One expert, Dr. David Rico, has said that “IT products are short-lived. In dog years, Google’s products are about 70, while Hadoop is 56.”

There may be some truth to what Rico says. It appears that Hadoop is going through a major overhaul. To learn more about it, Rubin invited researchers to a Twin Cities Hadoop User Group meeting, and the topic of discussion was Introduction to YARN:

      “Apache Hadoop 2 includes a new MapReduce engine, which has a number of advantages over the previous implementation, including better scalability and resource utilization. The new implementation is built on a general resource management system for running distributed applications called


      Hadoop gets a lot of buzz in database and content management circles, but there are still many questions around it and how it can best be used.
Apache Slider is a new code base for the Hadoop data analytics tool set or ‘suite’ licensed by the Apache software foundation. This project should be released in the second half of 2014 and will help users to apply Hadoop and the YARN resource management tool to various goals and objectives.

Techopedia explains Apache Slider

Experts explain that Apache Slider will help to extend the reach of what Hadoop and YARN can do by allowing certain kinds of databases to run unmodified in the YARN resource management environment.
YARN is an existing Hadoop resource that focuses on resource management and complements other tools like MapReduce and the Hadoop HDFS file handling system. Apache Slider will make more different types of programs compatible with YARN and extend the ‘case uses’ that are possible.
Instead of modifying existing applications, say experts, Apache Slider will allow for a much broader and diversified application of database and data analytics platforms to Hadoop’s core software resources. Using Apache slider may also improve the efficiency of memory and processing resources for an entire project.
Another way to explain the use of Apache Slider and its development is that it can help YARN to eventually become the central software or “operating system” for a corporate data warehouse or other data center. For instance, tools like Apache HBase and Hive are often used in enterprise environments. Making these more compatible with Hadoop YARN can have some real impact on business process efficiency.
DWH and Hadoop
Big data analytics, advanced analytics (i.e., data mining, statistical analysis, complex SQL, and natural language processing), and discovery analytics benefit from Hadoop. HDFS and other Hadoop tools promise to extend and improve some areas within data warehouse architectures:
several DW teams that have consolidated and migrated their staging area(s) onto HDFS to take advantage of its low cost, linear scalability, facility with file-based data, and ability to manage unstructured data. Users who prefer to hand-code most of their ETL solutions will most likely feel at home in code-intense environments such as Apache MapReduce, Pig, and Hive.
They may even be able to refactor existing code to run there. For users who prefer to build their ETL solutions atop a vendor tool, the community of vendors for ETL and other data management tools is rolling out new interfaces and functions for the entire Hadoop product family.
Data archiving. When organizations embrace forms of advanced analytics that require detailed source data, they amass large volumes and retain most of the data over time, which taxes areas of the DW architecture where source data is stored. Storing terabytes of source data in the core EDW’s RDBMS can be prohibitively expensive, which is why many organizations have moved such data to less expensive satellite systems within their extended DW environments.
Similar to migrating staging areas to HDFS, some organizations are migrating their stores of source data and other archives to HDFS. This lowers the cost of archives and analytics while providing greater capacity.
Multi-structured data. : Relatively few organizations are currently getting BI value from semi- and unstructured data, despite years of wishing for it. HDFS can be a special place within your DW environment for managing and processing semi-structured and unstructured data. Hadoop users are finding this approach more successful than stretching an RDBMS-based DW platform to handle data types it was not designed for.
One of Hadoop’s strongest complements to a DW is its handling of semi- and unstructured data, but don’t go thinking that Hadoop is only for unstructured data: HDFS handles the full range of data, including structured forms. In fact, Hadoop can manage and process just about any data you can store in a file and copy into HDFS.
Processing flexibility. Given its ability to manage diverse multi-structured data, as just described, Hadoop’s NoSQL approach is a natural framework for manipulating nontraditional data types. Note that these data types are often free of schema or metadata, which makes them challenging for most vendor brands of SQL-based RDBMSs, although a few have functions for deducing, creating, and applying schema as needed. Hadoop supports a variety of programming languages (Java, R, C), thus providing more capabilities than SQL alone can offer. Again, a few RDBMSs support these same languages as a complement to SQL.
In addition, Hadoop enables the growing practice of “late binding.” With ETL for data warehousing, data is processed, standardized, aggregated, and remodeled before entering the data warehouse environment; this imposes an a priori structure on the data, which is appropriate for known reports, but limits the scope of analytic repurposing later. Data entering HDFS is typically processed lightly or not at all to avoid limiting its future applications. Instead, Hadoop data is processed and restructured at run time, so it can flexibly enable the open-ended data exploration and discovery analytics that many users are looking for today.
Hadoop and RDBMSs are complementary and should be used together
Hadoop’s help for data warehouse environments is limited to a few areas. Luckily, most of
Hadoop’s strengths are in areas where most warehouses and BI technology stacks are weak, such as unstructured data, very large data sets, non-SQL algorithmic analytics, and the flood of files that is drowning many DW environments. Conversely, Hadoop’s limitations are mostly met by mature functionality available today from a wide range of RDBMS types (OLTP databases, columnar databases, DW appliances, etc.), plus administrative tools. In that context, Hadoop and the average RDBMS-based data warehouse are complementary (despite some overlap), which results in a fortuitous synergy when the two are integrated.
The trick, of course, is making HDFS and an RDBMS work together optimally. To that end, one of the critical success factors for assimilating Hadoop into evolving data warehouse architectures is the improvement of interfaces and interoperability between HDFS and RDBMSs. Luckily, this is well under way due to efforts from software vendors and the open source community. Technical users are starting to leverage HDFS/RDBMS integration.
For example, an emerging best practice among DW professionals with Hadoop experience is to manage diverse big data in HDFS, but process it and move the results (via ETL or other data integration media) to RDBMSs (elsewhere in the DW architecture), which are more conducive to SQL-based analytics. Hence, HDFS serves as a massive data staging area and archive.
A similar best practice is to use an RDBMS as a front end to HDFS data; this way, data is moved via distributed queries (whether ad hoc or standardized), not via ETL jobs. HDFS serves as a large, diverse operational data store, whereas the RDBMS serves as a user-friendly semantic layer that makes HDFS data look relational.
Actian Corporation has accumulated a fairly comprehensive portfolio of platforms and tools for managing analytics, big data, and all other enterprise data, encompassing the full range of structured, semi-structured, and unstructured data and content types. The new Actian Analytics Platform includes connectivity to more than 200 sources, a visual framework that simplifies ETL and data science, high-performance analytic engines, and libraries of analytic functions.
The Actian Analytics Platform centers on Matrix (a massively parallel columnar RDBMS formerly called ParAccel) and Vector (a single-node RDBMS optimized for BI). Actian DataFlow accelerates ETL natively on Hadoop. Actian Analytics includes more than 500 analytic functions ready to run in-database or on Hadoop. Actian DataConnect connects and enriches data from over 200 sources on-premises or in the cloud. The Actian platform is integrated by a modular framework that enables users to quickly connect to all data assets for open-ended analytics with linear scalability.
Strategic partnerships include Hortonworks (for HDFS and YARN), Attivio (for big content), and a number of contributors to the Actian Analytics library.
Cloudera is a leading provider of Apache Hadoop–based software, services, and training, enabling Cloudera data-driven organizations to derive business value from all their data while simultaneously reducing  the costs of data management. CDH (Cloudera’s distribution including Apache Hadoop) is a  comprehensive, tested, and stable distribution of Hadoop that is widely deployed in commercial and  non-commercial environments. Organizations can subscribe to Cloudera Enterprise—comprising  CDH, Cloudera Support, and the Cloudera Manager—to simplify and reduce the cost of Hadoop configuration, rollout, upgrades, and administration. Cloudera also provides Cloudera Enterprise  Real-Time Query (RTQ), powered by Cloudera Impala, the first low-latency SQL query engine that  runs directly over data in HDFS and HBase. Cloudera Search increases data ROI by offering non- technical resources a common and everyday method for accessing and querying large, disparate big  data stores of mixed format and structure managed in Hadoop. As a major contributor to the Apache  open source community, with customers in every industry, and a massive partner program,  Cloudera’s big data expertise is profound.
Datawatch Corporation provides a visual data discovery and analytics solution that optimizes any data—regardless of its variety, volume, or velocity—to reveal valuable insights for improving business  decisions. Datawatch has a unique ability to integrate structured, unstructured, and semi-structured  sources—such as reports, PDF files, print spools, and EDI streams—with real-time data streams  from CEP engines, tick feeds, or machinery and sensors into visually rich analytic applications,  which enable users to dynamically discover key factors about any operational aspect of their business.
Datawatch steps users through data access, exploration, discovery, analysis, and delivery, all in a  unified and easy-to-use tool called Visual Data Discovery, which integrates with existing BI and big  data platforms. IT’s involvement is minimal in that IT sets up data connectivity; most users can  create their own reports and analyses, then publish them for colleagues to share in a self-service  fashion. The solution is suitable for a single analyst, a department, or an enterprise. Regardless of user  type, whether business or technical or both, all benefit from the high ease of use, productivity, and  speed to insight that Datawatch’s real-time data visualization delivers.
Dell Software For years, Dell Software has been acquiring and building software tools (plus partnering with leading vendors for more tools) with the goal of assembling a comprehensive portfolio of IT administration tools for securing and managing networks, applications, systems, endpoints, devices, and data. Within that portfolio, Dell Software now offers a range of tools specifically for data management, with a focus on big data and analytics. For example, Toad Data Point provides interfaces and administrative functions for most traditional databases and packaged applications, plus new big data platforms such as Hadoop, MongoDB, Cassandra, SimpleDB, and Azure. Spotlight is a DBA tool for monitoring DBMS health and benchmarking. Shareplex supports Oracle-to-Oracle replication today, and will soon support Hadoop. Kitenga Big Data Analytics enables rapid
transformation of diverse unstructured data into actionable insights. Boomi MDM launched in 2013. The new Toad BI Suite pulls these tools together to span the entire information life cycle of big data and analytics. After all, the goal of Dell Software is: one vendor, one tool chain, all data.
HP HP Vertica provides solutions to big data challenges. The HP Vertica Analytics Platform was purpose-built for advanced analytics against big data. It consists of a massively parallel database with columnar support, plus an extensible analytics framework optimized for the real-time analysis of data. It is known for high performance with very complex analytic queries against multi-terabyte data sets.
Vertica offers advantages over SQL-on-Hadoop analytics, shortening some queries from days to minutes. Although SQL is the primary query language, Vertica also supports Java, R, and C.
Furthermore, the HP Vertica Flex Zone feature enables users to define and apply schema during query and analysis, thereby avoiding the need to prepocess data or deploy Hadoop or NoSQL platforms for schema-free data.
HP Vertica is part of HP’s new HAVEn platform, which integrates multiple products and services into a comprehensive big data platform that provides end-to-end information management for a wide range of structured and unstructured data domains. To simplify and accelerate the deployment of an analytic solution, HP offers the HP ConvergedSystem 300 for Vertica—a pre-built and pre-tested turn-key appliance.
MapR provides a complete distribution for Apache Hadoop, which is deployed at thousands of organizations globally for production, data-driven applications. MapR focuses on extending and advancing Hadoop, MapReduce, and NoSQL products and technologies to make them more feature rich, user friendly, dependable, and conducive to production IT environments. For example, MapR is spearheading the development of Apache Drill, which will bring ANSI SQL capabilities to Hadoop in the form of low-latency, interactive query capabilities for both structured and schema-free, nested data. As other examples, MapR is the first Hadoop distribution to integrate enterprise-grade search;
MapR enables flexible security via support for Kerberos and native authentication; and MapR provides a plug-and-play architecture for integrating real-time stream computational engines such as Storm with Hadoop. For greater high availability, MapR provides snapshots for point-in-time data rollback and a No NameNode architecture that avoids single points of failure within the system and ensures there are no bottlenecks to cluster scalability. In addition, it’s fast; MapR set the Terasort, MinuteSort, and YCSB world records.