Hadoop is an open source suite of programs, procedures & tools created by Apache Software Foundation. They are designed to facilitate the analysis of very large datasets of Structured & non-Structured data.
Hadoop makes it possible to work on thousand of nodes involving many terabytes, or even petabytes, of data. This makes analysis of big data substantially easier. It also has rapid data transfer rates among nodes, meaning if one goes down, you can transfer it’s work to another. This gives it a high degree of fault tolerance. Reducing the risk of failures slowing processing.
So, how does it work? At its simplest, Hadoop takes a large big data analysis problem, and breaks it down into smaller problems. It then distributes the smaller problems to inexpensive distributed computers or servers for parallel processing. It then combines the results for easy analysis, or further processing.
Hadoop was inspired by Google’s MapReduce white paper. This paper described why they created MapReduce – to be able to index the huge increase in data on the internet. Hadoop was released in 2005 by Apache Software Foundation, as an open source product. Although one of the more interesting facts about its history is the name. Doug Cutting was one of the creators of Hadoop, and it was named after his child’s stuffed toy elephant. Digital Marketers all over the world are despairing as a result…
Hadoop was originally composed of the core components of MapReduce, and HDFS (Hadoop Distributed File System). However, as further components were added for specific needs, the number of components increased. These are now generally referred to as the Hadoop Ecosystem.
Let’s look at the base modules:
Hadoop distributed file system (HDFS)
File System that keeps track of data across large number of linked storage. It can be accessed by any computer using a supported operating system. It will accept any type of data, you just put it in the cluster, and leave it there until you decided how you want to process it.
Hadoop actually supports many different file systems, for example the Amazon Web Service integrates Hadoop with it’s own S3 file system. But HDFS is the Hadoop version.
This is the default data processing system. It’s Java-Based. As Hadoop is not a relational database, or indeed a database at all, you cannot use SQL to get answers from the data. As a result, you use NoSQL.
Tools and libraries needed for other Hadoop modules.
Manages resources of systems which store the data & run the analysis.
There are now also a number of additional modules you can use for different processing on top of these modules. For example,
- Hive for SQL (or a version of SQL they call HiveQL)
- HBase for NoSQL
- Spark for in-memory processing
- Pig for Scripting
– Very Flexible. Easy to scale up or down on inexpensive computers or servers.
– Popular. It’s widely used in Big Data industry, and as a mature product offering support is available.
– Free as it’s open source. Also means, if software experts make enhancements, it’s fed back into the development community for general use.
– It will process any amount of data – Petabytes and above. The data can be in any form – Structures, non-structures, emails, patents, voicemails, etc.
– In it’s basic state, Hadoop can be complex to use. As a result, commercial versions are being created with simplified use. For example, Cloudera, Hortonworks, MapR, etc. You’ll pay for support & consultation with many of these.
– MapReduce is not a good match for all analysis. It’s quite a complicated piece to work with, and can often only handle one problem at a time. This means you either need experts to run it, or look into alternative options to run it, for example HIVE (which has a broader range of skilled practitioners available).
– Hadoop is a more mature option, when it comes to open source Big Data processing. But it’s not necessarily the best. A lot is being said about Spark these days, which is quickly finding it’s place in industry. Just because Hadoop was out first, does not make it the best for your requirements. Maybe you need both? But it’s a decision that needs thought put into it. What do you actually want to process and how?
First off, a disclaimer: I’ve never used Hadoop. My opinion is formed from articles on the internet, and the content of text books. So, feel free to disagree.
It seems to me, Hadoop was the best option at the time, and it still does some things well. However, it’s difficult to use, and as a result, people are adding bits on top which go against the original premise. For example, Putting relational technology (SQL) on top of Hadoop, which is neither a database nor relational, seems daft. The whole concept of big data concerns, is that existing models of processing can’t cope. Why then try to restrict it backwards with processing we’re up to speed with? If training, and a lack of experts is the problem, well then train them up. IT professionals of any reasonable skill level are adept at learning new computer languages, and platforms. They are inherently interested in learning new ways to do cool things. (And yes that includes myself.) If Big data processing is the means to the future of data analytics, then be open to the alternative ways of working that come with it.
Secondly, it’s all very well loading data into your Hadoop cluster until you’re ready to work with it. But that brings security risks & expense if you’re on cloud computing services. It’s hard to see how data governance can be enforced if you’re not even sure where your data is, or what is in it. We’re told again, and again, that computer piracy is on the up, and to do as much as you can to protect your data, especially customer & core organisational data. However, the risk with big data is we get overwhelmed with it, and as a result don’t go to the same rounds of protecting it. This is more a big data issue, than a Hadoop issue. But reading these articles, it’s seen as a major advantage of Hadoop to be able to throw anything into the Hadoop cluster until you’re ready to use it. One even mentioned thinking of Hadoop as a big bucket! As long as you remember you need to lock that big bucket in a safe, inside a safe, inside a castle with a big moat around it, you may be OK.
I’m a big fan of Open Source programs. Apart from the fact they’re free, there is generally a whole community of developers willing to help you use them to the fullest. The downside is they’re often not as user friendly as those created to make profit. I believe this is part of the issue with Hadoop. The original developers came up with a great solution. All the add-ons since then have either been a solution to another problem, using the core functionality of Hadoop as its foundations. Or have made more user-friendly options available. At it’s core it’s a great solution to the processing of very large data. We shouldn’t forget the core value of that solution when overtaken by all the add-ons.