hadoopHadoop is an open source suite of programs, procedures & tools created by Apache Software Foundation.  They are designed to facilitate the analysis of very large datasets of Structured & non-Structured data.

Hadoop makes it possible to work on thousand of nodes involving many terabytes, or even petabytes, of data.  This makes analysis of big data substantially easier.  It also has rapid data transfer rates among nodes, meaning if one goes down, you can transfer it’s work to another.  This gives it a high degree of fault tolerance.  Reducing the risk of failures slowing processing.

So, how does it work?  At its simplest, Hadoop takes a large big data analysis problem, and breaks it down into smaller problems.  It then distributes the smaller problems to inexpensive distributed computers or servers for parallel processing.  It then combines the results for easy analysis, or further processing.


Hadoop was inspired by Google’s MapReduce white paper.  This paper described why they created MapReduce – to be able to index the huge increase in data on the internet.  Hadoop was released in 2005 by Apache Software Foundation, as an open source product.  Although one of the more interesting facts about its history is the name.  Doug Cutting was one of the creators of Hadoop, and it was named after his child’s stuffed toy elephant.  Digital Marketers all over the world are despairing as a result…

hadoop bed time

Hadoop Ecosystem:

Hadoop was originally composed of the core components of MapReduce, and HDFS (Hadoop Distributed File System).  However, as further components were added for specific needs, the number of components increased.  These are now generally referred to as the Hadoop Ecosystem.

Let’s look at the base modules:

Hadoop distributed file system (HDFS)

File System that keeps track of data across large number of linked storage.  It can be accessed by any computer using a supported operating system.  It will accept any type of data, you just put it in the cluster, and leave it there until you decided how you want to process it.

Hadoop actually supports many different file systems, for example the Amazon Web Service integrates Hadoop with it’s own S3 file system.  But HDFS is the Hadoop version.


This is the default data processing system.  It’s Java-Based.  As Hadoop is not a relational database, or indeed a database at all, you cannot use SQL to get answers from the data.  As a result, you use NoSQL.

Hadoop Common

Tools and libraries needed for other Hadoop modules.


Manages resources of systems which store the data & run the analysis.


Hadoop ecosystem

There are now also a number of additional modules you can use for different processing on top of these modules.  For example,


– Very Flexible. Easy to scale up or down on inexpensive computers or servers.

– Popular.  It’s widely used in Big Data industry, and as a mature product offering support is available.

– Free as it’s open source.  Also means, if software experts make enhancements, it’s fed back into the development community for general use.

– It will process any amount of data – Petabytes and above.  The data can be in any form – Structures, non-structures, emails, patents, voicemails, etc.


– In it’s basic state, Hadoop can be complex to use.  As a result, commercial versions are being created with simplified use.  For example, Cloudera, Hortonworks, MapR, etc.  You’ll pay for support & consultation with many of these.

– MapReduce is not a good match for all analysis.  It’s quite a complicated piece to work with, and can often only handle one problem at a time.  This means you either need experts to run it, or look into alternative options to run it, for example HIVE (which has a broader range of skilled practitioners available).

– Hadoop is a more mature option, when it comes to open source Big Data processing.  But it’s not necessarily the best.  A lot is being said about Spark these days, which is quickly finding it’s place in industry.  Just because Hadoop was out first, does not make it the best for your requirements.  Maybe you need both?  But it’s a decision that needs thought put into it.  What do you actually want to process and how?


First off, a disclaimer: I’ve never used Hadoop.  My opinion is formed from articles on the internet, and the content of text books.  So, feel free to disagree.

It seems to me, Hadoop was the best option at the time, and it still does some things well.  However, it’s difficult to use, and as a result, people are adding bits on top which go against the original premise.  For example, Putting relational technology (SQL) on top of Hadoop, which is neither a database nor relational, seems daft.  The whole concept of big data concerns, is that existing models of processing can’t cope.  Why then try to restrict it backwards with processing we’re up to speed with?  If training, and a lack of experts is the problem, well then train them up.  IT professionals of any reasonable skill level are adept at learning new computer languages, and platforms.  They are inherently interested in learning new ways to do cool things.  (And yes that includes myself.)  If Big data processing is the means to the future of data analytics, then be open to the alternative ways of working that come with it.

Secondly, it’s all very well loading data into your Hadoop cluster until you’re ready to work with it.  But that brings security risks & expense if you’re on cloud computing services.  It’s hard to see how data governance can be enforced if you’re not even sure where your data is, or what is in it.  We’re told again, and again, that computer piracy is on the up, and to do as much as you can to protect your data, especially customer & core organisational data.  However, the risk with big data is we get overwhelmed with it, and as a result don’t go to the same rounds of protecting it.  This is more a big data issue, than a Hadoop issue.  But reading these articles, it’s seen as a major advantage of Hadoop to be able to throw anything into the Hadoop cluster until you’re ready to use it.  One even mentioned thinking of Hadoop as a big bucket!  As long as you remember you need to lock that big bucket in a safe, inside a safe, inside a castle with a big moat around it, you may be OK.

I’m a big fan of Open Source programs.  Apart from the fact they’re free, there is generally a whole community of developers willing to help you use them to the fullest.  The downside is they’re often not as user friendly as those created to make profit.  I believe this is part of the issue with Hadoop.  The original developers came up with a great solution.  All the add-ons since then have either been a solution to another problem, using the core functionality of Hadoop as its foundations.  Or have made more user-friendly options available.  At it’s core it’s a great solution to the processing of very large data.  We shouldn’t forget the core value of that solution when overtaken by all the add-ons.



Data Star Trek

What is Data?

According to Wikipedia, Data is “uninterpreted information”.  It then goes on to ask what type of Data are you interested in?  Options include ‘a fictional android from Star Trek’, a book by Euclid, a moth, a British Drum & Bass musician, and a ‘non-governmental organisation founded by Bono’! 1   For this blog I’m interested in Computing Data, although the android lived an interesting life too …

Data is raw facts- numbers, text, images, symbols, etc.  On it’s own data means nothing.  Data needs to be interpreted, grouped & processed, in order to become usable information.  After all Colonel Mustard, in the library, with the Candlestick, doesn’t mean anything.  However, if you know you’re playing Cluedo, it makes all the difference!


What is information?

This leaves us with the understanding that information is interpreted data.  Information is what we use to make decisions & decide future actions.

If you’ve ever applied for a mortgage, you know the amount of financial information you’re required to hand into the bank – Salary information; Bank Accounts, balances & statements for 6 months; Any loans; Details of current rent, etc.  At it’s basic level it’s just figures on a page.  The bank then groups this data together into Earnings &  Outgoings.  This information is then used to decide if a Mortgage approval will be given, and for home much.

In order for the decision to be useful, the processed data must be:

If we go back to our mortgage example, if the information is 2 years old, it’s not useful to enable us make an informed decision on a customers current financial situation.  We may not know about a new gambling addiction that would rule out approving a mortgage.  In the same way, if we believe a customer has an outstanding loan for €1,000, but it’s actually €10,000, that makes a big difference to a decision.  Lastly, if the bank only records one salary, rather than two for a couple, the financial situation will not be complete and the decision cannot be correct.

In days gone by, all of the above mortgage decision was done on paper in your bank, where they knew your name.  If they were still unsure, you’d get an important member of the community to give you a reference, i.e. priest, doctor, etc.!  These days, it’s all on computers, for good or ill!  As an IT professional, I’d say it’s progress, and point out the amazing things we can do now with IT.  But then you couldn’t have put a country in debt to billions the old way …

What is Knowledge?

Knowledge it the understanding of the context of information.  How to interpret the information.



Reverting back to our mortgage example for the last time, looking at one persons Income & Expenditure only gives the scoring team so much information to make a decision.  However, if they have the same information for 1,000 previous scoring decisions, and the outcomes of the mortgages granted, then they will be able to look at the information with experienced eyes.  So, looking at your expenditure and seeing a BetFair or PaddyPower online account regularly used, that is just information.  However, if the last 20 people they gave a mortgage to with the same outgoings history got into financial difficulties, then that’s a concern.  Only by having that context on the information, can the bank acquire the knowledge to look for the red flags.


We now know that data is the building block for knowledge.  So, how does a computer store data?

That is to be continued in later blog…





1. https://en.wikipedia.org/wiki/Data_%28disambiguation%29

Can R tells us who is the best Female Tennis Player of all time?

What is R?

First off, let’s look at R itself.  R is a programming language.  It was, and is, designed specifically for data analysis.  It allows you to manipulate, calculate and graph data.  It allows you model statistics and save your results to many standard file types, e.g. PDF, Jpeg, Metadata, etc.

It’s completely open source, i.e. free, unlike equivalent statistical packages, e.g. SAS.  There’s also a thriving R community to support your use, and expand what’s possible in R. 1

Where to start with R?

I started out with the free Try R course from www.codeschool.com who give you this lovely badge on completion of the course: TryRCertification

The R Graphics Cookbook, or it’s website (http://www.cookbook-r.com/), was also useful starting off.

However, I found the only way to get going was to choose some data, download R and try it out.

Stackoverflow.com, and many other R advice websites, were hugely beneficial in helping me out if I got stuck.  The main thing I discovered is that there is never just one way to do something in R, so keep trying and learning!

Let’s try it!

As I write this blog (05/08/2015), Serena Williams has recently won Wimbledon again.  The US Open is coming up.  The question is being asked in numerous media & blogs, who is the greatest Female Tennis Player of all time?  The consensus seemed to be it was between Serena Williams and Steffi Graf, but are they right? 2 3 4

Generally, we can’t judge people of a different era together.  They may not have had the same opportunities to travel and compete, or maybe their competitors were not as good as other eras.  However, we have win records; maybe the statistics can help throw light on the argument?  Or could it muddy the waters further?  Let’s see what R can tell us…

To start, I need data.  I searched online, and came up with the names of 11 female tennis players who regularly appear in the ‘Top 10 Best Female Players‘ type posts.  Using that list, I went to the WTA (Women’s Tennis Association) website, and retrieved player stats for the players I’d chosen.  These were then saved to a CSV file, and loaded into R.

Load Data

What can we tell from the data?

How to we classify “the Best”?

Single Titles

Let’s look first at who won the most Singles Titles.  These are generally considered the benchmark for the best players.  So, who won the most?

Single Slams Bar

Margaret Court won 24, ahead of all the competition.  Case closed, correct?  The data is irrefutable?!  However, she won the majority of her titles at the Australian Open in an era when the majority of her competition didn’t travel that far.  So, if the best of her competition was not even playing, can they still hold up to the current standard?  Possibly not.  So, what else can we look at?

Total Prize Winnings

In this age, who won the most prize money could be considered a true marker of greatness.  In Golf they even have the ‘Race to Dubai’ every year, which is purely based on Prize money during the year.  Let’s see if that gives us a true answer.

Total Prize WinningsSerena Williams, outright winner.  Which, let’s face it, is another unfair chart.  The prize winnings on offer were nowhere near current trends when Billie Jean King was playing.  Even in the era of Steffi Graf, Female tennis winnings were not on a par with Male winnings.  The only legitimate comparison we could do in this case is Serena in comparison to her sister, Venus, as they are competing in the same era prize money wise.

Career Win Stats

Another line of comparison is career win stats for each player.  That is their Win/Loss record expressed as a percentage.

Win StatsWhich as you can see proves Venus Williams is the worst player of the list!  Obviously, the statistics don’t lie.  But … How can we stand over that?  Especially when Venus has won more slams & more prize money that Martina Hingis – her closest competitor.  It doesn’t seem right.  Maybe there is no “best” player in this case.

However, let’s have one more stab at this.

All round player

As it currently stands, fewer and fewer players are competing in Doubles matches as well as the Singles competitions.  5 There are a few reasons for this, less prize money, concentrating effort on the Singles prizes, and a different skillset required to play.  There is certainly more Serve & Volley skills required in Doubles than in the average Singles match.  There is an argument which says to be the best all round tennis player you should be winning both Singles & Doubles matches.  Have we a player who stands out for both?

This graph groups the slams wins by player:

Total Slams Won

Finally, this graph shows the slams stacked for each player:

Total Slams Won

So, Martina Navratilova is the winner for Best Player!  Serena Williams coming up a close second and as she is still playing she could still reach the top.  I’m happy with that result, but then that shows my personal preferences.


As you can see depending on how we ask, we get a different answer.  The phrase “Lies, Damn Lies and statistics” comes to mind.  Let’s look at a summary of the players:


In this case, there is no correct answer with the data I’ve entered.  Martina Navratilova comes out tops in more categories than Serena Williams, with Chris Evert and Margaret Court coming up next behind them.  Surprisingly, Steffi Graf is a little behind.  However, that could say more about the fact she gave up Tennis at a relatively young age, or the quality of her opposition, but who’s to say.  There are alternative possible means of getting an answer in this case, however, I won’t be continuing with this analytics further.

Alternative Suggestions

1. You could look at total career titles, not just the slams.  This would be over their entire career, and not just the headline grabbing main competitions.

2. You could look at the players whole career, rank the quality of their opposition,  and using the resulting quality scores, analyse who were more successful.

3. You could look at match stats, such as unforced errors, serving stats, etc.


I may not have confirmed who the best ever female tennis player was, but I acquired a good understanding of a subsection of R.  The TryR course was a good starting point, but I didn’t feel very confident with my knowledge immediately afterwards.  As with most programming languages, actually working with real world data makes it easier to learn.  In addition, you gain from working through the frustration of figuring out something that won’t work.  The community sharing help for R makes it even easier, as long as you put the work in.

I feel I’ve only scraped the surface of what is possible in R.  It’s worth considering other R courses, or available training online to advance the knowledge.  For example, There is a free R Programming course from John Hopkins on Coursera to learn more of the options available within the environment.  Interesting assignment.  Thank you.



1. http://www.inside-r.org/why-use-r

2. http://www.forbes.com/sites/davidlariviere/2015/01/31/steffi-graf-remains-best-womens-tennis-player-of-all-time-over-serena-williams/

3. http://theconversation.com/is-serena-williams-the-greatest-female-tennis-player-of-all-time-44527

4. https://sport.bt.com/more-sport-hub/women-in-sport/serena-williams-is-now-the-greatest-female-tennis-player-of-all-time-S11363959338693

5. http://www.tennis.com/pro-game/2013/12/john-mcenroe-doubleswhy-are-we-even-playing-it/49913/#.Vcj0OPm8Dl5