Can R tells us who is the best Female Tennis Player of all time?

What is R?

First off, let’s look at R itself.  R is a programming language.  It was, and is, designed specifically for data analysis.  It allows you to manipulate, calculate and graph data.  It allows you model statistics and save your results to many standard file types, e.g. PDF, Jpeg, Metadata, etc.

It’s completely open source, i.e. free, unlike equivalent statistical packages, e.g. SAS.  There’s also a thriving R community to support your use, and expand what’s possible in R. 1

Where to start with R?

I started out with the free Try R course from who give you this lovely badge on completion of the course: TryRCertification

The R Graphics Cookbook, or it’s website (, was also useful starting off.

However, I found the only way to get going was to choose some data, download R and try it out., and many other R advice websites, were hugely beneficial in helping me out if I got stuck.  The main thing I discovered is that there is never just one way to do something in R, so keep trying and learning!

Let’s try it!

As I write this blog (05/08/2015), Serena Williams has recently won Wimbledon again.  The US Open is coming up.  The question is being asked in numerous media & blogs, who is the greatest Female Tennis Player of all time?  The consensus seemed to be it was between Serena Williams and Steffi Graf, but are they right? 2 3 4

Generally, we can’t judge people of a different era together.  They may not have had the same opportunities to travel and compete, or maybe their competitors were not as good as other eras.  However, we have win records; maybe the statistics can help throw light on the argument?  Or could it muddy the waters further?  Let’s see what R can tell us…

To start, I need data.  I searched online, and came up with the names of 11 female tennis players who regularly appear in the ‘Top 10 Best Female Players‘ type posts.  Using that list, I went to the WTA (Women’s Tennis Association) website, and retrieved player stats for the players I’d chosen.  These were then saved to a CSV file, and loaded into R.

Load Data

What can we tell from the data?

How to we classify “the Best”?

Single Titles

Let’s look first at who won the most Singles Titles.  These are generally considered the benchmark for the best players.  So, who won the most?

Single Slams Bar

Margaret Court won 24, ahead of all the competition.  Case closed, correct?  The data is irrefutable?!  However, she won the majority of her titles at the Australian Open in an era when the majority of her competition didn’t travel that far.  So, if the best of her competition was not even playing, can they still hold up to the current standard?  Possibly not.  So, what else can we look at?

Total Prize Winnings

In this age, who won the most prize money could be considered a true marker of greatness.  In Golf they even have the ‘Race to Dubai’ every year, which is purely based on Prize money during the year.  Let’s see if that gives us a true answer.

Total Prize WinningsSerena Williams, outright winner.  Which, let’s face it, is another unfair chart.  The prize winnings on offer were nowhere near current trends when Billie Jean King was playing.  Even in the era of Steffi Graf, Female tennis winnings were not on a par with Male winnings.  The only legitimate comparison we could do in this case is Serena in comparison to her sister, Venus, as they are competing in the same era prize money wise.

Career Win Stats

Another line of comparison is career win stats for each player.  That is their Win/Loss record expressed as a percentage.

Win StatsWhich as you can see proves Venus Williams is the worst player of the list!  Obviously, the statistics don’t lie.  But … How can we stand over that?  Especially when Venus has won more slams & more prize money that Martina Hingis – her closest competitor.  It doesn’t seem right.  Maybe there is no “best” player in this case.

However, let’s have one more stab at this.

All round player

As it currently stands, fewer and fewer players are competing in Doubles matches as well as the Singles competitions.  5 There are a few reasons for this, less prize money, concentrating effort on the Singles prizes, and a different skillset required to play.  There is certainly more Serve & Volley skills required in Doubles than in the average Singles match.  There is an argument which says to be the best all round tennis player you should be winning both Singles & Doubles matches.  Have we a player who stands out for both?

This graph groups the slams wins by player:

Total Slams Won

Finally, this graph shows the slams stacked for each player:

Total Slams Won

So, Martina Navratilova is the winner for Best Player!  Serena Williams coming up a close second and as she is still playing she could still reach the top.  I’m happy with that result, but then that shows my personal preferences.


As you can see depending on how we ask, we get a different answer.  The phrase “Lies, Damn Lies and statistics” comes to mind.  Let’s look at a summary of the players:


In this case, there is no correct answer with the data I’ve entered.  Martina Navratilova comes out tops in more categories than Serena Williams, with Chris Evert and Margaret Court coming up next behind them.  Surprisingly, Steffi Graf is a little behind.  However, that could say more about the fact she gave up Tennis at a relatively young age, or the quality of her opposition, but who’s to say.  There are alternative possible means of getting an answer in this case, however, I won’t be continuing with this analytics further.

Alternative Suggestions

1. You could look at total career titles, not just the slams.  This would be over their entire career, and not just the headline grabbing main competitions.

2. You could look at the players whole career, rank the quality of their opposition,  and using the resulting quality scores, analyse who were more successful.

3. You could look at match stats, such as unforced errors, serving stats, etc.


I may not have confirmed who the best ever female tennis player was, but I acquired a good understanding of a subsection of R.  The TryR course was a good starting point, but I didn’t feel very confident with my knowledge immediately afterwards.  As with most programming languages, actually working with real world data makes it easier to learn.  In addition, you gain from working through the frustration of figuring out something that won’t work.  The community sharing help for R makes it even easier, as long as you put the work in.

I feel I’ve only scraped the surface of what is possible in R.  It’s worth considering other R courses, or available training online to advance the knowledge.  For example, There is a free R Programming course from John Hopkins on Coursera to learn more of the options available within the environment.  Interesting assignment.  Thank you.










Google Fusion Tables

Fusion tables are an experimental offering from Google for use in Data Management.  It allows you to gather, visualise & share data 1.

Data can be converted into Charts, Network graphs, Scatterplots, Timelines, & Geographical Maps.

For this assignment, we were asked to use Fusion Charts to create a Heat Map of the Republic of Ireland taking the 2011 census data.


So, how do we go about this task?

First we have to find the data.  The population data came from the Central Statistics office website here.  This I saved as a comma delimited CSV file.

The Irish KMZ Datafile which gave the borders for each county came from

I loaded both files into Google Drive.  And cleaned up any anomalies in the data which weren’t helpful in this case.  For example, We were only interested in Dublin county as a whole, not each council area.  In addition, Galway City & County would not match to Galway on the KML file.  We were only interested in the entire County Galway population.  Another I didn’t spot until I’d created my Geographic map was Tipperary.  Tipperary North & South were on the CSV file, but Tipperary the entire county boundary were in the KML file.  This meant I had a clear area for the whole of Tipperary.  Once I merged the two Tipperary figures, I then filled in the gap in the image.

Anyway, once you’re happy with the data, you need to merge the tables.  To do this, open one of the spreadsheets.  Then click File>Merge…, this will open up another window where you can choose the name of the other table you want to merge.  Then choose how you want to merge the data, in our case Name field from the map_lead table, and the County field in the County Population table.  You’ll then have a merge table, which I’d advise to rename to something more appropriate.

In the ‘Map of geometry’ tab you’ll now see the merged data, albeit all in one colour.  To make this a useful image, we’ll want to graduate the colours used, based on population counts.  To do this, choose ‘Change feature styles …’ from the left hand side of the screen.  On the subsequent pop up window, choose ‘Fill Color’ under Polygons, and the ‘Buckets’ tab.  I divided this into 5 buckets.  The system will automatically break up the population figures, if you allow it.  However, if you try it, you’ll end up with Dublin as one colour, and the rest of the country lagging behind.  While this may be a fair reflection, it doesn’t make a great image.  So, I broke the bands up as follows: 0 – 70,000: Blue; 70,000 – 100,000: Green; 100,000 – 140,000: Yellow; 140,000 – 190,000: Orange; 190,000 – 1,273,070: Red.

I then made the map Public on the internet. This allowed me to publish the map on this blog.

All of which leaves you with this:



The map dramatically shows the distribution of the Republic around the big cities – Dublin, Cork, Limerick & Galway.  There are whole swathes of Connaught and Border areas which are under populated as a result.  Even without knowing anything about the country’s infrastructure, you could surmise that most of the jobs are in these areas.

So, how could we make the map more useful?  What could we add?


It’s apparent that the majority of the motorways start in Dublin and are spread out the main cities like the spokes of a bicycle wheel.  Limerick – Cork or Galway don’t have motorways the whole way.  Going from Galway to anywhere north of the Galway-Dublin line is also not on motorways.  Maybe if the infrastructure for industry to get in and out of Sligo or Leitrim, for example, we wouldn’t have population black spots there.


What percentage of each county is unemployed?  Or has suffered the most from emigration?  If there is 5% unemployment in Dublin, say.  While we could say 25% in Leitrim, for our argument.  Using these figures, Enterprise Ireland or IDA Ireland could prioritise those areas with the worst employment figures for future employment opportunity.


Where are hospitals based around the country?  And what specialities do they have?  If you’re injured in Roscommon on a Sunday at 8pm, where do you have to go for A&E?  If you’re living in Donegal, where do you have to go for Cancer treatment?  Are these distances appropriate for Ireland, and their populations?  Or are all the resources being spent in the capital?


Fusion Maps is a really easy product to use.  There is no need to know anything about mapping tools to use it.  It’s part of the Google suite of applications, and if you’re already working with those tools, it’s handy to take data you’ve already stored on Google Drive and work with it.

However, I’m not a fan for one reason: there are so many better applications out there at the moment.

What I mean by that is, it’s neither as fun, interactive or useful for routes as say Google Earth & Google Maps 4, at one end of the user spectrum.  Nor is it capable of handling sizable data, i.e. it currently has a size limit of 250MB per table, 1GB total.  It can only handle more that 350,000 features on a map ( ).  Which means it’s out for big data, or any serious data analysis.  Students studying for a PhD in Science, for example, would often have more than that, never mind industry.

In my personal opinion, even publishing your map onto another webpage, or even WordPress blog (!) is more difficult than QGIS, which is Open Source.

To be fair to Google, it is documented as an ‘Experimental’ application.  So, maybe it’s the starting point to something that will eventually take on ArcGIS, QGIS, or many of the other mapping applications, open source or not  3.  Not forgetting the functionality currently available in Excel, e.g. Pivot tables, charts, etc.

It’ll be interesting to see where they go with it into the future.








Tesco Clubcard Data

Tesco was one of the leading lights of Customer data analytics in the world of Retail.  In 1994 they first looked into the Clubcard idea.  They trialled it in a few stores, working with a Customer science company called Dunnhumby.  Reportedly, the then Chairman of Tesco, Lord MacLaurin, commented “What scares me about this is that you know more about my customers after three months than I know after 30 years.” 1

Based on the Feedback and information they accumulated from the trials, they rolled it out throughout the Tesco business in 1995.  This data was then utilised to target customer marketing, and supply chain opimisation, to drive profits across the company.  Up until the early 2010s this was the textbook case study for customer data analysis driving profit:



As late as 2013, Tesco “stressed that they do not sell their loyalty data to third parties”. 3

In January 2015, Goldman Sachs were brought on board to prepare the Tesco-owned Dunnhumby for sale, with a reported £2billion price tag attached.4

So, what happened?

Tesco announced a Pre-Tax loss of £6.38billion up to the end of February 2015.  5

Depending on what you’re reading, various areas are blamed:

  • £250million profit overstatement,
  • Competitive Marketplace,
  • An unsuccessful attempt to expand into the American market with ‘Fresh & Easy’ convenience stores,
  • Property & Share losses,
  • Tesco, themselves blame a shift online, which while successful in itself, has seen profits fall.

I suspect it’s likely a perfect storm of all of the above, the worldwide recession, and Tesco taking their eye off the core business were to blame.

So, heads have rolled, and non-core businesses are up for sale or sold.  Including the Clubcard consumer data.



Tesco reportedly have a database of one billion worldwide shoppers, and the experience to analyse & use this data for customer marketing.

“Major food and drinks companies like Coca-Cola, Unilever and Nestle, are all willing to pay Tesco for access to the data and the consumer insight and shopping habits it provides.” 6

Google & WPP, an advertising and marketing company are also in the mix.

At this point, you’ll read all over the internet about how Tesco were so focused on Data they lost sight of their customers, and maybe that’s true.

Harvard Business Review

LinkedIn Post

Tesco will also have to try and learn the lessons and turn the business around fast.

However, I’d like to cover two other points instead:

  1. Who you want to buy the data?
  2. Who owns the data?


Tesco have been working with this data for 15/20 years now.  They’ve maximised savings and learned quite a bit about consumer habits from it.  But it didn’t stop them from falling & loosing customers.  So, is it worth spending £2billion to buy it?

The low-cost competitors who are receiving some of the blame, don’t have loyalty systems.  They spend a fortune on marketing to get customers through the doors, but they don’t market directly to specific segments of their customers.  I don’t think many people would believe they’re not doing customer analytics without the customer loyalty schemes either.  If they can do it alone, why can’t others?  So, what will the new buyers of the database gain from buying the company?

They’ll get expertise in the form of expert staff.  They’ll get access to a database they’ve never seen before, and may be able to derive new data from.  However, the data is in the past.  Technology and recession has changed buyer behaviour.  So, is the data valid into the future?

Is it possible that the data itself was part of Tesco’s downfall?  Was it prepared for the fact that I can order coffee capsules in bulk online, and will never need coffee from the supermarket?  Or that if my next door neighbour does their shopping online that they’re not tempted by items they don’t need or pester power by their child?  What about the statistical increase in one/ two person households in Ireland at least.  Their shopping baskets are bound to be smaller that expected. 7

I agree that if the new owners ask the right questions of the data, they may get new answers that hadn’t been considered before.  However, from my perspective, the employees may be the greatest asset in the sale.


OK, let’s be clear Tesco own the data, for now.  As a customer, if you had a Clubcard, you were clearly swapping your purchasing habit data for ‘Points’.  (The fact supermarkets, credit card companies, etc. collect it regardless of this permission is for another rant another day.)

However, I believe this was on the understanding that it stayed with Tesco for their use.  The terms may now say, they can anonymise the data and share with third parties, but I suspect it wasn’t there in 1995/96!

Is this a breach of customer trust?  Can I refuse to have my data transferred?  Opt out?  Or should I just accept it’s going to happen and forget about it?  Frankly, I shop in Tesco about twice a year, so I’m not all that concerned on their behave.  However, it’s the principle of a ‘fair’ swap that seems to be being phased out in this world of consumer marketing.

Of course, as a Data Analyst, I’d love to see what I could find in the data!


From one of the original innovators in Retail customer data analysis, Tesco are now having to go a new route.  The Data will move on, and both Tesco and the new owners will have to re-think existing marketing strategies.

I think one of the most clearest pieces of knowledge to come out of this whole thing in the customers know what they want.  When you stop delivering it, they’ll move on.  You need to keep up with the changing world, but ensure you’re continuing to put customers first & foremost in your plans.  Investing in side lines, and things your customers don’t value, is a recipe for failure.

In the end data can only try and give you an answer to question.  If it’s the correct question you’re on a winner.  If it’s not, you can go down the wrong path completely.