I’m quietly quite proud of the data-driven content I’ve produced over the last few years, and I think there are still deep veins of data-rich content waiting to be tapped by organisations that want to experiment.
Data is, of course, de rigueur amongst journalists after the Nate Silver figured out how to use it to create to news stories and David McCandless figured out how to make it pretty. Here’s the Google Trends graph for data journalism, presented in a data journalismy format:
And as we all know, where journalists go, content marketers follow.
The problem with “doing data” is you need a fairly broad range of skills. First and foremost you really need a head for figures. You also need a basic understanding of how data is published online. You really need to know how to use a spreadsheet. And know how to make graphs and charts. And lastly you need to know how to tell a good story in a way your audience will understand.
Data is like a box of chocolates
You never know what you’re gonna get
The data you need is out there somewhere – but until you stumble upon it it’s impossible to know what format it’ll be in. Most of the time, though, you’ll get it in three ways – as a raw data file (CSV, XLS or something a little more exotic like XML), published on a web page (HTML), or just by asking (awkward human contact).
CSVs and XLS files
The data you need for your stories will usually come from a public body such as the Office for National Statistics or be published somewhere online, either as an HTML page or as a CSV or XLS.
CSVs and XLS files are usually the easiest. The ONS website is a bit of a pig to navigate so I usually find the data I want using some Google search with a few parameters. For example, if I want to find the raw Excel file for the ONS’ Labour Market Statistics, I could tell Google to look only on the ONS site using the site: parameter, and only Excel files using the filetype: parameter. Thus, this search:
site:ons.gov.uk filetype:xls labour market
Returns exactly the data I want. Generally speaking if you know where the data is, and what format it’s in, Google and a few search parameters are your best friend. You can also throw in a date range if you want to make sure you get the latest data, or historic data from a specific time.
From a website
If the data you need is published on a website, you have a few options. The first and most obvious is to just copy and paste it into a spreadsheet (protip: use CTRL+SHIFT+V to paste with correct formatting).
If you want a slightly cleaner option, try the Google Spreadsheets IMPORTHTML function, which allows you to pull HTML lists or tables into a spreadsheet automatically. This is especially useful for large tables or data which might change regularly (when the page updates, so does the data in your spreadsheet).
Select the cell you want to be the upper-left corner of the table in Google Spreadsheets and use for following format:
=IMPORTHTML(“Full URL of where the data is“, “table or list“, Element number)
For example if I wanted to grab the big 2010 Election results table from this Wikipedia page (which is the eleventh table on the page – the variables in the function above) I would do the following:
The one consideration with this method is that if you’re pulling the data from a small or unreliable website it may disappear, breaking your spreadsheet – so keep a hard copy too. You can do this by selecting everything the IMPORTHTML function imports, copying it, creating a new sheet, and using Edit -> Paste Special -> Paste values only.
You’ll be surprised how generous people are with their data. Unless it’s commercially sensitive or cannot be released for legal reasons often all you have to do to get your hands on it is ask. Shoot someone an email and you’ll be surprised how often you can get what you’re after.
Basic spreadsheet skills
If you’re a functional human who works on a computer I would hope that you have a basic knowledge of how spreadsheets work. You put numbers in cells A1 and B1, then in C1 you type =A1+B1 and it adds them up. Basic arithmetic is important, but most of the data you’ll get when doing data-driven content will be quite expansive – so knowing how to play with large datasets to pull out the data you want is just as important as doing sums.
For these examples I’ll be using Google Spreadsheets, as it’s my spreadsheet app of choice (and it should be yours, too).
A good rule of thumb when putting together big spreadsheets is that programmatic or function-based methods of manipulating data beat out manual methods every day of the week.
For example, if you’re doing some analysis of data grouped by UK postcode and want to separate out the data associated with each postcode group into different sheets, you could copy and paste each group, but there are 127 postcode groupings so that shit will get old real fast.
A better approach would be to isolate each group with a formula, do your first bit of analysis, then replicate the whole sheet and apply the same analysis to the next grouping.
Basically, if you find yourself doing lots of manual actions in a spreadsheet you’re doing it wrong – it’ll be well worth the half an hour it takes to learn a new function to save yourself hours down the line.
The FILTER function will become your bread and butter method for chopping and grouping data for analysis. If you’ve ever used an Excel filter to distil your data down, the Google Spreadsheets FILTER function achieves the same thing, but allows you to pull the data into other sheets where you can do your analysis. I prefer to do my number crunching this way – one sheet containing the master data, and extra sheets where I can pull out and dig into each set of data.
Lets say after I imported the Election data above, I wanted to do some analysis on just the parties which fielded fewer than 50 candidates. I can use a FILTER function to isolate that data like this:
A closer look at that function:
=FILTER(The data the function should analyse and return, The data the function should apply the variable to & the variable)
Or, put another way:
Show me this data if the values here equal this.
So in this case the function is looking at columns B to K in the sheet called Results (Results!B:K), and returning rows where column C (the number of candidates fielded) is less than 50 (Results!C:C<50). You can apply extra filters too, so if I wanted to look for parties which fielded fewer than 50 candidates and received fewer than 20,000 votes, I could use the following:
=FILTER(Results!B:K, Results!C:C<50, Results!D:D<20000)
Filter is, for my money, the best function for chopping data into usable portions ready for analysis. I won’t go into each and every way you can analyse the data (because that would require a book), but your favourites will be:
- Mean and Median for averages
- Equals (=) and Does Not Equal (<>) for direct comparisons
- UNIQUE for de-duplicating data
- COUNT for, erm, counting data
- COUNTIF for counting with conditions
- COUNTBLANK for counting empty cells
Mama always had a way of explaining things so I could understand them
Graphs are your best good friend
Once you’ve chopped, normalised and analysed your data, it’s time to make it meaningful for humans. A picture is worth a thousand words, and a good graph is worth a hundred thousand spreadsheet cells.
The first hurdle is deciding what kind of graph or chart to use. Line? Bar? Bubble Candlestick? Your choices are endless – but thankfully, some smart person (Andrew Abela of the Catholic University of America, to be precise) has done a lot of the legwork for you.
(Full PDF here)
So once your know how you’re presenting your data, you need to figure out how to build it. Thankfully with the proliferation of data journalism has come an equivalent avalanche of self-service graphing tools. Here are just a few.
I’ve used all of the above and eventually settled on WP Charts as I think it offers the best mix of customisation while still being relatively straightforward. As the charts are native to your site there are also fewer outside dependencies.
Making your data and your story go together
Like peas and carrots
One of the best things about using data in your content is it can challenge your assumptions. For example, this piece I wrote last year about footballers wages began life as a fairly cheap “footballers are overpaid crybabies” pitch, but as I ran the numbers I actually discovered that footballers make incredibly high tax contributions and it’s difficult to conclude their high wages are bad for wider society. The data became the foundation of the piece.
Elsewhere, I started this piece about Amazon’s tax payments knowing exactly what the situation was – the data aspect served to back up the argument and visualise the point I was trying to make.
Sometimes the data you have and the way you present it can be so amazing that it becomes the whole piece – this awesome New York Times budget visualisation, for example.
Data can serve many purposes but – much like that graph I put up at the top of this piece – it shouldn’t just be there as pointless adornment.