Getting There: Australian Politician Flight Expense Data, Part 1

Citizen journalism or free lunch? Admirable voluntarism or crowdfunded advantage-taking? It’s your call, but the Guardian wants you to donate your – and your, and your – analytical smarts to the data on nine spreadsheets itemizing Australian politicos’ expenses, in a collective scanfest to see what sort of discrepancies and pause-givers might be worthy of investigative pursuit.

The sheets, which can be downloaded en masse in zipped form by clicking here on the page linked above:

au1

report travel allowance information, the locations of and costs attending politicians’ sorties away from their district, and data about trips taken by tandems of elected officials who happened to belong to the same party, that latter compendium edging towards the question as to whether they all had to come along for the ride.

The two largest workbooks uncover flight and associate cost data for 2012-2013 (combined) and 2014; and the nearly 45,000 trips filling the two books pump out enough raw materials to perhaps substantiate a story by themselves. After all, the earlier sheet’s 22,044 flights (770 of which predate 2012, by the way) stand atop a tottering denominator of only 244 discrete office-holders, beating out an average of 91.47 flights per politician – or one flight every four days. That’s 90 movies, too. And the 2014 average of 88.63 is also very much up in the air.

Because the data across the two workbooks in effect mean to tell us the same sorts of things, a consolidation might be in order, in the interests of treating the data all of a piece. I’ve used the verb “consolidation” irresponsibly, though, because it implies a turning toward Excel’s long-standing Consolidate feature, which doesn’t seem to be the tool of choice here; Consolidate acts mathematically upon the data to be consolidated (i.e., by adding like items), but all we want to do is effect a merger of sorts, such that the 45,000 records in the two files line up obediently in their appointed fields in a unified workbook. It seems to me that a workaday copy-and-paste of the flight data from one book (probably from 2012-2013 to 2014) to the other will work, once you understand – and you will – two disparities: First, that the first five fields comprise comparable information that nevertheless find themselves in different columns, because the Date field waves at us from columns B and D in the 2012-13 and 2014 books respectively. And second, the electorate, state, party, partygroup, and house fields are quite peculiar to the latter file. There’s nothing there in 2012-2013.

Moreover, the 2012-13 workbook enrolls names that don’t appear in the 2014 edition, e.g. Patrick Secker, and as such no state/party/partygroup information about these one-timers could avail at all. That means of course that assorted blanks and fellow-traveling error messages will sully any attempt to composit cross-workbook breakouts along party and state lines and the like. My first assessment of the discrepancies, worked out after redeploying the 2014 data as a de facto lookup table, turns up about 6% worth of unusables, or about 11.5% of the 2012-13 records – again, meaning that these names just can’t be looked up in 2014, because they’re not in that workbook. That disconnect might be prohibitive, particularly because party expense correlations call so much reportorial attention to themselves. The simple, if only partially satisfactory way out, then, would be to confine some analyses to the 2013-14 information.

What are universally available across the sheets are the date data. You’ll be pleased to know that every one of the 44,911 two-workbook records is in fact a date; but you’ll be a little less pleased to discover that the amount data a) aren’t numeric at all, but rather have been straitened into text mode, and b) about 9,000 of the amounts shed any pretence to numeracy at all, substituting hyphens or absolutely nothing for the impostor values. And that is quite beside the 16 entries whose incipient values are freighted with asterisks, e.g. 528.81**.

Now, ok – the asterisks, and all the other text-presenting values in the field, could be righted by couching the data in the VALUE function, e.g.

=VALUE(E2)

Or, in the case of the asterisk-ridden:

=VALUE(LEFT(E39326,LEN(E39326)-2))

That expression imparts value to its referenced cell after shearing its pair of asterisks and rescuing the numeric remainder via a LEFT. (Note by the way the dollar signs, reflective of the indigenous Australian currency. And note as well the negative amount entries, apparent reimbursement figures. Thanks to the Guardian’s Nick Evershed for that clarification.)

But with 20% of its flight-charge amount records calling in absent, confident conclusions about who’s done what won’t scamper in your direction, though of course, some findings do remain to be plausibly plumbed, e.g. this pivot-table matrix (I’ve assumed you’ve combined the two workbooks to the extent possible):

Row Labels: Location-from

Column Labels: Location-to (in fact there are 792 blanks in there, and two in Location-From. I’ve renamed the Row and Column Label headers To and From respectively.)

Values: Date (Count; the field was chosen because it is accounted for in every record)

Slicer: politician

You’ll also note this Location-from misfit:

aust2

Thus Australian Prime Minister Tony Abbott’s itinerary looks like this:

au3

That sort of tableau may prove instructive, but with so much important data simply not there the limitations here are real.

But we have seven other zipped spreadsheets to look at.

Posted in Benefits of open data, Data Journalism, Informing Decision-making, Journalism, Posts from feeds, Transparency Tagged with: , , , , , ,