Getting There: Australian Politician Housing Expense Data, Part 2

We’re following the money down under, via the conduit of expenses-plotting spreadsheets flowing our way via the Guardian‘s and its piece on the spending activity of Australian politicians. Having spent last week counting the beans doled out for politicians’ flight outlays, we’ve disembarked and booked a look at the hotels at which they’ve pulled in for the night, or nights. In short, time to take a look at quarters for quarters, in the Travel Allowance 2014 workbook wrapped up in the zip file we downloaded last week.

You’ll recall that last week’s encounter with the Australia data flew into a cloud cover of data quality issues, conspicuous among these spending figures suited up in textual format. That turbulence has cleared here; all the rate amounts in column G and all but 24 of the amount entries in J are properly numerated. Those 24 appear to be simultaneously encumbered by footnoting asterisks and HTML-code snippets, e.g.

$1,072.00 <i>**</i>

It seems the following value-restorative tack works, which I’ve penned in U8888 (note that before broaching the formula below I sorted the amounts Smallest to Largest, a directionality that ground-floors text entries to the bottom of the stack, crowding them into the J8888:J8910 range. Thus my first formula writes


That expression flags the position of the first instance of the “<” via FIND and comes away with all the characters in J8888 stationed to the left of that symbol, less 1. It finally metamorphoses J8888’s textual tenant into an upstanding value, while interestingly spurning the $ currency sign in its course. Once done, a standard Copy > Paste Values addressed to J8888:J8910 and a deletion of the formulas from U will complete the repair.

I can’t explain, however, the values’ several formatting incarnations, e.g. -85, -$104.00, and the reddened -$328.00 (the latter not hued by any conditional format, but rather by a Format Cells option). As a functional proposition the diversity is meaningless; as Ms. Stein put it, a number is a number is a number, particularly if the plan is to feed them all to a pivot table in which a new, all-of-a-piece format is made to devolve upon them. But the inconsistencies stoke questions about the data quality that shouldn’t have to be asked.

And a question or two could likewise be asked about the yes/no data marking the withinelectorate field T column. Its header, I’m assuming, wants you to know which lodgings were contracted within and without the office-holder’s district. But a question need be entered about the spate of blanks washing across the field. Since the respective answers to the within/without question should be irrefutably binary, it becomes fair to say that an absence of answer means nothing but an absence of information. But blanks aren’t pivot-table friendly; break out the yes/nos this way:

Row Labels: withinelectorate

Values: withinelectorate (necessarily by Count)

and you get


That is, the blanks remain uncounted, and given their unmissable prevalence I think you’d want the associated number of blanks right in there. The standard workaround: just select T2:T8911 and run a Find and Replace through the cells: Find (leave blank) and Replace with NA, or something like it. But that routine flung my laptop into something like an infinite loop, however, and after having put the Esc to its endless fulminations I bowed to Plan B, whereupon I wrote


to U2, copied down the column, and plugged a Copy > Paste Values of the results back into the T column and deleted the U formulas. Something about discretion and the better part of valor.

Then refresh the pivot table and jigger the Values into % of Column Total mode and:


I think all those NAs matter, standing in the way of a definitive understanding of these home-and-away travels.

Now back to those amounts, 24 of which are negative. We learned last week that these two dozen sub-zero totals attest reimbursements, presumably for trips not taken or ones errantly billed to the reimburser – that is, the Australian government. Reasons aside, their numbers are indeed few, but they need to be dealt with just the same. I’d allow that they be sorted to the bottom of the field and pushed away from the operative data set via our storied blank-row interpolation, because these negatives report what in are effect non-bookings, for which the politician can be neither credited – or blamed, for that matter.

Hewing to my own advice, I’m left with 8885 records worth of expenses, which we can, for example, pivot table by party, number of bookings, and amount:

Row Labels: party

Values: party (Count; in effect, any field all of whose cells are occupied could be counted)

Amount (Sum, formatted with commas and to two decimals). I get


Jogging both sets of values into % of Column Totals we see:


(You don’t need the redundant 100% Grand Totals here,) The party-amount proportions line up, more or less, but they need to be proportioned in turn to the numbers of sectarians per party. We’ve discussed this problem elsewhere (trust me, because I can’t yet track down the post) – in which politician names and their parties visit the data set repeatedly, even as we want to count the records of each party member and his/her party uniquely. Again, the most limpid instruction set would advise us to construct this pivot table:

Row Labels: poliitican

Party (Tabular design without subtotals and grand totals)

That understated table could be selected (via , PivotTable Tools > Options > Select > Entire PivotTable in the Options button group) copied, and subjected to a Paste Values (and a and it could be pasted atop itself, in effect replacing the pivot table), thus in effect carving a new, lean data set which in turn could itself be pivot tabled:

Row Labels: party

Values: party (Count, along with % of Column Total)


Thus we see that with 43.70% of the recorded politicians and 47.00% of the declared expenses the Liberals seem slightly more indulgent than their contingent size would project, while Labor’s 37.39%/31.80% enrollment/expenditure split speaks to their apparent frugality. But of course neither inference can serve as anything resembling the last word on the subject. Any impugning of legislator or party intents would require a far finer sifting of the nature and geographical plotting of the work to be done. But it’s a start.

A raft of other analytical permutations are in there to be spun, of course, but here’s just one more:

Row Labels: details

Values: details (Count)

amount (Sum)

I get, in excerpt:


Scroll down and you’ll see 3760 charges totalling over $4,000,000 for Sittings of Parliament – whatever that might mean, though it likely refers for housing costs accumulated during the body’s sessions – or about 60% of all expenses. But don’t get excited over the Abbott Government’s Budget Cuts and Commission of Audit; they mean Prime Minister Tony Abbott – not me.

Posted in Benefits of open data, Data Journalism, Informing Decision-making, Journalism, Posts from feeds, Transparency Tagged with: , , , , , ,