Hacking the Hacks, Part 2: A Week’s Worth of New York Taxi Rides

I said it before, and I’m going to have to say it again. I closed last week’s installment on the data for fare activity between January 1 and 7 of New York’s green cab armada with a reminder (largely to myself): that additional data vetting and cleansing needs to come along with the analytical ride. So here’s some more.
Because I had sorted the data by our manufactured Duration field guided by the intention to depose various unsuitable records from the serviceable data set, I found myself with the briefest durations at the bottom of set (if you’re not there right now you can just re-sort the newly-calculated durations Largest to Smallest). And there you’ll note some exceedingly brief putative rides, lasting a couple of seconds or so. But again plausibility takes a hit.
Consider, for example, the ride rolling out at 1:38:00 AM on January 3, and coming to a halt at 1:38:01 AM (remember a bit of rounding-off informs the durations) on January 3. Fare: $50, either giving new meaning to the term “rip-off”, or more intelligibly, flagging a mistake, as does a ride on the 6th commencing at 7:04:41 AM, pulling in to the curb at 7:04:43 AM, a two-or-so-second swing through the Apple that cost its passenger $150. The distances travelled by these quicksilver fares seem predominantly, but not invariably, plausible, though explaining a four-second cab ride does remain an expository challenge. In the interests of making a decision, one whose intention you may not share, I’d sever all less-than-a-minute durations from the data set, conceding again a loss of some bona fide data; but a record-by-record judgement of these fare discrepancies requires the scrutiny of someone who’s actually getting paid to the work.
And of course none of this means that still other hard-to-reconcile fare-to-duration relationships and the like don’t persist in the data set, e.g. the 1.58-minute ride on New Year’s day that exacted a toll charge of $869.88. Those six passengers must have crossed an awfully long bridge.
Still, we’re left with 278,675 records to play with, and the statistical gravitas of large numbers should enable some half-way meaningful yields. Start with a simple but integral pivot table, one breaking out numbers of rides by hours of the day:

Row Labels: lpep_pickup_datetime (grouped by hours only; remember our data confine themselves to the first week of January)

Values: lpep_pickup_datetime (Count, then by % of Column Total)


But that’s not right at all. Can each one of the 278,675 rides have been hailed in the morning hours? That’s a rhetorical question, you’ve gathered.

What we’re witnessing again is a misconstruing of ride times, as perpetrated by the source data. Every time, excepting those that fit themselves between the noon-1 PM swath, is understood in AM terms by a failure of differentiation, by whatever time-reading system was put in place by the data, and I’m at a loss as to how the failure might be redressed. There appears no way of knowing which rides were in actuality booked in the afternoon, no handle around which to curl one’s grasp of the times.

Accountings aside, the error is remarkable in its sweep, rendering time-driven breakouts of the ride data all but pointless. But durations, which are after all times-independent – a five-minute ride counts itself off identically irrespective of the hour from which it proceeded – might continue to be available to us, once the error checks have run their way through the data.

And that takes us here: Last week’s data survey came across a spate of negative durations, which we treated per the understanding that the AM readings on some drop-off times wrongly submitted what in fact appeared to be a PM time-stamp, thus reporting drop-offs that preceded their pick-ups. Now we need to confront more than 2000 rides of doubtfully extended durations that seem to have started in one day and finished on the next day. While in principle that possibility is eminently real, of course, far too many of the trips are of the 11:58 start and 12:04 finish stripe that nevertheless have registered 1445-minutes.

But because it’s existentially possible to engage your driver for 1445 minutes there’s nothing “wrong” with the records – no negative values emergent, no formulas gone wrong. As such, no programmatic means comes to mind and hand that would conduce toward a decisive paring of the faulty records from the faultless, I think.

But let’s see if anything at all can be done. First, I’d isolate these two-day discrepancies by claiming the next free column (for me X), head it Days, and enter in X2:


We’ve thus recalled TRUNC, the function we first aimed at the data set last week. Here TRUNC determines if in effect the drop-off day exceeds the pick-up day by one. If so, the cell in X receives a 2. We can copy the above formula down X and sort Smallest to Largest, gathering all two-day rides at the bottom of the data set. I get 2356 such trips, many of which exhibit a time in the drop-off field that exceeds, that is, occurs later, than its counterpart in the pick-up field, and many others of which disclose earlier times that its pick-up companion – a formidable discrepancy in its own right. It seems to me, then, that if we could simply subtract the earlier time from the later one – wherever those respective values fall – we’d come away with a pretty workable set of durations.

All of which inspires a rethink of our Duration formula of last week, which to date looks like this:


That expression meant to deal with times that fell on the wrong side of the noon divide, but a simpler resolution suggests itself now. If in fact we’re happy with the subtraction tack described above, this formula might replace the one we’ve just reviewed:


Much simpler, and I think ultimately more effective. The formula subtracts the modulus – that is, the remainder of the numbers, both of which here are divided by one, then realizes their absolute result – the value that appears when the relevant – or + sign is ignored – and finally multiplies that by the 1440 we discussed last week. Thus both negative and positive subtraction results are treated here as positive, turning out what at first scan appears to be a set of usable durations.

That’s all pretty neat so far as it goes, but it doesn’t seem to go far enough. That’s because some of the durations we’ve just recast continue to give off large numbers. Sort the nascent durations largest to smallest and rides of 718 or so minutes rise to the top – because a previous issue, an afternoon time that presents itself as say 1:00 instead of 13:00 begs our attention anew. What do we do now?

I’m thinking about it.

Posted in Benefits of open data, Data, Data Journalism, Journalism, Posts from feeds, Smart communities, Visualising data Tagged with: , , , , ,