Hey Fitbit, my data belong to me!

(This article was first published on Data Until I Die!, and kindly contributed to R-bloggers)

When you go on Fitbit’s website and want to download your own Fitbit data, you will find your way to their Data Export utility on their settings page. Once you get there, you are greeted with what looks like a slogan: “Your data belongs to you!”. Having seen the very high level data that they provide you with when you click download, I don’t believe them.

How Orwellian!

How Orwellian!

It’s nice that they let you download even this much data (I really like the sleep tracking feature), and I can at imagine myself being interested in my daily stats as I become more competitive with myself, but:

A) Where’s the heart rate data? I got a Fitbit Charge HR for my birthday, and I happen to like the heart-rate monitor function!
B) What about the intraday data? I am interested in doing further analysis of my heart rate based on time of day.
C) What about the specific activities that I log by pressing the button on the side of my Charge HR and annotate/entitle on the website?

For these reasons, I’m convinced that they don’t really believe that my data belong to me. That’s why it was nice to find out about Corey Nissen’s fitbitScraper package. At the current time, it allows you to get steps, distance, active minutes, floors, calories burned on a 15 minute interval, and heart rate on a 5 minute interval. It basically logs on to the Fitbit website and scrapes your data from the page source code!

Below, you will see the code I cooked up showing how I used his package to start a running dataset, spanning as many days as I want, on what my heart rate is doing throughout each day. If you want to adapt this code to download the other available categories of data, note that any references to heart rate will have to be changed accordingly :)

Initializing the dataframe

library(lubridate)
library(plyr)
library(dplyr)
library(fitbitScraper)

hr_data = list(time = c(), hrate = c())

cookie = login("my@email.com", "mypassword", rememberMe = TRUE)
startdate = as.Date('2015-08-07', format = "%Y-%m-%d")
enddate = today()
s = seq(startdate, enddate, by="days")

for (i in 1:length(s)) {
 df = get_intraday_data(cookie, "heart-rate", date=sprintf("%s",s[i]))
 names(df) = c("time","hrate")
 hr_data = rbind(hr_data, df)
 rm(df)}

This code might need some explanation (feel free to skip this text if you understand it just fine). First you’ll notice the login function where you’ll need to enter in the email and password with which you registered for your fitbit account. Then you’ll notice that I’ve specified a ‘startdate’ variable and have put ‘2015-08-07′ as the value within the as.Date function. This was just the first full day that I spent with the fitbit, and you can change it to whatever your first fitbit day was. Then, the seq function conveniently creates a vector containing a series of date stamps starting from the specified ‘startdate’ and ending with ‘enddate’ which you can see is whatever date today happens to be.

Finally, we have the for loop, which loops through indices representing each element in the date sequence ‘s’ so that each day’s worth of 5 minute data can be saved to a temporary data frame, and appended to the ‘hr_data’ object, which converts from being a list to being a data frame. After all that is said and done, you now have your first volley of your own fitbit data.

But wait, there’s more! What about when you want to download more data, several days from now? Do you have to run this same code again, overwriting data from days that have already been processed? You could do that, or you could use the more complex code below, which only scrapes the fitbit website for data from days that aren’t complete, or there at all!

Code for when you want more and more and more fitbit data!!!

library(lubridate)
library(plyr)
library(dplyr)
library(fitbitScraper)

cookie = login("my@email.com", "mypassword", rememberMe = TRUE)
startdate = as.Date('2015-08-07', format = "%Y-%m-%d")
enddate = today()
s = seq(startdate, enddate, by="days")

completeness = hr_data %>% group_by(dte = as.Date(time)) %>% summarise(comp = mean(hrate > 0))
incomp.days = which(completeness$comp < .9)
missing.days = which(s %in% completeness$dte == FALSE)
days.to.process = c(incomp.days, missing.days)

for (i in days.to.process) {
  df = get_intraday_data(cookie, "heart-rate", date=sprintf("%s",s[i]))
  names(df) = c("time","hrate")
 
  # If the newly downloaded data are for a day already in
  # the pre-existing dataframe, then the following applies:

  if (mean(df$time %in% hr_data$time) == 1) {

    # Get pct of nonzero hrate values in the pre-existing dataframe
    # where the timestamp is the same
    # as the current timestamp being processed in the for loop.

    pct.complete.of.local.day = mean(hr_data$hrate[yday(hr_data$time) == yday(s[i])] > 0)

    # Get pct of nonzero hrate values in the temporary dataframe
    # where the timestamp is the same
    # as the current timestamp being processed in the for loop.

    pct.complete.of.server.day = mean(df$hrate[yday(df$time) == yday(s[i])] > 0)

    # If the newly downloaded data are more complete for this day
    # than what is pre-existing, overwrite the heart rate data
    # for that day.

    if (pct.complete.of.local.day < pct.complete.of.server.day) {
      rows = which(hr_data$time %in% df$time)
      hr_data$hrate[rows] = df$hrate}
  }
  else {

    # If the newly downloaded data are for a day not already in
    # the pre-existing dataframe, then use rbind to just add them!

    hr_data = rbind(hr_data, df)
  }
  rm(df)
}

The first thing I’d like to draw your attention to in this code in the block beginning with the definition of the ‘completeness’ object and ending with the ‘days.to.process’ object. What I’m trying to accomplish with these objects is:

A) To get a list of which days in my pre-existing data frame might benefit from a complete data refresh due to too much missing data (you’ll notice I’ve defined an incomplete day as one with less than 90% of non-zero data), and
B) Which days worth of data I am missing because I just didn’t download data on that day.

The ‘days.to.process’ object is just a vector that puts together the date stamps of any days that are incomplete, and any days that are missing in my data frame.

In the for loop, I loop through the date stamps represented in the ‘days.to.process’ object, and proceed like I did before at first, but then a few new things happen:

I check to see if the date-time stamps in the temporary data frame are in my pre-existing data frame. If they are, I then do a comparison of the percent of data that is non-zero for that day in my pre-existing data frame (that’s the purpose of the ‘pct.complete.of.local.day’ variable) against the percent of data that is non-zero in the temporary data frame (hence the ‘pct.complete.of.server.day’ variable).

If there is more non-zero data in the temporary data frame for that day, I then find the row indices in the pre-existing data frame for the day in question, and then use them to update the data in place with the new data from the temporary data frame.  This particular comparison of non-zero data in the pre-existing vs the temporary data frame is probably redundant in most cases.

It would be nice to be able to easily/automatically get the physical activities that I have logged through the website (rowing machine, stationary bike, treadmill, etc.) so that I could correlate them with my heart rate at the time, but I guess I have to do that manually for now. Eventually, I’m interested in a somewhat deeper analysis of my heart rate at different times of the day.

At least now I feel more like my data belong to me, even though I had to resort to making use of someone else’s very smart coding (thank you, Corey!) to do it!

To leave a comment for the author, please follow the link and comment on his blog: Data Until I Die!.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Posted in Posts from feeds, Smart communities, Transparency, Visualising data Tagged with: , , , ,