This project originally appeared on Inertia7, a social platform of open source computer and data science projects. It was republished with permission from the author.
Thanks to the popularity of wearable technology, data scientists and health professionals have found a mutually beneficial tool for gaining insight into a powerful data generator: the human body. Because data from wearable devices are so readily available, many organizations and researchers have taken steps towards utilizing and understanding them to improve the human experience, such as evaluating a user's health history.
First, a few examples: Evidation Health has done studies which utilize longitudinal wearable technologies to predict a binary anxiety diagnosis with the help of temporal convolutional neural networks. These networks are able to perform better than baseline models because they use data collected at increasing time granularities, such as steps and sleep in minute-long intervals. Meanwhile, researchers in San Francisco have created a semi-supervised deep learning model that can predict medical conditions with the help of wearable app data from commercial offerings like FitBit and Apple Watch. By utilizing long short-term memory (LSTM), the model yields better predictive power than baseline models (such as classification models that are not considered deep learning, including random forest and multilayer perceptron).
Wearable apps are playing an interesting role in the evolution of data science for a few reasons, including:
- The quantity of data available
- The ease of access of mHealth data
- The relationship between app data and other data sources relevant to the user
Of course, there are many challenges associated with data collection and privacy that can make users wary of researchers who would utilize their data. Understanding the importance of wearable apps within the context of medical advances, however, can help foster trust and, ultimately, the growth of artificial intelligence in the healthcare industry.
One of the first challenges is making sense of the vast amount of data available. In this post, I will conduct an exploratory analysis of data from a friend's Apple Watch. The data is anonymized and I will focus on exploring one factor that was measured by the app: active energy burned (although, in future iterations, I would like to include other categories gathered by the watch).
I extracted the data from my friend's Apple Watch as an XML file. I then used a Python script created by Nicholas J. Radcliffe, which parses XML files and creates CSV files. The repository for that project can be found here.
This walkthrough uses tidyverse, lubridate, and ggplot2 to provide elegant visuals that showcase our exploratory analysis of a single person's Apple Watch data.
Here, we load the appropriate libraries into our R environment:
library(tidyverse) library(lubridate) library(here) here::here() source(here::here("src", "helper_functions.R"))
##  "/Users/ravi/Desktop/my_projects/apple_health_data"
When utilizing functions in the
here package, we will use the notation below:
This is due to conflicts in functions that are called "here" in the lubridate and here packages. I also created a script called
helper_functions which has two functions that create metadata related to the day of the week and month (called
extract_date_data), along with data type conversions related to date columns (called
We will be loading the data with
tidyr's CSV file reader, which will place the data into a
tibble. There are a few other columns related to the metadata of the Apple Watch, but they have been removed for the purposes of this analysis.
# Load Data anon_data <- read_csv( here::here("data/processed", "data.csv") )
## Parsed with column specification: ## cols( ## sourceName = col_character(), ## type = col_character(), ## unit = col_character(), ## creationDate = col_datetime(format = ""), ## startDate = col_datetime(format = ""), ## endDate = col_datetime(format = ""), ## value = col_double() ## )
Go ahead and use the helper functions to extract your metadata.
anon_data <- anon_data %>% convert_date(creationDate) %>% convert_date(startDate) %>% convert_date(endDate) %>% extract_date_data(creationDate)
convert_date function converts the columns to date/time using pipe operators to do the function calls. This was accomplished by creating the functions with non-standard evaluation.
extract_date_data function creates columns of date metadata using lubridate, including columns for day of the week and month in chronological order (January to December).
More Data Cleaning
Another transformation I included was the difference in minutes between the start date and end date of each measurement. I did this using
dplyr and the function
difftime, which will find the difference between two values in units of minutes.
# Find difference in start and end date in minutes anon_data <- anon_data %>% mutate(time_diff_mins = as.numeric( difftime(endDate, startDate, units = "mins") ))
Since I am not too familiar with Apple Watches or their data collection processes, I wanted to include these values to gain some insight into how regularly measurements are taken.
Now let's preview the
## # A tibble: 6 x 12 ## sourceName type unit creationDate startDate ## ## 1 Anon’s Apple … ActiveEner… kcal 2016-11-06 06:55:53 2016-11-06 05:51:32 ## 2 Anon’s Apple … ActiveEner… kcal 2016-11-06 06:55:53 2016-11-06 06:51:33 ## 3 Anon’s Apple … ActiveEner… kcal 2016-11-06 06:55:53 2016-11-06 06:53:36 ## 4 Anon’s Apple … ActiveEner… kcal 2016-11-06 06:55:53 2016-11-06 06:54:37 ## 5 Anon’s Apple … ActiveEner… kcal 2016-11-06 06:59:19 2016-11-06 06:57:52 ## 6 Anon’s Apple … ActiveEner… kcal 2016-11-06 07:00:47 2016-11-06 06:58:44 ## # ... with 7 more variables: endDate , value , tz , ## # months , year , week_days , time_diff_mins
This will help showcase the structure of the data frame and is a general good practice to help other users become familiar with the data frame.
Dive Into the Data
I flattened the data across years and months. This is achieved with the help of the spread function — you can use the year as the key and fill any values that are not present with "None Noted," taking note of the distribution of the data.
anon_data %>% group_by(months, year) %>% summarize(total_Cal = sum(value)) %>% spread(key = year, value = total_Cal, fill = "None Noted")
As you can see, the person's data collection started in November 2016 and was ongoing until March 2018. Next, I created a visual representation to drive the point home.
anon_data %>% group_by(months, year) %>% summarize(total_Cal = sum(value)) %>% ggplot(., aes(months, y = total_Cal, fill = as.factor(year))) + geom_bar(stat = "identity", colour = "#39a78e") + theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + facet_wrap( ~ year) + labs(x = "Year", y = "Total", title = "Total Calories Burned by Month and Year") + scale_fill_hue(l = 40, name = "Year")
I decided to only analyze the data from 2017, since that's the year with the most complete dataset. Looking at incomplete data from other years would just skew our conclusions.
Create A Tibble for 2017
Here, we will be using the
filter function to view data which have a value in the year column (generated by the
creationDate column) equal to 2017. We will call it
anon_data17 <- anon_data %>% filter(year == 2017)
Although we have data for all of 2017 (by month), I wanted to explore in more detail how much data (if any) is missing from our data frame as it relates to the creation date of the recordings. I created two vectors; the first includes all entries as dates from our data frame and the second includes all dates from 2017. I then compared these two vectors to see which dates weren't recorded by the Apple Watch.
year_ad17 <- unique(as.Date(anon_data17$creationDate)) year17 <- seq(as.Date("2017-01-01"), as.Date("2017-12-31"), by = "days") missing_values <- year17[!year17 %in% year_ad17] missing_values
##  "2017-04-01" "2017-07-24" "2017-07-25" "2017-07-26" "2017-07-27" ##  "2017-07-28" "2017-07-29" "2017-07-30" "2017-07-31" "2017-08-01" ##  "2017-08-02" "2017-08-03" "2017-08-04" "2017-12-08"
You can see that the missing entries include the time interval 7/24/2017 to 08/04/2017 (12 days). Most likely, there is a human-related reason why this data is missing; it could perhaps be explained by a vacation or a lost watch. Because the data is not missing at random, we have to be wary of any conclusions we make about data in this time frame.
Now that we have an understanding of the missing data, we can go ahead and start exploring. I began by looking at the distribution of the calories burned (called
value in the data frame) using the
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0010 0.1180 0.2080 0.5017 0.3500 11.2990
This gives us great insight into some summary statistics. But visualizing data can provide insights that can't be deduced from summary statistics — so let's visualize the distribution utilizing a histogram.
ggplot(anon_data17, aes(value)) + geom_histogram(fill = "#39a78e", bins = 200) + theme_minimal() + scale_x_continuous(breaks = seq(0, max(anon_data17$value), by = 1)) + labs(x = "Calories Burned", y = "Total", title = "Energy Burned by Month")
We can see that there is some right skewness and that most of our data falls in the range between [0, 1].
Data Collection By Day
I wanted to briefly look at data collection by day, so I grouped the data by creation date and counted the units and provided summary statistics.
anon_data17 %>% group_by(creation_date = as.Date(creationDate)) %>% count(unit) %>% ungroup() %>% select(n) %>% summary() # anon_data17 %>% # group_by(creation_date = as.Date(creationDate)) %>% # count(unit) %>% # filter(n > 900) %>% # arrange(desc(n))
## n ## Min. : 1.0 ## 1st Qu.: 569.0 ## Median : 719.0 ## Mean : 701.5 ## 3rd Qu.: 855.0 ## Max. :1276.0
Here, we can see the average and median fall around 700 instances per day.
Inspired by this blog post, I created a calendar heatmap to showcase the total amount of calories burned across 2017 by day of the week.
anon_data17 %>% mutate(week_date = ceiling(day(creationDate) / 7)) %>% group_by(week_date, months, week_days) %>% summarise(total_cal = sum(value)) %>% ggplot(., aes(week_days, week_date, fill = total_cal)) + geom_tile(colour = "white") + facet_wrap(~months) + theme_bw() + scale_fill_gradient(name = "Total \nCalories", low = "#39a78e", high = "#a73952") + labs(x = "Week of the Month", y = "Weekday") + scale_y_continuous(trans = "reverse")
Calories and Time
Next, I wanted to shift my focus to a column I created during the data cleaning process: the time difference. I decided to explore the relationship between time range and the amount of calories burned.
ggplot(anon_data17, aes(time_diff_mins, value)) + geom_point(alpha = 0.10, fill = "#39a78e") + theme_minimal() + labs(x = "Time Interval (in mins.)", y = "Calories Burned")
Since the data is heavily skewed, this visual doesn't give us too much insight. I proceeded to log transform both variables in hopes of understanding their relationship better.
ggplot(anon_data17, aes(time_diff_mins, value)) + geom_point(alpha = 0.10, fill = "#39a78e") + theme_minimal() + labs(x = "Time Interval in mins. (log transformation)", y = "Calories Burned (log transformation)") + scale_y_log10() + scale_x_log10()
Some key takeaways:
- Data capturing is mostly done in uniform time intervals
- Calories burned show an increase until around the range between [2.5, 3] minutes, at which point, the calories burned start to show a decrease
- Larger time intervals become more sporadic with respect to calories burned
Time Frame Analysis
Next, I concentrated on the time intervals, since I found some interesting relationships between time and calories. I began by using summary on the time difference.
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0000 1.0167 1.0167 0.9878 1.0333 60.0167
You can see that the data falls mostly within the one-minute time frame and the maximum value is an hour. Let's dissect the data more by grouping both by month and by day of the week to see if there are any trends.
Let's start with grouping by month.
anon_data17 %>% group_by(months) %>% summarise(avg_time_spent = mean(time_diff_mins))
## # A tibble: 12 x 2 ## months avg_time_spent ## ## 1 January 1.01 ## 2 February 1.01 ## 3 March 1.00 ## 4 April 1.01 ## 5 May 1.00 ## 6 June 0.996 ## 7 July 1.01 ## 8 August 0.995 ## 9 September 1.00 ## 10 October 0.940 ## 11 November 0.945 ## 12 December 0.961
The average for each month still gravitates around one minute. Now, let's try grouping by weekday and month. This time, since the output would be too large to read easily, I created a visual.
anon_data17 %>% group_by(week_days, months) %>% summarise(avg_time_spent = mean(time_diff_mins)) %>% ggplot(., aes(week_days, avg_time_spent, fill = week_days)) + geom_bar(stat="identity") + facet_wrap(~months) + theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs(x = "Weekday", y = "Average Time Spent", title = "Average Time by months and Weekday") + scale_fill_hue(l = 40, name = "Weekday")
Again, most of our data is collected in one-minute intervals, and there aren't any months or weekdays that show irregularities in time intervals for data collection.
Outliers are difficult to understand without knowing the specs for the Apple Watch. I'd need to do a bit more research to understand the data capturing process.
Diving into a Time Difference Column
Let's look at time frames that are larger than 30 minutes.
anon_data17 %>% filter((time_diff_mins > 30) ) %>% select(value, time_diff_mins) %>% arrange(desc(value))
## # A tibble: 11 x 2 ## value time_diff_mins ## ## 1 0.766 38.2 ## 2 0.75 49.7 ## 3 0.562 32.0 ## 4 0.228 46.6 ## 5 0.053 54.8 ## 6 0.041 39.1 ## 7 0.018 36 ## 8 0.016 54.8 ## 9 0.01 60.0 ## 10 0.002 55.3 ## 11 0.002 40.8
These values don't present any values larger than one, which is unusual. Upon reading this article, it appears there could have been problems with how the watch performs. However, it might be the case that during these large time intervals, the user simply wasn't that active.
Data Capture By Hour
I wanted to conclude my dissection of time by looking at what hours have the highest instances of data capture. I did this by converting creation date to hour using lubridate.
ggplot(anon_data17, aes(lubridate::hour(anon_data17$creationDate))) + geom_bar(fill = '#39a78e', stat = 'count') + theme_minimal() + labs(x = 'Hour', y = 'Total Count', title = 'Data Capture by Hour') + scale_x_continuous(breaks = seq(0, max(lubridate::hour(anon_data17$creationDate)), by = 1))
You can see that data capture follows a fairly regular schedule, with a significant decrease starting at hour eight. This could be when the user is starting to go to sleep (peak hour most likely between 11 and 15).
I also wanted to visualize total calories burned by day, so I grouped the data by dates and days of the week to create a visual showing a linear view for 2017.
anon_data17 %>% group_by(creation_date = as.Date(creationDate), week_days) %>% summarise(burn_per_day = sum(value)) %>% ggplot(., aes(creation_date, burn_per_day, fill = week_days)) + geom_bar(stat="identity") + theme_minimal() + labs(x = "Creation Date", y = "Calories burned by day", title = "Calories by Creation Date") + scale_fill_hue(l = 40, name = "Weekday")
As you can see, there aren't any obvious trends or seasonality, but the time frame that has missing data is easy to spot.
Calories Burned by Weekday and Month
Next, I went a little deeper by facet wrapping by month.
anon_data17 %>% group_by(creation_date = as.Date(creationDate), week_days, months) %>% summarise(burn_per_day = sum(value)) %>% ggplot(., aes(week_days, burn_per_day, fill = as.factor(week_days))) + geom_bar(stat = "identity") + facet_wrap(~ months) + theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs(x = "Weekday", y = "Total Calories", title = "Calories Burned by Weekday and Month") + scale_fill_hue(l = 40, name = "Weekday")
You can start to see some obvious patterns — like which days of the week the user burns the most calories each month — but juxtaposing the months with days of the week would give us a better comparison.
This leads me to my next step: looking at the data from a time-series perspective. The way I did this was by averaging the values by date then plotting the results.
Average Energy Burned for 2017
Before I visualize our data by data of the week, I wanted to visualize it as a single value by averaging energy burned by date. I accomplished this by grouping the data by start date then summarizing the calories burned by averaging. I used geom_smooth to see if there's any trend in our data set.
anon_data17 %>% group_by(creation_date = as.Date(creationDate)) %>% summarise(total_Cal = mean(value)) %>% # filter(total_Cal == max(total_Cal)) # Used to view the date in which there was a spike in October ggplot(., aes(creation_date, total_Cal)) + geom_line(colour = "#39a78e") + geom_smooth(se = FALSE) + theme_minimal() + labs(x = "Date of Activity", y = "Average Calories", title = "Average Energy Burned Time Series")
We can see a slow decreasing trend with a spike in average calories on October 21, 2017. There are no obvious trends across each month.
Average Calories by Weekday
Let's create a visual that showcases the average calories burned during the year by day of the week. This can help us understand any weekly trends that might not have been obvious in the previous plot.
anon_data17 %>% group_by(week_days, year, months) %>% summarise(total_Cal = mean(value)) %>% ggplot(., aes(months, total_Cal, group = week_days, colour = week_days)) + geom_line(aes(colour = week_days)) + theme_minimal() + labs(x = "Month", y = "Average Calories", title = "Average Energy Burned by Weekday for 2017") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + scale_color_hue(l = 40, name = "Weekday")
You might notice some seasonality in the days of the week, as well as a decreasing trend as the year goes on. This tells us that the user's activity decreased over the course of the year, which might be helpful for the user to know. With this information, we can show the user when activity decreased and help motivate them to maintain a steady level of activity.
I wanted to conclude this exploratory analysis by understanding how the user burned energy throughout 2017. Most of my analysis can be applied to any data science project — hopefully it will help you extract more from your data, even if that data is not from an Apple Watch. You saw that data can be extracted from existing columns to create slices on your data frame and that visualizations provide useful information you otherwise might not glean from a table.
This project is still in its early iterations. I hope to expand it to other categories of data that the Apple Watch collects and examine the relationships across these covariates. Stay tuned!