Processing Pebble accelerometer data: A guide
Our data science intern, Heng Li, obliged our begging and wrote a blog post about what he’s been working on for us this winter. Here is what he told us:
This winter I have the pleasure of working on a data science internship project for Strap along with Qiyao Wang. Simply speaking, the goal of our internship project is to explore customers’ data. As data scientists, our job is to answer and solve this question: could we design a pattern recognition system to classify and predict customers' movements into different activity groups, such as running, walking, sitting and so forth, to further refine the way we can classify Strap’s data?
The first thing we did was extract and clean the messy data from our platform database. We adopted “visit” as the unit to store each single activity’s information. That way we made sure all the data was clean and meaningful before we fit it into our pattern recognition system. For those visits missing important features we need to do the machine learning work - we do not include them in our dataset.
After obtaining the clean data, we still needed to go further to facilitate data analysis afterwards. According to our project goal (classifying customer’s activity events), we needed to create labels to distinguish events in the whole dataset. For each visit, our platform collected hundreds of accelerometer measurements within a short time period, and several successive visits were collected before a relatively big time gap, so it made sense to concentrate these successive visits to a single activity event. This is the rule on which we based our labels.
The final step is data analysis and this is what we are doing currently. We are trying several different unsupervised clustering methods. The first, and most straightforward one, is that we ignore time. We do this by measuring accelerometer events over time, instead of the original measurements, for clustering.
Most of the traditional machine learning methods, such as K-means and hierarchical clustering, would then work for our problem. However, this method did not generate very accurate results for our problem because it is the pattern of accelerometer measurements curve over time that characterizes human activities. So we did more research and found an advanced clustering method, which not only takes time effect into consideration, but also is powerful enough to handle multivariate clustering at the same time (as we simultaneously have accelerometer measurements along ‘x’, ‘y’ and ‘z’ axis). This is the second method we tried.
Data analysis will be the key to turning the corner with wearables and sensor data
After all the procedures above, we could produce some sample clustering results, such as there seems to be four clusters of activities, and 30% of the movements belongs to the first activity group, 20% belongs to the second activity group, 25% belongs to the third activity group and the remaining 25% belongs to the fourth activity group. In order to interpret the meaning of each activity group, we also need to compare the feature of data in each group with feature of some possible activities. For example, the curve of running should be accelerometer along ‘x’ axis be periodically positive and then negative, accelerometer along ‘y’ axis be some kind of constant and accelerometer along ‘z’ axis be very close to zero consistently. Then any of the four groups who have these features will be interpreted as running group.
In this way, we can statistically answer the question of classification of people’s movement into several common activity groups.
I’ve really enjoyed working with Strap to analyze their data. Data analysis will be the key to turning the corner with wearables and sensor data. Monitoring activity is one thing, using it to help us get healthy is another. I’m happy to have the opportunity to start solving some of these problems on the ground with Strap. It’s been fun helping start the wearable revolution.