Healthcare (Accelerometer data) case study

Human activity monitoring is a growing field within data science. It has practical use within the healthcare industry, particular with tracking the elderly to make sure they don't end up doing things which might cause them to hurt themselves. Governments are also very interested in it do that they can detect unusual crowd activities, perimeter breaches, or the identification of specific activities, such as loitering, littering, or fighting. Fitness apps also make use of activity monitoring to better estimate the amount of calories used by the body during a period of time.

In this project, we will be training a random forest against a public domain Human Activity Dataset titled Wearable Computing: Accelerometers' Data Classification of Body Postures and Movements, containing 165,633, one of which is invalid. Within the dataset, there are five target activities:

Sitting
Sitting Down
Standing
Standing Up
Walking

These activities were captured from four people wearing accelerometers mounted on their waist, left thigh, right arm, and right ankle. To get started:

Acquire the DLA HAR Dataset from their webpage. Be sure to get the dataset-har-PUC-Rio-ugulino.zip file and not the weight lifting one.

Based on the data that's collected, we'll predict the type of activity the user was performing (see the bullet list above)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import time


# Grab the DLA HAR dataset from:
# http://groupware.les.inf.puc-rio.br/har
# http://groupware.les.inf.puc-rio.br/static/har/dataset-har-PUC-Rio-ugulino.zip

#
#  Load up the dataset into dataframe 'X'
#

X = pd.read_csv('dataset-har-PUC-Rio-ugulino.csv', sep = '\;')
X.head()

/Users/Pandu/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:5: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.

	user	gender	age	how_tall_in_meters	weight	body_mass_index	x1	y1	z1	x2	y2	z2	x3	y3	z3	x4	y4	z4	class
0	debora	Woman	46	1,62	75	28,6	-3	92	-63	-23	18	-19	5	104	-92	-150	-103	-147	sitting
1	debora	Woman	46	1,62	75	28,6	-3	94	-64	-21	18	-18	-14	104	-90	-149	-104	-145	sitting
2	debora	Woman	46	1,62	75	28,6	-1	97	-61	-12	20	-15	-13	104	-90	-151	-104	-144	sitting
3	debora	Woman	46	1,62	75	28,6	-2	96	-57	-15	21	-16	-13	104	-89	-153	-103	-142	sitting
4	debora	Woman	46	1,62	75	28,6	-1	96	-61	-13	20	-15	-13	104	-89	-153	-104	-143	sitting

# Get the list of columns and data types
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165633 entries, 0 to 165632
Data columns (total 19 columns):
user                  165633 non-null object
gender                165633 non-null object
age                   165633 non-null int64
how_tall_in_meters    165633 non-null object
weight                165633 non-null int64
body_mass_index       165633 non-null object
x1                    165633 non-null int64
y1                    165633 non-null int64
z1                    165633 non-null int64
x2                    165633 non-null int64
y2                    165633 non-null int64
z2                    165633 non-null int64
x3                    165633 non-null int64
y3                    165633 non-null int64
z3                    165633 non-null int64
x4                    165633 non-null int64
y4                    165633 non-null int64
z4                    165633 non-null object
class                 165633 non-null object
dtypes: int64(13), object(6)
memory usage: 24.0+ MB

# Get the unique counts by gender

X.gender.value_counts()

Woman    101374
Man       64259
Name: gender, dtype: int64

#
# Encode the gender column, 0 as male, 1 as female
#
X.gender = X.gender.map({'Man':1, 'Woman':0})

X.gender.value_counts()

0    101374
1     64259
Name: gender, dtype: int64

X['class'].value_counts()

sitting        50631
standing       47370
walking        43390
standingup     12415
sittingdown    11827
Name: class, dtype: int64

# Encode class to integer values
X['class'] = X['class'].map({'sitting':0, 'standing':1, 'walking':2,'standingup':3, 'sittingdown':4})

X['class'].value_counts()

0    50631
1    47370
2    43390
3    12415
4    11827
Name: class, dtype: int64

#
# Clean up any column with commas in it
# so that they're properly represented as decimals instead
#
X['how_tall_in_meters'] = X['how_tall_in_meters'].str.replace(',', '.')

X['body_mass_index'] = X['body_mass_index'].str.replace(',', '.')

X.head()

	user	age	how_tall_in_meters	weight	body_mass_index	x1	y1	z1	x2	y2	z2	x3	y3	z3	x4	y4	z4
0	debora	46	1.62	75	28.6	-3	92	-63	-23	18	-19	5	104	-92	-150	-103	-147
1	debora	46	1.62	75	28.6	-3	94	-64	-21	18	-18	-14	104	-90	-149	-104	-145
2	debora	46	1.62	75	28.6	-1	97	-61	-12	20	-15	-13	104	-90	-151	-104	-144
3	debora	46	1.62	75	28.6	-2	96	-57	-15	21	-16	-13	104	-89	-153	-103	-142
4	debora	46	1.62	75	28.6	-1	96	-61	-13	20	-15	-13	104	-89	-153	-104	-143

X['class'].value_counts()

0    50631
1    47370
2    43390
3    12415
4    11827
Name: class, dtype: int64

X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165633 entries, 0 to 165632
Data columns (total 19 columns):
user                  165633 non-null object
gender                165633 non-null int64
age                   165633 non-null int64
how_tall_in_meters    165633 non-null object
weight                165633 non-null int64
body_mass_index       165633 non-null object
x1                    165633 non-null int64
y1                    165633 non-null int64
z1                    165633 non-null int64
x2                    165633 non-null int64
y2                    165633 non-null int64
z2                    165633 non-null int64
x3                    165633 non-null int64
y3                    165633 non-null int64
z3                    165633 non-null int64
x4                    165633 non-null int64
y4                    165633 non-null int64
z4                    165633 non-null object
class                 165633 non-null int64
dtypes: int64(15), object(4)
memory usage: 24.0+ MB

# Convert how_tall_in_meters, body_mass_index, z4 to numeric
X['how_tall_in_meters'] = pd.to_numeric(X['how_tall_in_meters'], errors='raise')
X['body_mass_index'] = pd.to_numeric(X['body_mass_index'], errors='raise')

X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165633 entries, 0 to 165632
Data columns (total 19 columns):
user                  165633 non-null object
gender                165633 non-null int64
age                   165633 non-null int64
how_tall_in_meters    165633 non-null float64
weight                165633 non-null int64
body_mass_index       165633 non-null float64
x1                    165633 non-null int64
y1                    165633 non-null int64
z1                    165633 non-null int64
x2                    165633 non-null int64
y2                    165633 non-null int64
z2                    165633 non-null int64
x3                    165633 non-null int64
y3                    165633 non-null int64
z3                    165633 non-null int64
x4                    165633 non-null int64
y4                    165633 non-null int64
z4                    165633 non-null object
class                 165633 non-null int64
dtypes: float64(2), int64(15), object(2)
memory usage: 24.0+ MB

X['z4'] = pd.to_numeric(X['z4'], errors='raise')

X.z4.value_counts().head()

-162    6859
-158    6770
-163    6762
-159    6641
-161    6402
Name: z4, dtype: int64

X[ (X['z4'] == np.NaN)]

	gender	age	how_tall_in_meters	weight	body_mass_index	x1	y1	z1	x2	y2	z2	x3	y3	z3	x4	y4	z4

X['z4'].unique()

array(['-147', '-145', '-144', '-142', '-143', '-146', '-138', '-139',
       '-141', '-133', '-134', '-135', '-140', '-137', '-148', '-151',
       '-149', '-150', '-152', '-156', '-157', '-155', '-154', '-153',
       '-158', '-159', '-162', '-163', '-161', '-160', '-164', '-165',
       '-136', '-126', '-125', '-122', '-132', '-167', '-170', '-169',
       '-175', '-173', '-166', '-168', '-171', '-172', '-174', '-179',
       '-177', '-176', '-127', '-131', '-184', '-181', '-178', '-180',
       '-106', '-114', '-116', '-115', '-89', '-79', '-93', '-78', '-128',
       '-109', '-113', '-119', '-102', '-111', '-103', '-130', '-129',
       '-182', '-121', '-186', '-183', '-187', '-190', '-188', '-189',
       '-185', '-192', '-191', '-193', '-194', '-197', '-196', '-201',
       '-251', '-195', '-199', '-198', '-123', '-124', '-112', '-118',
       '-120', '-99', '-95', '-68', '-110', '-117', '-96', '-100', '-105',
       '-92', '-88', '-107', '-213', '-108', '-104', '-98', '-94', '-91',
       '-97', '-69', '-101', '-86', '-82', '-66', '-77', '-90', '-56',
       '-74', '-83', '-81', '-219', '-200', '-221', '-80', '-231', '-210',
       '-71', '-207', '-209', '-202', '-205', '-208', '-211', '-214',
       '-203', '-204', '-212', '-206', '-216', '-73',
       '-14420-11-2011 04:50:23.713', '-43', '-537', '-218', '-215', '-259'], dtype=object)

X['z4'] = pd.to_numeric(X['z4'], errors='coerce')

X.z4.value_counts().head()

-162.0    6859
-158.0    6770
-163.0    6762
-159.0    6641
-161.0    6402
Name: z4, dtype: int64

X['z4'].unique()

array([-147., -145., -144., -142., -143., -146., -138., -139., -141.,
       -133., -134., -135., -140., -137., -148., -151., -149., -150.,
       -152., -156., -157., -155., -154., -153., -158., -159., -162.,
       -163., -161., -160., -164., -165., -136., -126., -125., -122.,
       -132., -167., -170., -169., -175., -173., -166., -168., -171.,
       -172., -174., -179., -177., -176., -127., -131., -184., -181.,
       -178., -180., -106., -114., -116., -115.,  -89.,  -79.,  -93.,
        -78., -128., -109., -113., -119., -102., -111., -103., -130.,
       -129., -182., -121., -186., -183., -187., -190., -188., -189.,
       -185., -192., -191., -193., -194., -197., -196., -201., -251.,
       -195., -199., -198., -123., -124., -112., -118., -120.,  -99.,
        -95.,  -68., -110., -117.,  -96., -100., -105.,  -92.,  -88.,
       -107., -213., -108., -104.,  -98.,  -94.,  -91.,  -97.,  -69.,
       -101.,  -86.,  -82.,  -66.,  -77.,  -90.,  -56.,  -74.,  -83.,
        -81., -219., -200., -221.,  -80., -231., -210.,  -71., -207.,
       -209., -202., -205., -208., -211., -214., -203., -204., -212.,
       -206., -216.,  -73.,   nan,  -43., -537., -218., -215., -259.])

X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165633 entries, 0 to 165632
Data columns (total 19 columns):
user                  165633 non-null object
gender                165633 non-null int64
age                   165633 non-null int64
how_tall_in_meters    165633 non-null float64
weight                165633 non-null int64
body_mass_index       165633 non-null float64
x1                    165633 non-null int64
y1                    165633 non-null int64
z1                    165633 non-null int64
x2                    165633 non-null int64
y2                    165633 non-null int64
z2                    165633 non-null int64
x3                    165633 non-null int64
y3                    165633 non-null int64
z3                    165633 non-null int64
x4                    165633 non-null int64
y4                    165633 non-null int64
z4                    165632 non-null float64
class                 165633 non-null int64
dtypes: float64(3), int64(15), object(1)
memory usage: 24.0+ MB

# Drop rows with NaNs
X = X.dropna()

# Make sure there are no rows with NaNs
print (X[pd.isnull(X).any(axis=1)])

Empty DataFrame
Columns: [user, gender, age, how_tall_in_meters, weight, body_mass_index, x1, y1, z1, x2, y2, z2, x3, y3, z3, x4, y4, z4, class]
Index: []

X['z4'].unique()

array([-147., -145., -144., -142., -143., -146., -138., -139., -141.,
       -133., -134., -135., -140., -137., -148., -151., -149., -150.,
       -152., -156., -157., -155., -154., -153., -158., -159., -162.,
       -163., -161., -160., -164., -165., -136., -126., -125., -122.,
       -132., -167., -170., -169., -175., -173., -166., -168., -171.,
       -172., -174., -179., -177., -176., -127., -131., -184., -181.,
       -178., -180., -106., -114., -116., -115.,  -89.,  -79.,  -93.,
        -78., -128., -109., -113., -119., -102., -111., -103., -130.,
       -129., -182., -121., -186., -183., -187., -190., -188., -189.,
       -185., -192., -191., -193., -194., -197., -196., -201., -251.,
       -195., -199., -198., -123., -124., -112., -118., -120.,  -99.,
        -95.,  -68., -110., -117.,  -96., -100., -105.,  -92.,  -88.,
       -107., -213., -108., -104.,  -98.,  -94.,  -91.,  -97.,  -69.,
       -101.,  -86.,  -82.,  -66.,  -77.,  -90.,  -56.,  -74.,  -83.,
        -81., -219., -200., -221.,  -80., -231., -210.,  -71., -207.,
       -209., -202., -205., -208., -211., -214., -203., -204., -212.,
       -206., -216.,  -73.,  -43., -537., -218., -215., -259.])

# Convert Z4 to int
X['z4'] = X['z4'].astype(int)

# Store the predictor variable in y
y = X[['class']]

# Drop the predictor class from X
X = X.drop(labels=['class'], axis=1)
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 165632 entries, 0 to 165632
Data columns (total 18 columns):
user                  165632 non-null object
gender                165632 non-null int64
age                   165632 non-null int64
how_tall_in_meters    165632 non-null float64
weight                165632 non-null int64
body_mass_index       165632 non-null float64
x1                    165632 non-null int64
y1                    165632 non-null int64
z1                    165632 non-null int64
x2                    165632 non-null int64
y2                    165632 non-null int64
z2                    165632 non-null int64
x3                    165632 non-null int64
y3                    165632 non-null int64
z3                    165632 non-null int64
x4                    165632 non-null int64
y4                    165632 non-null int64
z4                    165632 non-null int64
dtypes: float64(2), int64(15), object(1)
memory usage: 24.0+ MB

X = X.drop(labels=['user'], axis=1)
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 165632 entries, 0 to 165632
Data columns (total 17 columns):
gender                165632 non-null int64
age                   165632 non-null int64
how_tall_in_meters    165632 non-null float64
weight                165632 non-null int64
body_mass_index       165632 non-null float64
x1                    165632 non-null int64
y1                    165632 non-null int64
z1                    165632 non-null int64
x2                    165632 non-null int64
y2                    165632 non-null int64
z2                    165632 non-null int64
x3                    165632 non-null int64
y3                    165632 non-null int64
z3                    165632 non-null int64
x4                    165632 non-null int64
y4                    165632 non-null int64
z4                    165632 non-null int64
dtypes: float64(2), int64(15)
memory usage: 22.7 MB

print (X[pd.isnull(X).any(axis=1)])

Empty DataFrame
Columns: [gender, age, how_tall_in_meters, weight, body_mass_index, x1, y1, z1, x2, y2, z2, x3, y3, z3, x4, y4, z4]
Index: []

# Split your data into test / train sets
# Your test size can be 30% with random_state 7
# Use variable names: X_train, X_test, y_train, y_test
#
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, np.ravel(y), test_size=0.30, random_state=7)

#
# Create an RForest classifier 'model' and set n_estimators=30,
# the max_depth to 10, and oob_score=True, and random_state=0
#
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=30, max_depth=10, oob_score=True, random_state=0)

print ("Fitting...")
s = time.time()
#
# train your model on your training set
#
model.fit(X_train, y_train)

print ("Fitting completed in: ", time.time() - s)

Fitting...
Fitting completed in:  4.249861001968384

#
# INFO: Display the OOB Score of your data
oob_score = model.oob_score_
print ("OOB Score: ", round((oob_score*100), 3))

OOB Score:  97.012

print ("Scoring...")
s = time.time()
#
# score your model on your test set
#
score = model.score(X_test, y_test)

print ("Score: ", round((score*100), 3))
print ("Scoring completed in: ", time.time() - s)

Scoring...
Score:  97.094
Scoring completed in:  0.21677207946777344

# Predict based on the train model

predictions = model.predict(X_test)

from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions))

             precision    recall  f1-score   support

          0       1.00      1.00      1.00     15209
          1       0.93      0.99      0.96     14077
          2       0.99      0.97      0.98     13189
          3       0.97      0.83      0.90      3725
          4       0.95      0.90      0.93      3490

avg / total       0.97      0.97      0.97     49690

print(confusion_matrix(y_test,predictions))

[[15188     0     0    19     2]
 [    0 13992    82     3     0]
 [    0   348 12825     6    10]
 [   22   421    39  3101   142]
 [   26   235    31    58  3140]]

Accuracy

The percentage of cases correctly classified
(TP+TN)/(P+N)

Recall or True Positive Rate or Sensitivity

The proportion of positive instances that are correctly classified as positive
TP/P

Precision or Positive Predictive Value

Proportion of instances classified as positive that are really positive
TP/(TP+FP)

True negative rate or Specificity

The proportion of negative instances that are correctly classified as negative
TN/N

F-1 Score

A measure that combines Precision and Recall
(2PrecisionRecall)/(Precision+recall)