Human activity monitoring is a growing field within data science. It has practical use within the healthcare industry, particular with tracking the elderly to make sure they don't end up doing things which might cause them to hurt themselves. Governments are also very interested in it do that they can detect unusual crowd activities, perimeter breaches, or the identification of specific activities, such as loitering, littering, or fighting. Fitness apps also make use of activity monitoring to better estimate the amount of calories used by the body during a period of time.
In this project, we will be training a random forest against a public domain Human Activity Dataset titled Wearable Computing: Accelerometers' Data Classification of Body Postures and Movements, containing 165,633, one of which is invalid. Within the dataset, there are five target activities:
- Sitting
- Sitting Down
- Standing
- Standing Up
- Walking
These activities were captured from four people wearing accelerometers mounted on their waist, left thigh, right arm, and right ankle. To get started:
Acquire the DLA HAR Dataset from their webpage. Be sure to get the dataset-har-PUC-Rio-ugulino.zip file and not the weight lifting one.
Based on the data that's collected, we'll predict the type of activity the user was performing (see the bullet list above)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import time
# Grab the DLA HAR dataset from:
# http://groupware.les.inf.puc-rio.br/har
# http://groupware.les.inf.puc-rio.br/static/har/dataset-har-PUC-Rio-ugulino.zip
#
# Load up the dataset into dataframe 'X'
#
X = pd.read_csv('dataset-har-PUC-Rio-ugulino.csv', sep = '\;')
X.head()
/Users/Pandu/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:5: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
user | gender | age | how_tall_in_meters | weight | body_mass_index | x1 | y1 | z1 | x2 | y2 | z2 | x3 | y3 | z3 | x4 | y4 | z4 | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | debora | Woman | 46 | 1,62 | 75 | 28,6 | -3 | 92 | -63 | -23 | 18 | -19 | 5 | 104 | -92 | -150 | -103 | -147 | sitting |
1 | debora | Woman | 46 | 1,62 | 75 | 28,6 | -3 | 94 | -64 | -21 | 18 | -18 | -14 | 104 | -90 | -149 | -104 | -145 | sitting |
2 | debora | Woman | 46 | 1,62 | 75 | 28,6 | -1 | 97 | -61 | -12 | 20 | -15 | -13 | 104 | -90 | -151 | -104 | -144 | sitting |
3 | debora | Woman | 46 | 1,62 | 75 | 28,6 | -2 | 96 | -57 | -15 | 21 | -16 | -13 | 104 | -89 | -153 | -103 | -142 | sitting |
4 | debora | Woman | 46 | 1,62 | 75 | 28,6 | -1 | 96 | -61 | -13 | 20 | -15 | -13 | 104 | -89 | -153 | -104 | -143 | sitting |
# Get the list of columns and data types
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165633 entries, 0 to 165632
Data columns (total 19 columns):
user 165633 non-null object
gender 165633 non-null object
age 165633 non-null int64
how_tall_in_meters 165633 non-null object
weight 165633 non-null int64
body_mass_index 165633 non-null object
x1 165633 non-null int64
y1 165633 non-null int64
z1 165633 non-null int64
x2 165633 non-null int64
y2 165633 non-null int64
z2 165633 non-null int64
x3 165633 non-null int64
y3 165633 non-null int64
z3 165633 non-null int64
x4 165633 non-null int64
y4 165633 non-null int64
z4 165633 non-null object
class 165633 non-null object
dtypes: int64(13), object(6)
memory usage: 24.0+ MB
# Get the unique counts by gender
X.gender.value_counts()
Woman 101374
Man 64259
Name: gender, dtype: int64
#
# Encode the gender column, 0 as male, 1 as female
#
X.gender = X.gender.map({'Man':1, 'Woman':0})
X.gender.value_counts()
0 101374
1 64259
Name: gender, dtype: int64
X['class'].value_counts()
sitting 50631
standing 47370
walking 43390
standingup 12415
sittingdown 11827
Name: class, dtype: int64
# Encode class to integer values
X['class'] = X['class'].map({'sitting':0, 'standing':1, 'walking':2,'standingup':3, 'sittingdown':4})
X['class'].value_counts()
0 50631
1 47370
2 43390
3 12415
4 11827
Name: class, dtype: int64
#
# Clean up any column with commas in it
# so that they're properly represented as decimals instead
#
X['how_tall_in_meters'] = X['how_tall_in_meters'].str.replace(',', '.')
X['body_mass_index'] = X['body_mass_index'].str.replace(',', '.')
X.head()
user | gender | age | how_tall_in_meters | weight | body_mass_index | x1 | y1 | z1 | x2 | y2 | z2 | x3 | y3 | z3 | x4 | y4 | z4 | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | debora | 0 | 46 | 1.62 | 75 | 28.6 | -3 | 92 | -63 | -23 | 18 | -19 | 5 | 104 | -92 | -150 | -103 | -147 | 0 |
1 | debora | 0 | 46 | 1.62 | 75 | 28.6 | -3 | 94 | -64 | -21 | 18 | -18 | -14 | 104 | -90 | -149 | -104 | -145 | 0 |
2 | debora | 0 | 46 | 1.62 | 75 | 28.6 | -1 | 97 | -61 | -12 | 20 | -15 | -13 | 104 | -90 | -151 | -104 | -144 | 0 |
3 | debora | 0 | 46 | 1.62 | 75 | 28.6 | -2 | 96 | -57 | -15 | 21 | -16 | -13 | 104 | -89 | -153 | -103 | -142 | 0 |
4 | debora | 0 | 46 | 1.62 | 75 | 28.6 | -1 | 96 | -61 | -13 | 20 | -15 | -13 | 104 | -89 | -153 | -104 | -143 | 0 |
X['class'].value_counts()
0 50631
1 47370
2 43390
3 12415
4 11827
Name: class, dtype: int64
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165633 entries, 0 to 165632
Data columns (total 19 columns):
user 165633 non-null object
gender 165633 non-null int64
age 165633 non-null int64
how_tall_in_meters 165633 non-null object
weight 165633 non-null int64
body_mass_index 165633 non-null object
x1 165633 non-null int64
y1 165633 non-null int64
z1 165633 non-null int64
x2 165633 non-null int64
y2 165633 non-null int64
z2 165633 non-null int64
x3 165633 non-null int64
y3 165633 non-null int64
z3 165633 non-null int64
x4 165633 non-null int64
y4 165633 non-null int64
z4 165633 non-null object
class 165633 non-null int64
dtypes: int64(15), object(4)
memory usage: 24.0+ MB
# Convert how_tall_in_meters, body_mass_index, z4 to numeric
X['how_tall_in_meters'] = pd.to_numeric(X['how_tall_in_meters'], errors='raise')
X['body_mass_index'] = pd.to_numeric(X['body_mass_index'], errors='raise')
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165633 entries, 0 to 165632
Data columns (total 19 columns):
user 165633 non-null object
gender 165633 non-null int64
age 165633 non-null int64
how_tall_in_meters 165633 non-null float64
weight 165633 non-null int64
body_mass_index 165633 non-null float64
x1 165633 non-null int64
y1 165633 non-null int64
z1 165633 non-null int64
x2 165633 non-null int64
y2 165633 non-null int64
z2 165633 non-null int64
x3 165633 non-null int64
y3 165633 non-null int64
z3 165633 non-null int64
x4 165633 non-null int64
y4 165633 non-null int64
z4 165633 non-null object
class 165633 non-null int64
dtypes: float64(2), int64(15), object(2)
memory usage: 24.0+ MB
X['z4'] = pd.to_numeric(X['z4'], errors='raise')
X.z4.value_counts().head()
-162 6859
-158 6770
-163 6762
-159 6641
-161 6402
Name: z4, dtype: int64
X[ (X['z4'] == np.NaN)]
gender | age | how_tall_in_meters | weight | body_mass_index | x1 | y1 | z1 | x2 | y2 | z2 | x3 | y3 | z3 | x4 | y4 | z4 |
---|
X['z4'].unique()
array(['-147', '-145', '-144', '-142', '-143', '-146', '-138', '-139',
'-141', '-133', '-134', '-135', '-140', '-137', '-148', '-151',
'-149', '-150', '-152', '-156', '-157', '-155', '-154', '-153',
'-158', '-159', '-162', '-163', '-161', '-160', '-164', '-165',
'-136', '-126', '-125', '-122', '-132', '-167', '-170', '-169',
'-175', '-173', '-166', '-168', '-171', '-172', '-174', '-179',
'-177', '-176', '-127', '-131', '-184', '-181', '-178', '-180',
'-106', '-114', '-116', '-115', '-89', '-79', '-93', '-78', '-128',
'-109', '-113', '-119', '-102', '-111', '-103', '-130', '-129',
'-182', '-121', '-186', '-183', '-187', '-190', '-188', '-189',
'-185', '-192', '-191', '-193', '-194', '-197', '-196', '-201',
'-251', '-195', '-199', '-198', '-123', '-124', '-112', '-118',
'-120', '-99', '-95', '-68', '-110', '-117', '-96', '-100', '-105',
'-92', '-88', '-107', '-213', '-108', '-104', '-98', '-94', '-91',
'-97', '-69', '-101', '-86', '-82', '-66', '-77', '-90', '-56',
'-74', '-83', '-81', '-219', '-200', '-221', '-80', '-231', '-210',
'-71', '-207', '-209', '-202', '-205', '-208', '-211', '-214',
'-203', '-204', '-212', '-206', '-216', '-73',
'-14420-11-2011 04:50:23.713', '-43', '-537', '-218', '-215', '-259'], dtype=object)
X['z4'] = pd.to_numeric(X['z4'], errors='coerce')
X.z4.value_counts().head()
-162.0 6859
-158.0 6770
-163.0 6762
-159.0 6641
-161.0 6402
Name: z4, dtype: int64
X['z4'].unique()
array([-147., -145., -144., -142., -143., -146., -138., -139., -141.,
-133., -134., -135., -140., -137., -148., -151., -149., -150.,
-152., -156., -157., -155., -154., -153., -158., -159., -162.,
-163., -161., -160., -164., -165., -136., -126., -125., -122.,
-132., -167., -170., -169., -175., -173., -166., -168., -171.,
-172., -174., -179., -177., -176., -127., -131., -184., -181.,
-178., -180., -106., -114., -116., -115., -89., -79., -93.,
-78., -128., -109., -113., -119., -102., -111., -103., -130.,
-129., -182., -121., -186., -183., -187., -190., -188., -189.,
-185., -192., -191., -193., -194., -197., -196., -201., -251.,
-195., -199., -198., -123., -124., -112., -118., -120., -99.,
-95., -68., -110., -117., -96., -100., -105., -92., -88.,
-107., -213., -108., -104., -98., -94., -91., -97., -69.,
-101., -86., -82., -66., -77., -90., -56., -74., -83.,
-81., -219., -200., -221., -80., -231., -210., -71., -207.,
-209., -202., -205., -208., -211., -214., -203., -204., -212.,
-206., -216., -73., nan, -43., -537., -218., -215., -259.])
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165633 entries, 0 to 165632
Data columns (total 19 columns):
user 165633 non-null object
gender 165633 non-null int64
age 165633 non-null int64
how_tall_in_meters 165633 non-null float64
weight 165633 non-null int64
body_mass_index 165633 non-null float64
x1 165633 non-null int64
y1 165633 non-null int64
z1 165633 non-null int64
x2 165633 non-null int64
y2 165633 non-null int64
z2 165633 non-null int64
x3 165633 non-null int64
y3 165633 non-null int64
z3 165633 non-null int64
x4 165633 non-null int64
y4 165633 non-null int64
z4 165632 non-null float64
class 165633 non-null int64
dtypes: float64(3), int64(15), object(1)
memory usage: 24.0+ MB
# Drop rows with NaNs
X = X.dropna()
# Make sure there are no rows with NaNs
print (X[pd.isnull(X).any(axis=1)])
Empty DataFrame
Columns: [user, gender, age, how_tall_in_meters, weight, body_mass_index, x1, y1, z1, x2, y2, z2, x3, y3, z3, x4, y4, z4, class]
Index: []
X['z4'].unique()
array([-147., -145., -144., -142., -143., -146., -138., -139., -141.,
-133., -134., -135., -140., -137., -148., -151., -149., -150.,
-152., -156., -157., -155., -154., -153., -158., -159., -162.,
-163., -161., -160., -164., -165., -136., -126., -125., -122.,
-132., -167., -170., -169., -175., -173., -166., -168., -171.,
-172., -174., -179., -177., -176., -127., -131., -184., -181.,
-178., -180., -106., -114., -116., -115., -89., -79., -93.,
-78., -128., -109., -113., -119., -102., -111., -103., -130.,
-129., -182., -121., -186., -183., -187., -190., -188., -189.,
-185., -192., -191., -193., -194., -197., -196., -201., -251.,
-195., -199., -198., -123., -124., -112., -118., -120., -99.,
-95., -68., -110., -117., -96., -100., -105., -92., -88.,
-107., -213., -108., -104., -98., -94., -91., -97., -69.,
-101., -86., -82., -66., -77., -90., -56., -74., -83.,
-81., -219., -200., -221., -80., -231., -210., -71., -207.,
-209., -202., -205., -208., -211., -214., -203., -204., -212.,
-206., -216., -73., -43., -537., -218., -215., -259.])
# Convert Z4 to int
X['z4'] = X['z4'].astype(int)
# Store the predictor variable in y
y = X[['class']]
# Drop the predictor class from X
X = X.drop(labels=['class'], axis=1)
X.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 165632 entries, 0 to 165632
Data columns (total 18 columns):
user 165632 non-null object
gender 165632 non-null int64
age 165632 non-null int64
how_tall_in_meters 165632 non-null float64
weight 165632 non-null int64
body_mass_index 165632 non-null float64
x1 165632 non-null int64
y1 165632 non-null int64
z1 165632 non-null int64
x2 165632 non-null int64
y2 165632 non-null int64
z2 165632 non-null int64
x3 165632 non-null int64
y3 165632 non-null int64
z3 165632 non-null int64
x4 165632 non-null int64
y4 165632 non-null int64
z4 165632 non-null int64
dtypes: float64(2), int64(15), object(1)
memory usage: 24.0+ MB
X = X.drop(labels=['user'], axis=1)
X.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 165632 entries, 0 to 165632
Data columns (total 17 columns):
gender 165632 non-null int64
age 165632 non-null int64
how_tall_in_meters 165632 non-null float64
weight 165632 non-null int64
body_mass_index 165632 non-null float64
x1 165632 non-null int64
y1 165632 non-null int64
z1 165632 non-null int64
x2 165632 non-null int64
y2 165632 non-null int64
z2 165632 non-null int64
x3 165632 non-null int64
y3 165632 non-null int64
z3 165632 non-null int64
x4 165632 non-null int64
y4 165632 non-null int64
z4 165632 non-null int64
dtypes: float64(2), int64(15)
memory usage: 22.7 MB
print (X[pd.isnull(X).any(axis=1)])
Empty DataFrame
Columns: [gender, age, how_tall_in_meters, weight, body_mass_index, x1, y1, z1, x2, y2, z2, x3, y3, z3, x4, y4, z4]
Index: []
# Split your data into test / train sets
# Your test size can be 30% with random_state 7
# Use variable names: X_train, X_test, y_train, y_test
#
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, np.ravel(y), test_size=0.30, random_state=7)
#
# Create an RForest classifier 'model' and set n_estimators=30,
# the max_depth to 10, and oob_score=True, and random_state=0
#
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=30, max_depth=10, oob_score=True, random_state=0)
print ("Fitting...")
s = time.time()
#
# train your model on your training set
#
model.fit(X_train, y_train)
print ("Fitting completed in: ", time.time() - s)
Fitting...
Fitting completed in: 4.249861001968384
#
# INFO: Display the OOB Score of your data
oob_score = model.oob_score_
print ("OOB Score: ", round((oob_score*100), 3))
OOB Score: 97.012
print ("Scoring...")
s = time.time()
#
# score your model on your test set
#
score = model.score(X_test, y_test)
print ("Score: ", round((score*100), 3))
print ("Scoring completed in: ", time.time() - s)
Scoring...
Score: 97.094
Scoring completed in: 0.21677207946777344
# Predict based on the train model
predictions = model.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions))
precision recall f1-score support
0 1.00 1.00 1.00 15209
1 0.93 0.99 0.96 14077
2 0.99 0.97 0.98 13189
3 0.97 0.83 0.90 3725
4 0.95 0.90 0.93 3490
avg / total 0.97 0.97 0.97 49690
print(confusion_matrix(y_test,predictions))
[[15188 0 0 19 2]
[ 0 13992 82 3 0]
[ 0 348 12825 6 10]
[ 22 421 39 3101 142]
[ 26 235 31 58 3140]]
Accuracy
- The percentage of cases correctly classified
- (TP+TN)/(P+N)
Recall or True Positive Rate or Sensitivity
- The proportion of positive instances that are correctly classified as positive
- TP/P
Precision or Positive Predictive Value
- Proportion of instances classified as positive that are really positive
- TP/(TP+FP)
True negative rate or Specificity
- The proportion of negative instances that are correctly classified as negative
- TN/N
F-1 Score
- A measure that combines Precision and Recall
- (2PrecisionRecall)/(Precision+recall)