Parkinson’s disease predictor

Parkinson's disease itself is a long-term disorder of the nervous system that affects many aspects of a person's mobility over time. It's characterized by shaking, slowed movement, rigidity, dementia, and depression. In 2013, some 53 million people were diagnosed with it, mostly men. Other famous personalities affected by it include actor Michael J Fox, and olympic cyclist Davis Phinney.

In this we will be applying PCA and SVC to the Parkinson's Data Set, provided courtesy of UCI's Machine Learning Repository. The dataset was created at the University of Oxford, in collaboration with 10 medical centers around the US, along with Intel who developed the device used to record the primary features of the dataset: speech signals. Your goals for this assignment are first to see if it's possible to differentiate between people who have Parkinson's and who don't using SciKit-Learn's support vector classifier, and then to take a first-stab at a naive way of fine-tuning your parameters in an attempt to maximize the accuracy of your testing set.

Steps to follow

  1. Load up the /Module6/Datasets/parkinsons.data data set into a variable X, being sure to drop the name column.

  2. Splice out the status column into a variable y and delete it from X.

  3. Perform a train/test split. 30% test group size, with a random_state equal to 7.

  4. Apply PCA to reduce the number of dimensions and then create a SVC classifier.

  5. Fit it against your training data and then score your testing data.

Matrix column entries (attributes):

  • name - ASCII subject name and recording number
  • MDVP:Fo(Hz) - Average vocal fundamental frequency
  • MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
  • MDVP:Flo(Hz) - Minimum vocal fundamental frequency
  • MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency
  • MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in * amplitude
  • NHR,HNR - Two measures of ratio of noise to tonal components in the voice
  • status - Health status of the subject (one) - Parkinson's, (zero) - healthy
  • RPDE,D2 - Two nonlinear dynamical complexity measures
  • DFA - Signal fractal scaling exponent
  • spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline

X = pd.read_csv('/Users/Pandu/Downloads/DAT210x/code/Module6/Datasets/parkinsons.data')
X.head()
name MDVP:Fo(Hz) MDVP:Fhi(Hz) MDVP:Flo(Hz) MDVP:Jitter(%) MDVP:Jitter(Abs) MDVP:RAP MDVP:PPQ Jitter:DDP MDVP:Shimmer ... Shimmer:DDA NHR HNR status RPDE DFA spread1 spread2 D2 PPE
0 phon_R01_S01_1 119.992 157.302 74.997 0.00784 0.00007 0.00370 0.00554 0.01109 0.04374 ... 0.06545 0.02211 21.033 1 0.414783 0.815285 -4.813031 0.266482 2.301442 0.284654
1 phon_R01_S01_2 122.400 148.650 113.819 0.00968 0.00008 0.00465 0.00696 0.01394 0.06134 ... 0.09403 0.01929 19.085 1 0.458359 0.819521 -4.075192 0.335590 2.486855 0.368674
2 phon_R01_S01_3 116.682 131.111 111.555 0.01050 0.00009 0.00544 0.00781 0.01633 0.05233 ... 0.08270 0.01309 20.651 1 0.429895 0.825288 -4.443179 0.311173 2.342259 0.332634
3 phon_R01_S01_4 116.676 137.871 111.366 0.00997 0.00009 0.00502 0.00698 0.01505 0.05492 ... 0.08771 0.01353 20.644 1 0.434969 0.819235 -4.117501 0.334147 2.405554 0.368975
4 phon_R01_S01_5 116.014 141.781 110.655 0.01284 0.00011 0.00655 0.00908 0.01966 0.06425 ... 0.10470 0.01767 19.649 1 0.417356 0.823484 -3.747787 0.234513 2.332180 0.410335

5 rows × 24 columns

# Drop the name column
X = X.drop(labels=['name'], axis=1)
# Store the predictor class in y
y = X[['status']]
# Drop status column (predictor class) from test data
X = X.drop(labels=['status'], axis=1)
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X)
StandardScaler(copy=True, with_mean=True, with_std=True)
scaled_data = scaler.transform(X)
# run PCA repeatedly for n_components 4 to 14
## Perform grid SVC

from sklearn.grid_search import GridSearchCV

C_range = np.arange(0.05, 2, 0.05)
gamma_range = np.arange(0.001, 0.1, 0.001)
# param_grid = dict(gamma=gamma_range.tolist(), C=C_range.tolist())


X_train, X_test, y_train, y_test = train_test_split(scaled_data, np.ravel(y), test_size=0.30, random_state=7)

for nc in range(4,15):
    ## Perform PCA
    pca = PCA(n_components=nc)
    pca.fit(X_train)

    X_t_train = pca.transform(X_train)
    X_t_test = pca.transform(X_test)

    ## Perform SVC
    param_grid = {"gamma": gamma_range.tolist(), "C": C_range.tolist(), 'kernel': ['rbf']}
    grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=3)
    grid.fit(X_train,y_train)
    grid.best_params_
    score = grid.score(X_test, y_test)
    print('grid.best_params_:  ', 'score:  ', grid.best_params_, score)
print('grid.best_params_:  ', 'score:  ', grid.best_params_, score)
grid.best_params_:   score:   {'C': 0.7000000000000001, 'gamma': 0.056, 'kernel': 'rbf'} 0.898305084746
grid_predictions = grid.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,grid_predictions))
             precision    recall  f1-score   support

          0       1.00      0.50      0.67        12
          1       0.89      1.00      0.94        47

avg / total       0.91      0.90      0.88        59