Parkinson’s disease predictor

Parkinson's disease itself is a long-term disorder of the nervous system that affects many aspects of a person's mobility over time. It's characterized by shaking, slowed movement, rigidity, dementia, and depression. In 2013, some 53 million people were diagnosed with it, mostly men. Other famous personalities affected by it include actor Michael J Fox, and olympic cyclist Davis Phinney.

In this we will be applying PCA and SVC to the Parkinson's Data Set, provided courtesy of UCI's Machine Learning Repository. The dataset was created at the University of Oxford, in collaboration with 10 medical centers around the US, along with Intel who developed the device used to record the primary features of the dataset: speech signals. Your goals for this assignment are first to see if it's possible to differentiate between people who have Parkinson's and who don't using SciKit-Learn's support vector classifier, and then to take a first-stab at a naive way of fine-tuning your parameters in an attempt to maximize the accuracy of your testing set.

Steps to follow

Load up the /Module6/Datasets/parkinsons.data data set into a variable X, being sure to drop the name column.
Splice out the status column into a variable y and delete it from X.
Perform a train/test split. 30% test group size, with a random_state equal to 7.
Apply PCA to reduce the number of dimensions and then create a SVC classifier.
Fit it against your training data and then score your testing data.

Matrix column entries (attributes):

name - ASCII subject name and recording number
MDVP:Fo(Hz) - Average vocal fundamental frequency
MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
MDVP:Flo(Hz) - Minimum vocal fundamental frequency
MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency
MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in * amplitude
NHR,HNR - Two measures of ratio of noise to tonal components in the voice
status - Health status of the subject (one) - Parkinson's, (zero) - healthy
RPDE,D2 - Two nonlinear dynamical complexity measures
DFA - Signal fractal scaling exponent
spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline

X = pd.read_csv('/Users/Pandu/Downloads/DAT210x/code/Module6/Datasets/parkinsons.data')
X.head()

	name	MDVP:Fo(Hz)	MDVP:Fhi(Hz)	MDVP:Flo(Hz)	MDVP:Jitter(%)	MDVP:Jitter(Abs)	MDVP:RAP	MDVP:PPQ	Jitter:DDP	MDVP:Shimmer	...	Shimmer:DDA	NHR	HNR	status	RPDE	DFA	spread1	spread2	D2	PPE
0	phon_R01_S01_1	119.992	157.302	74.997	0.00784	0.00007	0.00370	0.00554	0.01109	0.04374	...	0.06545	0.02211	21.033	1	0.414783	0.815285	-4.813031	0.266482	2.301442	0.284654
1	phon_R01_S01_2	122.400	148.650	113.819	0.00968	0.00008	0.00465	0.00696	0.01394	0.06134	...	0.09403	0.01929	19.085	1	0.458359	0.819521	-4.075192	0.335590	2.486855	0.368674
2	phon_R01_S01_3	116.682	131.111	111.555	0.01050	0.00009	0.00544	0.00781	0.01633	0.05233	...	0.08270	0.01309	20.651	1	0.429895	0.825288	-4.443179	0.311173	2.342259	0.332634
3	phon_R01_S01_4	116.676	137.871	111.366	0.00997	0.00009	0.00502	0.00698	0.01505	0.05492	...	0.08771	0.01353	20.644	1	0.434969	0.819235	-4.117501	0.334147	2.405554	0.368975
4	phon_R01_S01_5	116.014	141.781	110.655	0.01284	0.00011	0.00655	0.00908	0.01966	0.06425	...	0.10470	0.01767	19.649	1	0.417356	0.823484	-3.747787	0.234513	2.332180	0.410335

5 rows × 24 columns

# Drop the name column
X = X.drop(labels=['name'], axis=1)

# Store the predictor class in y
y = X[['status']]

# Drop status column (predictor class) from test data
X = X.drop(labels=['status'], axis=1)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X)

StandardScaler(copy=True, with_mean=True, with_std=True)

scaled_data = scaler.transform(X)

# run PCA repeatedly for n_components 4 to 14
## Perform grid SVC

from sklearn.grid_search import GridSearchCV

C_range = np.arange(0.05, 2, 0.05)
gamma_range = np.arange(0.001, 0.1, 0.001)
# param_grid = dict(gamma=gamma_range.tolist(), C=C_range.tolist())


X_train, X_test, y_train, y_test = train_test_split(scaled_data, np.ravel(y), test_size=0.30, random_state=7)

for nc in range(4,15):
    ## Perform PCA
    pca = PCA(n_components=nc)
    pca.fit(X_train)

    X_t_train = pca.transform(X_train)
    X_t_test = pca.transform(X_test)

    ## Perform SVC
    param_grid = {"gamma": gamma_range.tolist(), "C": C_range.tolist(), 'kernel': ['rbf']}
    grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=3)
    grid.fit(X_train,y_train)
    grid.best_params_
    score = grid.score(X_test, y_test)
    print('grid.best_params_:  ', 'score:  ', grid.best_params_, score)

print('grid.best_params_:  ', 'score:  ', grid.best_params_, score)

grid.best_params_:   score:   {'C': 0.7000000000000001, 'gamma': 0.056, 'kernel': 'rbf'} 0.898305084746

grid_predictions = grid.predict(X_test)

from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,grid_predictions))

             precision    recall  f1-score   support

          0       1.00      0.50      0.67        12
          1       0.89      1.00      0.94        47

avg / total       0.91      0.90      0.88        59