Note
This tutorial part is also available for download as an IPython notebook: [ipynb]
This is already the second time that we will engage in a classification analysis, so let’s first recap what we did before in the first tutorial part:
>>> from mvpa2.tutorial_suite import *
>>> ds = get_haxby2001_data()
>>> clf = kNN(k=1, dfx=one_minus_correlation, voting='majority')
>>> cvte = CrossValidation(clf, HalfPartitioner(attr='runtype'))
>>> cv_results = cvte(ds)
>>> np.mean(cv_results)
0.0625
Looking at this little code snippet we can nicely see the logical parts of a cross-validated classification analysis.
Our previous choice of the classifier was guided by the intention to replicate Haxby et al. (2001), but what if we want to try a different algorithm? In this case a nice feature of PyMVPA comes into play. All classifiers implement a common interface that makes them easily exchangeable without the need to adapt any other part of the analysis code. If, for example, we want to try the popular support vector machine (SVM) on our example dataset it looks like this:
>>> clf = LinearCSVMC()
>>> cvte = CrossValidation(clf, HalfPartitioner(attr='runtype'))
>>> cv_results = cvte(ds)
>>> np.mean(cv_results)
0.1875
Instead of k-nearest-neighbor, we create a linear SVM classifier, internally using the popular LIBSVM library (note that PyMVPA provides additional SVM implementations). The rest of the code remains identical. SVM with its default settings seems to perform slightly worse than the simple kNN-classifier. We’ll get back to the classifiers shortly. Let’s first look at the remaining part of this analysis.
We already know that CrossValidation can be used to compute errors. So far we have used only the mean mismatch between actual targets and classifier predictions as the error function (which is the default). However, PyMVPA offers a number of alternative functions in the mvpa2.misc.errorfx module, but it is also trivial to specify custom ones. For example, if we do not want to have error reported, but instead accuracy, we can do that:
>>> cvte = CrossValidation(clf, HalfPartitioner(attr='runtype'),
... errorfx=lambda p, t: np.mean(p == t))
>>> cv_results = cvte(ds)
>>> np.mean(cv_results)
0.8125
This example reuses the SVM classifier we have create before, and yields exactly what we expect from the previous result.
The details of the cross-validation procedure are also heavily customizable. We have seen that a Partitioner is used to generate training and testing dataset for each cross-validation fold. So far we have only used HalfPartitioner to divide the dataset into odd and even runs (based on our custom sample attribute runtype). However, in general it is more common to perform so called leave-one-out cross-validation, where one independent part of a dataset is selected as testing dataset, while the other parts constitute the training dataset. This procedure is repeated till all parts have served as the testing dataset once. In case of our dataset we could consider each of the 12 runs as independent measurements (fMRI data doesn’t allow us to consider temporally adjacent data to be considered independent).
To run such an analysis we first need to redo our dataset preprocessing, since in the current one we only have one sample per stimulus category for both odd and even runs. To get a dataset with one sample per stimulus category for each run, we need to modify the averaging step. Using what we have learned from the last tutorial part the following code snippet should be plausible:
>>> # directory that contains the data files
>>> datapath = os.path.join(tutorial_data_path, 'data')
>>> # load the raw data
>>> attr = SampleAttributes(os.path.join(datapath, 'attributes.txt'))
>>> ds = fmri_dataset(samples=os.path.join(datapath, 'bold.nii.gz'),
... targets=attr.targets, chunks=attr.chunks,
... mask=os.path.join(datapath, 'mask_vt.nii.gz'))
>>> # pre-process
>>> poly_detrend(ds, polyord=1, chunks_attr='chunks')
>>> zscore(ds, param_est=('targets', ['rest']))
>>> ds = ds[ds.sa.targets != 'rest']
>>> # average
>>> run_averager = mean_group_sample(['targets', 'chunks'])
>>> ds = ds.get_mapped(run_averager)
>>> ds.shape
(96, 577)
Instead of two samples per category in the whole dataset, now we have one sample per category, per experiment run, hence 96 samples in the whole dataset. To set up a 12-fold leave-one-run-out cross-validation, we can make use of NFoldPartitioner. By default it is going to select samples from one chunk at a time:
>>> cvte = CrossValidation(clf, NFoldPartitioner(),
... errorfx=lambda p, t: np.mean(p == t))
>>> cv_results = cvte(ds)
>>> np.mean(cv_results)
0.78125
We get almost the same prediction accuracy (reusing the SVM classifier and our custom error function). Note that this time we performed the analysis on a lot more samples that were each was computed from just a few fMRI volumes (about nine each).
So far we have just looked at the mean accuracy or error. Let’s investigate the results of the cross-validation analysis a bit further.
>>> type(cv_results)
<class 'mvpa2.datasets.base.Dataset'>
>>> print cv_results.samples
[[ 0.75 ]
[ 0.875]
[ 1. ]
[ 0.75 ]
[ 0.75 ]
[ 0.875]
[ 0.75 ]
[ 0.875]
[ 0.75 ]
[ 0.375]
[ 1. ]
[ 0.625]]
The returned value is actually a Dataset with the results for all cross-validation folds. Since our error function computes only a single scalar value for each fold the dataset only contains a single feature (in this case the accuracy), and a sample per each fold.
By now we have already done a few cross-validation analyses using two different classifiers and different pre-processing strategies. In all these cases we have just looked at the generalization performance or error. However, error rates hide a lot of interesting information that is very important for an interpretation of results. In our case we analyze a dataset with eight different categories. An average misclassification rate doesn’t tell us much about the contribution of each category to the prediction error. It could be that half of the samples of each category get misclassified, but the same average error might be due to all samples from half of the categories being completely misclassified, while prediction accuracy for samples from the remaining categories is perfect. These two results would have to be interpreted in totally different ways, despite the same average error rate.
In psychological research this type of results is usually presented as a contingency table or cross tabulation of expected vs. empirical results. Signal detection theory offers a whole range of techniques to characterize classifier’s performance based on that. From this angle a classification analysis is hardly any different from a psychological experiment where a human observer performs a detection task, hence the same analysis procedures can be applied here as well.
PyMVPA provides convenient access to confusion matrices, i.e. contingency tables of targets vs. actual predictions. However, to prevent wasting CPU-time and memory they are not computed by default, but instead have to be enabled explicitly. Optional analysis results like this are available in a dedicated collection of conditional attributes – analogous to sa and fa in datasets, it is named ca. Let’s see how it works:
>>> cvte = CrossValidation(clf, NFoldPartitioner(),
... errorfx=lambda p, t: np.mean(p == t),
... enable_ca=['stats'])
>>> cv_results = cvte(ds)
Via the enable_ca argument we triggered computing confusion tables for all cross-validation folds, but otherwise there is no change in the code. Afterwards the aggregated confusion for the whole cross-validation procedure is available in the ca collection. Let’s take a look (note that in the printed manual the output is truncated due to page width constraints – please refer to the HTML-based version full the full matrix).
>>> print cvte.ca.stats.as_string(description=True)
----------.
predictions\targets bottle cat chair face house scissors scrambledpix shoe
`------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ P' N' FP FN PPV NPV TPR SPC FDR MCC F1
bottle 6 0 3 0 0 5 0 1 15 75 9 6 0.4 0.92 0.5 0.88 0.6 0.34 0.44
cat 0 10 0 0 0 0 0 0 10 67 0 2 1 0.97 0.83 1 0 0.79 0.91
chair 0 0 7 0 0 0 0 0 7 73 0 5 1 0.93 0.58 1 0 0.66 0.74
face 0 2 0 12 0 0 0 0 14 63 2 0 0.86 1 1 0.97 0.14 0.8 0.92
house 0 0 0 0 12 0 0 0 12 63 0 0 1 1 1 1 0 0.87 1
scissors 2 0 1 0 0 6 0 0 9 75 3 6 0.67 0.92 0.5 0.96 0.33 0.48 0.57
scrambledpix 2 0 1 0 0 0 12 1 16 63 4 0 0.75 1 1 0.94 0.25 0.75 0.86
shoe 2 0 0 0 0 1 0 10 13 67 3 2 0.77 0.97 0.83 0.96 0.23 0.69 0.8
Per target: ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
P 12 12 12 12 12 12 12 12
N 84 84 84 84 84 84 84 84
TP 6 10 7 12 12 6 12 10
TN 69 65 68 63 63 69 63 65
Summary \ Means: ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ 12 68.25 2.62 2.62 0.81 0.96 0.78 0.96 0.19 0.67 0.78
CHI^2 442.67 p=2e-58
ACC 0.78
ACC% 78.12
# of sets 12 ACC(i) = 0.87-0.015*i p=0.3 r=-0.33 r^2=0.11
Statistics computed in 1-vs-rest fashion per each target.
Abbreviations (for details see http://en.wikipedia.org/wiki/ROC_curve):
TP : true positive (AKA hit)
TN : true negative (AKA correct rejection)
FP : false positive (AKA false alarm, Type I error)
FN : false negative (AKA miss, Type II error)
TPR: true positive rate (AKA hit rate, recall, sensitivity)
TPR = TP / P = TP / (TP + FN)
FPR: false positive rate (AKA false alarm rate, fall-out)
FPR = FP / N = FP / (FP + TN)
ACC: accuracy
ACC = (TP + TN) / (P + N)
SPC: specificity
SPC = TN / (FP + TN) = 1 - FPR
PPV: positive predictive value (AKA precision)
PPV = TP / (TP + FP)
NPV: negative predictive value
NPV = TN / (TN + FN)
FDR: false discovery rate
FDR = FP / (FP + TP)
MCC: Matthews Correlation Coefficient
MCC = (TP*TN - FP*FN)/sqrt(P N P' N')
F1 : F1 score
F1 = 2TP / (P + P') = 2TP / (2TP + FP + FN)
AUC: Area under (AUC) curve
CHI^2: Chi-square of confusion matrix
LOE(ACC): Linear Order Effect in ACC across sets
# of sets: number of target/prediction sets which were provided
This output is a comprehensive summary of the performed analysis. We can see that the confusion matrix has a strong diagonal, and confusion happens mostly among small objects. In addition to the plain contingency table there are also a number of useful summary statistics readily available – including average accuracy.
Especially for multi-class datasets the matrix quickly becomes incomprehensible. For these cases the confusion matrix can also be plotted via its plot() method. If the confusions shall be used as input for further processing they can also be accessed in pure matrix format:
>>> print cvte.ca.stats.matrix
[[ 6 0 3 0 0 5 0 1]
[ 0 10 0 0 0 0 0 0]
[ 0 0 7 0 0 0 0 0]
[ 0 2 0 12 0 0 0 0]
[ 0 0 0 0 12 0 0 0]
[ 2 0 1 0 0 6 0 0]
[ 2 0 1 0 0 0 12 1]
[ 2 0 0 0 0 1 0 10]]
The classifier confusions are just an example of the general mechanism of conditional attribute that is supported by many objects in PyMVPA.
We just saw that it is possible to encapsulate a whole cross-validation analysis into a single object that can be called with any dataset to produce the desired results. We also saw that despite this encapsulation we can still get a fair amount of information about the performed analysis. However, what happens if we want to do some further processing of the data within the cross-validation analysis. That seems to be difficult, since we feed a whole dataset into the analysis, and only internally it get split into the respective pieces.
Of course there is a solution to this problem – a meta-classifier. This is a classifier that doesn’t implement a classification algorithm on its own, but uses another classifier to do the actual work. In addition, the meta-classifier adds another processing step that is performed before the actual base-classifier sees the data.
An example of such meta-classifier is MappedClassifier. Its purpose is simple: Apply a mapper to both training and testing data before it is passed on to the internal base-classifier. With this technique it is possible to implement arbitrary pre-processing within a cross-validation analysis. Suppose we want to perform the classification not on voxel intensities themselves, but on the same samples in the space spanned by the singular vectors of the training data, it would look like this:
>>> baseclf = LinearCSVMC()
>>> metaclf = MappedClassifier(baseclf, SVDMapper())
>>> cvte = CrossValidation(metaclf, NFoldPartitioner())
>>> cv_results = cvte(ds)
>>> print np.mean(cv_results)
0.15625
First we notice that little has been changed in the code and the results – the error is slightly reduced, but still comparable. The critical line is the second, where we create the MappedClassifier from the SVM classifier instance, and a SVDMapper that implements singular value decomposition as a mapper.
Exercise
What might be the reasons for the error decrease in comparison to the results on the dataset with voxel intensities?
We know that mappers can be combined into complex processing pipelines, and since MappedClassifier takes any mapper as argument, we can implement arbitrary preprocessing steps within the cross-validation procedure. Let’s say we have heard rumors that only the first two dimensions of the space spanned by the SVD vectors cover the “interesting” variance and the rest is noise. We can easily check that with an appropriate mapper:
>>> mapper = ChainMapper([SVDMapper(), StaticFeatureSelection(slice(None, 2))])
>>> metaclf = MappedClassifier(baseclf, mapper)
>>> cvte = CrossValidation(metaclf, NFoldPartitioner())
>>> cv_results = cvte(ds)
>>> svm_err = np.mean(cv_results)
>>> print round(svm_err, 2)
0.57
Well, obviously the discarded components cannot only be noise, since the error is substantially increased. But maybe it is the classifier that cannot deal with the data. Since nothing in this code is specific to the actual classification algorithm we can easily go back to the kNN classifier that has served us well in the past.
>>> baseclf = kNN(k=1, dfx=one_minus_correlation, voting='majority')
>>> mapper = ChainMapper([SVDMapper(), StaticFeatureSelection(slice(None, 2))])
>>> metaclf = MappedClassifier(baseclf, mapper)
>>> cvte = CrossValidation(metaclf, NFoldPartitioner())
>>> cv_results = cvte(ds)
>>> np.mean(cv_results) < svm_err
False
Oh, that was even worse. We would have to take a closer look at the data to figure out what is happening here.
Exercise
Inspect the confusion matrix of this analysis for both classifiers. What information is represented in the first two SVD components and what is not? Plot the samples of the full dataset after they have been mapped onto the first two SVD components. Why does the kNN classifier perform so bad in comparison to the SVM (hint: think about the distance function)?
In this tutorial part we took a look at classifiers. We have seen that regardless of the actual algorithm all classifiers are implementing the same interface. Because of that they can be replaced by another classifier without having to change any other part of the analysis code. Moreover, we have seen that it is possible to enable and access optional information that is offered by particular parts of the processing pipeline.
However, we still have done little to address one of the major questions in neuroscience research, that is: Where does the information come from? One possible approach to this question is the topic of the next tutorial part.