Generic storage class for datasets with multiple attributes.
A dataset consists of four pieces. The core is a two-dimensional array that has variables (so-called features) in its columns and the associated observations (so-called samples) in the rows. In addition a dataset may have any number of attributes for features and samples. Unsurprisingly, these are called ‘feature attributes’ and ‘sample attributes’. Each attribute is a vector of any datatype that contains a value per each item (feature or sample). Both types of attributes are organized in their respective collections – accessible via the sa (sample attribute) and fa (feature attribute) attributes. Finally, a dataset itself may have any number of additional attributes (i.e. a mapper) that are stored in their own collection that is accessible via the a attribute (see examples below).
Notes
Any dataset might have a mapper attached that is stored as a dataset attribute called mapper.
Examples
The simplest way to create a dataset is from a 2D array.
>>> import numpy as np
>>> from mvpa2.datasets import *
>>> samples = np.arange(12).reshape((4,3))
>>> ds = AttrDataset(samples)
>>> ds.nsamples
4
>>> ds.nfeatures
3
>>> ds.samples
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
The above dataset can only be used for unsupervised machine-learning algorithms, since it doesn’t have any targets associated with its samples. However, creating a labeled dataset is equally simple.
>>> ds_labeled = dataset_wizard(samples, targets=range(4))
Both the labeled and the unlabeled dataset share the same samples array. No copying is performed.
>>> ds.samples is ds_labeled.samples
True
If the data should not be shared the samples array has to be copied beforehand.
The targets are available from the samples attributes collection, but also via the convenience property targets.
>>> ds_labeled.sa.targets is ds_labeled.targets
True
If desired, it is possible to add an arbitrary amount of additional attributes. Regardless if their original sequence type they will be converted into an array.
>>> ds_labeled.sa['lovesme'] = [0,0,1,0]
>>> ds_labeled.sa.lovesme
array([0, 0, 1, 0])
An alternative method to create datasets with arbitrary attributes is to provide the attribute collections to the constructor itself – which would also test for an appropriate size of the given attributes:
>>> fancyds = AttrDataset(samples, sa={'targets': range(4),
... 'lovesme': [0,0,1,0]})
>>> fancyds.sa.lovesme
array([0, 0, 1, 0])
Exactly the same logic applies to feature attributes as well.
Datasets can be sliced (selecting a subset of samples and/or features) similar to arrays. Selection is possible using boolean selection masks, index sequences or slicing arguments. The following calls for samples selection all result in the same dataset:
>>> sel1 = ds[np.array([False, True, True])]
>>> sel2 = ds[[1,2]]
>>> sel3 = ds[1:3]
>>> np.all(sel1.samples == sel2.samples)
True
>>> np.all(sel2.samples == sel3.samples)
True
During selection data is only copied if necessary. If the slicing syntax is used the resulting dataset will share the samples with the original dataset.
>>> sel1.samples.base is ds.samples
False
>>> sel2.samples.base is ds.samples
False
>>> sel3.samples.base is ds.samples
True
For feature selection the syntax is very similar they are just represented on the second axis of the samples array. Plain feature selection is achieved be keeping all samples and select a subset of features (all syntax variants for samples selection are also supported for feature selection).
>>> fsel = ds[:, 1:3]
>>> fsel.samples
array([[ 1, 2],
[ 4, 5],
[ 7, 8],
[10, 11]])
It is also possible to simultaneously selection a subset of samples and features. Using the slicing syntax now copying will be performed.
>>> fsel = ds[:3, 1:3]
>>> fsel.samples
array([[1, 2],
[4, 5],
[7, 8]])
>>> fsel.samples.base is ds.samples
True
Please note that simultaneous selection of samples and features is not always congruent to array slicing.
>>> ds[[0,1,2], [1,2]].samples
array([[1, 2],
[4, 5],
[7, 8]])
Whereas the call: ‘ds.samples[[0,1,2], [1,2]]’ would not be possible. In AttrDatasets selection of samples and features is always applied individually and independently to each axis.
Attributes
Methods
aggregate_features(dataset[, fx]) | Apply a function to each row of the samples matrix of a dataset. |
append(other) | Append the content of a Dataset. |
coarsen_chunks(source[, nchunks]) | Change chunking of the dataset |
copy([deep, sa, fa, a, memo]) | Create a copy of a dataset. |
from_hdf5(source[, name]) | Load a Dataset from HDF5 file |
get_nsamples_per_attr(dataset, attr) | Returns the number of samples per unique value of a sample attribute. |
get_samples_by_attr(dataset, attr, values[, ...]) | Return indices of samples given a list of attributes |
get_samples_per_chunk_target(dataset[, ...]) | Returns an array with the number of samples per target in each chunk. |
init_origids(which[, attr, mode]) | Initialize the dataset’s ‘origids’ attribute. |
random_samples(dataset, npertarget[, ...]) | Create a dataset with a random subset of samples. |
remove_invariant_features(dataset) | Returns a new dataset with all invariant features removed. |
save(dataset, destination[, name, compression]) | Save Dataset into HDF5 file |
summary(dataset[, stats, lstats, sstats, ...]) | String summary over the object |
summary_targets(dataset[, targets_attr, ...]) | Provide summary statistics over the targets and chunks |
A Dataset might have an arbitrary number of attributes for samples, features, or the dataset as a whole. However, only the data samples themselves are required.
Parameters : | samples : ndarray
sa : SampleAttributesCollection
fa : FeatureAttributesCollection
a : DatasetAttributesCollection
|
---|
Methods
aggregate_features(dataset[, fx]) | Apply a function to each row of the samples matrix of a dataset. |
append(other) | Append the content of a Dataset. |
coarsen_chunks(source[, nchunks]) | Change chunking of the dataset |
copy([deep, sa, fa, a, memo]) | Create a copy of a dataset. |
from_hdf5(source[, name]) | Load a Dataset from HDF5 file |
get_nsamples_per_attr(dataset, attr) | Returns the number of samples per unique value of a sample attribute. |
get_samples_by_attr(dataset, attr, values[, ...]) | Return indices of samples given a list of attributes |
get_samples_per_chunk_target(dataset[, ...]) | Returns an array with the number of samples per target in each chunk. |
init_origids(which[, attr, mode]) | Initialize the dataset’s ‘origids’ attribute. |
random_samples(dataset, npertarget[, ...]) | Create a dataset with a random subset of samples. |
remove_invariant_features(dataset) | Returns a new dataset with all invariant features removed. |
save(dataset, destination[, name, compression]) | Save Dataset into HDF5 file |
summary(dataset[, stats, lstats, sstats, ...]) | String summary over the object |
summary_targets(dataset[, targets_attr, ...]) | Provide summary statistics over the targets and chunks |
Apply a function to each row of the samples matrix of a dataset.
The functor given as fx has to honour an axis keyword argument in the way that NumPy used it (e.g. NumPy.mean, var).
Returns : | a new `Dataset` object with the aggregated feature(s). : |
---|
Append the content of a Dataset.
Parameters : | other : AttrDataset
|
---|
Notes
No dataset attributes, or feature attributes will be merged! These respective properties of the other dataset are neither checked for compatibility nor copied over to this dataset. However, all samples attributes will be concatenated with the existing ones.
Change chunking of the dataset
Group chunks into groups to match desired number of chunks. Makes sense if originally there were no strong groupping into chunks or each sample was independent, thus belonged to its own chunk
Parameters : | source : Dataset or list of chunk ids
nchunks : int
|
---|
Create a copy of a dataset.
By default this is going to return a deep copy of the dataset, hence no data would be shared between the original dataset and its copy.
Parameters : | deep : boolean, optional
sa : list or None
fa : list or None
a : list or None
memo : dict
|
---|
Load a Dataset from HDF5 file
Parameters : | source : string or h5py.highlevel.File
name : string, optional
|
---|---|
Returns : | AttrDataset : |
Raises : | ValueError : |
Returns the number of samples per unique value of a sample attribute.
Parameters : | attr : str
|
---|---|
Returns : | dict with the number of samples (value) per unique attribute (key). : |
Return indices of samples given a list of attributes
Returns an array with the number of samples per target in each chunk.
Array shape is (chunks x targets).
Parameters : | dataset : Dataset
|
---|
Initialize the dataset’s ‘origids’ attribute.
The purpose of origids is that they allow to track the identity of a feature or a sample through the lifetime of a dataset (i.e. subsequent feature selections).
Calling this method will overwrite any potentially existing IDs (of the XXX)
Parameters : | which : {‘features’, ‘samples’, ‘both’}
attr : str
mode : {‘existing’, ‘new’, ‘raise’}, optional
|
---|---|
Raises : | `RuntimeError` :
|
len(object) -> integer
Return the number of items of a sequence or mapping.
Create a dataset with a random subset of samples.
Parameters : | dataset : Dataset npertarget : int or list
targets_attr : str, optional |
---|---|
Returns : | Dataset :
|
Returns a new dataset with all invariant features removed.
Save Dataset into HDF5 file
Parameters : | dataset : Dataset destination : h5py.highlevel.File or str name : str, optional compression : None or int or {‘gzip’, ‘szip’, ‘lzf’}, optional
|
---|
String summary over the object
Parameters : | stats : bool
lstats : ‘auto’ or bool
sstats : ‘auto’ or bool
idhash : bool
targets_attr : str, optional
chunks_attr : str, optional
maxt : int
maxc : int
|
---|
Provide summary statistics over the targets and chunks
Parameters : | dataset : Dataset
targets_attr : str, optional
chunks_attr : str, optional
maxc : int
maxt : int
|
---|