Neurobiological Data In OpenBiomind - GSoC 2009
OpenBioMind is already a powerful tool for geneticists. However, it need not be limited to analyzing gene sequences, microarray data, and related genetic datasets. Neurobiologists are also generating large datasets. I propose to extend OpenBioMind to analyze these neurobiological data.
- Code: http://code.google.com/p/openbiomind
- List: http://groups.google.com/group/openbiomind
- License: GPLv2
- Language: Java
Ensure that OpenBiomind's dependencies are available. OpenBiomind.jar should be in the same directory as a folder called 'lib', which should contain 'commons-math-2.0.jar' and 'niftijib.jar'. Alternatively, it is possible to add each dependency to the classpath, using the environment variable CLASSPATH or the java command-line option '-cp'.
'OpenBiomind.jar' and 'pipeline.parameters' must also be in the classpath.
Assuming that 'OpenBiomind.jar', 'pipeline.parameters', and 'lib' are all in the current working directory:
java -cp OpenBiomind.jar:. task.<OpenBiomind command> <command-line options> [optional command-line options]
fMRI Data format
OpenBiomind supports the NIfTI-1 data format, which is also backwards-compatible with the Analyze 7.5 format.
NIfTI-1 data files support dual storage, which means that a dataset and its header can be stored together (.nii) or seperate (.hdr and .img). The header of a NIfTI-1 data set provides information about the meaning of the actual data. OpenBiomind assumes that this information shows that the data is drawn from a typical whole-brain fMRI time series. For instance, this header information indicates that the brain has been seperated into a 3D array of 64x64x20 voxels, with 120 time steps:
- dim = 4 (i.e. scalar data; OpenBiomind does not support vector data) - dim = 64 - dim = 64 - dim = 20 - dim = 120
Another set of parameters of interest are the units of spatial and temporal dimensions. To train a classifier on one dataset, then apply it to another, each dataset must have the same voxel and time course dimensions. For instance, this dataset has voxels of size 3.75x3.75x5 mm, and each volume collected at 2 seconds intervals.
- pixdim = 3.75 - pixdim = 3.75 - pixdim = 5.0 - pixdim = 2.0 - xyzt_units = NIFTI_UNITS_MM | NIFTI_UNITS_SEC
Official NIfTI-1 specification and documentation: http://nifti.nimh.nih.gov/nifti-1/
OpenBiomind uses the niftijlib library for reading NIfTI-1 files: http://niftilib.sourceforge.net/
Before it is ready for classification, fMRI data needs to undergo a series of preprocessing steps. One of the biggest problems with collecting fMRI data is noise from head movement during data collection. Various methods are used to minimize head movement, but they cannot eliminate it entirely. The most important preprocessing step is detecting and correcting for this head movement during data collection.
Peoples' brains are different shapes and sizes. To compare data taken from different subjects, it is necessary to transforming the images into a standard coordinate system. The most common procedure is to map the data to the Talairach stereotactic coordinate system, but other algorithms exist.
Other tasks during preprocessing can include steps such as spatial and temporal smoothing and detrending.
The order and type of preprocessing steps necessary to prepare a dataset for analysis are beyond the scope of this document. There are many software packages capable of performing these tasks, such as FSL and SPM.
Further reading: Functional Magnetic Resonance Imaging (fMRI) by Robert L. Savoy, Ph.D. Overview of fMRI analysis by S M SMITH, MA, DPhil
The rest of this documentation will use a single-patient dataset from the study Haxbey et al (2001), available here: http://www.pymvpa.org/examples.html
This dataset has the advantage of already being preprocessed. It contains the following files (from the PyMVPA documentation at http://www.pymvpa.org/examples.html):
- bold.nii.gz - The motion-corrected and skull-stripped 4D timeseries (1452 volumes with 40 x 64 x 64 voxels, corresponding to a voxel size of 3.5 x 3.75 x 3.75 mm and a volume repetition time of 2.5 seconds). The timeseries contains all 12 runs of the original experiment, concatenated in a single file. Please note, that the timeseries signal is not detrended.
- bold_mc.par - The motion correction parameter output. This is a 6-column textfile with three rotation and three translation parameters respectively. This information can be used e.g. as additional regressors for motion-aware timeseries detrending.
- mask.nii.gz - A binary mask with a conservative brain outline estimate, i.e. including some non-brain voxels to prevent the exclusion of brain tissue.
- attributes_literal.txt - A two-column text file with the stimulation condition and the corresponding experimental run for each volume in the timeseries image. The labels are given in literal form (e.g. ‘face’).
- attributes.txt - Similar to attributes_literal.txt, but with the condition labels encoded as integers. This file is only provided for earlier PyMVPA version, that could not handle literal labels.
Converting NIfTI-1 to OpenBiomind data format
Use the ConvertNifti task to convert a NIfTI-1 dataset to a vertical OpenBiomind dataset. In addition to the NIfTI-1 dataset itself, this feature also requires a text file containing the category labels and run number for each time slice, one per line.
For example, if a dataset contains four hundred time slices, the category label file would contain four hundred lines. The first few lines of "attributes_full.txt" looks like this:
rest 0 rest 0 rest 0 rest 0 rest 0 rest 0 scissors 0 scissors 0 scissors 0 scissors 0 scissors 0 scissors 0 scissors 0 scissors 0 scissors 0
This category labels file shows that in the first run (run "0"), the patient was resting for the first six volumes, then shown a picture of scissors for the next nine volumes.
The command to convert a dataset is task.ConvertNifti.
java -cp OpenBiomind.jar:. task.ConvertNifti -i bold -a attributes_full.txt -o bold.txt [-m mask -i ignore.txt]
This command converts a NIfTI-1 dataset named "bold" to a vertical OpenBiomind dataset, bold.txt.
When specifying a dataset for OpenBiomind to convert, provide its name without the extension. OpenBiomind will automatically discover whether the header is seperate or not. (For instance, if converting a dataset with header "bold.hdr" and data file "bold.img", just use "bold"). OpenBiomind can also handle gzipped datafiles. (In our example, "bold.nii.gz" is also provided to task.ConvertNifti as "bold").
In addition to the required arguments, ConvertNifti can also take a mask file and an ignore file. Both of these are for restricting the size of the OpenBiomind dataset.
The mask file is for defining a region of interest. It is a NIfTI-1 dataset with the same 3D dimensions as the data to be converted. If a voxel in the mask has a value greater than 0, that voxel will be included. All other voxels are ignored. For instance, a mask file with "1" for the voxels in the anterior half of the brain, and "0" elsewhere, would result in an OpenBiomind dataset that only includes the anterior voxels.
The ignore file is a text file with one category label per line. All samples with these category labels are omitted from the OpenBiomind dataset.
Normalizing the data
Before training classifiers, it can be useful to normalize data, to reduce the effect of large variations in signal strength. The command is task.Normalize:
java -cp OpenBiomind.jar:. task.Normalize -i bold.txt -t row -o bold-normalized-row.txt [-d working directory]
This command takes the dataset "bold.txt", normalizes it so that each row has mean 0 and standard deviation 1, and outputs the result as bold-normalized-row.txt. The "-t" parameter also takes "column", in which case each column of the output has mean 0 and standard deviation 1. Both types of normalization should be tried, as one may give better results than the other.
This command may need to output some temporary files during the normalization process. The optional parameter "-d" tells OpenBiomind where to store those files.
Further reading: Machine learning classiﬁers and fMRI: A tutorial overview by Francisco Pereira, Tom Mitchell, Matthew Botvinick
Switching between vertical and horizontal data formats
The default OpenBiomind dataset is a horizontal dataset, which means that columns represent samples, and rows represent features. In the case of fMRI data, this means each column represents one stimulus class, and each row represents one voxel.
OpenBiomind also supports vertical datasets, in which the columns and rows have been switched: columns represent features, and rows represent samples.
Although these different types of datasets contain the exact same information, certain OpenBiomind commands require horizontal datasets, and others require vertical. These constraints are caused by the extremely large size of fMRI datasets.
It is simple to convert from a vertical dataset to horizontal, and vice versa:
java -cp OpenBiomind.jar:. task.Swap -i vertical-data.txt -o horizontal-data.txt [-d working directory]
java -cp OpenBiomind.jar:. task.Swap -i horizontal-data.txt -o vertical-data.txt [-d working directory]
The task.ConvertNifti command outputs a vertical dataset. Running task.Swap on this output rotates it to a horizontal dataset.
Performing column normalization on a dataset is the same as rotating it, performing row normalization, and then rotating it back. In fact, this is exactly what happens behind the scenes.
Dealing with large datasets
fMRI datasets can be several orders of magnitude larger than gene expression data. Often they are too large to fit into memory at once, especially if the data has been split into multiple folds. There are techniques available to deal with such large datasets in OpenBiomind.
Certain functionality that already existed in OpenBiomind has been reimplemented to handle these large datasets. For instance, OpenBiomind already supported vertical datasets and normalization, but these features required reading the entire dataset into memory. The Swap and Normalize tasks read datasets line by line, so they can handle much larger files.
OpenBiomind has two classes related to holding datasets: the Dataset class, which refers to a single dataset, and the FoldHolder class, which holds multiple datasets. By default, they try to hold everything in memory. However, for large datasets this is impossible. In this case, these classes have a "lazy" version that reads the data from disk whenever it is needed.
Use the following guidelines to select the appropriate memory configuration:
- Datasets are small enough that all folds fit into memory: use "-foldholderType full" and "-datasetType full". This is the default. Requires horizontal datasets as input.
- Datasets are small enough to fit in memory, but folds of multiple datasets are too large: use "-foldholderType lazy" and "-datasetType full". Again, this requires horizontal datasets.
- Datasets are too large to fit into memory: use "-foldholderType full" and "-datasetType lazy". Note that this configuration is incredibly slow, as the datasets must be continually streamed from the hard disk. To use the LazyDataset class, the input must be a vertical dataset.
- Datasets are so large that even folds of lazy datasets are too big to fit into memory: use "-foldholderType lazy" and "-datasetType lazy". Again, this requires vertical datasets as input. This setting is so slow that it is basically unusable. If you find yourself considering these settings, consider some other options, such as reducing the size of your dataset, buying more RAM, or contributing optimization code to OpenBiomind.
Again, note that "-datasetType full" requires horizontal datasets, and "-datasetType lazy" requires vertical datasets.
OpenBiomind is written in Java, which means that even these memory optimizations are not guaranteed to work; they depend on the performance of the garbage collector and the size of the heap. It is worthwhile to try tuning the garbage collector and heap behaviour.
The following JVM command-line options set the size of the heap. If you know how much memory OpenBiomind will need to process your datasets, set the initial and maximum heap size appropriately.
-Xms<size> set initial Java heap size
-Xmx<size> set maximum Java heap size
-Xss<size> set java thread stack size
There are many command-line options for tuning the garbage collector. For multi-core machines, it may be useful to enable the multithreaded garbage collector:
There are many other options for fine-tuning the garbage collector, depending on your specific needs. For more information, visit http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html
Reducing dataset size
See again the section Converting NIfTI-1 to OpenBiomind data format for information on using a mask file to select regions of interest, and an ignore file to exclude certain samples from the dataset. Both of these are crude ways to reduce the dataset size.
In addition to these methods, the various ways to reduce the size of datasets can be classified broadly into two categories: feature selection and dimensionality reduction.
Feature selection restricts a dataset to the features (voxels) that are likely to be the most useful for classification. All other features are discarded. OpenBiomind supports differentiation and SAM feature selection, using the DatasetTransformer task.
Dimensionality reduction attempts to transform the features of a dataset to new, smaller set of features that still encapsulate most of the information from the original. Feature selection is often a useful precursor to fMRI classification. Currently, OpenBiomind does not have any dimensionality reduction algorithms. They are slated for inclusion in a future version. In the meantime, there are third-party libraries that provide these algorithms.
One very useful algorithm for dimensionality reduction is Principal Component Analysis, or PCA.
Folding and running a MetaTask
After the dataset has been converted to an OpenBiomind dataset and normalised, it can be folded as normal, with or without feature selection. These folds can then be used in a MetaTask, using any of the classification methods available in OpenBiomind.
java -cp OpenBiomind.jar:. task.DatasetTransformer -d dataset -o folds_directory [-targetCategory] [-numberOfFolds] [-testDataset] [-numberOfSelectedFeatures] [-featureSelectionMethod] [-foldholderType full|lazy] [-datasetType full|lazy]
java -cp OpenBiomind.jar:. task.MetaTask -d folds_directory -o result_directory [-targetCategory] [-numberOfTasks] [-classificationMethod] [-metataskShuffling] [-foldholderType full|lazy] [-datasetType full|lazy]
SVM, or Support Vector Machine, is another type of classifier often useful in fMRI classification. OpenBiomind can convert datasets to the format used by popular SVM libraries, such as SVM-Light and libsvm.
java -cp OpenBiomind.jar:. task.ConvertToSvm -i input_dataset -o output_svm_dataset [-targetCategory] [-skipZeros yes|no]
The optional command-line options are useful depending on which SVM library to be used. The '-targetCategory' option should be used for SVM-Light, which requires that the target be labeled with a '1', and all other categories labeled with '-1'. SVM-Light datasets may also skip any features with value zero, so use '-skipZeros yes' in this case.
Putting it all together
One recommended workflow is to generate pairwise datasets of only two category labels each, then compare the results of classsifying each dataset.
From Pereira, Mitchell, and Botvinick 2008: "Hence, it might make more sense to consider all pairs of classes, train and test a classiﬁer for each pairwise distinction and produce... a confusion matrix... Often this reveals groups of classes that are hard to distinguish from each other but that are distinguishable from another group (e.g. several small object stimuli versus animal and human faces)."
Creating this kind of pairwise dataset requires all of the concepts covered in this document.
1. Use ConvertNifti to create an OpenBiomind dataset from a NIfTI-1 dataset. Use the ignore file to exclude all but two category labels.
2. Dataset reduction using feature selection or dimensionality reduction
3. Perform row and/or column normalization
4. Optionally swap the dataset, depending on the '-datasetType' needed
5. Run a MetaTask
Community Bonding Period
Week of May 3 (2009, May 3 - May 9)
- Setup OpenBiomind
- Built from source
- Worked through tutorial
- Set up project in Netbeans
- Setup this wiki page
- Downloaded publications on MEG and fMRI analysis
- Collected textbooks on machine learning
- Downloaded other software tools for comparison
Week of May 10 (2009, May 10 - May 16)
- I decided to focus on fMRI for the time being. I have had an easier time finding data and resources for analyzing fMRI data than for EEG/MEG data.
- Installed Matlab and SPM
- Studied fMRI formats, especially NiFTI-1 and Analyze
- NiFTI-1 seems a good choice for OpenBiomind support
- Analyze 7.5 is very popular; NiFTI-1 is backwards-compatible with Analyze
- Many tools, such as SPM, FmribSoftwareLibrary, and AFNI use NiFTI-1 as their default format.
- Tools exist to convert between other popular formats
- NiFTI-1 seems a good choice for OpenBiomind support
- Downloaded the NiFTI reference library
- It would be easiest to simply include this library OpenBiomind. This is permitted by the license: it is in the public domain.
- Preprocessed SPM sample data, following the documentation
- Bought reference textbooks
Week of May 17 (2009, May 17 - May 23)
- Studied the NiFTI-1 specification, header file, and publications.
- Downloaded niftijlib, a java library for working with NiFTI-1 datasets.
- Integrated niftijlib into OpenBiomind.
- Read papers specifically on classification of fMRI data using Machine Learning. Especially:
Week of May 24 (2009, May 24 - May 30)
- Switched to different dataset for introductory work. I made this decision for a couple reasons:
- Ease of working with epoch-based data.
- Already preprocessed.
- Already used in classification studies.
- Successfully imported data into OpenBiomind using niftijlib. Of course, it was not in any format OpenBiomind could use. I spent the rest of the week writing new classes to allow OpenBiomind's genetic algorithms to run on the imported datasets, such as linear classifiers.
Week of May 31 (2009, May 31 - June 6)
- Spent more time trying to understand OpenBiomind's internals and finesse the imported data into a useable form.
- Halfway through the week, persuaded by Lucio to try first convert the data to an external OpenBiomind-compatible text file.
- Successfully wrote the code for this conversion, but it was very inefficient. I spent a few days optimizing and refactoring the code. Despite improvements, it was still too slow.
- Lucio de Souza Coelho and Ben Goertzel recommended adding a data compression step, such as principal component analysis (PCA) to the preprocessing pipeline before converting to OpenBiomind format. This will be my primary goal next week.
Week of June 7 (2009, June 7 - June 13)
- Sucessfully optimized the conversion code to run in a reasonable amount of time. However, data reduction will still be necessary, because of the size of the dataset.
- Optimization required output of "vertical" data file. Will need to transform this into a "horizontal" data file by switching rows and columns.
- Studied means of data reduction with Principal Component Analysis
- Attempted with both PyMVPA and MVPA in Matlab, but both have memory issues
Week of June 14 (2009, June 14 - June 20)
- Investigated data reduction algorithms, in addition to PCA:
- Independent Component Analysis
- Support vector data machine (SVDM)
- GLM-based algorithms
- "Harel and Koren" (suggested by Ben Goertzel)
- Modified the NiFTI dataset code to allow specifying regions of interest using a binary mask. This allows a crude form of data reduction by ignoring certain regions. It is also useful for excluding noise that is outside of the brain volume. Using this with a mask for the sample dataset I've been working with cuts the disk size by a third.
- Wrote a simple function to reverse columns and rows in OpenBiomind datasets.
- First tried using vertical clustering, (recommended by Lucio), but it was not possible to read datasets as a vertical file due to some assumptions about dataset formats.
Week of June 21 (2009, June 21 - June 27)
- Finished optimizations to successfully ran a pipeline all the way through the folding step on a full-sized dataset
- Still failing with memory errors during the metatask step.
- Began refactoring FoldHolder to optimize for memory usage when using large datasets
- Instead of reading entire set of folds, only read one fold at a time, when needed
- Additional overhead for runs with small datasets, but avoids memory errors for large datasets.
Week of June 28 (2009, June 28 - July 4)
- Completed refactoring of FoldHolder. Newly optimized code successfully completes one run of metatask, but still runs out of memory on subsequent runs.
- Leaking memory is one possible explanation. I am currently running profiles of memory usage and collecting heap dumps, to find opportunities for optimization.
- Garbage collection tuning is another possible solution. Currently experimenting with different configurations.
Week of July 5 (2009, July 5 - July 11)
- Revised goals for the rest of Summer of Code:
- Essential goals:
- Finish memory optimizations. I'd like to be able to stream data from the disk, so that OB can deal with arbitrarily large datasets.
- Implement a couple of the most promising data reduction algorithms, such as PCA, ICA, ROI, etc.
- Write or integrate some new classifiers (especially SVM, but I'm also thinking of GNB, LDA, or even some nonlinear classifiers if they seem worth it)
- Code to normalize fMRI data (Fereira, Mitchell, Botvinick, NeuroImage 2009)
- Ability to deal with epoch-based data
- Also, here are a few things that I don't think are vital, but would be nice to work on if I have the time:
- visualizations (such as accuracy maps)
- integrate with GUI from past GSOC
- support for GLM to predict voxels for classification
- Essential goals:
- Began investigating incorporating libsvm, to allow SVM classification
- Wrote code for column and row normalization of data.
- Wrote a simple class to subset datasets, ignoring some classes. This is a crude but effective way to reduce the size of the dataset, since OpenBiomind works primarily with binary classifiers.
Week of July 12 (2009, July 12 - July 18)
- Continued work on SVM classification.
- Began cleaning up the code for potential release.
- Wrote javadocs for my new classes
- Fixed many small "todo's" throughout the code, especially where I hardcoded values in the interest of time.
- Improved error handling, especially for dealing with file streams.
- Wrote test suites for the Swap, Normalize, and ConvertNifti classes.
- Began writing a tutorial for using OpenBiomind with fMRI data, and compiling a list of useful resources.
Week of July 19 (2009, July 19 - July 25)
- Finished integration of libsvm library, to allow SVM classification. It consists of two parts:
- Converter for OpenBiomind-formatted data that outputs the result to libsvm-formatted data.
- Wrapper functions for libsvm command-line binaries.
- These need to be more integrated into the rest of the OpenBiomind code.
- Continued code cleanup
- More renaming
- Reducing code reuse by moving common operations into utility functions
- Rearranged classes to make them more readable.
Week of July 26 (2009, July 26 - August 1)
- Returned to memory issues. I'm working on a class that can stream datasets from the disk, so that OpenBiomind can analyze any size datasets.
- Went back to working on SVM code
- Found some bugs in the dataset converter
- Removed the wrapper functions for libsvm - it is possible to just run it from the command line anyways. Running it from within OpenBiomind was too verbose.
- Continued code cleanup.
- Combed through my code, modifying variable names and other considerations to comply with standard Java conventions.
Week of August 2 (2009, August 2 - August 8)
- Finally finished writing my implementation of a streaming dataset. Theoretically this should allow OpenBiomind to deal with arbitrarily large datasets. In practice, so far it is far too slow to be useable, especially when running a MetaTask with many runs.
- These classes still need to be integrated into the old parts of OpenBiomind.
- Completed all features in the swap and normalize tasks, including command-line selection of a working directory for the temporary files.
Week of August 9 (2009, August 9 - August 17)
- Final integration of the new memory-related classes into the entire OpenBiomind codebase. It is now possible to specify from the command line or the parameters file exactly how OpenBiomind should deal with datasets.
- Added some features to the SVM code:
- Now supports SVM-light formatted data, as well as libsvm.
- Wrote a nice little bash wrapper that greatly reduces the size of an OpenBiomind command on the command line. This also allows OpenBiomind to be run from anywhere, more like a common Unix-style binary.
- Code cleanup, debugging, and documentation. There's always more of this!
Still to be done
Of my major goals, I achieved all but the data reduction algorithms. Of course, there is always more to be done. Here are some of the goals that I couldn't finish this summer, or that I have thought of, but did not have time to implement
- Optimization: My solutions to the memory problem work, but are slow. Ideally, I would like to use a distributed solution that can scale up to multiple machines. Datasets could be stored in a distributed database, and many tasks in OpenBiomind are amenable to parallelization.
- Data reduction and feature selection: OpenBiomind could benefit from algorithms that are known to work well with fMRI data, such as PCA. This would also help with the large dataset problem.
- Project workflow: One common workflow for dealing with fMRI data is to break a dataset up into all pairs of classes, train classifiers on each pairwise set, then compare the results in a confusion matrix. While this is technically possible with OpenBiomind already, the process could be greatly simplified.
- Statistical significance: Automatic calculation of various measures of the statistical significance of classification results would be nice.