Please use this identifier to cite or link to this item: http://hdl.handle.net/1946/31585
The thesis goal is to improve the I/O performance of the PiSVM suite of parallel and scalable tools used for machine learning on HPC platforms. This is achieved by analyzing the current state of I/O and then designing an I/O framework that enables PiSVM programs to read and write data in parallel, using the HDF5 library. HDF5 is a highly scalable library that is used in HPC but its use is not widespread. The thesis implements the design into the PiSVM toolset as a proof-of-concept. A parser will be added to the PiSVM suite that converts data from the currently used SVMLight format into HDF5 format. A 3.45\% overall reduction in execution time was achieved in PiSVM-Train on the Indian Pines Raw dataset. A 4.88\% overall reduction in execution time was achieved in PiSVM-Predict on the Indian Pines Raw dataset. Read and write times were improved by a bigger percentage, upward to 98\% reduction in read and write times. This can be attributed to the design of the I/O framework and usage of advanced data storage features that HDF5 offers. A further significant result is a reduction of data file size by 72\% and a reduction of model file size by 24\% on the Indian Pines Raw dataset. In practice, any work with PiSVM will gain significant benefits from the work done in this thesis. Whole research groups tend to have multiple copies of the data, working with different feature engineering techniques and As PiSVM is used in many supercomputing centers and by multiple research groups, the gains are significant.
|Thesis-Final.pdf||4.66 MB||Open||Complete Text||View/Open|