Somehow, in random surfing, I came across this Master’s dissertation dated Feb 2008, from University of Pretoria, by Christiaan Maarten Van Dar Walt. It is titled – “Data Measures that characterize classification problems“. Well, I am a total dumbo as far as the mathematics of classification is concerned. Once or twice, I tried to decipher the myth, but each time, i have been too intelligent to realize that its pointless to write with a broken pencil. (Now, if you don’t have enough math foundation, do you really expect to understand internals of these stuff?)
Reading through this thesis, I felt its a very clearly written thesis, which was educative, at least for me and informative too. I am not going to write an elaborate description of this thesis. Let me mention in brief, what each chapter contains. To read it or not is left to the enthusiasts. I just read it in parts randomly and found that lot of investigation went in to it. Here you go :
Essentially, this thesis aims at understanding the relationship between the nature of data and the choice of a particular classifier. It goes in to performing this “to do” in the following steps:
1. Identifying the properties of data, which will affect a classifier performance
2. Proposing measures to quantify these properties
3. Validating the efficiency of these measures
4. Use these measures, build a meta classifier and explain the predictions regarding classification using them.
5. Explain the results of their experiments, interpretation, contributions to research, short comings of this work and future directions.
There is a great deal of background work material provided for the enthusiasts, which I personally liked – for the depth of bibliography 🙂
I haven’t read the thesis top to bottom, but I can be sure that it gives a fair idea of the choice of a classifier for a specific kind of data set and issues involved in the process of choosing. However, I wondered that they should have at least mentioned about the impact of domain of the data set on classifier choice. Or, the relation between the statistical nature of the data set and its domain put together on classifier choice.
Anyways, the thesis can be accessed here. I liked it for its clarity. I’d go back to this when I really do some work which needs this background, quite confident about the content.