For the Un-initiated:
“MALLET” stands for MAchine Learning for LanguagE Toolkit. It contains collections of Java classes, which will help us perform some Natural language processing/Information Extraction tasks, using Machine learning. It can be used both for experimentation-evaluation as well as in application development.
Its an open source software, released under “Common Public License”.
Home page here.
I am not a “technical blogger”. But, I thought this’d help with someone new to “Mallet” and working on simpler tasks. I am not going to tell something I “Invented”, for there is nothing like that. I will just be mentioning over here some links to get a good idea about writing code which uses Mallet’s classes for your tasks.
And yeah, theres nothing over here – to “display”. Its just a log so that I can re-visit when I have to work on mallet again.
Coming to the point, after installing, compilation and building mallet, Let us take a simple document classification example, to work with it.
You have a set of documents – each document tagged with a category – as your training data.
Using this data, our task is to predict the category of any new document, that comes as input.
“Mallet” comes with a set of command line tools, to perform these tasks. However, if one wants to go beyond that, we can use Mallet’s API to achieve our goals. While browsing through their site, I came across some code samples, which will help us in doing that.
Here are the steps:
-First, convert your labeled document collection in to “mallet” format. How to arrange them is explained here. What approaches can you use to train the labeled collection- is also explained in the same page.
On command line, this can be achieved by using : bin/mallet –import-file or bin/mallet –import-dir, along with their options, depending on how your training data is.
This step gives us a .mallet file, which has our data in a form understandable by mallet.
More details can also be seen on the “Professionalization of Mallet” aka “pallet” page on google code links, here.
– Next -Use a classifier which will learn “how to classify a new document” from this data. On command line, its simple.
Run: bin/mallet train-classifier –input training.mallet –output-classifier my.classifier (where: training.mallet is the file you got from the previous step. my.classifier – is the file which contains your classifier).
[Over here, we can choose the classifier we want to try out – NaiveBayes/MaxEnt etc.]
We can also choose a split in training data – and evaluating the classifier considering the rest as testing data – etc. We can also compare the performance of various classifiers.
-Step 3: How to use this model now?
The same links mentioned in the last lines of previous step – give lot of information.
End user comments:
Oh, I am very satisfied with my first experience with Mallet. This is not exactly my first experience – I do remember playing it out long back – in 2006 I guess, for a few minutes only, though.
May be because this is a purely academic endeavour – there are little discussion boards/support groups for “mallet” online.
Anyways, may be as I begin doing more complex tasks, I’ll find more interesting stuff with Mallet, which I can blog about soon.