First things first. What is Solr?
“Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world’s largest internet sites. “
-Says the Apache Solr website. I can’t explain it better than this, anyways.
I first used Solr in Jan 2010 and have been using it at various stages of my professional as well as student life, ever since. To say the least, its amazing! 🙂 I was just browsing through this book – “Solr 1.4 Enterprise Search Server” (by David Smiley and Eric Pugh) today. Oh, well, I should have read it in 2010 ideally. But, I found it now. There was one more which I found a couple of weeks ago, “Apache Solr 3.1 Cookbook” by Rafal Kuc. I thought it would be good to put the set of related links that I found useful over a period of time, so that others like me, who want to begin with Solr, might find them useful too. There might be N such posts on the web. But, this is partly for my own log too!
1) I remember starting at this presentation on “What is Solr?” by Tom Hill (courtesy: my google bookmarks). It might be a bit dated now, though.
2) This article by Grant Ingersoll, at IBM Developer works, was another useful one, back then.
3) Solr Wiki is the best source for everything, ofcourse. (http://wiki.apache.org/solr)
4) “Seven Deadly Sins of Solr” on LucidImagination pages, was another one, which helped me a lot during the initial days. In general, LucidImagination has some really good articles on Solr, for users of all stages of Solr experience.
5) In my view, setting Solr in Eclipse, with Tomcat is a real irritating issue. Well, it’s no difficult. Its straight forward. But, I don’t (perhaps won’t) understand Tomcat behavior in Eclipse, personally. This blogpost gives an overview of what to do to make a test Solr server work in your eclipse.
6) I was wondering about the crawling part of Solr. I always used external crawlers (Heritrix, Wget and even 80legs etc) to get required data. But there seem to be decent amount of options, which I did not try out, yet. Some of them –
a) Using Nutch with Solr – a LucidImagination article (I did play around with Nutch a while ago, but I did not know about Solr back then!)
b) Youseer, which uses Heritrix to crawl and Solr to index.
c) Crawl-Anywhere, another thing that I read about recently, which uses their own crawler, with Solr Indexer.
7) Here is a good example of reading Solr results, from SolrJ, a java library for Solr. I particularly cite this because, I had a tough time in finding an example on using SolrJ to read Solr results :). Mats Lindh’s post was also a very useful one to use Solrj.
The best part of this is, I need not read Solr results in java alone. I can develop my webapp in any language I want and Solr supports it! (Well, I know that people use Java, Ruby and Python. Since Solr supports JSON, I guess it can be used in other languages too, perhaps).
9) I tried using the Solr-UIMA integration, but it did not work my UIMA annotators. So, I left it there and managed to do the UIMA stuff before sending the doc to Solr :). It might look like “running away from the problem”, but it works.
10) Time and again, I am amazed at the new stuff I discover about Solr. The books I mentioned above, are really good sources of information. You can perhaps manage to find pirated pdfs online. I don’t want to give links myself, here. 🙂
11) Oh, I did mention links on setting up Solr to work, Knowing more about Solr and Accessing the search results of Solr using SolrJ. But, I did not mention about indexing a set of documents. I started with an example code by Grant Ingersoll. I can’t place that right now, but found this Solr-user group mail, which can serve as an example. Here is one more from Lucid Imagination, which uses TIKA parser.
(Hoping that someday, someone like me, will find this post useful! Perhaps, this is an attempt to give back to the community! :P)