Automated grading and Counter arguments-1

Take 1: I read this NYT article about EdX’s announcement that it will release its automatic grading software “free on the web, to any institution that wants to use it”.(Article can be read here)

I particularly also liked this part of the statement:

“The EdX assessment tool requires human teachers, or graders, to first grade 100 essays or essay questions. The system then uses a variety of machine-learning techniques to train itself to be able to grade any number of essays or answers automatically and almost instantaneously.”

Take 2: There is this organization called humanreaders.org. According to the news report, “The group, which calls itself Professionals Against Machine Scoring of Student Essays in High-Stakes Assessment, has collected nearly 2,000 signatures, including some from luminaries like Noam Chomsky.”

Given my recent interest on “evaluation of evaluation”, this particular statement from one of the people from this group caught my attention:

“My first and greatest objection to the research is that they did not have any valid statistical test comparing the software directly to human graders,” said Mr. Perelman, a retired director of writing and a current researcher at M.I.T.”

Take 3: I end up navigating through the pages of humanreaders.org and read their reports and conclusions, purely because of the above statement.

Take 4: This post comes out. … and seems to become a couple of posts soon.
****
So, the humanreaders.org’s major claim is:
“We call for schools, colleges, and educational assessment programs to stop using computer scoring of student essays written during high-stakes tests.”

As someone not doing anything directly with automated scoring but having academic interest in it owing to its proximity to what I do, I was naturally curious after seeing such a strong statement.

At this point, I have to state what I think about it. I think automated scoring is a nice complimentary system to have, along with human evaluators. This is also why I like the GRE/GMAT-AWA section style scoring model. For example, here is what they say on the ETS website, about GRE essay scoring:

“For the Analytical Writing section, each essay receives a score from at least one trained reader, using a six-point holistic scale. In holistic scoring, readers are trained to assign scores on the basis of the overall quality of an essay in response to the assigned task. The essay score is then reviewed by e-rater, a computerized program developed by ETS, which is used to monitor the human reader. If the e-rater evaluation and the human score agree, the human score is used as the final score. If they disagree by a certain amount, a second human score is obtained, and the final score is the average of the two human scores.”
(Link with more detailed explanation here)

Also, in the context of MOOCs and the sheer number of students that enrol in them, perhaps, its a worthwhile idea to explore ways of evaluating them better. Surely, when you are offering courses for free or minimal charges and you have thousands of students, you cannot afford to manually grade each and every student test script. I do like the idea of peer-reviewed essay grading too, though.

Coming back to the topic, the humanwriters.org homepage continued:

Independent and industry studies show that by its nature computerized essay rating is

* trivial, rating essays only on surface features such as word size, topic vocabulary, and essay length
* reductive, handling extended prose written only at a grade-school level
* inaccurate, missing much error in student writing and finding much error where it does not exist
* undiagnostic, correlating hardly at all with subsequent writing performance
* unfair, discriminating against minority groups and second-language writers
* secretive, with testing companies blocking independent research into their products

-It is here, that I began feeling … “not true… not true…something is missing”… reasons? While for some of them, I would need a more detailed reading, I was surprised to see some of the other points above:

trivial: I did spend some time in the past few months, reading published (and peer-reviewed) research on this kind of systems (e-rater, for example) and at least I feel that its not really “trivial”. We can always argue –
a) “this is not how a human mind does it”
b) “there is much more than what you do now”.
(I am reminded of the Norvig-Chomsky debate as I write these above two lines!)
But, IMHO, we still cannot call the current state-of-the-art “trivial”. If it is so trivial, why is it that so many researchers still spend major part of their working hours on handling this problem?

unfair: Even if I start believing that its true, I don’t understand how we can be so sure that a human evaluator too won’t do this?

secretive: On this part, I partly agree. But, these days, there are so many competitions on automated assessment (eg: 2012 automated essay scoring competition by the Hewlett Foundation, the Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge at SemEval 2013, Question Answering for Machine Reading Evaluation at CLEF 2013) and people from the industry also participate in these competitions, as far as I noticed. So, although one might still not be able to see the actual working of the respective company products or their actual student texts (hello, they are companies and like many other companies, they too have some proprietary stuff!), these competitions actually provide a scope for open research, fair play and also a scope to explore various dimensions of automated essay scoring. After just browsing through some of these things, I really can’t call things trivial… secretive, they may be… stupid, they certainly are not!

So… I ended up reading their “research findings” as well. As I started reading some of the references, I understood the power of selective reporting, again! By selectively reporting what we choose to report, we can always turn everything in our favor… and this realization perhaps is making me write these posts. 🙂

(continued)

PS 1: What qualification do you have? someone might ask me. I have the necessary background to understand research documents on this topic. I did some hobby experiments with this sort of stuff, with a freely accessible exam dataset and have an idea of what works and why it works when it works. I never worked on any real automated scoring system. I have no more interest than this on this topic, at least as of now.

Published in: on April 6, 2013 at 8:38 pm  Comments (8)  
Tags:

Machine Learning that Matters – Some thoughts.

Its almost an year since Praneeth sent this paper and I read it…and began blogging about it. I began re-reading it today, as a part of my “evaluating the evaluation” readings, and thought I still have something to say (largely to myself) on some of the points mentioned in this paper.

Machine Learning that Matters
by Kiri L. Wagstaff
Published in proceedings of ICML 2012.

This is how it begins:


“Much of current machine learning (ML) research has lost its connection to problems of import to the larger world of science and society”

-I guess the tone and intention of this paper is pretty clear in this first sentence.

I don’t have any issues with the tone as such – but I thought there are so many real-world applications of machine learning these days! That doesn’t mean that every machine learning research problem leads to solving a real-world problem though, which holds good for any research. So, the above statement in my view can apply to any research in general.

I was fascinated by this statistics on the hyper-focus on bench marked datasets.

A survey of the 152 non-cross-conference papers published at ICML 2011 reveals:
148/152 (93%) include experiments of some sort
57/148 (39%) use synthetic data
55/148 (37%) use UCI data
34/148 (23%) use ONLY UCI and/or synthetic data
1/148 (1%) interpret results in domain context

-Since I am not into machine learning research but only use ML for computational linguistics problems, I found this to be very interesting… and a very valid point.

Then, the discussion moves on to evaluation metrics:

“These metrics are abstract in that they explicitly ignore or remove problem-specific details, usually so that numbers can be compared across domains. Does this seemingly obvious strategy provide us with useful information?”

-In the discussion that followed, there were some interesting points on what various evaluation metrics fail to capture etc. I have been reading on this topic of evaluation metrics for supervised machine learning in the recent past…and like with those, I am left with the same question even here – what is the best evaluation, then? Ofcourse, “real world”. But, how do you quantify that? How can there be some kind of evaluation metric, thats truly comparable with other peer research groups?

I got my answer in the later part of the paper:

Yet (as noted earlier) the common approach of using the same metric for all domains relies on an unstated, and usually unfounded, assumption that it is possible to equate an x% improvement in one domain with that in another. Instead, if the same method can yield profit improvements of $10,000 per year for an auto-tire business as well as the avoidance of 300 unnecessary surgical interventions per year, then it will have demonstrated a powerful, wide-ranging utility.

Next part of the discussion is on identifying where machine learning matters:

“It is very hard to identify a problem for which machine learning may offer a solution, determine what data should be collected, select or extract relevant features, choose an appropriate learning method, select an evaluation method, interpret the results, involve domain experts, publicize the results to the relevant scientific community, persuade users to adopt the technique, and (only then) to truly have made a difference”

-Now, I like that. 🙂 🙂

I also liked this point on the involvement of the world outside ML.

“We could also solicit short “Comment” papers, to accompany the publication of a new ML advance, that are authored by researchers with relevant domain expertise but who were uninvolved with the ML research. They could provide an independent assessment of the performance, utility, and impact of the work. As an additional benefit, this informs new communities about how, and how well, ML methods work.”

“Finally, we should consider potential impact when selecting which research problems to tackle, not merely how interesting or challenging they are from the ML perspective. How many people, species, countries, or square meters would be impacted by a solution to the problem? What level of performance would constitute a meaningful improvement over the status quo?”

-Well, I personally share the sentiments expressed here. I like and I want to work on problems whose solutions can possibly have a real life impact. However, I consider it my personal choice. But, I don’t understand what is wrong in doing something because its challenging! What’s wrong in researching for fact finding? There will be practical implications to certain research problems. There might not be an immediate impact for some. There might not be a direct impact for some. There might not really be a practical impact for some. But should that be the only deciding factor? (Well, of course, when the researchers are funded from public taxes, perhaps its expected to be thus. But, should it be thus, always??)

I found the six old and new Machine learning impact challenges really interesting.
Here are the new ones from the paper:

1. A law passed or legal decision made that relies on the result of an ML analysis.
2. $100M saved through improved decision making provided by an ML system.
3. A conflict between nations averted through high-quality translation provided by an ML system.
4. A 50% reduction in cybersecurity break-ins through ML defenses.
5. A human life saved through a diagnosis or intervention recommended by an ML system.
6. Improvement of 10% in one country’s Human Development Index (HDI) (Anand & Sen,1994) attributable to an ML system.

And finally, I found the last discussion on obstacles to ML impact also to be very true. I don’t know why there is so little work making machine learning output comprehensible to its users (e.g., doctors using a classifier to identify certain traits in a patient might not really want to see an SVM output and take a decision without understanding the output!) (atleast, I did not find too much work on Human Comprehensible Machine Learning)

As I read it again and again, this paper seems to me like a Theory vs Practice debate (generally speaking) and can possibly be worth reading for anyone outside machine learning community too (like it was useful for me!).

******
End disclaimer: All those thoughts expressed are my individual feelings and are not related to my employer.:-)

Published in: on March 26, 2013 at 12:35 pm  Comments (22)  

Jurafsky’s Introduction to NLP

Well, its a considerable break from blogging for a compulsive blogger like me 😉 …. not that I don’t have anything interesting… but its just that I got disinterested in online activities. What prompted me to write today was Prof Jurafsky’s 15min talk introducing Natural Language Processing.

I have been having a feast of sorts since October 2011, starting with Prof Andrew Ng’s Machine Learning course. Now, I am spoilt for choice, with the advent of Udacity and Coursera. Amidst all these choices, I wasn’t really keen on attending the “Introduction to NLP” class, despite the fact that its being taught by people like Prof Jurafsky and Prof Manning!

However, a friend’s mail made me visit the course website today and I wanted to see the introductory lecture. I am just fascinated by the way Prof Jurafsky introduced NLP to the world (…..of the site visitors!). From the moment I started the video till now – I must have said – “Wow! What an intro!” several times, with several people, along with a “please watch this video” request. If someone introduced something in this manner, I will perhaps be tempted to take that course, whatever it is! In Jurafsky’s magic, I, who decided not to take this course, listened to all the uploaded lectures on the site so far!!

Irrespective of your background, I would strongly recommend this video to people who are curious about –
1) What Natural Language Processing means?
2) Where is it used?
3) (For Telugu readers) What is it that I tried to explain last year, in this ?? 😉 (Yeah, shameless inter-linking!)

e-education rox!

(Okay, since its officially downloadable, I’ve uploaded this video that floored me, on dropbox, so that anyone who doesn’t have a login on course website can also view this video. Here it is. If this uploading is a crime, let me know…I will un-crime myself!)

Published in: on March 9, 2012 at 9:06 pm  Comments (1)  

Online Readings-4

(Disclaimer: This is dedicated to some Technology stuff I read and found interesting over the past two weeks..)

A couple of weeks back, I saw this blog post called “The evolution of search in six minutes”, with a video attached. To me, it is a very well-made, down to earth post on, well, the evolution of search as we use it now…and how it might be in the future. I feel its a must watch for everyone irrespective of their search-technology awareness.

And yes, although I cannot understand a bit about “Quantum Computing”, this article on “Machine Learning with quantum algorithms”, on google research blog, made an interesting read. May be, at some point of time, if I can manage to understand the science behind it, I might try to put it forth in a more common-manish way :P. Thanks to Praneeth for sharing this.

For a couple of days in the past fortnight, I was addicted to the idea of “Forensic Linguistics”. I was searching for some Applications of linguistics (rather…Applied Linguistics) and this video, although not “extraordinarily” shot, made me more curious about the subject. I even managed to get a book introducing the concepts of Forensic Linguistics (online)…but, its so easy to drift when you don’t know so many things 😉 😛

Although I was fascinated by “Wolfram Alpha” when it was first announced, I soon drifted…well…I just told you why. But again, this particular blog post on Wolfram Alpha site, on handling/solving permutations related queries, fascinated me a lot 🙂

And then, I work in applying Language Technologies for Language Learning. I found something similar, but not related to “Language”, on the Communications of the ACM pages today. It is titled “Massive scale data mining for Education”. It talks about the use of massive math-query logs of students to predict clusters of difficult math problems, ways to track the progress of students etc. Yes, like they say, if it works, it will really be a massive scale student modeling. I loved the way the article ended. “We experiment, try different students on different problems, discover which exercises cause similar difficulties, and which help students break out of those difficulties. We learn paths in the data and models of the students. We learn to teach.” – It’d be interesting to see how it goes! 🙂

Back to CACM again, this piece on the duality of defining computing as an “instrument for human mind” (of a human mind and by a human mind too?? :P)… and connecting it to Steve Jobs and Dennis Ritchie ….was interesting too.

Lastly, again on CACM, there was this extremely insightful and well-written article on “Natural” Search User Interfaces by Marti Hearst. I just loved reading it and would recommend it to everyone interested in “Search”. Its a bit long…but surely worth it, IMHO.

Jai Hind, for now!!

Published in: on December 20, 2011 at 1:43 am  Comments (2)  
Tags:

On Solr

First things first. What is Solr?

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world’s largest internet sites. “
-Says the Apache Solr website. I can’t explain it better than this, anyways.

I first used Solr in Jan 2010 and have been using it at various stages of my professional as well as student life, ever since. To say the least, its amazing! 🙂 I was just browsing through this book – “Solr 1.4 Enterprise Search Server” (by David Smiley and Eric Pugh) today. Oh, well, I should have read it in 2010 ideally. But, I found it now. There was one more which I found a couple of weeks ago, “Apache Solr 3.1 Cookbook” by Rafal Kuc. I thought it would be good to put the set of related links that I found useful over a period of time, so that others like me, who want to begin with Solr, might find them useful too. There might be N such posts on the web. But, this is partly for my own log too!

1) I remember starting at this presentation on “What is Solr?” by Tom Hill (courtesy: my google bookmarks). It might be a bit dated now, though.

2) This article by Grant Ingersoll, at IBM Developer works, was another useful one, back then.

3) Solr Wiki is the best source for everything, ofcourse. (http://wiki.apache.org/solr)

4) “Seven Deadly Sins of Solr” on LucidImagination pages, was another one, which helped me a lot during the initial days. In general, LucidImagination has some really good articles on Solr, for users of all stages of Solr experience.

5) In my view, setting Solr in Eclipse, with Tomcat is a real irritating issue. Well, it’s no difficult. Its straight forward. But, I don’t (perhaps won’t) understand Tomcat behavior in Eclipse, personally. This blogpost gives an overview of what to do to make a test Solr server work in your eclipse.

6) I was wondering about the crawling part of Solr. I always used external crawlers (Heritrix, Wget and even 80legs etc) to get required data. But there seem to be decent amount of options, which I did not try out, yet. Some of them –
a) Using Nutch with Solr – a LucidImagination article (I did play around with Nutch a while ago, but I did not know about Solr back then!)
b) Youseer, which uses Heritrix to crawl and Solr to index.
c) Crawl-Anywhere, another thing that I read about recently, which uses their own crawler, with Solr Indexer.

7) Here is a good example of reading Solr results, from SolrJ, a java library for Solr. I particularly cite this because, I had a tough time in finding an example on using SolrJ to read Solr results :). Mats Lindh’s post was also a very useful one to use Solrj.

The best part of this is, I need not read Solr results in java alone. I can develop my webapp in any language I want and Solr supports it! (Well, I know that people use Java, Ruby and Python. Since Solr supports JSON, I guess it can be used in other languages too, perhaps).

8 ) The best place to go, for any query about Solr is the Solr users group and its archives 🙂 Its a very friendly group, where you can always be assured that you will get a response.

9) I tried using the Solr-UIMA integration, but it did not work my UIMA annotators. So, I left it there and managed to do the UIMA stuff before sending the doc to Solr :). It might look like “running away from the problem”, but it works.

10) Time and again, I am amazed at the new stuff I discover about Solr. The books I mentioned above, are really good sources of information. You can perhaps manage to find pirated pdfs online. I don’t want to give links myself, here. 🙂

11) Oh, I did mention links on setting up Solr to work, Knowing more about Solr and Accessing the search results of Solr using SolrJ. But, I did not mention about indexing a set of documents. I started with an example code by Grant Ingersoll. I can’t place that right now, but found this Solr-user group mail, which can serve as an example. Here is one more from Lucid Imagination, which uses TIKA parser.

Happy Solr-ing.
(Hoping that someday, someone like me, will find this post useful! Perhaps, this is an attempt to give back to the community! :P)

Published in: on August 31, 2011 at 6:54 pm  Comments (2)  

Readings: Amazon Mechanical Turk – Gold mine or Coal mine?

First things first:
Title of the article: Amazon Mechanical Turk: Gold Mine or Coal Mine?
Authors: Karen Fort, Gilles Adda, K.Bretonnel Cohen
In: Computational Linguistics Journal, June 2011, Vol. 37, No. 2, Pages 413-420

In the past 2 months or so, I guess I read this article 2-3 times. I still don’t understand why it was written 😦 Oh, the incomprehensibility was not the reason why I read it again, though. I found it multiple times, lying on my table..for various reasons (which include piling up of stuff you never read!)

Coming to the point:
1) I never used amazon mechanical turk – neither as a requester nor as a turker. But I don’t feel that there is something wrong with the approach. Its up to the turkers to do or not do a given task. So, I don’t think there should be question of ethics here. If there is anything, there should be a question on the quality.

2) If some turkers use it to meet their basic needs, (IMHO) its not the wrong-doing of amazon or the requesters.

So, if the concern in the article was about the quality of linguistic resources developed, perhaps, It might have sounded so abnormal to me. But, the issue was on the “working conditions” of the AMT “workers”. Whatever way people are using it (as a hobby, timepass, for some pocket money, to meet living expenses etc), its people who do that…and neither Amazon nor the task givers on AMT promise employment, right?

And hence the confusion… 🙂
Ofcourse, now, I won’t do a re-reading of the article! 😛

Published in: on July 1, 2011 at 1:58 pm  Leave a Comment  

Automatic Adaptation of Dynamic Second language reading texts

Title: Automatic Adaptation of Dynamic second language reading texts
Author: Michael Walmsley

The tool can be accessed here.

Let me tell you, I am not reviewing the paper here. There are two reasons I started posting this.
1. I liked the FlaxReader tool, and began believing since yesterday that it will help in my German lessons.
2. Some of the statements and conclusions in this paper – attracted, amused, surprised (not all these emotions are for one statement!) me. So, thought I’d post them here.

Firstly, this paper begins with the idea that “Extensive Reading” (ER) is an effective Language Acquisition strategy. So, in this paper, there is a discussion on providing reading material, automatically adapting it to computer adapted ER. The “This research aims to make improvements in how people can learn languages” part attracted me the most, though I read this paper for professional reasons, and not as a beginner German learner.

As they say in Telugu – Swami Kaaryam, Swakaaryam…. 😛

So, what are those statements that caught my attention?

* “Nation[3] states that students need to read close to 500,000 words per year, which is equivalent to one and a half substantial first year text books or six novels.”
(Nation[3] here is… Nation I.S.P, 2009, Teaching ESL/EFL reading and writing.)
– I did not go to that to see how they made those conclusions (Yet!)…but, it was interesting to note that. I wonder if my vocab, especially in Telugu..shows any improvement, despite the fact that I read many Telugu books, during an year! 😛

* “Dictionaries should only be used for frequently occurring unknown words that prevent overall understanding. Literature suggests that comprehension is difficult when more than 5% of the words in a text are unfamiliar to a learner.”
(Again from Nation[3])
-Well, I am wondering at how the stats were arrived at. Nevertheless, its also an interesting fact to know.

There was this statement, which is so true…but at the same time, its structure made me ROFL:
“Elaboration is a text simplification technique”

“Vocabulary researchers have proposed that English learners must minimally know the most frequent 3000 to 5000 word families in order to effectively read authentic English text.”[24]
(The [24] is Cobb T, 2007, Computing the vocabulary demands of L2 reading)

“Learners can acquire new L2 words and strengthen existing vocabulary knowledge, by reading L1 texts with some words replaced by L2 equivalents” (L2 here is the language you are learning and L1 is the one you are good at).
-This particular statement summarizes what the flax reader tool does. I don’t know how the learnability is evaluated…but I guess this will be a good exercise for learners (like me!)

The rest of the paper discusses on what L1 words should be replaced by L2 equivalents, in a chosen text, how? what aids can be used? etc.

And then, there are others in the references, that attracted me. May be I will read some of them sometime soon….

Published in: on May 25, 2011 at 2:10 pm  Leave a Comment  

Readability and Indian Language texts

I got interested in computing the readability of texts recently. I have been suffering from a readers bloc for the past three weeks or so. So, I was not able to read much on this subject, though I skimmed through quite a bit of material. Well, I did find some works on creating a language models to estimate text difficulty. But, majorly, most of the stuff I found were based on conventional readability measures, which must have had their golden jubilee too, in the past decade.

Then, I came across this decade old paper:
“Living off the land: The web as a source of practice texts for learners of less prevalent languages”

This was on finding out right texts from the web, to provide learning material for second language learners of Nordic Languages. I was fascinated by the official title of the project described in the paper: “Corpus based language technology for computer assisted learning of Nordic languages”.

So, what is this about?
To summarize briefly, here is what they do:
1. The user supplies example text, in the form of a URL.
2. The text this URL is evaluated for the readability and other language factors.
3. Along with these statistics, the user is also presented with ten possible query terms, from that document.
4. The user can choose what query terms can be sent to the search engine (Evreka)
5. Again, each of those result documents are evaluated for the statistics of (2), and the user is presented with the results, along with a brief summary about each document result.

Well, I don’t understand the motive behind asking the user to give a URL and then doing all this. However, that’s not why I am writing this post.

I was left wondering (sitting in the banks of Neckar, with pigeons playing in front of me … in big numbers) about the readability of Indian language texts. Did anyone attempt at experimenting with that? Will these traditional readability measures that are seen about English work well for Indian languages? Is it necessary to think about Computer Assisted Language Learning for, say, Telugu? Can Telugu be called an LPL (Less Previleged Language)?

Well, “What is readability?” “How can someone measure general readability, isn’t there a personalized angle there?” – are my perpetual questions, though. They are not specific to Indian Languages.

However, atleast if you let the imagination run high, there is a very interesting scope to use this practically, in the Language teaching domain (IMHO). Perhaps, I’d write better if I can get over my readers bloc 🙂

Published in: on April 19, 2011 at 7:00 am  Leave a Comment  

On Practical Web Crawling issues

Practical Web Crawling Issues by Carlos Castillo.

I don’t remember how I came to this document. I think I was looking for some website copier-like crawlers, which I can use to run crawls to get a few 100s or 1000s of webpages from some seed websites and google showed me this.

As I browsed through this, it appeared to me as well written and comprehensive despite the fact that the issues are kind of obvious when you think of a crawler. I was never involved in designing a crawler, though I used a couple of crawlers over a period of time. To me, it provided a good overview. Once I began to write about this, I tried to find out who the author is, and here he is 🙂

“Carlos Castillo is a research scientist at Yahoo! Research, Barcelona. He received his Ph.D from the University of Chile in 2004, and was a visiting scientist at Universitat Pompeu Fabra (2005) and Universita di Roma La Sapienza (2006). At age 33 (2010), he is an active researcher with over 30 publications in international conferences and journals in the areas of Web search and mining. He has served in the program committee of most major conferences in his area (WWW, WSDM, SIGIR, CIKM, etc.), and co-organized the Adversarial Information Retrieval Workshop and Web Spam Challenge in 2007 and 2008, and the ECML/PKDD Discovery Challenge 2010. His current research interest is the mining of content, links and usage data from the Web.”
-That bio says it all, about the credentials of the document 😛

Coming to the point, this is a brief ‘executive’ summary of the document.

Practical web crawling issues can be broadly classified in to 6 categories:
1)Networking issues (variable network quality, server admins concerns, inconsistent firewall settings)
2)Massive DNS Resolving (crashing of local DNS servers, Temporary DNS failures, Malformed DNS records, Wrong DNS records, Use of WWW prefix)
3)HTTP Implementations(Accept Headers not honored, Range Errors, Response Lacking Headers, Found where you mean error, Wrong dates in headers)
4) HTML Coding (Wrong markup, Physical over logical content representation)
5) Webcontent Characteristics (blogging,mailinglists and forums, Duplicate Detection)
6) Server Application Programming (Embedded Session Ids, Repeated Path Components, Slower/Erroneous pages)

The best part about this document is that it is not an enumeration. It also suggests some solutions wherever possible. Most of us need not write a crawler at all, even when we need one. However, if we use a crawler, I think it helps to understand some of these issues. Taking an extreme stance, I would say that every search engine user ought to know more about crawling than what he/she knows now. I would not say that this is the only document, but, this is a good one to know more, for sure.

I think I would read more from this thesis soon, whenever I feel compelled to do so 😉

Published in: on November 12, 2010 at 8:00 am  Comments (2)  

Finding replicated web collections (1999)

I got in to reading more than a decade old papers. Comparatively, I do get a feeling of ‘old ness’ as I read these things. But, these papers provide good overviews of the problems in web information retrieval and possible application scenarios to the solutions.

(By reading – I mean, a careful reading and understanding of the overall intentions of the paper and not analysing and validating its math etc. I have only a generic interest in the subject)

Finding replicated web collections – is a 25 page paper (:O) from Stanford-google, by Junghoo Cho, Narayanan Shivakumar and Hector Garcia-Molina

In brief: This paper, as the name suggests, deals with finding replicated documents on the web. The approaches proposed are scalable to tens and millions of pages of data.

The intentions in finding replicated documents is multifold. As the ‘introduction’ section summarizes, some of the web search engine tasks can be performed more effectively by finding out replications.

1) A crawler’s job can be finished quickly by making it skip replicated content. This also saves a lot of bandwidth.

2) A page that has many copies is perhaps more important than others. So, search results can be ranked by this factor.

3) For archiving, pages with more duplicates can be given priority, incase there are some storage constraints, since they are ‘important’ and hence are of more archival value.

4) I did not quiet get this point, but here it is in their own words: “cache holds web pages that are frequently accessed by some organization. Again knowledge about collection replication can be used to save space. Furthermore, caching collections as a whole may help improve hit ratios (e.g., if a user is accessing a few pages in the LDP, chances are he will also access others in the collection).”

The 3 main issues involved in this are:
1)Defining a notion of ‘relativeness’

2)Finding methods to identify such related documents

3)Exploiting this replication

Major difficulties in finding out the replicated documents are:
1) Sometimes, the replications are updated at different points of time. During the crawl time, they might not look similar because of this.

2) Sometimes, mirror collections are only partial copies.

3) The data formats of mirrors and original may be different

4) The sanpshots compared may be of partial crawls.

What is similarity anyways? How should we define it?
A major contribution of this paper is that this paper defines a similarity measure, uses this and improves the crawl quality and efficiency using the similarity information.

Major part of the rest of the paper went in to defining, calculating and evaluating various similarity measures, Though this was what I wanted to read, when I began this paper, other things mentioned here began interesting me and I skipped this part! Post this section, they mentioned how they applied this to crawling and searching – and it was about its application in crawling that I got interested in, over the course of reading this paper.

Well, Google’s crawler back then had on 25million pages itseems (!! seems like pre historic age !!). Taking this crawled data, they analyzed it for the amount of duplicate content, and understood that almost 48% of the crawl had duplicate content (‘Duplication’ here does not mean a 100% duplication.) However, taking this data, they precoded information on widely replicated collections and machines with multiple names in to Google’s crawler and repeated a crawl. The duplication now was only 13%, despite the fact that the crawl size now was 35million pages!
[Interesting…. and I can see a sci-fi fantasy where duplicates, which are creating confusion and havoc in the bit world are being checked by the geeks – a war ensues and finally, for now geeks won, and the duplicate bits vow for a revenge.. part-2 continues! :P]
-Likewise, each crawl makes a collection of information on replicated content, which won’t be crawled in future.

Apart from the above mentioned paragraph, I found the whole of the introductory sections extremely readable.

Reading ‘History’ is good, anyways! 🙂

Published in: on November 10, 2010 at 8:31 am  Comments (1)