Automatic question generation for measuring comprehension – some thoughts

In our weekly/fortnightly reading group here, we spent most of the past 2 months discussing about “automatic question generation”. We discussed primarily NLP papers but included a couple of educational research papers as well. NLP papers usually focus on the engineering aspects of the system and are usually heavy on computations required. Educational research papers primarily focus on performing user studies with some approach of question creation and then correlating the user performance with these questions to text comprehension. So, these are usually low on the computational part. That is roughly the difference between the two kinds of articles we chose to read and discuss.

Now, as we progressed with these discussions, and as more people (with diverse backgrounds) joined the group for a couple of sessions, I realized that I am learning to see things in different perspectives. I am now writing this post to summarize what I thought about these articles, what I learnt through these discussions, and what I think about the whole idea of automatic question generation at this point of time. I will give pointers to the relevant articles. Most of them are freely accessible. Leave a comment if you want something you see here and can’t get access. Questions here were of two kinds – factual questions from the text (Who did what to whom kind of things) and fill in the blank kind of questions where one of the key words go missing.

Let me start with a summary of the stuff we discussed in the past few weeks:
a) We first started with the generation of factual questions from any text i.e., the purpose of the system here is to generate questions like – “When was Gandhi born? Where was Gandhi born?” etc., from a biography page on Mahatma Gandhi. Here, we primarily discussed the approach followed by Michael Heilman. More details about the related articles and the released code can be seen here. Here, the primary focus of the approach has been to generate grammatically correct questions.

b) We then moved to more recent work from Microsoft Research, published in 2015, where the task of “generating the questions” is transformed by using crowd sourcing to create question templates. So, the primary problem here is to replicate the human judgements of relevant question templates for a given text, by drawing inferences about the category of the content in a particular section of text through machine learning. (I am trying to summarize in one sentence, but someone wanting to know more please read the article). The resource created and the features used to infer category/section will eventually be released here.

c) At this time, after a slight digression into the cognitive and psycholinguistic aspects of gap filling probabilities, we got into an article which manually designed a fill-in-the-blank kind of test which allegedly measures reading comprehension. They concluded that such kind of tests are quick to create, take less time to test, and still do what you want out of such test (i.e., understand how much the readers understood).

d) Naturally, the next question for us was: “How can we generate the best gaps automatically?”. Amidst a couple of articles we explored, we again picked an older article from Microsoft Research for discussion. This is about deciding what gaps in a sentence are the best to test the “key concepts” in texts. Again, the approach relies on crowd sourcing to get these judgements from human raters first, and then develops a machine learning approach to replicate this. The data thus created, and some details about the machine learning approach implementation can be found here.

Now, my thoughts on the topic in general:
a) To be able to generate real “comprehension” testing questions from any possible text, we should make sure that we are not falsefully ending up testing the ability of a reader to remember the text. So, I did not get a clear picture of how fill-in-the-blank questions avoid this pitfall. Generating who? what? kind of questions instead of fill-in-the-blanks perhaps to some extent covers this up. Yet, if these questions only require you to know that one sentence, how are they really measuring comprehension of the whole piece of text, when comprehension can include drawing inferences from multiple parts of the text?

b) One dis-satisfying aspect of all these readings has been that: people who do user-studies don’t talk about the scalability of their method beyond a laboratory setup and people who engineer technological solutions don’t discuss if these approaches are really working with real users in testing their comprehension. I was surprised that several NLP papers I read on the topic in the past weeks (apart from those mentioned above) talk about question generation approaches, evaluate on some dataset about the correctness or relevance of the “questions” generated (be it gap-filling or questions with a question mark). But, I haven’t seen anyone do an evaluation on the possible consumers of such an application. The only exception in my readings has been – Michael Heilman’s PhD thesis, where they evaluated their question generation approach as a possible assisting tool for teachers to prepare questions.

On one hand, I think this is a very interesting topic to work on, with all the possible commercial and not-so-commercial real-life impact it can have in these days of massive online education and non-conventional ways of learning. Clearly, there is a lot of work going on on various ways to generate questions automatically, which is a very useful method to have in such massive learning scenarios. We know what approaches “kind of” work and what don’t, in generating the questions as such. However, I wonder what exactly are we trying to achieve by not doing the final step of user evaluation with these computational approaches. If we do not know whether all the fancy approaches are going really doing what they are supposed to do (testing comprehension of the readers), what is the point? To use a Tennis term, the missing “follow throughfollow through” is a problem for much of this work remaining unusable for the actual consumers of this kind of work – teachers, learners and other such people in a learning environment. I am not a dreamer, so I know the difficulties in working across groups and I can guess the reasons for the missing “follow through” (especially, as someone currently in the academia!).

The only way I see the “follow through” being possible is in an ed-tech company, since they have to do the user evaluation to get going 🙂 Perhaps I should wait and see if new ed-tech startups working on learning analytics and measuring learning outcomes can come up with effective solutions. On that optimistic note, I should perhaps end my post for now.

Acknowledgements: I have benefited a lot from the comments on these papers by Magdalena Wolska, Maria Chinkina, Martí Quixal, Xiaobin Chen and Simón Ruiz, who attended some or all of these meetings in the past few months. Long live discussions! 😉

Published in: on October 22, 2015 at 4:06 pm  Comments (5)  

The URI to TrackBack this entry is:

RSS feed for comments on this post.

5 CommentsLeave a comment

  1. could not get a bit. like the legal jargon. do you really think kind of research is of any use to anyone. or may be it is not for common folks.

    • Dear Anon,
      Thanks for the comment. It is not about you and me thinking about something being important or not important, in my opinion. There are university groups and technology companies are working on exactly these problems for some time now and some have successfully made both money and fame with these kind of ideas. So, I personally think it is of use atleast for both people who are curious about language and people who want to make money – I don’t know about hermits and the likes, and this need not be of interest to them. Also, just because we don’t understand it, something does not automatically become useless. I as of today do not understand the process of sending a space expedition. But, I do not say it is totally useless and it is not for common folks 🙂 Hope you got the point.

  2. Pretty interesting. NLP is finally seeing some light. What kid of reasoning approach(es) are proposed to be used in your discussion? How fast are they? And how do you propose to handle any possible fuzziness?


    Somebody asked me a similar question when I was a research student, way back in the mid-90s. “Whats the use of all this research pertaining to Natural Language understanding? Do you think it would be useful in the next 100 years?”

    I said “Wait for 25 years”

    I dont know where he currently is, but if he owns an iPhone, I am sure he would appreciate Siri 🙂

    • Bhardwaj garu: The learning-based approaches we saw in these articles primarily relied on human judgements as training data and trained classification systems – they used the classifier confidence probabilities to gain some kind of knowledge about the fuzziness of prediction. But since human judgements are aggregated into good or bad eventually before training a classifier, the notion of fuzziness will not exist in that part. Which is why we spoke about the missing connection with the actual application scenario. Now that people are seeing the commercial value of these things especially after MOOC boom (I am talking about startups like aspiring minds in India, learnlaunch accelarator companies in Boston, and several other such new companies), I hope something will move towards the direction of that actually demonstrates success from the perspective of the end-user.

  3. Excellent post, thanks for sharing.

    I recently met a guy who is working on detecting cognition models from social media posts. He said his company sees huge commercial prospects from this work. Cognition is of course the first (and crucial) step of the AIDA cycle.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: