In our weekly/fortnightly reading group here, we spent most of the past 2 months discussing about “automatic question generation”. We discussed primarily NLP papers but included a couple of educational research papers as well. NLP papers usually focus on the engineering aspects of the system and are usually heavy on computations required. Educational research papers primarily focus on performing user studies with some approach of question creation and then correlating the user performance with these questions to text comprehension. So, these are usually low on the computational part. That is roughly the difference between the two kinds of articles we chose to read and discuss.
Now, as we progressed with these discussions, and as more people (with diverse backgrounds) joined the group for a couple of sessions, I realized that I am learning to see things in different perspectives. I am now writing this post to summarize what I thought about these articles, what I learnt through these discussions, and what I think about the whole idea of automatic question generation at this point of time. I will give pointers to the relevant articles. Most of them are freely accessible. Leave a comment if you want something you see here and can’t get access. Questions here were of two kinds – factual questions from the text (Who did what to whom kind of things) and fill in the blank kind of questions where one of the key words go missing.
Let me start with a summary of the stuff we discussed in the past few weeks:
a) We first started with the generation of factual questions from any text i.e., the purpose of the system here is to generate questions like – “When was Gandhi born? Where was Gandhi born?” etc., from a biography page on Mahatma Gandhi. Here, we primarily discussed the approach followed by Michael Heilman. More details about the related articles and the released code can be seen here. Here, the primary focus of the approach has been to generate grammatically correct questions.
b) We then moved to more recent work from Microsoft Research, published in 2015, where the task of “generating the questions” is transformed by using crowd sourcing to create question templates. So, the primary problem here is to replicate the human judgements of relevant question templates for a given text, by drawing inferences about the category of the content in a particular section of text through machine learning. (I am trying to summarize in one sentence, but someone wanting to know more please read the article). The resource created and the features used to infer category/section will eventually be released here.
c) At this time, after a slight digression into the cognitive and psycholinguistic aspects of gap filling probabilities, we got into an article which manually designed a fill-in-the-blank kind of test which allegedly measures reading comprehension. They concluded that such kind of tests are quick to create, take less time to test, and still do what you want out of such test (i.e., understand how much the readers understood).
d) Naturally, the next question for us was: “How can we generate the best gaps automatically?”. Amidst a couple of articles we explored, we again picked an older article from Microsoft Research for discussion. This is about deciding what gaps in a sentence are the best to test the “key concepts” in texts. Again, the approach relies on crowd sourcing to get these judgements from human raters first, and then develops a machine learning approach to replicate this. The data thus created, and some details about the machine learning approach implementation can be found here.
Now, my thoughts on the topic in general:
a) To be able to generate real “comprehension” testing questions from any possible text, we should make sure that we are not falsefully ending up testing the ability of a reader to remember the text. So, I did not get a clear picture of how fill-in-the-blank questions avoid this pitfall. Generating who? what? kind of questions instead of fill-in-the-blanks perhaps to some extent covers this up. Yet, if these questions only require you to know that one sentence, how are they really measuring comprehension of the whole piece of text, when comprehension can include drawing inferences from multiple parts of the text?
b) One dis-satisfying aspect of all these readings has been that: people who do user-studies don’t talk about the scalability of their method beyond a laboratory setup and people who engineer technological solutions don’t discuss if these approaches are really working with real users in testing their comprehension. I was surprised that several NLP papers I read on the topic in the past weeks (apart from those mentioned above) talk about question generation approaches, evaluate on some dataset about the correctness or relevance of the “questions” generated (be it gap-filling or questions with a question mark). But, I haven’t seen anyone do an evaluation on the possible consumers of such an application. The only exception in my readings has been – Michael Heilman’s PhD thesis, where they evaluated their question generation approach as a possible assisting tool for teachers to prepare questions.
On one hand, I think this is a very interesting topic to work on, with all the possible commercial and not-so-commercial real-life impact it can have in these days of massive online education and non-conventional ways of learning. Clearly, there is a lot of work going on on various ways to generate questions automatically, which is a very useful method to have in such massive learning scenarios. We know what approaches “kind of” work and what don’t, in generating the questions as such. However, I wonder what exactly are we trying to achieve by not doing the final step of user evaluation with these computational approaches. If we do not know whether all the fancy approaches are going really doing what they are supposed to do (testing comprehension of the readers), what is the point? To use a Tennis term, the missing “follow through
” is a problem for much of this work remaining unusable for the actual consumers of this kind of work – teachers, learners and other such people in a learning environment. I am not a dreamer, so I know the difficulties in working across groups and I can guess the reasons for the missing “follow through” (especially, as someone currently in the academia!).
The only way I see the “follow through” being possible is in an ed-tech company, since they have to do the user evaluation to get going
Perhaps I should wait and see if new ed-tech startups working on learning analytics and measuring learning outcomes can come up with effective solutions. On that optimistic note, I should perhaps end my post for now.
Acknowledgements: I have benefited a lot from the comments on these papers by Magdalena Wolska, Maria Chinkina, Martí Quixal, Xiaobin Chen and Simón Ruiz, who attended some or all of these meetings in the past few months. Long live discussions!😉