On the Effectiveness of Query Variation in Technology-Assisted Review Systems

Giorgio Maria Di Nunzio

Abstract:

High-recall Information Retrieval systems tackle challenging tasks that require the finding of (nearly) all the relevant documents in a collection of documents. Electronic discovery (eDiscovery) and systematic review systems are probably the most important applications of such systems where the search for relevant information with limited resources, such as time and money, is necessary.

In this field, Technology-Assisted Review (TAR) systems use a kind of human-in-the-loop approach where, starting from an initial query of the user, ranking algorithms are continuously trained according to the relevance feedback from the user until a substantial number of the relevant documents are identified. This approach, named Continuous Active Learning (CAL), is more effective and more efficient than traditional e-discovery and systematic review practices, which typically consist of a mix of keyword search and manual review of the search results.

In this work, we aim to study the effectiveness of query variation approaches that work in parallel with the explicit relevance feedback of the users during a search session. In particular, we want to predict when to stop the search for relevant documents in terms of the cost/benefits of the amount of missing information compared to the effort of continuing to search. We evaluate the approaches on standard Information Retrieval test collections provided by the Conference and Labs Evaluation Forum (CLEF) and the Text Retrieval Conference (TREC).