Conditional Information Retrieval Research

Undergraduate Research at UC Berkeley

Python | Natural Language Processing | Machine Learning

Link: GitHub Repository

Introduction

As an undergraduate researcher in the Department of Electrical Engineering and Computer Sciences at the University of California, Berkeley, I collaborated with Alexander J. Spangher under the guidance of Professor Costas J. Spanos. Our research focused on Conditional Information Retrieval (CIR), aiming to enhance the relevance and accuracy of information retrieval systems by incorporating contextual conditions into query processing.

Solution

We investigated human information retrieval behaviors and analyzed prior works to establish a comprehensive understanding of CIR. A significant aspect of our research involved producing a large silver standard dataset to serve as ground truth for training and evaluating retrieval models. Additionally, we curated a query-context relationship dataset from 18,000 articles to train a HayStack Dense Passage Retriever, leveraging datapoint embeddings to improve retrieval performance.

Silver Standard Dataset Creation

Developed a substantial silver standard dataset by studying human information retrieval behaviors and analyzing existing literature. This dataset serves as a ground truth benchmark for evaluating CIR models.

Query-Context Relationship Dataset

Curated a dataset encompassing query-context relationships from 18,000 articles, facilitating the training of retrieval models to understand and leverage contextual information effectively.

Training HayStack Dense Passage Retriever

Utilized the curated dataset to train a HayStack Dense Passage Retriever, enhancing the model's ability to retrieve relevant information by understanding the context of queries through datapoint embeddings.

This research project provided valuable insights into the complexities of information retrieval and the importance of context in enhancing retrieval performance. Key learnings include:

Understanding Human Retrieval Behaviors: Gained insights into how users interact with information retrieval systems and the significance of context in shaping retrieval needs.

Dataset Curation: Developed skills in creating large-scale datasets that serve as benchmarks for training and evaluating machine learning models in information retrieval.

Model Training and Evaluation: Acquired experience in training advanced retrieval models like the HayStack Dense Passage Retriever and assessing their performance in context-sensitive scenarios.

Overall, this experience deepened my understanding of natural language processing and machine learning techniques applied to information retrieval, highlighting the critical role of context in developing more effective and user-centric retrieval systems.

Brandon Wu

About

Projects

Courses