• Overview
  • Technical Features
  • Takeaways
  • Conditional Information Retrieval Research

    Undergraduate Research at UC Berkeley

    Python | Natural Language Processing | Machine Learning

    Link: GitHub Repository

    Overview

    Introduction

    As an undergraduate researcher in the Department of Electrical Engineering and Computer Sciences at the University of California, Berkeley, I collaborated with Alexander J. Spangher under the guidance of Professor Costas J. Spanos. Our research focused on Conditional Information Retrieval (CIR), aiming to enhance the relevance and accuracy of information retrieval systems by incorporating contextual conditions into query processing.

    Solution

    We investigated human information retrieval behaviors and analyzed prior works to establish a comprehensive understanding of CIR. A significant aspect of our research involved producing a large silver standard dataset to serve as ground truth for training and evaluating retrieval models. Additionally, we curated a query-context relationship dataset from 18,000 articles to train a HayStack Dense Passage Retriever, leveraging datapoint embeddings to improve retrieval performance.

    Technical Features

    Silver Standard Dataset Creation

    Developed a substantial silver standard dataset by studying human information retrieval behaviors and analyzing existing literature. This dataset serves as a ground truth benchmark for evaluating CIR models.

    Silver Standard Dataset Creation

    Query-Context Relationship Dataset

    Curated a dataset encompassing query-context relationships from 18,000 articles, facilitating the training of retrieval models to understand and leverage contextual information effectively.

    Query-Context Relationship Dataset

    Training HayStack Dense Passage Retriever

    Utilized the curated dataset to train a HayStack Dense Passage Retriever, enhancing the model's ability to retrieve relevant information by understanding the context of queries through datapoint embeddings.

    Training HayStack Dense Passage Retriever

    Takeaways...

    This research project provided valuable insights into the complexities of information retrieval and the importance of context in enhancing retrieval performance. Key learnings include:

  • Understanding Human Retrieval Behaviors: Gained insights into how users interact with information retrieval systems and the significance of context in shaping retrieval needs.
  • Dataset Curation: Developed skills in creating large-scale datasets that serve as benchmarks for training and evaluating machine learning models in information retrieval.
  • Model Training and Evaluation: Acquired experience in training advanced retrieval models like the HayStack Dense Passage Retriever and assessing their performance in context-sensitive scenarios.
  • Overall, this experience deepened my understanding of natural language processing and machine learning techniques applied to information retrieval, highlighting the critical role of context in developing more effective and user-centric retrieval systems.