• Overview
  • Technical Features
  • Takeaways
  • Benford's Analysis of COVID-19 Data

    Wharton Data Science Academy Research

    R | Exploratory Data Analysis | Statistical Modeling

    Link: Research Paper

    Overview

    Introduction

    During my time at the Wharton Data Science Academy from May to July 2021, I conducted a research project applying Benford's Law to analyze COVID-19 data. Benford's Law predicts the frequency distribution of leading digits in many real-life datasets, and deviations from this distribution can indicate anomalies or potential data manipulation. This analysis aimed to assess the integrity of reported COVID-19 case numbers.

    Solution

    Utilizing R and Exploratory Data Analysis (EDA) techniques, I examined over 28,000 COVID-19 case entries. By calculating the logarithm of p-values for case and death data, I modeled a linear regression to plot p-values ranging from 0.1 to 0.05, providing insights into the conformity of the data with Benford's Law.

    Technical Features

    Data Collection and Preparation

    Collected a dataset comprising over 28,000 COVID-19 case entries, ensuring data cleanliness and readiness for analysis by handling missing values and standardizing formats.

    Data Collection and Preparation

    Application of Benford's Law

    Applied Benford's Law to the dataset to analyze the frequency distribution of leading digits in reported COVID-19 cases and deaths, identifying deviations that could indicate anomalies or inconsistencies.

    Application of Benford`s Law

    Statistical Modeling

    Calculated the logarithm of p-values for case and death data, modeling a linear regression to plot p-values ranging from 0.1 to 0.05, providing a visual representation of data conformity to Benford's Law.

    Statistical Modeling

    Takeaways...

    This research project provided valuable insights into the application of statistical methods for data validation. Key learnings include:

  • Understanding Benford's Law: Gained a deep understanding of Benford's Law and its applicability in detecting anomalies within large datasets.
  • Data Integrity Assessment: Developed skills in assessing the integrity of publicly reported data, crucial for informing policy decisions during a global health crisis.
  • Recognition: The project was honored with the Outstanding Project Award at the 4th Wharton Data Science Live Conference, competing among over 100 teams.
  • Overall, this experience enhanced my proficiency in statistical analysis and data science methodologies, emphasizing the importance of data integrity in public health reporting.