• Overview
  • Technical Features
  • Takeaways
  • Facebook General Matrix Multiplication

    Optimizations in C++

    C++ | OpenMP

    Link: Github

    Overview

    Introduction

    This project focused on optimizing the FBGEMM (Facebook General Matrix Multiplication) library, a high-performance kernel library designed for machine learning workloads. By extending auto-vectorization capabilities, I aimed to maximize data parallelism and enhance computational efficiency for table batched embedding operations, which are vital in recommendation systems and other ML applications.

    Solution

    The work involved close collaboration with low-level programming in C/C++ and parallel processing frameworks, resulting in significant runtime improvements while maintaining high accuracy.

    Deliverable

    Our deliverable was presented to the applied scientists and engineers at Meta's Sunnyvale headquarters.

    Technical Features

    Auto-Vectorization Enhancements

    Extended compiler support to auto-vectorize operations for a range of data types, including 32-bit and 8-bit floats and 4-bit and 8-bit integers, unlocking substantial performance gains.

    Unit Testing for Accuracy Validation

    Designed and implemented 10 unit tests to compare results from parallelized embedding operations against ground truth outputs, ensuring a high accuracy rate of 98%.

    none

    Compiler Optimizations

    Composed and fine-tuned compiler flags for Meta's internal compiler system to optimize runtime performance, achieving up to a 16x speedup in certain scenarios.

    none

    Thread-Level Parallelism with OpenMP

    Leveraged the OpenMP library to distribute workloads effectively across threads, enhancing scalability and computational efficiency for multi-core systems.

    none

    Takeaways...

    This project underscored the importance of combining algorithmic optimization with hardware-aware programming to achieve impactful results. I learned:

  • The power of auto-vectorization: How modern compilers can automatically parallelize data operations with the right configurations and code patterns.
  • The balance of performance and accuracy: Achieving near-perfect accuracy while pushing runtime improvements demands rigorous testing and validation.
  • The utility of OpenMP: Thread-level parallelism is a crucial tool for scaling workloads, particularly in memory-intensive applications like embeddings.
  • Compiler flags as a performance lever: Fine-tuning these settings can significantly influence runtime performance in large-scale systems.
  • This experience enriched my understanding of parallel computing and low-level optimizations, skills that continue to influence my approach to solving performance-critical problems.