Facebook General Matrix Multiplication

Optimizations in C++

C++ | OpenMP

Link: Github

Introduction

This project focused on optimizing the FBGEMM (Facebook General Matrix Multiplication) library, a high-performance kernel library designed for machine learning workloads. By extending auto-vectorization capabilities, I aimed to maximize data parallelism and enhance computational efficiency for table batched embedding operations, which are vital in recommendation systems and other ML applications.

Solution

The work involved close collaboration with low-level programming in C/C++ and parallel processing frameworks, resulting in significant runtime improvements while maintaining high accuracy.

Deliverable

Our deliverable was presented to the applied scientists and engineers at Meta's Sunnyvale headquarters.

Auto-Vectorization Enhancements

Extended compiler support to auto-vectorize operations for a range of data types, including 32-bit and 8-bit floats and 4-bit and 8-bit integers, unlocking substantial performance gains.

Unit Testing for Accuracy Validation

Designed and implemented 10 unit tests to compare results from parallelized embedding operations against ground truth outputs, ensuring a high accuracy rate of 98%.

Compiler Optimizations

Composed and fine-tuned compiler flags for Meta's internal compiler system to optimize runtime performance, achieving up to a 16x speedup in certain scenarios.

Thread-Level Parallelism with OpenMP

Leveraged the OpenMP library to distribute workloads effectively across threads, enhancing scalability and computational efficiency for multi-core systems.

This project underscored the importance of combining algorithmic optimization with hardware-aware programming to achieve impactful results. I learned:

The power of auto-vectorization: How modern compilers can automatically parallelize data operations with the right configurations and code patterns.

The balance of performance and accuracy: Achieving near-perfect accuracy while pushing runtime improvements demands rigorous testing and validation.

The utility of OpenMP: Thread-level parallelism is a crucial tool for scaling workloads, particularly in memory-intensive applications like embeddings.

Compiler flags as a performance lever: Fine-tuning these settings can significantly influence runtime performance in large-scale systems.

This experience enriched my understanding of parallel computing and low-level optimizations, skills that continue to influence my approach to solving performance-critical problems.

Brandon Wu

About

Projects

Courses