This project focused on optimizing the FBGEMM (Facebook General Matrix Multiplication) library, a high-performance kernel library designed for machine learning workloads. By extending auto-vectorization capabilities, I aimed to maximize data parallelism and enhance computational efficiency for table batched embedding operations, which are vital in recommendation systems and other ML applications.
The work involved close collaboration with low-level programming in C/C++ and parallel processing frameworks, resulting in significant runtime improvements while maintaining high accuracy.
Our deliverable was presented to the applied scientists and engineers at Meta's Sunnyvale headquarters.
Extended compiler support to auto-vectorize operations for a range of data types, including 32-bit and 8-bit floats and 4-bit and 8-bit integers, unlocking substantial performance gains.
Designed and implemented 10 unit tests to compare results from parallelized embedding operations against ground truth outputs, ensuring a high accuracy rate of 98%.

Composed and fine-tuned compiler flags for Meta's internal compiler system to optimize runtime performance, achieving up to a 16x speedup in certain scenarios.

Leveraged the OpenMP library to distribute workloads effectively across threads, enhancing scalability and computational efficiency for multi-core systems.

This project underscored the importance of combining algorithmic optimization with hardware-aware programming to achieve impactful results. I learned:
This experience enriched my understanding of parallel computing and low-level optimizations, skills that continue to influence my approach to solving performance-critical problems.