Rohit Patel

Research and Publications

GIM: Evaluating models via tasks that integrate multiple cognitive domains, with Alexandre Rezende and Steven McClain
GIM is a benchmark of 820 original expert-authored problems designed to evaluate model reasoning through integration across multiple cognitive domains rather than obscure knowledge recall. It introduces rubric-decomposed scoring, calibrated IRT-based ability estimates, and a broad leaderboard analysis showing how model configuration and test-time compute significantly affect performance.

CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark, with Meta collaborators
CRAG-MM presents a comprehensive benchmark for multimodal, multi-turn retrieval-augmented generation in wearable-style settings. It includes 6.5K image-question-answer triplets and 2K multi-turn conversations across 13 domains, with tasks covering single-source augmentation, multi-source augmentation, and multi-turn QA. Baseline and industry systems perform modestly, highlighting substantial headroom for improved multimodal RAG systems.

WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios, with MSL and Meta Reality Labs, NeurIPS
The first benchmark designed to evaluate visual question answering capabilities of multi-modal AI assistants on wearable devices like smart glasses. Unlike prior benchmarks with high-quality third-person imagery, WearVQA reflects the challenges of ego-centric interaction with occluded, poorly lit, or blurry visual inputs. The benchmark comprises 2,500 curated image-question-answer triplets spanning 7 image domains, 10 cognitive task types, and 6 wearables-specific quality issues. Open-source and proprietary multi-modal LLMs achieved only 24–52% accuracy on WearVQA, with substantial drops on lower-quality images and reasoning-heavy tasks.

Understanding Reinforcement Learning for Model Training, and future directions with GRAPE
This paper provides a self-contained, from-scratch, exposition of key algorithms for instruction tuning of models: SFT, Rejection Sampling, REINFORCE, Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Group Relative Policy Optimization (GRPO), and Direct Preference Optimization (DPO). Explanations of these algorithms often assume prior knowledge, lack critical details, and/or are overly generalized and complex. Here, each method is discussed and developed step by step using simplified and explicit notation focused on LLMs, aiming to eliminate ambiguity and provide a clear and intuitive understanding of the concepts. By minimizing detours into the broader RL literature and connecting concepts to LLMs, we eliminate superfluous abstractions and reduce cognitive overhead. Following this exposition, we provide a literature review of new techniques and approaches beyond those detailed. Finally, new ideas for research and exploration in the form of GRAPE (Generalized Relative Advantage Policy Evolution) are presented.

The Llama 3 Herd of Models, with Meta GenAI team
This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.

Meta Llama 3.2, with Meta GenAI team
Meta AI's Llama 3.2 release introduces small and medium-sized vision LLMs and lightweight text models optimized for edge and mobile devices. These models support context lengths of 128K tokens and excel in on-device tasks like summarization and instruction following. Llama Stack distributions simplify deployment across environments with integrated safety features. This update enhances Llama's capabilities, modifiability and cost efficiency, driving innovation in generative AI applications.

Meta Llama 3 large language model, with Meta GenAI team
Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes. The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks. Further, in developing these models, we took great care to optimize helpfulness and safety.

Costly Inspection and Money Burning, with Can Urgun
A principal designs a mechanism to allocate an indivisible, productivity-increasing good to one of many agents. Monetary transfers are not allowed. Instead, we consider the interplay between two instruments studied only in isolation: “costly verification of the agent’s type” and “money burning”. We use a graph-theoretic approach and characterize the optimal mechanism completely.

Costly Verification by an Intermediary in a Two Sided Market
This paper studies the players in a two sided market when the good sold is of uncertain value, network effects are in play, and there is a cost to verifying the value of the good. It describes the optimal mechanism to maximize the revenue for the intermediary.

Indices for Dynamic Pricing in the Event Ticketing Industry
This paper introduces new price indices and measures to facilitate dynamic pricing in the sports ticketing industry.

A Seller's Problem and Costly Verification, with Can Urgun
This paper outlines the optimal strategy to maximize the revenue of a seller when the buyer's willingness to pay is unknown, but can be discovered at a cost (market research etc..).

Differentiating Reputations in Dynamic Duopolies, with R. Andrew Butters
This paper outlines how reputation in a marketplace can be used to boost prices and revenue in dynamic duopolies, and how firms end up at different product quality and prices in a dynamic equilibrium.

Advisory

BayPine | Digital Advisory Board
I serve on BayPine's Digital Advisory Board, where I advise on the use and implications of AI across private equity sourcing and investments, as well as portfolio companies' value creation plans.

Speaking

I regularly speak on topics including AI, evaluations, and building effective AI agents. I've spoken at conferences like CES, MIT Emtech, TechCrunch Disrupt and more:

eMerge Americas | Miami, FL | Apr 22-24, 2026
Fintech Americas | Miami, FL | Mar 24-26, 2026
MIT Emtech | Athens, GR | March 19-20, 2026
CES 2026 | Las Vegas, NV | Jan 6-9, 2026
AUTONOMOUS | Virtual | December 3-4, 2025
Tech Basel Miami AI Summit | Miami, FL | December 3, 2025
ISG AI Impact Summit | New York, NY | November 17-18, 2025
TechCrunch Disrupt | San Francisco, CA | October 27-29, 2025
NDSML Summit | Stockholm, SE | October 22-23, 2025
Ai4 | Las Vegas, NV | August 11-13, 2025
JP Morgan Quantitative Conference | New York, NY | June 2025
SINFO 32 | Lisbon, PT | February 2025
THE AI SUMMIT NEW YORK | New York, NY | December 2024

Articles and Essays

Some of my writing lives outside of research papers. Here are a few essays and explainers I have published online.

The Kind Of Intelligence We Are Building With AI, And What It Means To Be Human | Forbes
A reflection on the kind of intelligence current AI systems embody, and what that means for how we think about being human. It contrasts creative fluency in modern models with their continuing weaknesses in rigorous logic and reasoning.

What does the future of AI look like if we hit the LLM scaling wall? | Medium
An essay on what comes next if frontier LLM scaling slows down. The argument is that small models, scaled inference, and AI agents may become the more important path forward.

Understanding reinforcement learning for model training from scratch | Medium
A first-principles walkthrough of how pre-trained models become instruction-tuned models, covering supervised fine-tuning, rejection sampling, and reinforcement learning methods for LLM training.

An intuitive treatment of Negative log-likelihood, Cross entropy, KL divergence, and Importance sampling | Medium
A plain-English treatment of core ideas behind modern model training, including negative log-likelihood, cross entropy, KL divergence, and importance sampling.

Understanding LLMs from Scratch Using Middle School Math | Medium
A self-contained explanation of how LLMs work, starting from simple arithmetic and building up to transformers. It is written to make the core mechanics accessible without assuming prior machine learning background.

How to do the Price-Volume-Mix waterfall right | Medium
An explanation of how to correctly decompose revenue changes into price, volume, and mix effects, and why common PVM waterfall analyses often get the attribution wrong.

Rohit Patel

Research and Publications

Advisory

Speaking

Articles and Essays

Resume