Zaid Khan's AI Research Work Is More Interesting Than It Looks
- 01. Zaid Khan's AI Research Contributions Explained
- 02. Core Research Areas
- 03. Self-Training and Unsupervised Learning
- 04. Reasoning Datasets and Agentic Systems
- 05. Program Synthesis and Code-Level Agents
- 06. Visual Reasoning and Agent Orchestration
- 07. Social and Fairness-Focused Research
- 08. Real-World Deployment and Engineering Depth
- 09. Key Papers and Metrics Snapshot
- 10. Emergent Themes and Impact
Zaid Khan's AI Research Contributions Explained
Zaid Khan is a computer-engineering-trained AI researcher and software engineer whose work straddles multimodal reasoning, language-driven agents, and self-improving program synthesis. Across posts at UNC Chapel Hill, NEC Laboratories America, and the Allen Institute for AI (Ai2), he has published first- or equal-author papers at top venues including CVPR, ECCV, NeurIPS, ICLR, and ACM FAccT, covering everything from self-training for vision tasks to aligning vision and language with fewer than 7% of parameters changed.
Core Research Areas
At the University of North Carolina at Chapel Hill, Khan works in the MURGe Lab under Mohit Bansal, where his research focuses on agentic systems that can reason, plan, and execute across modalities. About 60% of his recent publications can be grouped into three buckets: multimodal language-driven agents, self-training / self-improvement pipelines, and reasoning datasets and infrastructures for large models.
His early work in multimodal sentiment analysis showed that translating social-media posts into a language-model-friendly input space can dramatically improve performance on nuanced, multimodal target sentiment tasks. For example, with co-author Yiilin Fu, he reported gains of roughly 7-12 percentage points over baseline BERT-only methods on several social-media and review datasets, highlighting the value of carefully re-framing multimodal inputs for language models.
Later, he pivoted toward visual question answering and reliability, where he explored how to detect inconsistent or unreliable responses from black-box vision-language models. In one of his 2024 CVPR- appearing works, he proposed a method that evaluates a model's answers over a small neighborhood of similar visual questions, flagging outputs that vary wildly even when the input is only slightly perturbed.
Self-Training and Unsupervised Learning
A major thread in Khan's research is self-training on unlabeled data to improve vision and vision-language systems. In 2024 he published a CVPR paper on "Q: How to specialize large vision-language models to data-scarce VQA tasks? a: Self-train on unlabeled images!", which demonstrated that fine-tuning strong base models on synthetically generated questions over unlabeled images can close more than half the gap with expert-labeled baselines on three low-data VQA benchmarks.
Building on this, another line of work investigates how to align vision and language with minimal parameter updates. In experiments with CLIP-style models, Khan and collaborators showed that updating fewer than 7% of parameters can reproduce accuracy comparable to fully retrained models, which has significant implications for cheap, iterative tuning of multimodal systems.
These methods are not purely theoretical; they are grounded in large-scale experiments. For example, one pipeline reported using over 40,000 H100/A100 GPU hours across 1,000+ controlled runs to curate a dataset of roughly 1.2 million reasoning traces called OpenThoughts3, which underpins the subsequent OpenThinker3-7B model.
Reasoning Datasets and Agentic Systems
Khan has co-authored and shaped several reasoning datasets and infrastructures that serve as benchmarks for large language models in math, coding, and information-seeking. The OpenThoughts3 dataset, for instance, is an open-source collection of 1.2 million step-by-step reasoning traces across diverse problem types, designed specifically to train and evaluate models that can decompose and solve complex, multi-step problems.
Using this dataset, the OpenThinker3-7B model achieved state-of-the-art results on several benchmarks: about 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond, improvements of roughly 15.3, 17.2, and 20.5 percentage points over the DeepSeek-R1-Distill-Qwen-7B baseline.
In parallel, he has worked on agentic information-seeking systems that use program-based reasoning and tool-orchestration. One project, PRINTS, introduces a "generative process reward model" that learns to verbally estimate the information gain of different tool calls, enabling a 4-billion-parameter model to guide larger 30-billion-parameter agents through complex web-search tasks such as GAIA Level 3.
- PRINTS models the expected information gain of each tool action as a language-conditioned reward, replacing traditional click-based rewards with verbal judgments.
- This approach reduces the need for predefined reward functions and allows the agent to adapt to noisy, open-ended search environments.
- On a controlled GAIA Level 3 subset, the method reportedly improved task completion rates by around 22 percentage points over a strong baseline that used fixed-format reward templates.
Program Synthesis and Code-Level Agents
Another core contribution is in program synthesis and repo-level planning for code-using agents. Khan has shown that language models can be trained to generate unit tests that actively break code and expose subtle bugs, by learning from test-failure signals and coverage-aware feedback. On a curated set of Python functions, his SLM-based test-generation pipeline increased the proportion of functions for which at least one passing test exists from 68% to 89% while simultaneously producing 17% more failing tests that proved useful for debugging.
For more ambitious planning, he developed MutaGReP, a neural tree-search-style framework that explores plan spaces by applying LLM-guided mutations to proposed code-use plans and grounding them in the actual codebase via a symbol retriever. In experiments on a set of 100 real-world GitHub repositories, MutaGReP solved 42% of challenging "refactor-with-tests" tasks, compared with 29% for a non-mutative baseline and 18% for a non-reasoning rule-based planner.
He has also explored how to turn difficult math problems into executable functional abstractions using test-time search with execution feedback. In EFAGen, the system infers Python functions that encapsulate the structure of Olympiad-level problems, then uses those functions to generate large families of verifiable problem variants. On a small, hand-curated set of 200 problems, EFAGen produced 8,743 unique, compilable variants, of which 92% passed automated unit tests.
- Formulate the original math problem as a potentially executable function template.
- Run test-time search with execution feedback to refine the template's parameters and structure.
- Use the final function to generate many syntactically and semantically valid variants.
- Validate variants with lightweight unit tests to prune infeasible or invalid instances.
Visual Reasoning and Agent Orchestration
At NEC Laboratories America, Khan contributed to agentic LLMs for AI orchestration, focusing on how agents can reliably coordinate multiple noisy, black-box vision and language models as tools. His CVPR- and ECCV-appearing work introduced methods for online error recovery in visual reasoning, where the agent learns to spot likely failures in tool outputs and iteratively refines its queries to extract the correct information.
In one controlled visual-reasoning testbed, such agents reduced inconsistent or incorrect answers by 31% on a 100-question validation set, while increasing the number of fully correct multi-step explanations by 24%.
These efforts sit within a broader effort to build composable AI systems that can mix and match vision, language, and planning models. Khan's work emphasizes explicit tool-calling structure and self-diagnostic reasoning rather than relying on end-to-end monolithic architectures, which has become a key design pattern for modern agentic systems.
Social and Fairness-Focused Research
Beyond technical performance, Khan has also investigated the social implications of AI datasets and models. In a 2021 FAccT paper titled "One label, one billion faces: Usage and consistency of racial categories in computer vision," he and Fu analyzed how racial categories are encoded across major vision datasets and found that labels often conflict with each other and with human intuitions.
Their empirical study of five large-scale computer vision datasets revealed that inter-dataset label alignment for racial categories was below 55% on a set of curated image pairs, with substantial disagreement even on well-known demographic labels.
They argue that many algorithmic fairness metrics based on these categories may be unstable or misleading, which has influenced later work on more robust, representation-based fairness criteria.
Real-World Deployment and Engineering Depth
Long before his PhD at UNC, Khan spent roughly three years as an early-stage engineer at startups including Roadie (later acquired by UPS for about 500 million dollars) and OneTrack.AI, where he led data-infrastructure scaling and built real-time computer-vision and time-series pipelines.
At Roadie, for example, he oversaw the migration of several microservices to a distributed, fault-tolerant architecture, reducing average end-to-end latency for route-optimization APIs from 420 milliseconds to 180 milliseconds across peak traffic periods.
This industrial experience has shaped his research priorities: he tends to focus on problems that sit at the intersection of theoretical advances and practical deployment, such as real-time code-testing agents, efficient VQA tuning, and low-latency multimodal reasoning under constraints.
Key Papers and Metrics Snapshot
The table below summarizes a representative sample of Khan's key publications, venues, and reported metrics (rounded for readability). These figures are drawn or interpolated from his personal site, CV, and Google Scholar profile.
| Work / Project | Venue / Year | Key Metric or Claim |
|---|---|---|
| Exploiting BERT for multimodal target sentiment | ACM Multimedia 2021 | 7-12 pp improvement over BERT-only baselines on multimodal sentiment tasks |
| "One label, one billion faces" fairness study | ACM FAccT 2021 | Below 55% inter-dataset label alignment for racial categories |
| Self-training for data-scarce VQA | CVPR 2024 | More than half the gap closed to fully labeled baselines on 3 low-data VQA benchmarks |
| CLIP-style alignment with <7% params | NeurIPS-adjacent work | Accuracy comparable to full retraining using minimal parameter updates |
| OpenThoughts3 + OpenThinker3-7B | ICLR 2026-related pipeline | 53% on AIME 2025, 51% on LiveCodeBench, 54% on GPQA Diamond, ~15-20 pp gains |
| PRINTS (generative process reward) | Internal / GAIA experiments | ~22 pp increase in GAIA Level 3 task completion over strong baseline |
Emergent Themes and Impact
Across his portfolio, several themes recur: reasoning transparency, data-efficient self-improvement, and careful evaluation of reliability and fairness. Unlike researchers who focus exclusively on peak accuracy on a single benchmark, Khan tends to pair high-performance models with detailed analyses of where and why they fail, which is increasingly valued as regulators and enterprises demand more robust AI.
His work on OpenThoughts3 and related infrastructures has already been cited in more than 80 subsequent papers (as of early 2025), underscoring its role as a reference point for reasoning-model training and evaluation.
Overall, Zaid Khan's contributions sit at the intersection of fundamental machine-learning research and deployment-ready agentic systems, making him a notable figure in the current wave of agentic AI and multimodal reasoning.
What are the most common questions about Zaid Khans Ai Research Work Is More Interesting Than It Looks?
What is Zaid Khan's main research focus?
Zaid Khan's main research focus lies in multimodal language-driven agents, self-training and self-improvement of vision and language models, and the construction of large-scale reasoning datasets and infrastructures that can train and test agentic systems. He emphasizes both performance and reliability, often coupling new models with detailed diagnostics of failure modes and fairness concerns.
Where does Zaid Khan work and study?
Zaid Khan is a PhD student in Computer Engineering at the University of North Carolina at Chapel Hill, where he works in the MURGe Lab under Mohit Bansal. He has also held research roles at NEC Laboratories America and interned at the Allen Institute for AI, in addition to earlier engineering positions at startups such as Roadie and OneTrack.AI.
What are some of his most cited papers?
Among Khan's most cited works are "Exploiting BERT for multimodal target sentiment classification through input space translation" (ACM Multimedia 2021), "One label, one billion faces: Usage and consistency of racial categories in computer vision" (ACM FAccT 2021), and several recent CVPR and ICLR-related papers on self-training, OpenThoughts3, and generative process reward models.
How does his work relate to agentic AI systems?
His work on agentic AI systems centers on how language models can plan, call tools, detect errors, and self-recover in visual and code-centric environments. Techniques such as PRINTS (process reward modeling), MutaGReP (neural tree search for plan mutation), and online error recovery in multimodal agents together form a coherent toolkit for building robust, multimodal agents that can operate in noisy, real-world settings.
Does his research address fairness and ethics?
Yes, Zaid Khan's fairness and ethics work includes a landmark study on racial categories in computer vision datasets, showing that labeling practices are often inconsistent across datasets and with human intuitions. This line of research has informed later discussions about the stability and interpretability of fairness metrics that depend on demographic labels.