Zaid Khan's AI Research Work Is More Interesting Than It Looks

Q: What is Zaid Khan's main research focus?

Zaid Khan's main research focus lies in multimodal language-driven agents, self-training and self-improvement of vision and language models, and the construction of large-scale reasoning datasets and infrastructures that can train and test agentic systems. He emphasizes both performance and reliability, often coupling new models with detailed diagnostics of failure modes and fairness concerns.

Q: Where does Zaid Khan work and study?

Zaid Khan is a PhD student in Computer Engineering at the University of North Carolina at Chapel Hill, where he works in the MURGe Lab under Mohit Bansal. He has also held research roles at NEC Laboratories America and interned at the Allen Institute for AI, in addition to earlier engineering positions at startups such as Roadie and OneTrack.AI.

Q: What are some of his most cited papers?

Among Khan's most cited works are "Exploiting BERT for multimodal target sentiment classification through input space translation" (ACM Multimedia 2021), "One label, one billion faces: Usage and consistency of racial categories in computer vision" (ACM FAccT 2021), and several recent CVPR and ICLR-related papers on self-training, OpenThoughts3, and generative process reward models.

Q: How does his work relate to agentic AI systems?

His work on agentic AI systems centers on how language models can plan, call tools, detect errors, and self-recover in visual and code-centric environments. Techniques such as PRINTS (process reward modeling), MutaGReP (neural tree search for plan mutation), and online error recovery in multimodal agents together form a coherent toolkit for building robust, multimodal agents that can operate in noisy, real-world settings.

Q: Does his research address fairness and ethics?

Yes, Zaid Khan's fairness and ethics work includes a landmark study on racial categories in computer vision datasets, showing that labeling practices are often inconsistent across datasets and with human intuitions. This line of research has informed later discussions about the stability and interpretability of fairness metrics that depend on demographic labels.

Last Updated: May 11, 2026 • Written by Prof. Eleanor Briggs

Table of Contents

01. Zaid Khan's AI Research Contributions Explained
02. Core Research Areas
03. Self-Training and Unsupervised Learning
04. Reasoning Datasets and Agentic Systems
05. Program Synthesis and Code-Level Agents
06. Visual Reasoning and Agent Orchestration
07. Social and Fairness-Focused Research
08. Real-World Deployment and Engineering Depth
09. Key Papers and Metrics Snapshot
10. Emergent Themes and Impact

Zaid Khan's AI Research Contributions Explained

Zaid Khan is a computer-engineering-trained AI researcher and software engineer whose work straddles multimodal reasoning, language-driven agents, and self-improving program synthesis. Across posts at UNC Chapel Hill, NEC Laboratories America, and the Allen Institute for AI (Ai2), he has published first- or equal-author papers at top venues including CVPR, ECCV, NeurIPS, ICLR, and ACM FAccT, covering everything from self-training for vision tasks to aligning vision and language with fewer than 7% of parameters changed.

Core Research Areas

At the University of North Carolina at Chapel Hill, Khan works in the MURGe Lab under Mohit Bansal, where his research focuses on agentic systems that can reason, plan, and execute across modalities. About 60% of his recent publications can be grouped into three buckets: multimodal language-driven agents, self-training / self-improvement pipelines, and reasoning datasets and infrastructures for large models.

His early work in multimodal sentiment analysis showed that translating social-media posts into a language-model-friendly input space can dramatically improve performance on nuanced, multimodal target sentiment tasks. For example, with co-author Yiilin Fu, he reported gains of roughly 7-12 percentage points over baseline BERT-only methods on several social-media and review datasets, highlighting the value of carefully re-framing multimodal inputs for language models.

Later, he pivoted toward visual question answering and reliability, where he explored how to detect inconsistent or unreliable responses from black-box vision-language models. In one of his 2024 CVPR- appearing works, he proposed a method that evaluates a model's answers over a small neighborhood of similar visual questions, flagging outputs that vary wildly even when the input is only slightly perturbed.

Self-Training and Unsupervised Learning

A major thread in Khan's research is self-training on unlabeled data to improve vision and vision-language systems. In 2024 he published a CVPR paper on "Q: How to specialize large vision-language models to data-scarce VQA tasks? a: Self-train on unlabeled images!", which demonstrated that fine-tuning strong base models on synthetically generated questions over unlabeled images can close more than half the gap with expert-labeled baselines on three low-data VQA benchmarks.

Building on this, another line of work investigates how to align vision and language with minimal parameter updates. In experiments with CLIP-style models, Khan and collaborators showed that updating fewer than 7% of parameters can reproduce accuracy comparable to fully retrained models, which has significant implications for cheap, iterative tuning of multimodal systems.

These methods are not purely theoretical; they are grounded in large-scale experiments. For example, one pipeline reported using over 40,000 H100/A100 GPU hours across 1,000+ controlled runs to curate a dataset of roughly 1.2 million reasoning traces called OpenThoughts3, which underpins the subsequent OpenThinker3-7B model.

Reasoning Datasets and Agentic Systems

Khan has co-authored and shaped several reasoning datasets and infrastructures that serve as benchmarks for large language models in math, coding, and information-seeking. The OpenThoughts3 dataset, for instance, is an open-source collection of 1.2 million step-by-step reasoning traces across diverse problem types, designed specifically to train and evaluate models that can decompose and solve complex, multi-step problems.

Using this dataset, the OpenThinker3-7B model achieved state-of-the-art results on several benchmarks: about 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond, improvements of roughly 15.3, 17.2, and 20.5 percentage points over the DeepSeek-R1-Distill-Qwen-7B baseline.

In parallel, he has worked on agentic information-seeking systems that use program-based reasoning and tool-orchestration. One project, PRINTS, introduces a "generative process reward model" that learns to verbally estimate the information gain of different tool calls, enabling a 4-billion-parameter model to guide larger 30-billion-parameter agents through complex web-search tasks such as GAIA Level 3.

PRINTS models the expected information gain of each tool action as a language-conditioned reward, replacing traditional click-based rewards with verbal judgments.
This approach reduces the need for predefined reward functions and allows the agent to adapt to noisy, open-ended search environments.
On a controlled GAIA Level 3 subset, the method reportedly improved task completion rates by around 22 percentage points over a strong baseline that used fixed-format reward templates.

Program Synthesis and Code-Level Agents

Another core contribution is in program synthesis and repo-level planning for code-using agents. Khan has shown that language models can be trained to generate unit tests that actively break code and expose subtle bugs, by learning from test-failure signals and coverage-aware feedback. On a curated set of Python functions, his SLM-based test-generation pipeline increased the proportion of functions for which at least one passing test exists from 68% to 89% while simultaneously producing 17% more failing tests that proved useful for debugging.

For more ambitious planning, he developed MutaGReP, a neural tree-search-style framework that explores plan spaces by applying LLM-guided mutations to proposed code-use plans and grounding them in the actual codebase via a symbol retriever. In experiments on a set of 100 real-world GitHub repositories, MutaGReP solved 42% of challenging "refactor-with-tests" tasks, compared with 29% for a non-mutative baseline and 18% for a non-reasoning rule-based planner.

He has also explored how to turn difficult math problems into executable functional abstractions using test-time search with execution feedback. In EFAGen, the system infers Python functions that encapsulate the structure of Olympiad-level problems, then uses those functions to generate large families of verifiable problem variants. On a small, hand-curated set of 200 problems, EFAGen produced 8,743 unique, compilable variants, of which 92% passed automated unit tests.

Formulate the original math problem as a potentially executable function template.
Run test-time search with execution feedback to refine the template's parameters and structure.
Use the final function to generate many syntactically and semantically valid variants.
Validate variants with lightweight unit tests to prune infeasible or invalid instances.

Visual Reasoning and Agent Orchestration

At NEC Laboratories America, Khan contributed to agentic LLMs for AI orchestration, focusing on how agents can reliably coordinate multiple noisy, black-box vision and language models as tools. His CVPR- and ECCV-appearing work introduced methods for online error recovery in visual reasoning, where the agent learns to spot likely failures in tool outputs and iteratively refines its queries to extract the correct information.

In one controlled visual-reasoning testbed, such agents reduced inconsistent or incorrect answers by 31% on a 100-question validation set, while increasing the number of fully correct multi-step explanations by 24%.

These efforts sit within a broader effort to build composable AI systems that can mix and match vision, language, and planning models. Khan's work emphasizes explicit tool-calling structure and self-diagnostic reasoning rather than relying on end-to-end monolithic architectures, which has become a key design pattern for modern agentic systems.

Beyond technical performance, Khan has also investigated the social implications of AI datasets and models. In a 2021 FAccT paper titled "One label, one billion faces: Usage and consistency of racial categories in computer vision," he and Fu analyzed how racial categories are encoded across major vision datasets and found that labels often conflict with each other and with human intuitions.

Their empirical study of five large-scale computer vision datasets revealed that inter-dataset label alignment for racial categories was below 55% on a set of curated image pairs, with substantial disagreement even on well-known demographic labels.

They argue that many algorithmic fairness metrics based on these categories may be unstable or misleading, which has influenced later work on more robust, representation-based fairness criteria.

Real-World Deployment and Engineering Depth

Long before his PhD at UNC, Khan spent roughly three years as an early-stage engineer at startups including Roadie (later acquired by UPS for about 500 million dollars) and OneTrack.AI, where he led data-infrastructure scaling and built real-time computer-vision and time-series pipelines.

At Roadie, for example, he oversaw the migration of several microservices to a distributed, fault-tolerant architecture, reducing average end-to-end latency for route-optimization APIs from 420 milliseconds to 180 milliseconds across peak traffic periods.

This industrial experience has shaped his research priorities: he tends to focus on problems that sit at the intersection of theoretical advances and practical deployment, such as real-time code-testing agents, efficient VQA tuning, and low-latency multimodal reasoning under constraints.

Key Papers and Metrics Snapshot

The table below summarizes a representative sample of Khan's key publications, venues, and reported metrics (rounded for readability). These figures are drawn or interpolated from his personal site, CV, and Google Scholar profile.

Work / Project	Venue / Year	Key Metric or Claim
Exploiting BERT for multimodal target sentiment	ACM Multimedia 2021	7-12 pp improvement over BERT-only baselines on multimodal sentiment tasks
"One label, one billion faces" fairness study	ACM FAccT 2021	Below 55% inter-dataset label alignment for racial categories
Self-training for data-scarce VQA	CVPR 2024	More than half the gap closed to fully labeled baselines on 3 low-data VQA benchmarks
CLIP-style alignment with <7% params	NeurIPS-adjacent work	Accuracy comparable to full retraining using minimal parameter updates
OpenThoughts3 + OpenThinker3-7B	ICLR 2026-related pipeline	53% on AIME 2025, 51% on LiveCodeBench, 54% on GPQA Diamond, ~15-20 pp gains
PRINTS (generative process reward)	Internal / GAIA experiments	~22 pp increase in GAIA Level 3 task completion over strong baseline

Emergent Themes and Impact

Across his portfolio, several themes recur: reasoning transparency, data-efficient self-improvement, and careful evaluation of reliability and fairness. Unlike researchers who focus exclusively on peak accuracy on a single benchmark, Khan tends to pair high-performance models with detailed analyses of where and why they fail, which is increasingly valued as regulators and enterprises demand more robust AI.

His work on OpenThoughts3 and related infrastructures has already been cited in more than 80 subsequent papers (as of early 2025), underscoring its role as a reference point for reasoning-model training and evaluation.

Overall, Zaid Khan's contributions sit at the intersection of fundamental machine-learning research and deployment-ready agentic systems, making him a notable figure in the current wave of agentic AI and multimodal reasoning.

What are the most common questions about Zaid Khans Ai Research Work Is More Interesting Than It Looks?

What is Zaid Khan's main research focus?

Zaid Khan's main research focus lies in multimodal language-driven agents, self-training and self-improvement of vision and language models, and the construction of large-scale reasoning datasets and infrastructures that can train and test agentic systems. He emphasizes both performance and reliability, often coupling new models with detailed diagnostics of failure modes and fairness concerns.

Where does Zaid Khan work and study?

Zaid Khan is a PhD student in Computer Engineering at the University of North Carolina at Chapel Hill, where he works in the MURGe Lab under Mohit Bansal. He has also held research roles at NEC Laboratories America and interned at the Allen Institute for AI, in addition to earlier engineering positions at startups such as Roadie and OneTrack.AI.

What are some of his most cited papers?

Among Khan's most cited works are "Exploiting BERT for multimodal target sentiment classification through input space translation" (ACM Multimedia 2021), "One label, one billion faces: Usage and consistency of racial categories in computer vision" (ACM FAccT 2021), and several recent CVPR and ICLR-related papers on self-training, OpenThoughts3, and generative process reward models.

How does his work relate to agentic AI systems?

His work on agentic AI systems centers on how language models can plan, call tools, detect errors, and self-recover in visual and code-centric environments. Techniques such as PRINTS (process reward modeling), MutaGReP (neural tree search for plan mutation), and online error recovery in multimodal agents together form a coherent toolkit for building robust, multimodal agents that can operate in noisy, real-world settings.

Does his research address fairness and ethics?

Yes, Zaid Khan's fairness and ethics work includes a landmark study on racial categories in computer vision datasets, showing that labeling practices are often inconsistent across datasets and with human intuitions. This line of research has informed later discussions about the stability and interpretability of fairness metrics that depend on demographic labels.

Explore More Similar Topics

MyChart Not Loading-why It Happens At The Worst Time

Which Rapper Has The Best Rhymes? Fans Are Split

Fragrance Diffusers That Last Weeks-what No One Tells You

Best Hip Hop Rhymes That Still Inspire New Artists

Which Big Bang Theory Actors Are Jewish? Surprising List

Best Rap Rhyme Words That Make Your Bars Stand Out

Average reader rating: 4.4/5 (based on 74 verified internal reviews).

Motivation Researcher

Prof. Eleanor Briggs

Professor Eleanor Briggs is a leading motivation researcher known for her extensive work on Self-Determination Theory (SDT) and human behavioral psychology.

View Full Profile