Sahil Khose

Sahil Khose

PhD student in Computer Science at UC Irvine, advised by Prof. Judy Hoffman.

Open to research internships, get in touch

world models VLMs robustness

I work toward generalizable, physically-grounded embodied intelligence: systems that perceive across modalities and act reliably in the world. Two commitments run through my work: models should generalize under distribution shift, and we should be able to tell when to trust them.

Lately I focus on generative world models for robot learning: using them to learn physically-grounded, data-efficient policies[VAM], and building interpretable evaluation to know when to trust them[WFM-Eval], since a generated video can look photorealistic while teaching a robot to grasp an object that isn't there. This builds on a foundation in OOD robustness[LatentDR], controllable synthetic data[SkyScenes], and multimodal / 3D grounding[SPOT][MLLM].

Affiliations

Manipal Institute of Technology

2018–2022

Indian Institute of Science

2021–2022

Georgia Tech

2022–2026

UC Irvine

2026–Present

News

  • May 2026Looking for research internships in world models, VLMs & robot learning. If your team is hiring, let's talk.
  • May 2026WFM-Eval accepted at two CVPR 2026 workshops: Video World Models and Foundation Models meet Embodied Agents (FMEA).
  • May 2026SPOT accepted at two CVPR 2026 workshops (Multimodal Spatial Intelligence & Visual Concepts), led by my labmate Mengqi Zhang.
  • Sep 2024SkyScenes featured in Georgia Tech News, College of Computing & Mirage News coverage.
  • Jul 2024SkyScenes accepted to ECCV 2024.
  • Jan 2024LatentDR accepted to WACV 2024.

Research

WFM-Eval

WFM-Eval: Interpretable Error Diagnostics for Video World Models in Robotics

CVPR 2026 Workshops · Video World Models & FMEA

Sahil Khose, Mengqi Zhang, Prithvijit Chattopadhyay, Judy Hoffman

TL;DR. A robotics-specific framework scoring task completion, object hallucination, and temporal consistency. We benchmark five video world models (Cosmos Predict2 & Predict2.5, Veo 3.1, HunyuanVideo 1.5, Wan2.2) across six manipulation datasets, and test VLM judges (Qwen2.5-VL, Qwen3-VL, InternVL3.5, Cosmos-Reason1, Kimi-K2.5) for task completion. No single VLM reliably judges success, model rankings reverse between datasets, and object hallucination (not photorealism) is the dominant failure mode for downstream policy learning.
VAM - video action model

Generative Video Models for Robot Policy Learning

Under review

TL;DR. Decouples a video world-model expert (Cosmos-Predict2) from a flow-matching action expert via cross-attention bridge tokens, grounded with affordance and depth. Delivers strong, highly data-efficient manipulation, outperforming prior video-action models on RoboCasa with far fewer demonstrations, and generalizes across LIBERO / LIBERO-Plus and real-world bimanual tasks.
SPOT

SPOT: Structured Prompting with Object-Centric Tokens for Open-World Scene Graphs

CVPR 2026 Workshops · MUSI & Visual Concepts

Mengqi Zhang, Sahil Khose, Fiona Ryan, Judy Hoffman

TL;DR. Open-world and 3D scene-graph generation on open-source VLMs via structured "fill-in-the-blank" prompts, object-centric SigLIP tokens, and distance-aware relation pruning. Competitive on Visual Genome (PredCLS R@100 = 61.9), outperforms GPT-4o-based methods on open-world cross-domain benchmarks (PSG, 3DSSG), and far ahead of prior open-vocabulary methods in 3D.
MLLM vision and audio

Extending Multimodal Large Language Models Beyond a Single Modality (Vision + Audio)

Preprint

Sahil Khose, Manushree Vasu, Humphrey Shi, Judy Hoffman

TL;DR. A generalist vision+audio MLLM (CLIP + BEATs + Vicuna). Joint instruction tuning beats sequential fine-tuning, simple MLP projectors beat Q-Former in data-scarce regimes, and instruction-following transfers from data-rich (vision) to data-scarce (audio) modalities.
SkyScenes

SkyScenes: A Synthetic Dataset for Aerial Scene Understanding

ECCV 2024

Sahil Khose*, Anisha Pal*, Aayushi Agarwal*, Deepanshi*, Judy Hoffman, Prithvijit Chattopadhyay

TL;DR. 33.6K densely-annotated, controllable synthetic UAV images (CARLA) across maps, weather/time, altitude, and pitch, with a HumanSpawn algorithm that ~10× tail-class pixels. Strong synthetic→real transfer for aerial segmentation. * equal contribution.

Previous research

INDICON 2023: Explainable Classification of Macular Degeneration Using Deep LearningIEEE | Paper
INDICON 2023: Fovea Segmentation Using Semi-Supervised LearningIEEE | Paper
NeurIPS-W 2022: Continual VQA for Disaster Response SystemsGitHub | Paper
ICML-W 2022: An Efficient Modern Baseline for FloodNet VQABest PaperGitHub | Paper
ACL-W 2022: Transformer based ensemble for emotion detectionOralGitHub | Paper
NeurIPS-W 2021: A Studious Approach to Semi-Supervised LearningGitHub | Paper
NeurIPS-W 2021: XCI-Sketch: Colored Outlines & Sketches from ImagesOralGitHub | Paper
NeurIPS-W 2021: Semi-Supervised Aerial Classification & SegmentationSpotlightGitHub | Paper
NAACL-W 2021: BERT for Health Information Extraction from Social MediaTop PerformerGitHub | Paper

Service

  • Conference reviewerCVPR (2026, 2025) · NeurIPS (2026, 2025) · ECCV (2026, 2024) · CoRL 2026
  • Workshop reviewerCVPR-W 2026 (FMEA) · NeurIPS-W 2025 (MATH-AI) · CVPR-W 2025 (EMACS) · NeurIPS-W 2023 (ICBINB, DGM4H) · ICCV-W 2023 (WiCV) · NAACL-W 2021 (SMM4H)
  • VolunteerICRA 2025 (Atlanta, GA) · NeurIPS 2022 (New Orleans, LA)