Sahil Khose

Research

WFM-Eval: Interpretable Error Diagnostics for Video World Models in Robotics

CVPR 2026 Workshops · Video World Models & FMEA

Sahil Khose, Mengqi Zhang, Prithvijit Chattopadhyay, Judy Hoffman

[Paper] [Website]

TL;DR. A robotics-specific framework scoring task completion, object hallucination, and temporal consistency. We benchmark five video world models (Cosmos Predict2 & Predict2.5, Veo 3.1, HunyuanVideo 1.5, Wan2.2) across six manipulation datasets, and test VLM judges (Qwen2.5-VL, Qwen3-VL, InternVL3.5, Cosmos-Reason1, Kimi-K2.5) for task completion. No single VLM reliably judges success, model rankings reverse between datasets, and object hallucination (not photorealism) is the dominant failure mode for downstream policy learning.

Generative Video Models for Robot Policy Learning

Under review

[Paper coming soon]

TL;DR. Decouples a video world-model expert (Cosmos-Predict2) from a flow-matching action expert via cross-attention bridge tokens, grounded with affordance and depth. Delivers strong, highly data-efficient manipulation, outperforming prior video-action models on RoboCasa with far fewer demonstrations, and generalizes across LIBERO / LIBERO-Plus and real-world bimanual tasks.

SPOT: Structured Prompting with Object-Centric Tokens for Open-World Scene Graphs

CVPR 2026 Workshops · MUSI & Visual Concepts

Mengqi Zhang, Sahil Khose, Fiona Ryan, Judy Hoffman

[Paper]

TL;DR. Open-world and 3D scene-graph generation on open-source VLMs via structured "fill-in-the-blank" prompts, object-centric SigLIP tokens, and distance-aware relation pruning. Competitive on Visual Genome (PredCLS R@100 = 61.9), outperforms GPT-4o-based methods on open-world cross-domain benchmarks (PSG, 3DSSG), and far ahead of prior open-vocabulary methods in 3D.

Extending Multimodal Large Language Models Beyond a Single Modality (Vision + Audio)

Preprint

Sahil Khose, Manushree Vasu, Humphrey Shi, Judy Hoffman

[Paper]

TL;DR. A generalist vision+audio MLLM (CLIP + BEATs + Vicuna). Joint instruction tuning beats sequential fine-tuning, simple MLP projectors beat Q-Former in data-scarce regimes, and instruction-following transfers from data-rich (vision) to data-scarce (audio) modalities.

SkyScenes: A Synthetic Dataset for Aerial Scene Understanding

ECCV 2024

Sahil Khose*, Anisha Pal*, Aayushi Agarwal*, Deepanshi*, Judy Hoffman, Prithvijit Chattopadhyay

[arXiv] [Dataset] [Code]

Press: Georgia Tech News GT College of Computing GT News Center Mirage News

TL;DR. 33.6K densely-annotated, controllable synthetic UAV images (CARLA) across maps, weather/time, altitude, and pitch, with a HumanSpawn algorithm that ~10× tail-class pixels. Strong synthetic→real transfer for aerial segmentation. * equal contribution.

LatentDR: Improving Model Generalization with Sample-Aware Latent Degradation & Restoration

WACV 2024

Ran Liu, Sahil Khose, Jingyun Xiao, Lakshmi Sathidevi, Keerthan Ramnath, Zsolt Kira, Eva L. Dyer

[arXiv]

TL;DR. A latent-space augmentation that degrades a sample toward classifier confusion then restores its class via attention across minibatch samples: +2.7% avg over ERM on DomainBed and up to +9% on medical imaging.

Previous research

INDICON 2023: Explainable Classification of Macular Degeneration Using Deep LearningIEEE | Paper

INDICON 2023: Fovea Segmentation Using Semi-Supervised LearningIEEE | Paper

NeurIPS-W 2022: Continual VQA for Disaster Response SystemsGitHub | Paper

ICML-W 2022: An Efficient Modern Baseline for FloodNet VQABest PaperGitHub | Paper

ACL-W 2022: Transformer based ensemble for emotion detectionOralGitHub | Paper

NeurIPS-W 2021: A Studious Approach to Semi-Supervised LearningGitHub | Paper

NeurIPS-W 2021: XCI-Sketch: Colored Outlines & Sketches from ImagesOralGitHub | Paper

NeurIPS-W 2021: Semi-Supervised Aerial Classification & SegmentationSpotlightGitHub | Paper

NAACL-W 2021: BERT for Health Information Extraction from Social MediaTop PerformerGitHub | Paper

Service

Conference reviewerCVPR (2026, 2025) · NeurIPS (2026, 2025) · ECCV (2026, 2024) · CoRL 2026
Workshop reviewerCVPR-W 2026 (FMEA) · NeurIPS-W 2025 (MATH-AI) · CVPR-W 2025 (EMACS) · NeurIPS-W 2023 (ICBINB, DGM4H) · ICCV-W 2023 (WiCV) · NAACL-W 2021 (SMM4H)
VolunteerICRA 2025 (Atlanta, GA) · NeurIPS 2022 (New Orleans, LA)

Sahil Khose

Affiliations

News

Research

Previous research

Service