About me

Iโ€™m a Ph.D. student in Computer Science at Georgia Tech, where Iโ€™m fortunate to be advised by Prof. Judy Hoffman. My research focuses on developing multimodal vision-language models that integrate spatial, semantic, and temporal reasoning with minimal supervision.

Recent work includes:

  1. A 7B open-source VLM for open-vocabulary 3D scene graph generation, under review at NeurIPS 2025.

  2. SkyScenes, a synthetic aerial dataset for improving real-world segmentation, accepted at ECCV 2024.

  3. Generalist Multimodal LLM, where I designed a jointly-trained vision-audio model that outperforms larger generalist systems by reducing cross-modal interference.

I bring prior experience in domain generalization, zero-shot learning, and synthetic-to-real adaptation, focusing on making models robust to diversity, correlation, and semantic shifts in real-world environments. My goal is to build generalizable systems that require minimal labeled data yet remain reliable under distribution shifts.

I also review for top conferences (NeurIPS, CVPR, ECCV), and have published across both vision and language communities.

๐ŸŒŸ Iโ€™ll be attending CVPR 2025 โ€” if youโ€™re around and working in similar areas, Iโ€™d love to connect!

๐Ÿ’ผ Iโ€™m currently looking for research internships for Summer 2026 โ€” feel free to reach out if youโ€™re hiring!



Recent Updates

[ ๐ŸŒŸ: Important | ๐Ÿ’ก: Research Paper | ๐Ÿ“†: Miscellaneous ]


Publications

2024

2022

2021