Hi! I am a principal researcher in Deep Learning Group at Microsoft Research, Redmond, directed by Dr. Jianfeng Gao. Prior to joining Microsoft at March 2020, I earned my Ph.D. in Computer Science from School of Interactive Computing at Georgia Tech with thesis “Structured Visual Understanding, Generation and Reasoning”. I was fortunate to be supervised by Prof. Devi Parikh and work closely with Prof. Dhruv Batra.

My current research is focused on building generalist multi-modal agent. Our team are the first few in this line of research, and have produced a series of works, including (a) bridging core vision tasks with language: UniCL, RegionCLIP, GLIP, KLITE, foundation model Florence; (b) developing generalist decoder X-Decoder, and (c) enabling more promptable, grounding and interactive systems like SEEM, Semantic-SAM, LLaVa, SoM Prompting for GPT-4V, and Phi-3-Vision. I believe, further pushing forward, we can achieve performant yet interpretable, grounded and robust multi-modality intelligent agents.

If you are interested in working with me, please feel free to drop me an email through jianwei.yang at microsoft dot com.

Research News

[09/2024] Three papers accepted to NeurIPS 2024!
[09/2024] BiomedParse is accepted by Nature Methods and GigaPath got accepted by Nature! Big congratulations to Health Future team and cheers on the great collaborations!
[08/2024] Five papers accepted to ECCV 2024, two papers accepted to CVPR 2024 and one paper accepted to COLM 2024!
[05/2024] We announced Phi-3-Vision, a 4.2B parameter multimodal model with language and vision capabilities! [blog] [hugging face]
[10/2023] We released Set-of-Mark (SoM) prompting to unleash the extraordinary visual grounding power in GPT-4V. Come and try our SoM toolbox!
[09/2023] Segment Everything Everywhere All at Once (SEEM) has been accepted by NeurIPS 2023!
[09/2023] Check out our Survey Paper and Book on Multimodal Foundation Models: From Specialists to General-Purpose Assistants!
[07/2023] We are releasing Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity!
[04/2023] We introduce SEEM that can Segment Everything Everywhere with Multi-modal prompts all at once!
[12/2022] We released X-Decoder, a generalist model for decoding pixel-level masks and token-level semantics seamlessly!

Academic Activities

[09/2024] Have a great panel discussion about the next generation multimodal models at Microsoft Research Forum Session 4 on Multimodality.
[07&08/2024] Serve as an Area Chair for NeurIPS 2024 and ICLR 2025.
[06/2024] Gave a tutorial on “A Close Look at Vision in Large Multimodal Models” [slides] [youtube] at CVPR 2024 Tutorial on Recent Advances in Vision Foundation Models.
[06/2024] Gave a keynote talk on “Promptable Vision Foundation in the Wild: From Head to Tail” at CVPR 2024 Worshop on Computer Vision for Materials Science.
[06/2024] Organized the 3rd Computer Vision in the Wild (CVinW) Workshop at CVPR 2024.
[05&06/2024] Invited talk on “Towards General-Purpose Multimodal Agent” at University of Washington and Together AI.
[07/2023] Panel Discussion on AI Frontier at WAIC and invited talk on “Towards General-Purpose Multimodal Agent” at IDEA.
[06/2023] Gave a tutorial on “From Representation to Interface: The Evolution of Foundation for Vision Understanding” [slides] [youtube] at CVPR 2023 Tutorial on Recent Advances in Vision Foundation Models.
[03/2023] We are announcing the 2nd Computer Vision in the Wild (CVinW) Workshop at CVPR 2023!
[12/2022] Served as an Area Chair for ICCV 2023.
[06/2022] Gave a tutorial on “Vision Language Pretraining for Image Classification” [slides] [youtube] at CVPR 2022 Tutorial on Recent Advances in Vision-and-Language Pretraining.
[09/2021] Guest lecture on “Learning Visual Curiosity for an Agent through Language and Embodiment” [youtube] at NeurIPS 2021 IGLU Contest.

Selected Preprints

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V.

Jianwei Yang*☨, Hao Zhang*, Feng Li*, Xueyan Zou*, Chunyuan Li, Jianfeng Gao.
arXiv, 2023

Florence: A new foundation model for computer vision.

Lu Yuan*, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, Pengchuan Zhang. (Team members in alphabetic order)
arXiv, 2021

Selected Publications

Semantic-SAM: Segment and Recognize Anything at Any Granularity.

Feng Li*, Hao Zhang*, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang^, Chunyuan Li, Lei Zhang☨, Jianfeng Gao☨.
ECCV, 2024

Segment Everything Everywhere all at Once.

Xueyan Zou*, Jianwei Yang*^, Hao Zhang*, Feng Li*, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao☨, Yong Jae Lee☨.
NeurIPS, 2023

A Simple Framework for Open-Vocabulary Segmentation and Detection.

Hao Zhang*, Feng Li*, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianfeng Gao, Jianwei Yang☨, Lei Zhang☨.
ICCV, 2023

Generalized Decoding for Pixel, Image, and Language.

Xueyan Zou*, Zi-Yi Dou*, Jianwei Yang*, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee☨, Jianfeng Gao☨.
CVPR, 2023

Focal Modulation Networks.

Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan and Jianfeng Gao.
NeurIPS, 2022

K-lite: Learning transferable visual models with external knowledge.

Sheng Shen*, Chunyuan Li*, Xiaowei Hu, Yujia Xie, Jianwei Yang, Pengchuan Zhang, Anna Rohrbach, Zhe Gan, Lijuan Wang, Lu Yuan, Ce Liu, Kurt Keutzer, Trevor Darrell, Jianfeng Gao.
NeurIPS, 2022. Oral

Grounded language-image pre-training.

Liunian Harold Li*, Pengchuan Zhang*, Haotian Zhang*, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, Jianfeng Gao.
CVPR, 2022. Best Paper Final list

Regionclip: Region-based language-image pretraining

Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, Jianfeng Gao.
CVPR, 2022

Unified contrastive learning in image-text-label space.

Jianwei Yang*, Chunyuan Li*, Pengchuan Zhang*, Bin Xiao*, Ce Liu, Lu Yuan, Jianfeng Gao.
CVPR, 2022

Efficient self-supervised vision transformers for representation learning.

Chunyuan Li, Jianwei Yang, Pengchuan Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, Jianfeng Gao.
ICLR, 2022

Focal attention for long-range interactions in vision transformers.

Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, Jianfeng Gao.
NeurIPS, 2021, Spotlight.

Taco: Token-aware cascade contrastive learning for video-text alignment.

Jianwei Yang, Yonatan Bisk, Jianfeng Gao.
ICCV, 2021