Welcome to My Homepage!

Hi, I am a principal researcher in Deep Learning Group at Microsoft Research, Redmond, directed by Dr. Jianfeng Gao. My research interests span in computer vision, vision & language and machine learning. More specifically, my primary researches are about structured visual understanding at different levels and how to further leverage them for intelligent interactions with human through language and environment through embodiment. I believe, by integrating fine-grained structured information, we can achieve better yet interpretable, grounded and robust multi-modality intelligent agents.

Prior to joining Microsoft at March 2020, I earned my Ph.D. in Computer Science from School of Interactive Computing at Georgia Tech with thesis “Structured Visual Understanding, Generation and Reasoning”. I was fortunate to be supervised by Prof. Devi Parikh and work closely with Prof. Dhruv Batra.

My current research is particularly focused on building generalist multi-modal foundations towards AGI. Our team is the first few in this line of research, and have produced a series of works, including (a) bridging core vision tasks with language: UniCL, RegionCLIP, GLIP, KLITE, foundation model Florence; (b) developing generalist decoder X-Decoder, and (c) enabling more promptable and interactive systems like SEEM, Semantic-SAM, LLaVa, and SoM for GPT-4V.

If you are interested in working with me to build next-generation multi-modal foundation models, please feel free to drop me an email through jianwei.yang at microsoft dot com.

News

[10/2023] We released Set-of-Mark (SoM) prompting to unleash the extraordinary visual grounding power in GPT-4V. Come and try our SoM toolbox!
[10/2023] We released BiomedJourney, a novel method for counterfactual medical image generation by instruction-learning from multimodal patient journeys! Check it out!
[09/2023] Segment Everything Everywhere All at Once (SEEM) has been accepted by NeurIPS 2023!
[09/2023] Check out our Survey Paper and Book on Multimodal Foundation Models: From Specialists to General-Purpose Assistants!
[07/2023] We are releasing Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity!
[04/2023] We introduce SEEM that can Segment Everything Everywhere with Multi-modal prompts all at once!
[03/2023] We are releasing X-GPT to connect X-Decoder with GPT for conversational AI!
[03/2023] We are announcing the 2nd Computer Vision in the Wild (CVinW) Workshop at CVPR 2023! Welcome to submit your work and/or participate the ICinW, ODinW and SGinW Challenges!
[12/2022] We released X-Decoder, a generalist model for decoding pixel-level masks and token-level semantics seamlessly, achieving state-of-the-art open-vocab segmentation on 8 datasets, referring segmentation on RefCOCOg+ and instance/panoptic segmentation on ADE-20K!
[12/2022] Serving as an Area Chair for ICCV 2023.
[11/2022] We wrote a blog post to explain in a plain way our FocalNet and the difference from Attention in terms of mechanism, interpretability and performance.
[10/2022] We scaled up our FocalNets to huge size, and achieved new SoTA on COCO object detection! 64.2 on minival and 64.3 on test-dev! Check out our new version and code!
[09/2022] Two papers got accepted by NeurIPS 2022, see you in New Orleans, AGAIN!
[09/2022] We are organizing CV in the Wild Workshop and Challange, submit your paper and method!

Selected Preprints

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V.

Jianwei Yang*☨, Hao Zhang*, Feng Li*, Xueyan Zou*, Chunyuan Li, Jianfeng Gao.
arXiv, 2023

Semantic-SAM: Segment and Recognize Anything at Any Granularity.

Feng Li*, Hao Zhang*, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang^, Chunyuan Li, Lei Zhang☨, Jianfeng Gao☨.
arXiv, 2023

Florence: A new foundation model for computer vision.

Lu Yuan*, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, Pengchuan Zhang. (Team members in alphabetic order)
arXiv, 2021

Selected Publications

Segment Everything Everywhere all at Once.

Xueyan Zou*, Jianwei Yang*^, Hao Zhang*, Feng Li*, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao☨, Yong Jae Lee☨.
NeurIPS, 2023

A Simple Framework for Open-Vocabulary Segmentation and Detection.

Hao Zhang*, Feng Li*, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianfeng Gao, Jianwei Yang☨, Lei Zhang☨.
ICCV, 2023

Generalized Decoding for Pixel, Image, and Language.

Xueyan Zou*, Zi-Yi Dou*, Jianwei Yang*, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee☨, Jianfeng Gao☨.
CVPR, 2023

Focal Modulation Networks.

Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan and Jianfeng Gao.
NeurIPS, 2022

K-lite: Learning transferable visual models with external knowledge.

Sheng Shen*, Chunyuan Li*, Xiaowei Hu, Yujia Xie, Jianwei Yang, Pengchuan Zhang, Anna Rohrbach, Zhe Gan, Lijuan Wang, Lu Yuan, Ce Liu, Kurt Keutzer, Trevor Darrell, Jianfeng Gao.
NeurIPS, 2022. Oral

Grounded language-image pre-training.

Liunian Harold Li*, Pengchuan Zhang*, Haotian Zhang*, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, Jianfeng Gao.
CVPR, 2022. Best Paper Final list

Regionclip: Region-based language-image pretraining

Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, Jianfeng Gao.
CVPR, 2022

Unified contrastive learning in image-text-label space.

Jianwei Yang*, Chunyuan Li*, Pengchuan Zhang*, Bin Xiao*, Ce Liu, Lu Yuan, Jianfeng Gao.
CVPR, 2022

Efficient self-supervised vision transformers for representation learning.

Chunyuan Li, Jianwei Yang, Pengchuan Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, Jianfeng Gao.
ICLR, 2022

Focal attention for long-range interactions in vision transformers.

Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, Jianfeng Gao.
NeurIPS, 2021, Spotlight.

Taco: Token-aware cascade contrastive learning for video-text alignment.

Jianwei Yang, Yonatan Bisk, Jianfeng Gao.
ICCV, 2021