Hi, I am a senior researcher in Deep Learning Group at Microsoft Research, Redmond, directed by Dr. Jianfeng Gao. My research interests span in computer vision, vision & language and machine learning. More specifically, my primary researches are about structured visual understanding at different levels and how to further leverage them for intelligent interactions with human through language and environment through embodiment. I believe, by integrating fine-grained structured information, we can achieve better yet interpretable, grounded and robust multi-modality intelligent agent.
Prior to joining Microsoft at March 2020, I earned my Ph.D. in Computer Science from School of Interactive Computing at Georgia Tech with thesis “Structured Visual Understanding, Generation and Reasoning”. I was honored to be supervised by Prof. Devi Parikh and work closely with Prof. Dhruv Batra. Here is my old GT homepage.
If you are interested in working with me as a research intern, please feel free to drop me an email through jianwei.yang at microsoft dot com.
[07/2021] Four papers are accepted to ICCV 2021! Congratulations to all authors!
[07/2021] We are releasing Focal Transformer, achieving new SoTA on COCO object detection, instance segmentation and ADE20K semantic segmentation!
[06/2021] We are releasing EsViT, a much more efficient self-supervised learning pipeline reaching new SoTA on ImageNet-1k linear probe!
[04/2021] We are introducing Vision Longformer which is an advanced version of vision transformer that achieves significant improvements on image classification, object detection comparing with ResNet and PVT!
[01/2021] We show in our arXiv paper that vision feature matters significantly for vision-language tasks!
[12/2020] We release our arXiv paper studying the visual reasoning capacity in visual question answering models!
[11/2020] We release our arXiv paper leveraging token relationships to learn neural-symbolic video captioning!