Xuhui Zhan
Xuhui Zhan
Home
Project
Mini Project
Business Idea
Fun Fact
Light
Dark
Automatic
Vision-Language Models
Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping
We’ve flipped the script on multimodal fusion, instead of forcing visual features into discrete text space, we map text embeddings into continuous visual space, eliminating costly alignment pre-training while achieving competitive performance.
Website
Code
Paper