Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution
CVPR 2025 |
1 GenAI, Meta | 2 Johns Hopkins University |
[Paper] [Code] [HuggingFace Demo] [BibTeX] |
We introduce CrossFlow, a streamlined framework that directly maps between two modalities using standard flow matching, eliminating the need for a noise distribution or additional conditioning. It enables more flexible and effective utilization of powerful flow matching models. |
Using a vanilla transformer without cross-attention, CrossFlow directly evolves text representations into images for text-to-image generation (see Fig. a). Furthermore, it achieves state-of-the-art performance across various tasks (see Fig. b), including image captioning, zero-shot depth estimation, and image super-resolution, by directly mapping between modalities without relying on task specific architectures. |
Abstract
Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise to the target media distribution. For cross-modal tasks such as text-to-image generation, this same mapping from noise to image is learnt whilst including a conditioning mechanism in the model. One key and thus far relatively unexplored feature of flow matching is that, unlike Diffusion models, they are not constrained for the source distribution to be noise. Hence, in this paper, we propose a paradigm shift, and ask the question of whether we can instead train flow matching models to learn a direct mapping from the distribution of one modality to the distribution of another, thus obviating the need for both the noise distribution and conditioning mechanism. We present a general and simple framework, CrossFlow, for cross-modal flow matching. We show the importance of applying Variational Encoders to the input data, and introduce a method to enable Classifier-free guidance. Surprisingly, for text-to-image, CrossFlow with a vanilla transformer without cross attention slightly outperforms standard flow matching, and we show that it scales better with training steps and model size, while also allowing for interesting latent arithmetic which results in semantically meaningful edits in the output space. To demonstrate the generalizability of our approach, we also show that CrossFlow is on par with or outperforms the state-of-the-art for various cross-modal / intra-modal mapping tasks, viz. image captioning, depth estimation, and image super-resolution. We hope this paper contributes to accelerating progress in cross-modal media generation.
Method
CrossFlow enables direct evolution between two different modalities. Taking text-to-image generation as an example, our T2I model consists of two main components: a Text Variational Encoder and a standard flow matching model. Starting with a text embedding x generated by any language model, the Text Variational Encoder predicts the mean and variance to sample the text latent z0. This sampled text latent z0 is then directly transformed into the image space to generate the image latent z1.
Text-to-image Generation
By directly mapping between text and image space using a vanilla transformer without cross-attention, our 1B model delivers performance comparable to state-of-the-art T2I models such as DALL·E 2 and Stable Diffusion 1.5. |
Scaling Characteristics
Our model exhibits superior scaling behavior compared to standard flow matching with cross-attention when scaling training steps or model size. We plot performance relative to model parameters and training iterations. Left: Larger models are able to exploit cross-modality connections better. Right: While CrossFlow needs more steps to converge, it ultimately achieves better final performance. |
Linear Interpolation in Latent Space
Our model provides visually smooth interpolations in the latent space. We show images generated by linear interpolation between the first and the second text latents (i.e., interpolation between z0). CrossFlow enables visually smooth transformations in object orientation, composite colors, shapes, background scenes, and even object categories. |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Arithmetic Operations in Latent Space
CrossFlow enables arithmetic operations in the text latent space. Using the Text Variational Encoder (VE), we first map the input text into the latent space z0. Arithmetic operations are then performed in this latent space, and the resulting latent representation is used to generate the corresponding image. The latent code z0 used to generate each image is provided at the bottom. |
CrossFlow for Various Tasks
CrossFlow can also be applied to other modalities for various tasks. For example, it directly maps images to text, achieving performance comparable to state-of-the-art models on image captioning on the COCO Karpathy split. |
Image captioning |
CrossFlow enables direct mapping from image to depth for zero-shot depth estimation. |
Zero-shot depth estimation |
CrossFlow also achieves state-of-the-art performance on image super-resolution. |
Image super-resolution (64×64 → 256×256) |
Acknowledgements
We sincerely appreciate Ricky Chen and Saketh Rambhatla for their valuable discussions. |
BibTex