Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution

Using a vanilla transformer without cross-attention, CrossFlow directly evolves text representations into images for text-to-image generation (see Fig. a). Furthermore, it achieves state-of-the-art performance across various tasks (see Fig. b), including image captioning, zero-shot depth estimation, and image super-resolution, by directly mapping between modalities without relying on task specific architectures.

Abstract

Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise to the target media distribution. For cross-modal tasks such as text-to-image generation, this same mapping from noise to image is learnt whilst including a conditioning mechanism in the model. One key and thus far relatively unexplored feature of flow matching is that, unlike Diffusion models, they are not constrained for the source distribution to be noise. Hence, in this paper, we propose a paradigm shift, and ask the question of whether we can instead train flow matching models to learn a direct mapping from the distribution of one modality to the distribution of another, thus obviating the need for both the noise distribution and conditioning mechanism. We present a general and simple framework, CrossFlow, for cross-modal flow matching. We show the importance of applying Variational Encoders to the input data, and introduce a method to enable Classifier-free guidance. Surprisingly, for text-to-image, CrossFlow with a vanilla transformer without cross attention slightly outperforms standard flow matching, and we show that it scales better with training steps and model size, while also allowing for interesting latent arithmetic which results in semantically meaningful edits in the output space. To demonstrate the generalizability of our approach, we also show that CrossFlow is on par with or outperforms the state-of-the-art for various cross-modal / intra-modal mapping tasks, viz. image captioning, depth estimation, and image super-resolution. We hope this paper contributes to accelerating progress in cross-modal media generation.

Method

CrossFlow enables direct evolution between two different modalities. Taking text-to-image generation as an example, our T2I model consists of two main components: a Text Variational Encoder and a standard flow matching model. Starting with a text embedding x generated by any language model, the Text Variational Encoder predicts the mean and variance to sample the text latent z0. This sampled text latent z0 is then directly transformed into the image space to generate the image latent z1.

Text-to-image Generation

Scaling Characteristics

Linear Interpolation in Latent Space

*Image 1: 'A white dog wearing a white and black helmet riding a bike in the park'*

*Image 2: 'An orange cat wearing sunglasses on a ship'*

*Linear interpolation between Image 1 and Image 2*

*Image 1: 'A robot cooking dinner in the kitchen'*

*Image 2: 'A panda eating hamburger in a classroom'*

*Image 1: 'A corgi wearing a red hat in the park'*

*Image 2: 'A teddy bear dressed in black wizard hat and robes sitting on the bed'*

Arithmetic Operations in Latent Space

CrossFlow for Various Tasks

Acknowledgements

BibTex

@inproceedings{liu2025flowing,
  title={Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution},
  author={Liu, Qihao and Yin, Xi and Yuille, Alan and Brown, Andrew and Singh, Mannat},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={2755--2765},
  year={2025}
}