CVPR 2026: The Universal Normal Embedding

C. Tasker, R. Betser, E. Gofer, M-Y Levi, G. Gilboa, CVPR 2026.

Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental property: latent space Gaussianity. Generative models map Gaussian noise to images, while encoders map images to semantic embeddings whose coordinates empirically behave as Gaussian. We hypothesize that both are views of a shared latent source, the \emph{Universal Normal Embedding (UNE)}: an approximately Gaussian latent space from which encoder embeddings and DDIM-inverted noise arise as noisy linear projections. To test our hypothesis, we introduce \emph{NoiseZoo}, a dataset of per-image latents comprising DDIM-inverted diffusion noise and matching encoder representations (CLIP, DINO). On CelebA, linear probes in both spaces yield strong, aligned attribute predictions, indicating that generative noise encodes meaningful semantics along linear directions. These directions further enable faithful, controllable edits (e.g., smile, gender, age) without architectural changes, where simple orthogonalization mitigates spurious entanglements. Taken together, our results provide empirical support for the UNE hypothesis and reveal a shared Gaussian-like latent geometry that concretely links encoding and generation.

Paper

Project page

Medium post

CVPR 2026: Training-free detection of generated videos via spatial-temporal likelihoods

O. Ben Hayun, R. Betser, M-Y Levi, L. Kassel, G. Gilboa, CVPR 2026.

Following major advances in text and image generation, the video domain has surged, producing highly realistic and controllable sequences that transform creative workflows. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial. Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors generalize poorly to unseen generators, a critical drawback given the rapid emergence of new models.These challenges motivate zero-shot approaches, which avoid synthetic data and instead score content against real-data statistics, enabling training-free, model-agnostic detection.
We introduce STALL, a simple, training-free, theoretically justified detector that provides likelihood-based scoring for videos, jointly modeling spatial and temporal evidence within a probabilistic framework. Across two public benchmarks including 20 generative models, STALL consistently outperforms prior image- and video-based baselines. To further test generalization, we curate ComGenVid, a new benchmark featuring state-of-the-art models (Sora and Veo-3), on which STALL demonstrates consistent and robust results.

Paper

Project page

Medium post

ICLR 2026 (oral): InfoNCE Induces Gaussian Distribution

R. Betser, E. Gofer, M-Y Levi, G. Gilboa, ICLR 2026 (oral presentation, top 1.18%)

Contrastive learning has been at the bedrock of unsupervised learning in recent years, allowing training with massive unlabeled data for both task-specific and general (foundation) models. A prototypical loss in contrastive training is InfoNCE and its variants. In this paper we show that the embedding of the features which emerge from InfoNCE training can be well approximated by a multivariate Gaussian distribution. We justify this claim by taking two approaches. First, we show that under certain alignment and concentration assumptions, finite projections of a high dimensional representation approach multivariate Gaussian distribution, as the representation dimensions approach infinity.
Next, under less strict assumptions, we show that adding a small regularization term (which vanishes asymptotically) that promotes low feature norm and high feature entropy, we reach similar asymptotic results. We demonstrate experimentally, in a synthetic setting, CIFAR-10 and on pretrained foundation models, that the features indeed follow almost precise Gaussian distribution. One can use the Gaussian model to easily derive analytic expressions in the representation space and to obtain very useful measures, such as likelihood, data entropy and mutual information. Hence, we expect such theoretical grounding to be very useful in various applications involving contrastive learning.

Paper

Project page

Medium post

WACV 2026: General and Domain-Specific Zero-shot Detection of Generated Images via Conditional Likelihood

Roy Betser, Omer Hofman, Roman Vainshtein, Guy Gilboa, “General and Domain-Specific Zero-shot Detection of Generated Images via Conditional Likelihood”, accepted to Winter Conference on Applications of Computer Vision (WACV) 2026.

The rapid advancement of generative models, particularly diffusion-based methods, has significantly improved the realism of synthetic images. As new generative models continuously emerge, detecting generated images remains a critical challenge. While fully supervised, and few-shot methods have been proposed, maintaining an updated dataset is time-consuming and challenging. Consequently, zero-shot methods have gained increasing attention in recent years. We find that existing zero-shot methods often struggle to adapt to specific image domains, such as artistic images, limiting their real-world applicability. In this work, we introduce CLIDE, a novel zero-shot detection method based on conditional likelihood approximation. Our approach computes likelihoods conditioned on real images, enabling adaptation across diverse image domains. We extensively evaluate CLIDE, demonstrating state-of-the-art performance on a large-scale general dataset and significantly outperform existing methods in domain-specific cases. These results demonstrate the robustness of our method and underscore the need of broad, domain-aware generalization for the AI-generated image detection task.

ICML 2025: Whitened CLIP as a Likelihood Surrogate of Images and Captions

Roy Betser, Meir-Yossef Levi, Guy Gilboa

Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025

Likelihood approximations for images are not trivial to compute and can be useful in many applications. We examine the use of Contrastive Language-Image Pre-training (CLIP) to assess the likelihood of images and captions. We introduce \textit{Whitened CLIP}, a novel transformation of the CLIP latent space via an invertible linear operation. This transformation ensures that each feature in the embedding space has zero mean, unit standard deviation, and no correlation with all other features, resulting in an identity covariance matrix. We show that the whitened embeddings statistics can be well approximated as a standard normal distribution, thus, the log-likelihood is estimated simply by the square Euclidean norm in the whitened embedding space. The whitening procedure is completely training-free and performed using a pre-computed whitening matrix, hence, is very fast. We present several preliminary experiments demonstrating the properties and applicability of these likelihood scores to images and captions.

Camera ready version.

ICML 2025: The Double-Ellipsoid Geometry of CLIP

Meir Yossef Levi, Guy Gilboa

Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025

Contrastive Language-Image Pre-Training (CLIP) is highly instrumental in machine learning applications within a large variety of domains.
We investigate the geometry of this embedding, which is still not well understood, and show that text and image reside on linearly separable ellipsoid shells, not centered at the origin. We explain the benefits of having this structure, allowing to better embed instances according to their uncertainty during contrastive training.
Frequent concepts in the dataset yield more false negatives, inducing greater uncertainty.
A new notion of conformity is introduced, which measures the average cosine similarity of an instance to any other instance within a representative data set. We prove this measure can be accurately estimated by simply computing the cosine similarity to the modality mean vector. Furthermore, we find that CLIP’s modality gap optimizes the matching of the conformity distributions of image and text.

Project page

Camera ready paper