Grant awarded for determining authenticity of digital media

The Ministy of Innovation, Science and Technology has decided to awarded our group a 3 year grant on the topic “Goemetric and Statistical Models for Zero-shot Detection of Generated Videos and Images”. This is part of the sub-field of exploring new technologies for determining authenticity of digital media.

We have already 2 works on the subject (ICLR 2025, WACV 2026) and plan to pursue this direction with new tools and ideas.

Thanks to Jonathan, Roy, Yossi, Omer and Levi for contributing to the grant application.

WACV 2026: General and Domain-Specific Zero-shot Detection of Generated Images via Conditional Likelihood

Roy Betser, Omer Hofman, Roman Vainshtein, Guy Gilboa, “General and Domain-Specific Zero-shot Detection of Generated Images via Conditional Likelihood”, accepted to Winter Conference on Applications of Computer Vision (WACV) 2026.

The rapid advancement of generative models, particularly diffusion-based methods, has significantly improved the realism of synthetic images. As new generative models continuously emerge, detecting generated images remains a critical challenge. While fully supervised, and few-shot methods have been proposed, maintaining an updated dataset is time-consuming and challenging. Consequently, zero-shot methods have gained increasing attention in recent years. We find that existing zero-shot methods often struggle to adapt to specific image domains, such as artistic images, limiting their real-world applicability. In this work, we introduce CLIDE, a novel zero-shot detection method based on conditional likelihood approximation. Our approach computes likelihoods conditioned on real images, enabling adaptation across diverse image domains. We extensively evaluate CLIDE, demonstrating state-of-the-art performance on a large-scale general dataset and significantly outperform existing methods in domain-specific cases. These results demonstrate the robustness of our method and underscore the need of broad, domain-aware generalization for the AI-generated image detection task.

Paper will be published soon..

The Visual Computer 2025: DXAI: Explaining Classification by Image Decomposition

Elnatan Kadar, Guy Gilboa, “DXAI: Explaining Classification by Image Decomposition”, accepted to The Visual Computer 2025.

We propose a new way to explain and to visualize neural network classification through a decomposition-based explainable AI (DXAI). Instead of providing an explanation heatmap, our method yields a decomposition of the image into class-agnostic and class-distinct parts, with respect to the data and chosen classifier. Following a fundamental signal processing paradigm of analysis and synthesis, the original image is the sum of the decomposed parts. We thus obtain a radically different way of explaining classification. The class-agnostic part ideally is composed of all image features which do not posses class information, where the class-distinct part is its complementary. This new visualization can be more helpful and informative in certain scenarios, especially when the attributes are dense, global and additive in nature, for instance, when colors or textures are essential for class distinction.

Arxiv version

NeuReps 2025 (NeurIPS workshop): Tracking Memorization Geometry throughout the Diffusion Model Generative Process

Jonathan Brokman, Itay Gershon, Omer Hofman, Guy Gilboa, Roman Vainshtein

Memorization in generative text-to-image diffusion models is a phenomenon where instead of valid image generations, the model outputs near-verbatim reproductions of training images. This poses privacy and copyright risks, and remains difficult to prevent without harming prompt fidelity. We present a mid-generation, geometry-informed criterion that detects, and then helps avoid (mitigate), memorized outputs. Our method analyzes the natural image distribution manifold as learnt by the diffusion model. We analyze a memorization criterion that has a local curvature interpretation. Thus we can track the generative process, and our criterion’s trajectory throughout it, to understand typical geometrical structures traversed throughout this process. This is harnessed as a geometry-aware indicator that distinguishes memorized from valid generations. Notably, our criterion uses only the direction of the normalized score field, unlike prior magnitude-based methods; combining direction and magnitude we improve mid-generation detection SOTA by %. Beyond detecting memorization, we use this indicator as a plug-in to a mitigation policy to steer trajectories away from memorized basins while preserving alignment to the text. Empirically, this demonstrates improved fidelity–memorization trade-off over the competitors. By linking memorization to magnitude-invariant geometric signatures of the generative process, our work opens a new direction for understanding—and systematically mitigating—failure modes in diffusion models. Official code: https://bit.ly/4ndeISd

PDF

Presenting at ICML 2025

Roy, Yossi and me attended ICML 2025 in Vancouver. Enjoying a good weather, it was great hearing all about the latest and greatest in ML and showing people our works on CLIP embedding analysis and likelihood estimation.

Yossi

Roy

Guy

From left:Guy, Yossi, Roy

ICML in Vancouver before the crowd comes..

Nice Vancouver forest

Vancouver Jericho beach

Philosophical Transactions A (2025): Ensemble of Weak Spectral Total Variation Learners: a PET-CT Case Study

Anna Rosenberg, John Kennedy, Zohar Keidar, Yehoshua Y. Zeevi and Guy Gilboa, Philosophical Transactions of the Royal Society A, 2025.

Link

Link to arxiv

Abstract

Solving computer vision problems through machine learning, one often encounters lack of sufficient training data. To mitigate this, we propose the use of ensembles of weak learners based on spectral total-variation (STV) features (Gilboa G. 2014 A total variation spectral framework for scale and texture analysis. SIAM J. Imaging Sci. 7, 1937–1961. (doi:10.1137/130930704)). The features are related to nonlinear eigenfunctions of the total-variation subgradient and can characterize well textures at various scales. It was shown (Burger M, Gilboa G, Moeller M, Eckardt L, Cremers D. 2016 Spectral decompositions using one-homogeneous functionals. SIAM J. Imaging Sci. 9, 1374–1408. (doi:10.1137/15m1054687)) that, in the one-dimensional case, orthogonal features are generated, whereas in two dimensions the features are empirically lowly correlated. Ensemble learning theory advocates the use of lowly correlated weak learners. We thus propose here to design ensembles using learners based on STV features. To show the effectiveness of this paradigm, we examine a hard real-world medical imaging problem: the predictive value of computed tomography (CT) data for high uptake in positron emission tomography (PET) for patients suspected of skeletal metastases. The database consists of 457 scans with 1524 unique pairs of registered CT and PET slices. Our approach is compared with deep-learning methods and to radiomics features, showing STV learners perform best (AUC=0.87), compared with neural nets (AUC=0.75) and radiomics (AUC=0.79). We observe that fine STV scales in CT images are especially indicative of the presence of high uptake in PET.

ICML 2025: Whitened CLIP as a Likelihood Surrogate of Images and Captions

Roy Betser, Meir-Yossef Levi, Guy Gilboa

Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025

Likelihood approximations for images are not trivial to compute and can be useful in many applications. We examine the use of Contrastive Language-Image Pre-training (CLIP) to assess the likelihood of images and captions. We introduce \textit{Whitened CLIP}, a novel transformation of the CLIP latent space via an invertible linear operation. This transformation ensures that each feature in the embedding space has zero mean, unit standard deviation, and no correlation with all other features, resulting in an identity covariance matrix. We show that the whitened embeddings statistics can be well approximated as a standard normal distribution, thus, the log-likelihood is estimated simply by the square Euclidean norm in the whitened embedding space. The whitening procedure is completely training-free and performed using a pre-computed whitening matrix, hence, is very fast. We present several preliminary experiments demonstrating the properties and applicability of these likelihood scores to images and captions.

Camera ready version.

ICML 2025: The Double-Ellipsoid Geometry of CLIP

Meir Yossef Levi, Guy Gilboa

Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025

Contrastive Language-Image Pre-Training (CLIP) is highly instrumental in machine learning applications within a large variety of domains.
We investigate the geometry of this embedding, which is still not well understood, and show that text and image reside on linearly separable ellipsoid shells, not centered at the origin. We explain the benefits of having this structure, allowing to better embed instances according to their uncertainty during contrastive training.
Frequent concepts in the dataset yield more false negatives, inducing greater uncertainty.
A new notion of conformity is introduced, which measures the average cosine similarity of an instance to any other instance within a representative data set. We prove this measure can be accurately estimated by simply computing the cosine similarity to the modality mean vector. Furthermore, we find that CLIP’s modality gap optimizes the matching of the conformity distributions of image and text.

Project page