Roy, Yossi and me attended ICML 2025 in Vancouver. Enjoying a good weather, it was great hearing all about the latest and greatest in ML and showing people our works on CLIP embedding analysis and likelihood estimation.







Roy, Yossi and me attended ICML 2025 in Vancouver. Enjoying a good weather, it was great hearing all about the latest and greatest in ML and showing people our works on CLIP embedding analysis and likelihood estimation.
Anna Rosenberg, John Kennedy, Zohar Keidar, Yehoshua Y. Zeevi and Guy Gilboa, Philosophical Transactions of the Royal Society A, 2025.
Abstract
Solving computer vision problems through machine learning, one often encounters lack of sufficient training data. To mitigate this, we propose the use of ensembles of weak learners based on spectral total-variation (STV) features (Gilboa G. 2014 A total variation spectral framework for scale and texture analysis. SIAM J. Imaging Sci. 7, 1937–1961. (doi:10.1137/130930704)). The features are related to nonlinear eigenfunctions of the total-variation subgradient and can characterize well textures at various scales. It was shown (Burger M, Gilboa G, Moeller M, Eckardt L, Cremers D. 2016 Spectral decompositions using one-homogeneous functionals. SIAM J. Imaging Sci. 9, 1374–1408. (doi:10.1137/15m1054687)) that, in the one-dimensional case, orthogonal features are generated, whereas in two dimensions the features are empirically lowly correlated. Ensemble learning theory advocates the use of lowly correlated weak learners. We thus propose here to design ensembles using learners based on STV features. To show the effectiveness of this paradigm, we examine a hard real-world medical imaging problem: the predictive value of computed tomography (CT) data for high uptake in positron emission tomography (PET) for patients suspected of skeletal metastases. The database consists of 457 scans with 1524 unique pairs of registered CT and PET slices. Our approach is compared with deep-learning methods and to radiomics features, showing STV learners perform best (AUC=0.87), compared with neural nets (AUC=0.75) and radiomics (AUC=0.79). We observe that fine STV scales in CT images are especially indicative of the presence of high uptake in PET.
Roy Betser, Meir-Yossef Levi, Guy Gilboa
Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025
Likelihood approximations for images are not trivial to compute and can be useful in many applications. We examine the use of Contrastive Language-Image Pre-training (CLIP) to assess the likelihood of images and captions. We introduce \textit{Whitened CLIP}, a novel transformation of the CLIP latent space via an invertible linear operation. This transformation ensures that each feature in the embedding space has zero mean, unit standard deviation, and no correlation with all other features, resulting in an identity covariance matrix. We show that the whitened embeddings statistics can be well approximated as a standard normal distribution, thus, the log-likelihood is estimated simply by the square Euclidean norm in the whitened embedding space. The whitening procedure is completely training-free and performed using a pre-computed whitening matrix, hence, is very fast. We present several preliminary experiments demonstrating the properties and applicability of these likelihood scores to images and captions.
Meir Yossef Levi, Guy Gilboa
Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025
Contrastive Language-Image Pre-Training (CLIP) is highly instrumental in machine learning applications within a large variety of domains.
We investigate the geometry of this embedding, which is still not well understood, and show that text and image reside on linearly separable ellipsoid shells, not centered at the origin. We explain the benefits of having this structure, allowing to better embed instances according to their uncertainty during contrastive training.
Frequent concepts in the dataset yield more false negatives, inducing greater uncertainty.
A new notion of conformity is introduced, which measures the average cosine similarity of an instance to any other instance within a representative data set. We prove this measure can be accurately estimated by simply computing the cosine similarity to the modality mean vector. Furthermore, we find that CLIP’s modality gap optimizes the matching of the conformity distributions of image and text.
Nice trip to Singapore.
Omer and I presented our poster at ICLR 2025 on “Manifold Induced Biases for Zero-shot and Few-shot Detection of Generated Images” by
Yossi presents his work on “Robustifying Point Cloud Networks by Refocusing” at 3DV in Singapore, March 2025.
Abatract:
Distinguishing between real and AI-generated images, commonly referred to as ‘image detection’, presents a timely and significant challenge. Despite extensive research in the (semi-)supervised regime, zero-shot and few-shot solutions have only recently emerged as promising alternatives. Their main advantage is in alleviating the ongoing data maintenance, which quickly becomes outdated due to advances in generative technologies. We identify two main gaps: (1) a lack of theoretical grounding for the methods, and (2) significant room for performance improvements in zero-shot and few-shot regimes. Our approach is founded on understanding and quantifying the biases inherent in generated content, where we use these quantities as criteria for characterizing generated images. Specifically, we explore the biases of the implicit probability manifold, captured by a pre-trained diffusion model. Through score-function analysis, we approximate the curvature, gradient, and bias towards points on the probability manifold, establishing criteria for detection in the zero-shot regime. We further extend our contribution to the few-shot setting by employing a mixture-of-experts methodology. Empirical results across 20 generative models demonstrate that our method outperforms current approaches in both zero-shot and few-shot settings. This work advances the theoretical understanding and practical usage of generated content biases through the lens of manifold analysis.
Jonathan Brokman, Amit Giloni, Omer Hofman, Roman Vainshtein, Hisashi Kojima, and Guy Gilboa, Int. Conf. on Scale Space and Variational Methods, 2025
Abstract:
Diffusion models, today’s leading image generative models, estimate the score function, i.e. the gradient of the log probability of (perturbed) data samples, without direct access to the underlying probability distribution. This work investigates whether the estimated score function can be leveraged to compute higher-order differentials, namely p-Laplace operators. We show here these operators can be employed to identify memorized training data. We propose a numerical p-Laplace approximation based on the learned score functions, showing its effectiveness in identifying key features of the probability landscape. We analyze the structured case of Gaussian mixture models, and demonstrate the results carry-over to image generative models, where memorization identification based on the p-Laplace operator is performed for the first time.
Minimizing Quotient Regularization Model
Chao Wang, Jean-Francois Aujol, Guy Gilboa, Yifei Lou
Inverse Problems and Imaging 2024