ICLR2024
Primary Area: applications to neuroscience & cognitive science
Keywords: compositional generalization compositionality representation learning out-of-distribution generalization multimodal reasoning
Scores: [ 6 3 8 ]
The advent of the Transformer has led to the development of large language models (LLM), which appear to demonstrate human-like capabilities. To assess the generality of this class of models and a variety of other base neural network architectures to multimodal domains, we evaluated and compared their capacity for multimodal generalization. We introduce a multimodal question-answer benchmark to evaluate three specific types of out-of-distribution (OOD) generalization performance: distractor generalization (generalization in the presence of distractors), systematic compositional generalization (generalization to new task permutations), and productive compositional generalization (generalization to more complex tasks with deeper dependencies). While we found that most architectures faired poorly on most forms of generalization (e.g., RNNs and standard Transformers), models that leveraged cross-attention mechanisms between input domains, such as the Perceiver, fared better. Our positive results demonstrate that for multimodal distractor and systematic generalization, cross-attention is an important mechanism to integrate multiple sources of information. On the other hand, all architectures failed in productive generalization, suggesting fundamental limitations of existing architectures for specific types of multimodal OOD generalization.These results demonstrate the strengths and limitations of specific architectural components underlying modern neural models for multimodal reasoning. Finally, we provide Generic COG (gCOG), a configurable benchmark with several multimodal generalization splits, for future studies to explore.
Primary Area: applications to neuroscience & cognitive science
Keywords: perceptual scale perceptual distance gaussian texture naturalistic textures texture interpolation
Scores: [ 8 5 5 3 ]
Perception is often viewed as a process that transforms physical variables, external to an observer, into internal psychological variables. Such a process can be modeled by a function coined perceptual scale. The perceptual scale can be deduced from psychophysical measurements that consist in comparing the relative differences between stimuli (i.e. difference scaling experiments). However, this approach is often overlooked by the modeling and experimentation communities. Here, we demonstrate the value of measuring the perceptual scale of classical (spatial frequency, orientation) and less classical physical variables (interpolation between textures) by embedding it in recent probabilistic modeling of perception. First, we show that the assumption that an observer has an internal representation of univariate parameters such as spatial frequency or orientation while stimuli are high-dimensional does not lead to contradictory predictions when following the theoretical framework. Second, we show that the measured perceptual scale corresponds to the transduction function hypothesized in this framework. In particular, we demonstrate that it is related to the Fisher information of the generative model that underlies perception and we test the predictions given by the generative model of different stimuli in a set a of difference scaling experiments. Our main conclusion is that the perceptual scale is mostly driven by the stimulus power spectrum. Finally, we propose that this measure of perceptual scale is a way to push further the notion of perceptual distances by estimating the perceptual geometry of images i.e. the path between images instead of simply the distance between those.
Primary Area: applications to neuroscience & cognitive science
Keywords: timescales recurrent neural networks memory tasks curriculum learning
Scores: [ 8 8 6 5 ]
Recurrent neural networks (RNNs) in the brain and in silico excel at solving tasks with intricate temporal dependencies. Long timescales required for solving such tasks can arise from properties of individual neurons (single-neuron timescale, τ, e.g., membrane time constant in biological neurons) or recurrent interactions among them (network-mediated timescale). However, the contribution of each mechanism for optimally solving memory-dependent tasks remains poorly understood. Here, we train RNNs to solve N-parity and N-delayed match-to-sample tasks with increasing memory requirements controlled by N by simultaneously optimizing recurrent weights and $\tau$s. We find that for both tasks RNNs develop longer timescales with increasing N, but depending on the learning objective, they use different mechanisms. Two distinct curricula define learning objectives: sequential learning of a single-N (single-head) or simultaneous learning of multiple $N$s (multi-head). Single-head networks increase their τ with N and are able to solve tasks for large N, but they suffer from catastrophic forgetting. However, multi-head networks, which are explicitly required to hold multiple concurrent memories, keep τ constant and develop longer timescales through recurrent connectivity. Moreover, we show that the multi-head curriculum increases training speed and network stability to ablations and perturbations, and allows RNNs to generalize better to tasks beyond their training regime. This curriculum also significantly improves training GRUs and LSTMs for large-N tasks. Our results suggest that adapting timescales to task requirements via recurrent interactions allows learning more complex objectives and improves the RNN's performance.
Primary Area: applications to neuroscience & cognitive science
Keywords: Spiking Neural Networks Rate and Temporal Information Hybrid Adversarial Attack
Scores: [ 6 6 8 8 ]
Spiking Neural Networks (SNNs) have received widespread attention in academic communities due to their superior spatio-temporal processing capabilities and energy-efficient characteristics. With further in-depth application in various fields, the vulnerability of SNNs under adversarial attack has become a focus of concern. In this paper, we draw inspiration from two mainstream learning algorithms of SNNs and observe that SNN models reserve both rate and temporal information. To better understand the capabilities of these two types of information, we conduct a quantitative analysis separately for each. In addition, we note that the retention degree of temporal information is related to the parameters and input settings of spiking neurons. Building on these insights, we propose a hybrid adversarial attack based on rate and temporal information (HART), which allows for dynamic adjustment of the rate and temporal attributes. Experimental results demonstrate that compared to previous works, HART attack can achieve significant superiority under different attack scenarios, data types, network architecture, time-steps, and model hyper-parameters. These findings call for further exploration into how both types of information can be effectively utilized to enhance the reliability of SNNs.
Primary Area: applications to neuroscience & cognitive science
Keywords: synaptic weight distributions synaptic plasticity biologically plausible learning mirror descent
Scores: [ 1 8 5 8 ]
Primary Area: applications to neuroscience & cognitive science
Keywords: Spiking neural networks excitation inhibition mechanism Long-Term Potentiation Long-Term Depression
Scores: [ 5 8 8 5 ]
Primary Area: applications to neuroscience & cognitive science
Keywords: memory transformers architecture inference cognitive model
Scores: [ 8 6 6 ]
How humans and machines make sense of current inputs for relation reasoning and question-answering, and put the perceived information into context of our past memories, has been a challenging conundrum in cognitive science and artificial intelligence. Inspired by human brain's memory system and cognitive architectures, we propose a PMI framework that consists of perception, memory and inference components. Notably, the memory module comprises working and long-term memory, with the latter endowed with a higher-order structure to retain more accumulated knowledge and experiences. Through a differentiable competitive write access, current perceptions update working memory, which is later merged with long-term memory via outer product associations, averting memory overflow and minimizing information conflicts. In the inference module, relevant information is retrieved from two separate memory origins and associatively integrated to attain a more comprehensive and precise interpretation of current perceptions. We exploratively apply our PMI to improve prevailing Transformers and CNN models on question-answering tasks like bAbI-20k and Sort-of-CLEVR datasets, as well as relation reasoning and image classification tasks, and in each case, our PMI enhancements consistently outshine their original counterparts significantly. Visualization analyses reveal that memory consolidation, along with the interaction and integration of information from diverse memory sources, substantially contributes to the model effectiveness on inference tasks.
Primary Area: applications to neuroscience & cognitive science
Keywords: Computational neuroscience recurrent neural networks learning connectivity structure inductive bias rich and lazy learning deep learning theory neural representations
Scores: [ 8 6 8 5 ]
Primary Area: applications to neuroscience & cognitive science
Keywords: RNNs Traveling Waves Memory Sequence Modeling
Scores: [ 6 6 8 5 ]
Traveling waves of neural activity have been observed throughout the brain at a diversity of regions and scales; however, their precise computational role is still debated. One physically grounded hypothesis suggests that the cortical sheet may act like a wave-field capable of invertibly storing a short-term memory of sequential stimuli through induced waves traveling across the cortical surface, and indeed many experimental results from neuroscience correlate wave activity with memory tasks. To date, however, the computational implications of this idea have remained hypothetical due to the lack of a simple recurrent neural network architecture capable of exhibiting such waves. In this work, we introduce a model to fill this gap, which we denote the Wave-RNN (wRNN), and demonstrate how such an architecture indeed efficiently encodes the recent past through a suite of synthetic memory tasks where wRNNs learn faster and reach significantly lower error than wave-free counterparts. We further explore the implications of this memory storage system on more complex sequence modeling tasks such as sequential image classification and find that wave-based models not only again outperform comparable wave-free RNNs while using significantly fewer parameters, but additionally perform comparably to more complex gated architectures such as LSTMs and GRUs.
Primary Area: applications to neuroscience & cognitive science
Keywords: neuroscience brain fmri captions image synthesis visual cortex computational neuroscience
Scores: [ 8 6 8 6 ]
Understanding the functional organization of higher visual cortex is a central focus in neuroscience. Past studies have primarily mapped the visual and semantic selectivity of neural populations using hand-selected stimuli, which may potentially bias results towards pre-existing hypotheses of visual cortex functionality. Moving beyond conventional approaches, we introduce a data-driven method that generates natural language descriptions for images predicted to maximally activate individual voxels of interest. Our method -- Semantic Captioning Using Brain Alignments ("BrainSCUBA") -- builds upon the rich embedding space learned by a contrastive vision-language model and utilizes a pre-trained large language model to generate interpretable captions. We validate our method through fine-grained voxel-level captioning across higher-order visual regions. We further perform text-conditioned image synthesis with the captions, and show that our images are semantically coherent and yield high predicted activations. Finally, to demonstrate how our method enables scientific discovery, we perform exploratory investigations on the distribution of "person" representations in the brain, and discover fine-grained semantic selectivity in body-selective areas. Unlike earlier studies that decode text, our method derives voxel-wise captions of semantic selectivity. Our results show that BrainSCUBA is a promising means for understanding functional preferences in the brain, and provides motivation for further hypothesis-driven investigation of visual cortex.
Primary Area: applications to neuroscience & cognitive science
Keywords: Multi-subject visual neural decoding representational similarity analysis CLIP transformer
Scores: [ 5 6 8 ]
Primary Area: applications to neuroscience & cognitive science
Keywords: representational geometry shape metrics dissimilarity metrics
Scores: [ 8 10 6 6 ]
Primary Area: applications to neuroscience & cognitive science
Keywords: Temporal Batch Normalization Spiking Neural Networks
Scores: [ 6 8 8 6 ]
Primary Area: applications to neuroscience & cognitive science
Keywords: Mesh deformation optimal transport cortical surface reconstruction computer vision medical imaging.
Scores: [ 6 6 8 8 ]
Mesh deformation plays a pivotal role in many 3D vision tasks including dynamic simulations, rendering, and reconstruction. However, defining an efficient discrepancy between predicted and target meshes remains an open problem. A prevalent approach in current deep learning is the set-based approach which measures the discrepancy between two surfaces by comparing two randomly sampled point-clouds from the two meshes with Chamfer pseudo-distance. Nevertheless, the set-based approach still has limitations such as lacking a theoretical guarantee for choosing the number of points in sampled point-clouds, and the pseudo-metricity and the quadratic complexity of the Chamfer divergence. To address these issues, we propose a novel metric for learning mesh deformation. The metric is defined by sliced Wasserstein distance on meshes represented as probability measures that generalize the set-based approach. By leveraging probability measure space, we gain flexibility in encoding meshes using diverse forms of probability measures, such as continuous, empirical, and discrete measures via \textit{varifold} representation. After having encoded probability measures, we can compare meshes by using the sliced Wasserstein distance which is an effective optimal transport distance with linear computational complexity and can provide a fast statistical rate for approximating the surface of meshes. To the end, we employ a neural ordinary differential equation (ODE) to deform the input surface into the target shape by modeling the trajectories of the points on the surface. Our experiments on cortical surface reconstruction demonstrate that our approach surpasses other competing methods in multiple datasets and metrics.
Primary Area: applications to neuroscience & cognitive science
Keywords: Spiking Neural Networks Learnable Multi-hierarchical Model Spatio-Temporal Back-propagation
Scores: [ 8 6 6 5 ]
Spiking Neural Networks (SNNs) have garnered considerable attention due to their energy efficiency and unique biological characteristics. However, the widely adopted Leaky Integrate-and-Fire (LIF) model, as the mainstream neuron model in current SNN research, has been revealed to exhibit significant deficiencies in deep-layer gradient calculation and capturing global information on the time dimension. In this paper, we propose the Learnable Multi-hierarchical (LM-H) model to address these issues by dynamically regulating its membrane-related factors. We point out that the LM-H model fully encompasses the information representation range of the LIF model while offering the flexibility to adjust the extraction ratio between historical and current information. Additionally, we theoretically demonstrate the effectiveness of the LM-H model and the functionality of its internal parameters, and propose a progressive training algorithm tailored specifically for the LM-H model. Furthermore, we devise an efficient training framework for our novel advanced model, encompassing hybrid training and time-slicing online training. Through extensive experiments on various datasets, we validate the remarkable superiority of our model and training algorithm compared to previous state-of-the-art approaches.
Primary Area: applications to neuroscience & cognitive science
Keywords: Spiking Neural Networks Network Pruning
Scores: [ 8 3 8 6 ]
Spiking Neural Networks (SNNs) have emerged as energy-efficient alternatives to Artificial Neural Networks (ANNs) when deployed on neuromorphic chips. While recent studies have demonstrated the impressive performance of deep SNNs on challenging tasks, their energy efficiency advantage has been diminished. Existing methods targeting energy consumption reduction do not fully exploit sparsity, whereas powerful pruning methods can achieve high sparsity but are not directly targeted at energy efficiency, limiting their effectiveness in energy saving. Furthermore, none of these works fully exploit the sparsity of neurons or the potential for unstructured neuron pruning in SNNs. In this paper, we propose a novel pruning framework that combines unstructured weight pruning with unstructured neuron pruning to maximize the utilization of the sparsity of neuromorphic computing, thereby enhancing energy efficiency. To the best of our knowledge, this is the first application of unstructured neuron pruning to deep SNNs. Experimental results demonstrate that our method achieves impressive energy efficiency gains. The sparse network pruned by our method with only 0.63% remaining connections can achieve a remarkable 91 times increase in energy efficiency compared to the original dense network, requiring only 8.5M SOPs for inference, with merely 2.19% accuracy loss on the CIFAR-10 dataset. Our work suggests that deep and dense SNNs exhibit high redundancy in energy consumption, highlighting the potential for targeted SNN sparsification to save energy.
Primary Area: applications to neuroscience & cognitive science
Keywords: cognitive modeling large language models neural networks cognitive psychology decision-making
Scores: [ 8 8 8 8 ]
Large language models are powerful systems that excel at many tasks, ranging from translation to mathematical reasoning. Yet, at the same time, these models often show unhuman-like characteristics. In the present paper, we address this gap and ask whether large language models can be turned into cognitive models. We find that -- after finetuning them on data from psychological experiments -- these models offer accurate representations of human behavior, even outperforming traditional cognitive models in two decision-making domains. In addition, we show that their representations contain the information necessary to model behavior on the level of individual subjects. Finally, we demonstrate that finetuning on multiple tasks enables large language models to predict human behavior in a previously unseen task. Taken together, these results suggest that large, pre-trained models can be adapted to become models of human cognition, which opens up future research directions toward building more general cognitive models.
Primary Area: applications to neuroscience & cognitive science
Keywords: EEG object recognition contrastive learning brain-computer interface
Scores: [ 3 8 8 8 ]
Primary Area: applications to neuroscience & cognitive science
Keywords: Visual prostheses Spiking recurrent neural network Dynamic vision sensor Biological phosphene model
Scores: [ 6 6 8 5 ]
Primary Area: applications to neuroscience & cognitive science
Keywords: hippocampus neuroscience cognitive science deep reinforcement learning representation learning prediction
Scores: [ 8 8 8 ]
Primary Area: applications to neuroscience & cognitive science
Keywords: Gaussian Processes Latent Variable Models Variational Autoencoders Neuroscience
Scores: [ 8 3 8 5 5 ]
Characterizing the relationship between neural population activity and behavioral data is a central goal of neuroscience. While latent variable models (LVMs) are successful in describing high-dimensional data, they are typically only designed for a single type of data, making it difficult to identify structure shared across different experimental data modalities. Here, we address this shortcoming by proposing an unsupervised LVM which extracts shared and independent latents for distinct, simultaneously recorded experimental modalities. We do this by combining Gaussian Process Factor Analysis (GPFA), an interpretable LVM for neural spiking data with temporally smooth latent space, with Gaussian Process Variational Autoencoders (GP-VAEs), which similarly use a GP prior to characterize correlations in a latent space, but admit rich expressivity due to a deep neural network mapping to observations. We achieve interpretability in our model by partitioning latent variability into components that are either shared between or independent to each modality. We parameterize the latents of our model in the Fourier domain, and show improved latent identification using this approach over standard GP-VAE methods. We validate our model on simulated multi-modal data consisting of Poisson spike counts and MNIST images that scale and rotate smoothly over time. We show that the multi-modal GP-VAE (MM-GPVAE) is able to not only identify the shared and independent latent structure across modalities accurately, but provides good reconstructions of both images and neural rates on held-out trials. Finally, we demonstrate our framework on two real world multi-modal experimental settings: Drosophila whole-brain calcium imaging alongside tracked limb positions, and Manduca sexta spike train measurements from ten wing muscles as the animal tracks a visual stimulus.
Primary Area: applications to neuroscience & cognitive science
Keywords: neural networks non-negative tensor factorization dynamic community detection functional connectome
Scores: [ 6 6 8 ]
Primary Area: applications to neuroscience & cognitive science
Keywords: neural spikes Bayesian nonparametric bi-clustering
Scores: [ 6 8 6 ]
Modern neural recording techniques allow neuroscientists to obtain spiking activity of multiple neurons from different brain regions over long time periods, which requires new statistical methods to be developed for understanding structure of the large-scale data. In this paper, we develop a bi-clustering method to cluster the neural spiking activity spatially and temporally, according to their low-dimensional latent structures. The spatial (neuron) clusters are defined by the latent trajectories within each neural population, while the temporal (state) clusters are defined by (populationally) synchronous local linear dynamics shared with different periods. To flexibly extract the bi-clustering structure, we build the model non-parametrically, and develop an efficient Markov chain Monte Carlo (MCMC) algorithm to sample the posterior distributions of model parameters. Validating our proposed MCMC algorithm through simulations, we find the method can recover unknown parameters and true bi-clustering structures successfully. We then apply the proposed bi-clustering method to multi-regional neural recordings under different experiment settings, where we find that simultaneously considering latent trajectories and spatial-temporal clustering structures can provide us with a more accurate and interpretable result. Overall, the proposed method provides scientific insights for large-scale (counting) time series with elongated recording periods, and it can potentially have application beyond neuroscience.
Primary Area: applications to neuroscience & cognitive science
Keywords: Spiking Neural Network Spike Calibration Transformer
Scores: [ 8 8 6 6 ]
Spiking neural networks (SNNs) are energy-efficient and hold great potential for large-scale inference. Since training SNNs from scratch is costly and has limited performance, converting pretrained artificial neural networks (ANNs) to SNNs is an attractive approach that retains robust performance without additional training data and resources. However, while existing conversion methods work well on convolution networks, emerging Transformer models introduce unique mechanisms like self-attention and test-time normalization, leading to non-causal non-linear interactions unachievable by current SNNs. To address this, we approximate these operations in both temporal and spatial dimensions, thereby providing the first SNN conversion pipeline for Transformers. We propose \textit{Universal Group Operators} to approximate non-linear operations spatially and a \textit{Temporal-Corrective Self-Attention Layer} that approximates spike multiplications at inference through an estimation-correction approach. Our algorithm is implemented on a pretrained ViT-B/32 from CLIP, inheriting its zero-shot classification capabilities, while improving control over conversion losses. To our knowledge, this is the first direct training-free conversion of a pretrained Transformer to a purely event-driven SNN, promising for neuromorphic hardware deployment.
Primary Area: applications to neuroscience & cognitive science
Keywords: neuroscience neural dynamics dynamical systems decision making
Scores: [ 5 6 6 6 6 ]
Understanding how multiple brain regions interact to produce behavior is a major challenge in systems neuroscience, with many regions causally implicated in common tasks such as sensory processing and decision making. A precise description of interactions between regions remains an open problem. Moreover, neural dynamics are nonlinear and non-stationary. Here, we propose MR-SDS, a multiregion, switching nonlinear state space model that decomposes global dynamics into local and cross-communication components in the latent space. MR-SDS includes directed interactions between brain regions, allowing for estimation of state-dependent communication signals, and accounts for sensory inputs effects, history effects, and heterogeneity across days and animals. We show that our model accurately recovers latent trajectories, vector fields underlying switching nonlinear dynamics, and cross-region communication profiles in three simulations. We then apply our method to two large-scale, multi-region neural datasets involving mouse decision making. The first includes hundreds of neurons per region, recorded simultaneously at single-cell-resolution across 3 distant cortical regions. The second is a mesoscale widefield dataset of 8 adjacent cortical regions imaged across both hemispheres. On these multi-region datasets, our model outperforms existing piece-wise linear multi-region models and reveals multiple distinct dynamical states and a rich set of cross-region communication profiles.
Primary Area: applications to neuroscience & cognitive science
Keywords: GPT Pretraining Transformers Multimodal Multitask Neuron Neuroscience Neuronal Spikes Brain Cortex Human Mice Biology
Scores: [ 6 8 6 5 ]
State-of-the-art systems neuroscience experiments yield large-scale multimodal data, and these data sets require new tools for analysis. Inspired by the success of large pretrained models in vision and language domains, we reframe the analysis of large-scale, cellular-resolution neuronal spiking data into an auto-regressive spatiotemporal generation problem. Neuroformer is a multimodal, multitask generative pre-trained transformer (GPT) model that is specifically designed to handle the intricacies of data in systems neuroscience. It scales linearly with feature size, can process an arbitrary number of modalities, and is adaptable to downstream tasks, such as predicting behavior. We first trained Neuroformer on simulated datasets, and found that it both accurately predicted simulated neuronal circuit activity, and also intrinsically inferred the underlying neural circuit connectivity, including direction. When pretrained to decode neural responses, the model predicted the behavior of a mouse with only few-shot fine-tuning, suggesting that the model begins learning how to do so directly from the neural representations themselves, without any explicit supervision. We used an ablation study to show that joint training on neuronal responses and behavior boosted performance, highlighting the model's ability to associate behavioral and neural representations in an unsupervised manner. These findings show that Neuroformer can analyze neural datasets and their emergent properties, informing the development of models and hypotheses associated with the brain.
Primary Area: applications to neuroscience & cognitive science
Keywords: emergent communication conversation signaling games language evolution
Scores: [ 8 6 5 ]
Research on conversation has put emphasis on the importance of a multi-level communication system, in which the interlocutors aim to establish and maintain common ground. In natural conversations, repair mechanisms such as clarification requests are frequently used to improve mutual understanding.Here we explore the effects of conversational repair on languages emerging in signaling games. We extend the basic Lewis signaling game setup with a feedback channel that allows for the transmission of messages backwards from the receiver to the sender. Further, we add noise to the communication channel so that repair mechanisms become necessary for optimal performance.We find that for models that were trained with a feedback channel the sender agents produce less compositional messages. However, they still achieve a substantially higher generalization performance, putting to question the role of compositionality for generalization.These findings generalize also to a more realistic case involving naturalistic images in a guessing game setup.More broadly, this study provides an important step towards the creation of signaling games that more closely resemble the conditions under which human languages emerged.
Primary Area: applications to neuroscience & cognitive science
Keywords: computational neuroscience probabilistic coding neural sampling priors
Scores: [ 6 5 6 6 ]
Despite many successful examples in which probabilistic inference can account for perception, we have little understanding of how the brain represents and uses structured priors that capture the complexity of natural input statistics. Here we construct a recurrent circuit model that can implicitly represent priors over latent variables, and combine them with sensory and contextual sources of information to encode task-specific posteriors. Inspired by the recent success of diffusion models as means of learning and using priors over images, our model uses dendritic nonlinearities optimized for denoising, and stochastic somatic integration with the degree of noise modulated by an oscillating global signal. Combining these elements into a recurrent network yields a dynamical system that samples from the prior at a rate prescribed by the period of the global oscillator. Additional inputs reflecting sensory or top-down contextual information alter these dynamics to generate samples from the corresponding posterior, with different input gating patterns selecting different inference tasks. We demonstrate that this architecture can sample from low dimensional nonlinear manifolds and multimodal posteriors. Overall, the model provides a new framework for circuit-level representation of probabilistic information, in a format that facilitates flexible inference.
Primary Area: applications to neuroscience & cognitive science
Keywords: Cognitive Science parallelization
Scores: [ 3 6 6 6 ]
Primary Area: applications to neuroscience & cognitive science
Keywords: foundation model fMRI
Scores: [ 6 6 6 6 ]
We introduce the Brain Language Model (BrainLM), a foundation model for brain activity dynamics trained on 6,700 hours of fMRI recordings. Utilizing self-supervised masked-prediction training, BrainLM demonstrates proficiency in both fine-tuning and zero-shot inference tasks. Fine-tuning allows for the accurate prediction of clinical variables like age, anxiety, and PTSD as well as forecasting of future brain states. Critically, the model generalizes well to entirely new external cohorts not seen during training. In zero-shot inference mode, BrainLM can identify intrinsic functional networks directly from raw fMRI data without any network-based supervision during training. The model also generates interpretable latent representations that reveal relationships between brain activity patterns and cognitive states. Overall, BrainLM offers a versatile and interpretable framework for elucidating the complex spatiotemporal dynamics of human brain activity. It serves as a powerful "lens" through which massive repositories of fMRI data can be analyzed in new ways, enabling more effective interpretation and utilization at scale. The work demonstrates the potential of foundation models to advance computational neuroscience research.
Primary Area: applications to neuroscience & cognitive science
Keywords: computational neuroscience offline reactivation replay recurrent neural networks path integration noise
Scores: [ 6 6 6 5 ]
Primary Area: applications to neuroscience & cognitive science
Keywords: EEG brain-computer interface representation learning
Scores: [ 8 8 6 ]
The current electroencephalogram (EEG) based deep learning models are typically designed for specific datasets and applications in brain-computer interaction (BCI), limiting the scale of the models and thus diminishing their perceptual capabilities and generalizability. Recently, Large Language Models (LLMs) have achieved unprecedented success in text processing, prompting us to explore the capabilities of Large EEG Models (LEMs). We hope that LEMs can break through the limitations of different task types of EEG datasets, and obtain universal perceptual capabilities of EEG signals through unsupervised pre-training. Then the models can be fine-tuned for different downstream tasks. However, compared to text data, the volume of EEG datasets is generally small and the format varies widely. For example, there can be mismatched numbers of electrodes, unequal length data samples, varied task designs, and low signal-to-noise ratio. To overcome these challenges, we propose a unified foundation model for EEG called Large Brain Model (LaBraM). LaBraM enables cross-dataset learning by segmenting the EEG signals into EEG channel patches. Vector-quantized neural spectrum prediction is used to train a semantically rich neural tokenizer that encodes continuous raw EEG channel patches into compact neural codes. We then pre-train neural Transformers by predicting the original neural codes for the masked EEG channel patches. The LaBraMs were pre-trained on about 2,500 hours of various types of EEG signals from around 20 datasets and validated on multiple different types of downstream tasks. Experiments on abnormal detection, event type classification, emotion recognition, and gait prediction show that our LaBraM outperforms all compared SOTA methods in their respective fields. Our code will be released.
Primary Area: applications to neuroscience & cognitive science
Keywords: Brain-Computer Interfaces Geometric Deep Learning Self-Supervised Learning Neuroscience
Scores: [ 8 6 8 ]
Primary Area: applications to neuroscience & cognitive science
Keywords: knowledge tracing interpretable representations knowledge graphs probabilistic models variational inference continual learning
Scores: [ 6 5 8 8 ]
Intelligent tutoring systems optimize the selection and timing of learning materials to enhance understanding and long-term retention. This requires estimates of both the learner's progress ("knowledge tracing"; KT), and the prerequisite structure of the learning domain ("knowledge mapping"). While recent deep learning models achieve high KT accuracy, they do so at the expense of the interpretability of psychologically-inspired models. In this work, we present a solution to this trade-off. PSI-KT is a hierarchical generative approach that explicitly models how both individual cognitive traits and the prerequisite structure of knowledge influence learning dynamics, thus achieving interpretability by design. Moreover, by using scalable Bayesian inference, PSI-KT targets the real-world need for efficient personalization even with a growing body of learners and interaction data. Evaluated on three datasets from online learning platforms, PSI-KT achieves superior multi-step predictive accuracy and scalable inference in continual-learning settings, all while providing interpretable representations of learner-specific traits and the prerequisite structure of knowledge that causally supports learning. In sum, predictive, scalable and interpretable knowledge tracing with solid knowledge mapping lays a key foundation for effective personalized learning to make education accessible to a broad, global audience.
Primary Area: applications to neuroscience & cognitive science
Keywords: Forward-only learning Biologically inspired learning Artificial neural networks Analytical characterization
Scores: [ 5 5 8 8 ]
Primary Area: applications to neuroscience & cognitive science
Keywords: peripheral vision object detection dataset foveation psychophysics
Scores: [ 3 6 8 ]
Evaluating deep neural networks (DNNs) as models of human perception has given rich insights into both human visual processing and representational properties of DNNs. We extend this work by analyzing how well DNNs perform compared to humans when constrained by peripheral vision -- which limits human performance on a variety of tasks, but also benefits the visual system significantly. We evaluate this by (1) modifying the Texture Tiling Model (TTM), a well tested model of peripheral vision to be more flexibly used with DNNs, (2) generating a large dataset which we call COCO-Periph that contains images transformed to capture the information available in human peripheral vision, and (3) comparing DNNs to humans at peripheral object detection using a psychophysics experiment. Our results show that common DNNs underperform at object detection compared to humans when simulating peripheral vision with TTM. Training on COCO-Periph begins to reduce the gap between human and DNN performance and leads to small increases in corruption robustness, but DNNs still struggle to capture human-like sensitivity to peripheral clutter. Our work brings us closer to accurately modeling human vision, and paves the way for DNNs to mimic and sometimes benefit from properties of human visual processing.
Primary Area: applications to neuroscience & cognitive science
Keywords: Hidden markov models generalized linear models functional connectivity inference
Scores: [ 8 3 8 ]
Primary Area: applications to neuroscience & cognitive science
Keywords: computational neuroscience interpretable dynamics motor control animal behavior dynamical systems system identification unsupervised learning zebrafish
Scores: [ 6 6 5 6 ]
A central objective in neuroscience is to understand how the brain orchestrates movement. Recent advances in automated tracking technologies have made it possible to document behavior with unprecedented temporal resolution and scale, generating rich datasets which can be exploited to gain insights into the neural control of movement. One common approach is to identify stereotypical motor primitives using cluster analysis. However, this categorical description can limit our ability to model the effect of more continuous control schemes. Here we take a control theoretic approach to behavioral modeling and argue that movements can be understood as the output of a controlled dynamical system. Previously, models of movement dynamics, trained solely on behavioral data, have been effective in reproducing observed features of neural activity. These models addressed specific scenarios where animals were trained to execute particular movements upon receiving a prompt. In this study, we extend this approach to analyze the full natural locomotor repertoire of an animal: the zebrafish larva. Our findings demonstrate that this repertoire can be effectively generated through a sparse control signal driving a latent Recurrent Neural Network (RNN). Our model's learned latent space preserves key kinematic features and disentangles different categories of movements. To further interpret the latent dynamics, we used balanced model reduction to yield a simplified model. Collectively, our methods serve as a case study for interpretable system identification, and offer a novel framework for understanding neural activity in relation to movement.
Primary Area: applications to neuroscience & cognitive science
Keywords: Hippocampal formation; episodic memory; relational learning
Scores: [ 6 8 8 5 ]
Motivated by a recent neuroscientific hypothesis, some theoretical studies have accounted for neural cognitive maps in the rodent hippocampal formation as a representation of the general relational structure across task environments. However, despite their remarkable results, it is unclear whether their account can be extended to more general settings beyond spatial random-walk tasks in 2D environments. To address this question, we construct a novel cognitive model that performs memory-based decision-making tasks, inspired by previous human studies, for learning abstract relational structures. Building on previous approaches of modular architecture, we develop a learning algorithm that performs reward-guided relational inference across different entity domains, while maintaining dynamic binding between abstract relations and concrete entities using our specific memory mechanism enabling content replacement. Our experiments show (i) the capability of our model to capture relational structures that can generalize over new domains with unseen entities, (ii) the difficulty of our task that leads previous models, including Neural Turing Machine and vanilla Transformer, to complete failure, and (iii) the similarity of performance and internal representations of our model to recent human behavioral and fMRI experimental data, a cognitive map representation in the human hippocampal formation.
Primary Area: applications to neuroscience & cognitive science
Keywords: EEG Emotion Recognition Multi-modal Domain Adaptation cross-subject
Scores: [ 8 8 6 6 ]
The research on human emotion under electroencephalogram (EEG) is an emerging field in which cross-subject emotion recognition (ER) is a promising but challenging task. Many approaches attempt to find emotionally relevant domain-invariant features using domain adaptation (DA) to improve the accuracy of cross-subject ER. Two problems exist with these methods. First, only single-modal data (EEG) are utilized, ignoring the complementarity between multi-modal physiological signals. Second, these methods aim to completely match the signal feature between different domains, which is difficult due to the extreme individual differences of EEG. To solve these problems, we introduced the complementarity of multi-modal physiological signals and proposed a new method for cross-subject ER that does not align the distribution of signal features but rather the distribution of spatio-temporal relationships between features. We design a Variational Bayesian Heterogeneous Graph Neural Network (VBH-GNN) with Relationship Distribution Adaptation (RDA). The RDA first aligns the domains by expressing the model space as a posterior distribution of a heterogeneous graph (HetG) for a given source domain through Bayesian graph inference. Then RDA transforms the HetG into an emotion-specific graph to further align the domains for the downstream ER task. Extensive experiments on two public datasets, DEAP and Dreamer, show that our VBH-GNN outperforms state-of-the-art methods.
Primary Area: applications to neuroscience & cognitive science
Keywords: spiking neural networks online training
Scores: [ 6 8 8 6 ]
Spiking neural networks (SNNs), attributed to the binary, event-driven nature of spikes, possess heightened biological plausibility and enhanced energy efficiency on neuromorphic hardware compared to analog neural networks (ANNs). Mainstream SNN training schemes apply backpropagation-through-time (BPTT) with surrogate gradients to replace the non-differentiable spike emitting process during backpropagation. While achieving competitive performance, the requirement for storing intermediate information at all time-steps incurs higher memory consumption and fails to fulfill the online property crucial to biological brains. Our work focuses on online training techniques, aiming for memory efficiency while preserving biological plausibility. The limitation of not having access to future information in early time steps in online training has constrained previous efforts to incorporate advantageous modules such as batch normalization. To address this problem, we propose Online Spiking Renormalization (OSR) to ensure consistent parameters between testing and training, and Online Threshold Stabilizer (OTS) to stabilize neuron firing rates across time steps. Furthermore, we design a novel online approach to compute the variable mean and variance over time for OSR. Experiments conducted on various datasets demonstrate the proposed method's superior performance among SNN online training algorithms.
Primary Area: applications to neuroscience & cognitive science
Keywords: critical learning periods deep neural networks gradient descent linear networks
Scores: [ 8 6 10 5 ]
Primary Area: applications to neuroscience & cognitive science
Keywords: brain simulator brain simulation computational neuroscience brain-inspired computing
Scores: [ 6 10 8 6 6 ]
Brain simulation builds dynamical models to mimic the structure and functions of the brain, while brain-inspired computing (BIC) develops intelligent systems by learning from the structure and functions of the brain. The two fields are intertwined and should share a common programming framework to facilitate each other's development. However, none of the existing software in the fields can achieve this goal, because traditional brain simulators lack differentiability for training, while existing deep learning (DL) frameworks fail to capture the biophysical realism and complexity of brain dynamics. In this paper, we introduce BrainPy, a differentiable brain simulator developed using JAX and XLA, with the aim of bridging the gap between brain simulation and BIC. BrainPy expands upon the functionalities of JAX, a powerful AI framework, by introducing complete capabilities for flexible, efficient, and scalable brain simulation. It offers a range of sparse and event-driven operators for efficient and scalable brain simulation, an abstraction for managing the intricacies of synaptic computations, a modular and flexible interface for constructing multi-scale brain models, and an object-oriented just-in-time compilation approach to handle the memory-intensive nature of brain dynamics. We showcase the efficiency and scalability of BrainPy on benchmark tasks, highlight its differentiable simulation for biologically plausible spiking models, and discuss its potential to support research at the intersection of brain simulation and BIC.
Primary Area: applications to neuroscience & cognitive science
Keywords: Efficient coding object representation dropout robustness human fMRI occipitotemporal cortex cognitive neuroscience distributed coding
Scores: [ 6 6 6 6 ]
According to the efficient coding hypothesis, neural populations encode information optimally when representations are high-dimensional and uncorrelated. However, such codes may carry a cost in terms of generalization and robustness. Past empirical studies of early visual cortex (V1) in rodents have suggested that this tradeoff indeed constrains sensory representations. However, it remains unclear whether these insights generalize across the hierarchy of the human visual system, and particularly to object representations in high-level occipitotemporal cortex (OTC). To gain new empirical clarity, here we develop a family of object recognition models with parametrically varying dropout proportion p, which induces systematically varying dimensionality of internal responses (while controlling all other inductive biases). We find that increasing dropout produces an increasingly smooth, low-dimensional representational space. Optimal robustness to lesioning is observed at around 70% dropout, after which both accuracy and robustness decline. Representational comparison to large-scale 7T fMRI data from occipitotemporal cortex in the Natural Scenes Dataset reveals that this optimal degree of dropout is also associated with maximal emergent neural predictivity. Finally, using new techniques for achieving denoised estimates of the eigenspectrum of human fMRI responses, we compare the rate of eigenspectrum decay between model and brain feature spaces. We observe that the match between model and brain representations is associated with a common balance between efficiency and robustness in the representational space. These results suggest that varying dropout may reveal an optimal point of balance between the efficiency of high-dimensional codes and the robustness of low dimensional codes in hierarchical vision systems.
Primary Area: applications to neuroscience & cognitive science
Keywords: neural dynamics transfer learning distribution alignment neuroscience few-shot learning
Scores: [ 6 8 6 6 ]
Large scale inference models are widely used in neuroscience to extract latent representations from high-dimensional neural recordings. Due to the statistical heterogeneities between sessions and animals, a new model is trained from scratch to infer the underlying dynamics for each new dataset. This is computationally expensive and does not fully leverage all the available data. Moreover, as these models get more complex, they can be challenging to train. In parallel, it is becoming common to use pre-trained models in the machine learning community for few shot and transfer learning. One major hurdle that prevents the re-use of generative models in neuroscience is the complex spatio-temporal structure of neural dynamics within and across animals. Interestingly, the underlying dynamics identified from different datasets on the same task are qualitatively similar. In this work, we exploit this observation and propose a source-free and unsupervised alignment approach that utilizes the learnt dynamics and enables the re-use of trained generative models. We validate our approach on simulations and show the efficacy of the alignment on neural recordings from the motor cortex obtained during a reaching task.
Primary Area: applications to neuroscience & cognitive science
Keywords: clustering discriminative stimuli interpretable optimization expectation-maximization functional cell types digital twins feature visualization pre-image search maximally exciting image
Scores: [ 5 6 5 6 ]
Identifying cell types and understanding their functional properties is crucial for unraveling the mechanisms underlying perception and cognition. For example in the retina, functional types can be identified by a carefully selected and manually curated battery of stimuli. However, this requires expert domain knowledge and biases the procedure towards previously known cell types. In the visual cortex, it is still unknown what functional types exist and how to identify them. Thus, for unbiased identification of the functional cell types in retina and visual cortex, new approaches are needed. Here we propose an optimization-based clustering approach using deep predictive models to obtain functional clusters of neurons using Maximally Discriminative Stimuli (MDS). Our approach alternates between stimulus optimization with cluster reassignment akin to an expectation-maximization algorithm. The algorithm recovers functional clusters in mouse retina, marmoset retina and macaque visual area V4. This demonstrates that our approach can successfully find discriminative stimuli across species, stages of the visual system and recording techniques. Presenting maximally discriminative stimuli during data acquisition allows for on-the-fly assignment to functional cell types, and paves the way for experiments that were previously limited by experimental time. Crucially, MDS are interpretable: they visualize the distinctive stimulus patterns that most unambiguously identify a specific type of neuron. We will make our code avail- able online upon publication.
Primary Area: applications to neuroscience & cognitive science
Keywords: Spiking Neural Networks Delays Neuromorphic Computing Speech Recognition
Scores: [ 6 8 8 6 ]
Primary Area: applications to neuroscience & cognitive science
Keywords: brain decoding neuroimaging image generation visual perception
Scores: [ 6 8 8 6 ]
In the past five years, the use of generative and foundational AI systems has greatly improved the decoding of brain activity. Visual perception, in particular, can now be decoded from functional Magnetic Resonance Imaging (fMRI) with remarkable fidelity. This neuroimaging technique, however, suffers from a limited temporal resolution ($\approx$0.5,Hz) and thus fundamentally constrains its real-time usage. Here, we propose an alternative approach based on magnetoencephalography (MEG), a neuroimaging device capable of measuring brain activity with high temporal resolution ($\approx$5,000 Hz). For this, we develop an MEG decoding model trained with both contrastive and regression objectives and consisting of three modules: i) pretrained embeddings obtained from the image, ii) an MEG module trained end-to-end and iii) a pretrained image generator. Our results are threefold: Firstly, our MEG decoder shows a 7X improvement of image-retrieval over classic linear decoders. Second, late brain responses to images are best decoded with DINOv2, a recent foundational image model. Third, image retrievals and generations both suggest that MEG signals primarily contain high-level visual features, whereas the same approach applied to 7T fMRI also recovers low-level features. Overall, these results provide an important step towards the decoding - in real time - of the visual processes continuously unfolding within the human brain.
Primary Area: applications to neuroscience & cognitive science
Keywords: Spiking neural networks Spiking transformer Transformer-based SNNs Neuromorphic chips Spike-driven self-attention
Scores: [ 5 6 6 ]
Neuromorphic computing, which exploits Spiking Neural Networks (SNNs) on neuromorphic chips, is a promising energy-efficient alternative to traditional AI. CNN-based SNNs are the current mainstream of neuromorphic computing. By contrast, no neuromorphic chips are designed especially for Transformer-based SNNs, which have just emerged, and their performance is only on par with CNN-based SNNs, offering no distinct advantage. In this work, we propose a general Transformer-based SNN architecture, termed as Meta-SpikeFormer", whose goals are: (1) Lower-power, supports the spike-driven paradigm that there is only sparse addition in the network; (2) Versatility, handles various vision tasks; (3) High-performance, shows overwhelming performance advantages over CNN-based SNNs; (4) Meta-architecture, provides inspiration for future next-generation Transformer-based neuromorphic chip designs. Specifically, we extend the Spike-driven Transformer in \citet{yao2023spike} into a meta architecture, and explore the impact of structure, spike-driven self-attention, and skip connection on its performance. On ImageNet-1K, Meta-SpikeFormer achieves 80.0% top-1 accuracy (55M), surpassing the current state-of-the-art (SOTA) SNN baselines (66M) by 3.7%. This is the first direct training SNN backbone that can simultaneously supports classification, detection, and segmentation, obtaining SOTA results in SNNs. Finally, we discuss the inspiration of the meta SNN architecture for neuromorphic chip design.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: natural language processing large language model prompting application in healthcare privacy
Scores: [ 6 6 6 6 ]
Large language models (LLMs) demonstrate remarkable medical expertise, but data privacy concerns impede their direct use in healthcare environments. Although offering improved data privacy protection, domain-specific small language models (SLMs) often underperform LLMs, emphasizing the need for methods that reduce this performance gap while alleviating privacy concerns. In this paper, we present a simple yet effective method that harnesses LLMs' medical proficiency to boost SLM performance in medical tasks under privacy−restricted scenarios. Specifically, we mitigate patient privacy issues by extracting keywords from medical data and prompting the LLM to generate a medical knowledge-intensive context by simulating clinicians' thought processes. This context serves as additional input for SLMs, augmenting their decision-making capabilities. Our method significantly enhances performance in both few-shot and full training settings across three medical knowledge-intensive tasks, achieving up to a 22.57% increase in absolute accuracy compared to SLM fine-tuning without context, and sets new state-of-the-art results in two medical tasks within privacy-restricted scenarios. Further out-of-domain testing and experiments in two general domain datasets showcase its generalizability and broad applicability.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Weather Forecasting Typhoon Trajectory Forecasting Tropical Cyclone Climate Change
Scores: [ 8 8 6 3 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Equivariant Graph Neural Network rigid body protein-protein docking interface fitting
Scores: [ 5 8 8 3 ]
The study of rigid protein-protein docking plays an essential role in a variety of tasks such as drug design and protein engineering. Recently, several learning-based methods have been proposed for the task, exhibiting much faster docking speed than those computational methods. In this paper, we propose a novel learning-based method called ElliDock, which predicts an elliptic paraboloid to represent the protein-protein docking interface. To be specific, our model estimates elliptic paraboloid interfaces for the two input proteins respectively, and obtains the roto-translation transformation for docking by making two interfaces coincide. By its design, ElliDock is independently equivariant with respect to arbitrary rotations/translations of the proteins, which is an indispensable property to ensure the generalization of the docking process. Experimental evaluations show that ElliDock achieves the fastest inference time among all compared methods, and outperforms state-of-the-art learning-based methods, like DiffDock-PP and Alphafold-Multimer, for particularly antibody-antigen docking.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: generative modeling langevin mcmc energy-based models score-based models protein design protein discovery
Scores: [ 8 8 8 ]
We resolve difficulties in training and sampling from a discrete generative model by learning a smoothed energy function, sampling from the smoothed data manifold with Langevin Markov chain Monte Carlo (MCMC), and projecting back to the true data manifold with one-step denoising. Our Discrete Walk-Jump Sampling formalism combines the contrastive divergence training of an energy-based model and improved sample quality of a score-based model, while simplifying training and sampling by requiring only a single noise level. We evaluate the robustness of our approach on generative modeling of antibody proteins and introduce the distributional conformity score to benchmark protein generative models. By optimizing and sampling from our models for the proposed distributional conformity score, 97-100% of generated samples are successfully expressed and purified and 70% of functional designs show equal or improved binding affinity compared to known functional antibodies on the first attempt in a single round of laboratory experiments. We also report the first demonstration of long-run fast-mixing MCMC chains where diverse antibody protein classes are visited in a single MCMC chain.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: equivariance graph neural networks long range
Scores: [ 6 8 5 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Large Language Models prompt retrieval domain feedback conversation drug editing drug optimization controllable generation small molecule peptide protein
Scores: [ 6 6 6 ]
Recent advancements in conversational large language models (LLMs), such as ChatGPT, have demonstrated remarkable promise in various domains, including drug discovery. However, existing works mainly focus on investigating the capabilities of conversational LLMs on chemical reactions and retrosynthesis. While drug editing, a critical task in the drug discovery pipeline, remains largely unexplored. To bridge this gap, we propose ChatDrug, a framework to facilitate the systematic investigation of drug editing using LLMs. ChatDrug jointly leverages a prompt module, a retrieval and domain feedback module, and a conversation module to streamline effective drug editing. We empirically show that ChatDrug reaches the best performance on all 39 drug editing tasks, encompassing small molecules, peptides, and proteins. We further demonstrate, through 10 case studies, that ChatDrug can successfully identify the key substructures for manipulation, generating diverse and valid suggestions for drug editing. Promisingly, we also show that ChatDrug can offer insightful explanations from a domain-specific perspective, enhancing interpretability and enabling informed decision-making.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: neural ODE time-series forecasting climate prediction physics-informed ML
Scores: [ 8 8 8 8 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Model based optimization protein engineering
Scores: [ 6 8 5 6 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: protein-protein interactions protein design generalization self-supervised learning equivariant 3D representations
Scores: [ 3 6 6 8 6 ]
Discovering mutations enhancing protein-protein interactions (PPIs) is critical for advancing biomedical research and developing improved therapeutics. While machine learning approaches have substantially advanced the field, they often struggle to generalize beyond training data in practical scenarios. The contributions of this work are three-fold. First, we construct PPIRef, the largest and non-redundant dataset of 3D protein-protein interactions, enabling effective large-scale learning. Second, we leverage PPIRef to pre-train PPIformer, a new SE(3)-equivariant model, generalizing across diverse protein-binder variants. We fine-tune PPIformer to predict effects of mutations on protein-protein interactions via a thermodynamically motivated adjustment of the pre-training loss function. Finally, we demonstrate the enhanced generalization of our new PPIformer approach by outperforming other state-of-the-art methods on the new non-leaking splits of the standard labeled PPI mutational data and independent case studies optimizing a human antibody against SARS-CoV-2 and increasing staphylokinase thrombolytic activity.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: inverse design generative design PDE physical simulation compositional
Scores: [ 8 8 6 6 ]
Inverse design, where we seek to design input variables in order to optimize an underlying objective function, is an important problem that arises across fields such as mechanical engineering to aerospace engineering. Inverse design is typically formulated as an optimization problem, with recent works leveraging optimization across learned dynamics models. However, as models are optimized they tend to fall into adversarial modes, preventing effective sampling. We illustrate that by instead optimizing over the learned energy function captured by the diffusion model, we can avoid such adversarial examples and significantly improve design performance. We further illustrate how such a design system is compositional, enabling us to combine multiple different diffusion models representing subcomponents of our desired system to design systems with every specified component. In an N-body interaction task and a challenging 2D multi-airfoil design task, we demonstrate that by composing the learned diffusion model at test time, our method allows us to design initial states and boundary shapes that are more complex than those in the training data. Our method outperforms state-of-the-art neural inverse design method by an average of 41.5% in prediction MAE and 14.3% in design objective for the N-body dataset and discovers formation flying to minimize drag in the multi-airfoil design task.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Generative models for materials diffusion models density function theory
Scores: [ 6 8 5 6 ]
Generative models trained on internet-scale data are capable of generating novel and realistic texts, images, and videos. A natural next question is whether these models can advance science, for example by generating novel stable materials. Traditionally, models with explicit structures (e.g., graphs) have been used in modeling structural relationships in scientific data (e.g., atoms and bonds in crystals), but generating structures can be difficult to scale to large and complex systems. Another challenge in generating materials is the mismatch between standard generative modeling metrics and downstream applications. For instance, common metrics such as the reconstruction error do not correlate well with the downstream goal of discovering novel stable materials. In this work, we tackle the scalability challenge by developing a unified crystal representation that can represent any crystal structure (UniMat), followed by training a diffusion probabilistic model on these UniMat representations. Our empirical results suggest that despite the lack of explicit structure modeling, UniMat can generate high fidelity crystal structures from larger and more complex chemical systems, outperforming previous graph-based approaches under various generative modeling metrics. To better connect the generation quality of materials to downstream applications, such as discovering novel stable materials, we propose additional metrics for evaluating generative models of materials, including per-composition formation energy and stability with respect to convex hulls through decomposition energy from Density Function Theory (DFT). Lastly, we show that conditional generation with UniMat can scale to previously established crystal datasets with up to millions of crystals structures, outperforming random structure search (the current leading method for structure discovery) in discovering new stable materials.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: quantum properties estimation pre-training fine-tuning
Scores: [ 8 8 8 8 ]
Properties estimation for quantum systems is crucial for addressing quantum many-body problems in physics and chemistry. Recently, task-specific deep learning models have exhibited an enhanced capacity to estimate the properties, surpassing the performance of conventional statistical approaches. However, with rapid escalation of quantum computers, existing learning-based models fall short in learning from explosion of quantum data generated by the systems under different physical conditions. Inspired by the triumphs of Large Language Models in Natural Language Processing and Computer Vision, we introduce Q-TAPE, a task-agnostic pre-trained model that 1) facilitates learning of the rich information from diverse quantum systems with different physical conditions in a fully unsupervised fashion; 2) delivers high performance with limited training data, mitigating the cost for quantum data collection and reducing the time for convergence for different supervised tasks. Extensive experiments demonstrate the promising efficacy of Q-TAPE in various tasks including classifying quantum phases of matter on Rydberg atom model and predicting two-body correlation function on anisotropic Heisenberg model. Source code will be made publicly available.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: generative model large language model stable materials AI for science
Scores: [ 6 5 6 ]
Deep learning models have drastically accelerated materials discovery by accelerating predictive computational simulations like density functional theory (DFT). Large open computational materials databases such as the Materials Project or OQMD contain O(106) known structures, and it is now straightforward to search those databases for materials with exciting properties. However, these databases are limited to experimentally known materials or candidates discovered in high-throughput computational campaigns. Many state-of-the-art engineering advances in solar photovaltaics, battery electrodes, and catalysts are made by discovering materials with outstanding properties that have not yet been discovered. Generative models are a natural solution to expand families of interest through sampling. While popular methods are typically constructed from variational autoencoders or diffusion models, we propose fine-tuning large language models for generation of stable materials. While unorthodox, fine-tuning large language models on text-encoded atomistic data is simple to implement yet reliable, with around 90% of sampled structures obeying physical constraints on atom positions and charges. Using energy of hull calculations from both learned ML potentials and gold-standard DFT calculations, we show that our strongest model (fine-tuned LLaMA-2 70B) can generate materials predicted to be metastable at about twice the rate (49% vs 28%) of CDVAE, a competing diffusion model. Because of text prompting's inherent flexibility, our models can simultaneously be used for unconditional generation of stable material, infilling of partial structures and text-conditional generation. Finally, we show that language models' ability to capture key symmetries of crystal structures improves with model scale, suggesting that the biases of pretrained LLMs are surprisingly well-suited for atomistic data.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Metagenomics Binning Computational Genomics Graph Neural Networks
Scores: [ 5 3 5 ]
Metagenomics studies genomic material derived from mixed microbial communities in diverse environments, holding considerable significance for both human health and environmental sustainability. Metagenomic binning refers to the clustering of genomic subsequences obtained from high-throughput DNA sequencing into distinct bins, each representing a constituent organism within the community. Mainstream binning methods primarily rely on sequence features such as composition and abundance, making them unable to effectively handle sequences shorter than 1,000 bp and inherent noise within sequences. Several binning tools have emerged, aiming to enhance binning outcomes by using the assembly graph generated by assemblers, which encodes valuable overlapping information among genomic sequences. However, existing assembly graph-based binners mainly focus on simplified contig-level assembly graphs that are recreated from assembler’s original graphs, unitig-level assembly graphs. The simplification reduces the resolution of the connectivity information in original graphs. In this paper, we design a novel binning tool named UnitigBin, which leverages representation learning on unitig-level assembly graphs while adhering to heterophilious constraints imposed by single-copy marker genes, ensuring that constrained contigs cannot be grouped together. Extensive experiments conducted on synthetic and real datasets demonstrate that UnitigBin significantly surpasses state-of-the-art binning tools.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Physics-Informed Machine Learning PDEs differentiable optimization neural networks mixture of experts constrained optimization
Scores: [ 6 6 6 6 6 ]
Imposing known physical constraints, such as conservation laws, during the training of neural networks introduces an inductive bias that can improve accuracy, convergence, and data efficiency for physical problems. While such constraints can be softly imposed via loss function penalties, recent advancements in differentiable physics and optimization improve accuracy by integrating constrained optimization within neural network training, enabling a stricter adherence to physical constraints. However, imposing hard constraints significantly increases computational and memory costs, especially for complex dynamical systems, as it requires solving the differentiable optimization problem over a large number of points in the spatiotemporal domain. To address this challenge, we develop a scalable approach to enforce hard physical constraints using Mixture-of-Experts (MoE). Our approach imposes the constraint over smaller decomposed domains, each of which is solved by an expert'' through differentiable optimization. During training, each expert independently performs a localized backpropagation step by leveraging the implicit function theorem; the independence of each expert allows for parallelization across multiple GPUs. Compared to standard differentiable optimization, our scalable approach achieves greater accuracy on challenging non-linear systems, concurrently improving training stability and requiring significantly less computation time during both training and inference stages.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: EHR Prediction Personalized Knowledge Graph Graph Neural Network
Scores: [ 6 6 6 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: mechanical metamaterials lattices elasticity GNN equivariant positive definite energy conservation
Scores: [ 6 5 8 5 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Protein Representation Learning Protein Structure
Scores: [ 8 6 5 6 ]
Protein structure representation learning is the foundation for promising applications in drug discovery, protein design, and protein function prediction. However, there remains a need for a robust, standardised benchmark to track the progress of new and established methods with greater granularity and relevance to downstream applications. In this work, we introduce a comprehensive and open benchmark suite for evaluating protein structure representation learning methods.We provide several pre-training methods, downstream tasks and pre-training corpora comprised of both experimental and predicted structures, offering a balanced challenge to representation learning algorithms. These tasks enable the systematic evaluation of the quality of the learned embeddings, the structural and functional relationships captured, and their usefulness in downstream tasks. We benchmark state-of-the-art protein-specific and generic geometric Graph Neural Networks and the extent to which they benefit from different types of pre-training. We find that pre-training consistently improves the performance of both rotation-invariant and equivariant models, and that equivariant models seem to benefit even more from pre-training compared to invariant models.We aim to establish a common ground for the machine learning and computational biology communities to collaborate, compare, and advance protein structure representation learning. By providing a standardised and rigorous evaluation platform, we expect to accelerate the development of novel methodologies and improve our understanding of protein structures and their functions. The codebase incorporates several engineering contributions which considerably reduces the barrier to entry for pre-training and working with large structure-based datasets. Our benchmark is available at: https://anonymous.4open.science/r/ProteinWorkshop-B8F5/
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Multiple Instance Learning Histopathology Nearest Neighbors Graph Representation
Scores: [ 8 6 8 8 ]
The visual examination of tissue biopsy sections is fundamental for cancer diagnosis, with pathologists analyzing sections at multiple magnifications to discern tumor cells and their subtypes. However, existing attention-based multiple instance learning (MIL) models, used for analyzing Whole Slide Images (WSIs) in cancer diagnostics, often overlook the contextual information of tumor and neighboring tiles, leading to misclassifications. To address this, we propose the Context-Aware Multiple Instance Learning (CAMIL) architecture. CAMIL incorporates neighbor-constrained attention to consider dependencies among tiles within a WSI and integrates contextual constraints as prior knowledge into the MIL model. We evaluated CAMIL on subtyping non-small cell lung cancer (TCGA-NSCLC) and detecting lymph node (CAMELYON16) metastasis, achieving test AUCs of 0.959% and 0.975%, respectively, outperforming other state-of-the-art methods. Additionally, CAMIL enhances model interpretability by identifying regions of high diagnostic value
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: protein design discrete optimization protein engineering markov chain monte carlo graph signal processing
Scores: [ 6 6 6 3 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Molecular Modeling Quantum Chemistry Fragmentation Non-Local Interactions
Scores: [ 8 5 6 6 ]
Computational simulation of chemical and biological systems using ab initio molecular dynamics has been a challenge over decades. Researchers have attempted to address the problem with machine learning and fragmentation-based methods. However, the two approaches fail to give a satisfactory description of long-range and many-body interactions, respectively. Inspired by fragmentation-based methods, we propose the Long-Short-Range Message-Passing (LSR-MP) framework as a generalization of the existing equivariant graph neural networks (EGNNs) with the intent to incorporate long-range interactions efficiently and effectively. We apply the LSR-MP framework to the recently proposed ViSNet and demonstrate the state-of-the-art results with up to 40% MAE reduction for molecules in MD22 and Chignolin datasets. Consistent improvements to various EGNNs will also be discussed to illustrate the general applicability and robustness of our LSR-MP framework. The code for our experiments and trained model weights could be found at https://anonymous.4open.science/r/LSRM-760E/.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Quantum Computing Reinforcement Learning Quantum Chemistry Quantum Architecture Search Optimization
Scores: [ 5 5 6 6 6 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: protein-protein docking energy-based model geometric deep learning energy-function
Scores: [ 6 8 3 6 ]
Protein complex formation, a pivotal challenge in contemporary biology, has recently gained interest from the machine learning community, particularly concerning protein-ligand docking tasks. In this paper, we delve into the equally crucial but comparatively under-investigated domain of protein-protein docking. Specifically, we propose a geometric deep learning framework, termed EBMDock, which employs statistical potential as its energy function. This approach produces a probability distribution over docking poses, such that the identified docking pose aligns with a minimum point in the energy landscape. We employ a differential algorithm grounded in Langevin dynamics to efficiently sample from the docking pose distribution. Additionally, we incorporate energy-based training using contrastive divergence, enhancing both performance and stability. Empirical results demonstrate that our approach achieves superior performance on two benchmark datasets DIPS and DB5.5. Furthermore, the results suggest EBMDock can serve as an orthogonal enhancement to existing methods. Source code will be released.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: ML PDE solver gaussian process PINN
Scores: [ 6 5 6 6 ]
Machine learning based solvers have garnered much attention in physical simulation and scientific computing, with a prominent example, physics-informed neural networks (PINNs). However, PINNs often struggle to solve high-frequency and multi-scale PDEs, which can be due to the spectral bias during neural network training. To address this problem, we resort to the Gaussian process (GP) framework. To flexibly capture the dominant frequencies, we model the power spectrum of the PDE solution with a student t mixture or Gaussian mixture. We then apply inverse Fourier transform to obtain the covariance function (according to the Wiener-Khinchin theorem). The covariance derived from the Gaussian mixture spectrum corresponds to the known spectral mixture kernel. We are the first to discover its rationale and effectiveness for PDE solving. Next, we estimate the mixture weights in the log domain, which we show is equivalent to placing a Jeffreys prior. It automatically induces sparsity, prunes excessive frequencies, and adjusts the remaining toward the ground truth. Third, to enable efficient and scalable computation on massive collocation points, which are critical to capture high frequencies, we place the collocation points on a grid, and multiply our covariance function at each input dimension. We use the GP conditional mean to predict the solution and its derivatives so as to fit the boundary condition and the equation itself. As a result, we can derive a Kronecker product structure in the covariance matrix. We use Kronecker product properties and multilinear algebra to greatly promote computational efficiency and scalability, without any low-rank approximations. We show the advantage of our method in systematic experiments.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: transformers graph neural networks electronic health records
Scores: [ 5 8 5 6 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Self-supervised learning Representation learning Foundation models Biosignals Wearable devices Health Photoplethysmography PPG Electrocardiogram ECG
Scores: [ 8 5 8 6 ]
Tracking biosignals is crucial for monitoring wellness and preempting the development of severe medical conditions. Today, wearable devices can conveniently record various biosignals, creating the opportunity to monitor health status without disruption to one's daily routine. Despite widespread use of wearable devices and existing digital biomarkers, the absence of curated data with annotated medical labels hinders the development of new biomarkers to measure common health conditions. In fact, medical datasets are usually small in comparison to other domains, which is an obstacle for developing neural network models for biosignals. To address this challenge, we have employed self-supervised learning using the unlabeled sensor data collected under informed consent from the large longitudinal Apple Heart and Movement Study (AHMS) to train foundation models for two common biosignals: photoplethysmography (PPG) and electrocardiogram (ECG) recorded on Apple Watch. We curated PPG and ECG datasets from AHMS that include data from ${\sim} 141$K participants spanning ∼3 years. Our self-supervised learning framework includes participant level positive pair selection, stochastic augmentation module and a regularized contrastive loss optimized with momentum training, and generalizes well to both PPG and ECG modalities. We show that the pre-trained foundation models readily encode information regarding participants' demographics and health conditions. To the best of our knowledge, this is the first study that builds foundation models using large-scale PPG and ECG data collected via wearable consumer devices \textendash prior works have commonly used smaller-size datasets collected in clinical and experimental settings. We believe PPG and ECG foundation models can enhance future wearable devices by reducing the reliance on labeled data and hold the potential to help the users improve their health.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: DNA Genome Language Model Foundation Model Benchmark
Scores: [ 6 8 6 6 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: multimodal molecular pretraining molecular representation learning modality blending
Scores: [ 6 5 8 5 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: chemistry reinforcement learning language models
Scores: [ 8 6 8 6 ]
Reinforcement learning (RL) over text representations can be effective for finding high-value policies that can search over graphs. However, RL requires careful structuring of the search space and algorithm design to be effective in this challenge. Through extensive experiments, we explore how different design choices for text grammar and algorithmic choices for training can affect an RL policy's ability to generate molecules with desired properties. We arrive at a new RL-based molecular design algorithm (ChemRLformer) and perform a thorough analysis using 25 molecule design tasks, including computationally complex protein docking simulations. From this analysis, we discover unique insights in this problem space and show that ChemRLformer achieves state-of-the-art performance while being more straightforward than prior work by demystifying which design choices are actually helpful for text-based molecule design.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: meta-learning physical systems multi-task learning interpretable deep learning identifiability electrostatics robotics control reinforcement learning scientific discovery
Scores: [ 5 6 8 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Protein Deign Graph Finetuning
Scores: [ 6 6 6 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: equivariant neural networks graph neural networks computational physics transformer networks
Scores: [ 6 6 6 ]
Equivariant Transformers such as Equiformer have demonstrated the efficacy of applying Transformers to the domain of 3D atomistic systems. However, they are limited to small degrees of equivariant representations due to their computational complexity. In this paper, we investigate whether these architectures can scale well to higher degrees. Starting from Equiformer, we first replace SO(3) convolutionswith eSCN convolutions to efficiently incorporate higher-degree tensors. Then, to better leverage the power of higher degrees, we propose three architectural improvements – attention re-normalization, separable S2 activation and separable layer normalization. Putting these all together, we propose EquiformerV2, which outperforms previous state-of-the-art methods on large-scale OC20 dataset by up to 9% on forces, 4% on energies, offers better speed-accuracy trade-offs, and 2$\times$ reduction in DFT calculations needed for computing adsorption energies. Additionally, EquiformerV2 trained on only OC22 dataset outperforms GemNet-OC trained on both OC20 and OC22 datasets, achieving much better data efficiency. Finally, we compare EquiformerV2 with Equiformer on QM9 and OC20 S2EF-2Mdatasets to better understand the performance gain brought by higher degrees.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: implicit neural representations field reconstruction sparse observation
Scores: [ 6 6 6 6 ]
Reliably reconstructing physical fields from sparse sensor data is a challenge that frequenty arises in many scientific domains. In practice, the process generating the data is often not known to sufficient accuracy. Therefore, there is a growing interest in the deep neural network route to the problem. In this work, we present a novel approach that learns a continuous representation of the field using implicit neural representations (INR). Specifically, after factorizing spatiotemporal variability into spatial and temporal components using the technique of separation of variables, the method learns relevant basis functions from sparsely sampled irregular data points to thus develop a continuous representation of the data. In experimental evaluations, the proposed model outperforms recent INR methods, offering superior reconstruction quality on simulation data from a state of the art climate model and on a second dataset that comprises of ultra-high resolution satellite-based sea surface temperature field.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: material generation spacegroup diffusion generative models
Scores: [ 6 8 8 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Diffusion Models Generative Modeling Protein Design Normal Mode Analysis
Scores: [ 5 5 6 6 ]
Current protein generative models are able to design novel backbones with desired shapes or functional motifs. However, despite the importance of a protein’s dynamical properties for its function, conditioning on dynamical properties remains elusive. We present a new approach to protein generative modeling by leveraging Normal Mode Analysis that enables us to capture dynamical properties too. We introduce a method for conditioning the diffusion probabilistic models on protein dynamics, specifically on the lowest non-trivial normal mode of oscillation. Our method, similar to the classifier guidance conditioning, formulates the sampling process as being driven by conditional and unconditional terms. However, unlike previous works, we approximate the conditional term with a simple analytical function rather than an external neural network, thus making the eigenvector calculations approachable. We present the corresponding SDE theory as a formal justification of our approach. We extend our framework to conditioning on structure and dynamics at the same time, enabling scaffolding of the dynamical motifs. We demonstrate the empirical effectiveness of our method by turning the open-source unconditional protein diffusion model Genie into the conditional model with no retraining. Generated proteins exhibit the desired dynamical and structural properties while still being biologically plausible. Our work represents a first step towards incorporating dynamical behaviour in protein design and may open the door to designing more flexible and functional proteins in the future.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: drug discovery foundation model multi-modal learning knowledge graph
Scores: [ 8 6 6 8 ]
Foundation models (FMs) are able to leverage large volumes of unlabeled data to demonstrate superior performance across a wide range of tasks. However, FMs developed for biomedical domains have largely remained unimodal, i.e., independently trained and used for tasks on protein sequences alone, small molecule structures alone, or clinical data alone.To overcome this limitation of biomedical FMs, we present BioBridge, a novel parameter-efficient learning framework, to bridge independently trained unimodal FMs to establish multimodal behavior. BioBridge achieves it by utilizing Knowledge Graphs (KG) to learn transformations between one unimodal FM and another without fine-tuning any underlying unimodal FMs.Our empirical results demonstrate that BioBridge canbeat the best baseline KG embedding methods (on average by ∼76.3%) in cross-modal retrieval tasks. We also identify BioBridge demonstrates out-of-domain generalization ability by extrapolating to unseen modalities or relations. Additionally, we also show that BioBridge presents itself as a general-purpose retriever that can aid biomedical multimodal question answering as well as enhance the guided generation of novel drugs.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Bioinformatics Protein-Protein Interaction Protein Sequence-Structure Co-Modeling
Scores: [ 8 6 3 ]
Protein-Protein Interactions (PPIs) are fundamental in various biological processes and play a key role in life activities. The growing demand and cost of experimental PPI assays require computational methods for efficient PPI prediction. While existing methods rely heavily on protein sequence for PPI prediction, it is the protein structure that is the key to determine the interactions. However, the complexity of protein structure modeling hinders its application to large-scale PPI prediction. To address this challenge and take both protein modalities into account, we define the microenvironment of an amino acid residue by its sequence and structural contexts, which describe the surrounding chemical properties and geometric features. In addition, microenvironments defined in previous work are largely based on experimentally assayed physicochemical properties, for which the "vocabulary" is usually extremely small. This makes it difficult to cover the diversity and complexity of microenvironments. In this paper, we propose Microenvironment-Aware Protein Embedding for PPI prediction (MPAE-PPI), which encodes microenvironments into chemically meaningful discrete codes via a sufficiently large microenvironment vocabulary" (i.e., codebook). Moreover, we propose a novel pre-training strategy, namely Masked Codebook Modeling (MCM), to capture the dependencies between different microenvironments by randomly masking the codebook and reconstructing the input. With the learned microenvironment codebook, we can reuse it as an off-the-shelf tool to efficiently and effectively encode proteins of different sizes and functions for large-scale PPI prediction. Extensive experiments show that MAPE-PPI can scale to PPI prediction with millions of PPIs with superior trade-offs between effectiveness and computational efficiency than the state-of-the-art competitors.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: neural PDE solvers adaptive moving mesh neural operators Monge-Ampère equation
Scores: [ 6 6 6 ]
Recently, neural networks have been extensively employed to solve partial differential equations (PDEs) in physical system modeling. While major studies focus on learning system evolution on predefined static mesh discretizations, some methods utilize reinforcement learning or supervised learning techniques to create adaptive and dynamic meshes, due to the dynamic nature of these systems. However, these approaches face two primary challenges: (1) the need for expensive optimal mesh data, and (2) the change of the solution space's degree of freedom and topology during mesh refinement. To address these challenges, this paper proposes a neural PDE solver with a neural mesh adapter. To begin with, we introduce a novel data-free neural mesh adaptor, called Data-free Mesh Mover (DMM), with two main innovations. Firstly, it is an operator that maps the solution to adaptive meshes and is trained using the Monge-Ampère equation without optimal mesh data. Besides, it dynamically changes the mesh by moving existing nodes rather than adding or deleting nodes and edges. Theoretical analysis shows that meshes generated by DMM have the lowest interpolation error bound. Based on DMM, to efficiently and accurately model dynamic systems, we develop a moving mesh based neural PDE solver (MM-PDE) that embeds the moving mesh with a two-branch architecture and a learnable interpolation framework to preserve information within the data. Empirical experiments demonstrate that our method generates suitable meshes and considerably enhances accuracy when modeling widely considered PDE systems.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Phylogenetic Inference GFlowNets Bayesian Inference Deep Generative Modeling
Scores: [ 6 8 8 5 ]
Phylogenetics is a branch of computational biology that studies the evolutionary relationships among biological entities. Its long history and numerous applications notwithstanding, inference of phylogenetic trees from sequence data remains challenging: the high complexity of tree space poses a significant obstacle for the current combinatorial and probabilistic techniques. In this paper, we adopt the framework of generative flow networks (GFlowNets) to tackle two core problems in phylogenetics: parsimony-based and Bayesian phylogenetic inference. Because GFlowNets are well-suited for sampling complex combinatorial structures, they are a natural choice for exploring and sampling from the multimodal posterior distribution over tree topologies and evolutionary distances. We demonstrate that our amortized posterior sampler, PhyloGFN, produces diverse and high-quality evolutionary hypotheses on real benchmark datasets. PhyloGFN is competitive with prior works in marginal likelihood estimation and achieves a closer fit to the target distribution than state-of-the-art variational inference methods.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Retrosynthetic planning route evaluation reinforcement learning
Scores: [ 6 8 6 6 ]
Retrosynthetic planning is a sequential decision-making process of identifying synthetic routes from the available building block materials to reach a desired target molecule.Though existing planning approaches show promisingly high solving rates and low costs, the trivial route cost evaluation via pre-trained forward reaction prediction models certainly falls short of real-world chemical practice.An alternative option is to annotate the actual cost of a route, such as yield, through chemical experiments or input from chemists, while this often leads to substantial query costs.In order to strike the balance between query costs and route quality evaluation, we propose an Active Retrosynthetic Planning (ARP) framework that remains compatible with the established retrosynthetic planners.On one hand, the proposed ARP trains an actor that decides whether to query the cost of a reaction; on the other hand, it resorts to a critic to estimate the value of a molecule with its preceding reaction cost as input. Those molecules with low reaction costs are preferred to expand first.We apply our framework to different existing approaches on both the benchmark and an expert dataset and demonstrate that it outperforms the existing state-of-the-art approach by 6.2% in route quality while reducing the query cost by 12.8%.In addition, ARP consistently plans high-quality routes with either abundant or sparse annotations.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Materials Science Attention Networks Transformer Physics-Informed ML
Scores: [ 5 8 8 8 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Retrosynthesis planning chemistry search
Scores: [ 6 6 6 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: mathematics arithmetic transformers explainability
Scores: [ 5 6 8 5 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: backpropagation through time unrolling differentiable physics differentiable simulators optimization
Scores: [ 3 8 8 8 ]
Of all the vector fields surrounding the minima of recurrent learning setups, the gradient field with its exploding and vanishing updates appears a poor choice for optimization, offering little beyond efficient computability. We seek to improve this suboptimal practice in the context of physics simulations, where backpropagating feedback through many unrolled time steps is considered crucial to acquiring temporally coherent behavior. The alternative vector field we propose follows from two principles: physics simulators, unlike neural networks, have a balanced gradient flow and certain modifications to the backpropagation pass leave the positions of the original minima unchanged. As any modification of backpropagation decouples forward and backward pass, the rotation-free character of the gradient field is lost. Therefore, we discuss the negative implications of using such a rotational vector field for optimization and how to counteract them. Our final procedure is easily implementable via a sequence of gradient stopping and component-wise comparison operations, which do not negatively affect scalability. Our experiments on three control problems show that especially as we increase the complexity of each task, the unbalanced updates from the gradient can no longer provide the precise control signals necessary while our method still solves the tasks.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: pde generative diffusion navier-stokes cfd
Scores: [ 8 8 5 6 ]
Simulations of turbulent flows in 3D are one of the most expensive simulations in computational fluid dynamics (CFD). Many works have been written on surrogate models to replace numerical solvers for fluid flows with faster, learned, autoregressive models. However, the intricacies of turbulence in three dimensions necessitate training these models with very small time steps, while generating realistic flow states requires either long roll-outs with many steps and significant error accumulation or starting from a known, realistic flow state—something we aimed to avoid in the first place. Instead, we propose to approach turbulent flow simulation as a generative task directly learning the manifold of all possible turbulent flow states without relying on any initial flow state. For our experiments, we introduce a challenging 3D turbulence dataset of high-resolution flows and detailed vortex structures caused by various objects and derive two novel sample evaluation metrics for turbulent flows. On this dataset, we show that our generative model captures the distribution of turbulent flows caused by unseen objects and generates high-quality, realistic samples amenable for downstream applications without access to any initial state.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: AI for PDEs; physical simulation neural operators boundary-embedded
Scores: [ 6 8 5 6 8 ]
Elliptic partial differential equations (PDEs) are a major class of time-independent PDEs that play a key role in many scientific and engineering domains such as fluid dynamics, plasma physics, and solid mechanics. Recently, neural operators have emerged as a promising technique to solve elliptic PDEs more efficiently by directly mapping the input to solutions. However, existing networks typically neglect complex geometries and inhomogeneous boundary values present in the real world. Here we introduce Boundary-Embedded Neural Operators (BENO), a novel neural operator architecture that embeds the complex geometries and inhomogeneous boundary values into the solving of elliptic PDEs. Inspired by classical Green's function, BENO consists of two Graph Neural Networks (GNNs) for interior source term and boundary values, respectively. Furthermore, a Transformer encoder maps the global boundary geometry into a latent vector which influences each message passing layer of the GNNs. We test our model and strong baselines extensively in elliptic PDEs with complex boundary conditions. We show that all existing baseline methods fail to learn the solution operator. In contrast, our model, endowed with boundary-embedded architecture, outperforms state-of-the-art neural operators and strong baselines by an average of 60.96%.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Neural operator PDEs semi-linear evolution sequential learning filtering data assimilation
Scores: [ 5 8 5 8 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Structure-based drug design molecule optimization diffusion models 3D molecule generation
Scores: [ 6 5 6 8 6 ]
Recently, 3D generative models have shown promising performances in structure-based drug design by learning to generate ligands given target binding sites. However, only modeling the target-ligand distribution can hardly fulfill one of the main goals in drug discovery -- designing novel ligands with desired properties, e.g., high binding affinity, easily synthesizable, etc. This challenge becomes particularly pronounced when the target-ligand pairs used for training do not align with these desired properties. Moreover, most existing methods aim at solving de novo design task, while many generative scenarios requiring flexible controllability, such as R-group optimization and scaffold hopping, have received little attention. In this work, we propose DecompOpt, a structure-based molecular optimization method based on a controllable and decomposed diffusion model. DecompOpt presents a new generation paradigm which combines optimization with conditional diffusion models to achieve desired properties while adhering to the molecular grammar. Additionally, DecompOpt offers a unified framework covering both de novo design and controllable generation. To achieve so, ligands are decomposed into substructures which allows fine-grained control and local optimization. Experiments show that DecompOpt can efficiently generate molecules with improved properties than strong de novo baselines, and demonstrate great potential in controllable generation tasks.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Multi-objective Diffusion Model 3D Molecule
Scores: [ 8 6 6 6 ]
Searching for novel and diverse molecular candidates is a critical undertaking in drug and material discovery. Existing approaches have successfully adapted the diffusion model, the most effective generative model in image generation, to create 1D SMILES strings, 2D chemical graphs, or 3D molecular conformers. However, these methods are not efficient and flexible enough to generate 3D molecules with multiple desired properties, as they require additional training for the models for each new property or even a new combination of existing properties. Moreover, some properties may potentially conflict, making it impossible to find a molecule that satisfies all of them simultaneously. To address these challenges, we present a training-free conditional 3D molecular generation algorithm based on off-the-shelf unconditional diffusion models and property prediction models. The key techniques include modeling the loss of property prediction models as energy functions, considering the property relation between multiple conditions as a probabilistic graph, and developing a stable posterior estimation for computing the conditional score function. We conducted experiments on both single-objective and multi-objective 3D molecule generation, focusing on quantum properties, and compared our approach with the trained or fine-tuned diffusion models. Our proposed model achieves superior performance in generating molecules that meet the conditions, without any additional training cost.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: physics-informed machine learning operator preconditioning deep learning neural network training
Scores: [ 5 6 5 6 8 8 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: AI4PDE; Neural Operator; Data Generation; Krylov Subspace
Scores: [ 8 6 6 8 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: generalization molecular docking protein-ligand binding diffusion models benchmark bootstrapping self-training
Scores: [ 6 5 5 8 6 ]
Accurate blind docking has the potential to lead to new biological breakthroughs, but for this promise to be realized, it is critical that docking methods generalize well across the proteome. However, existing benchmarks fail to rigorously assess generalizability. Therefore, we develop DockGen, a new benchmark based on the ligand-binding domains of proteins, and we show that machine learning-based docking models have very weak generalization abilities even when combined with various data augmentation strategies. Instead, we propose Confidence Bootstrapping, a new training paradigm that solely relies on the interaction between a diffusion and a confidence model. Unlike previous self-training methods from other domains, we directly exploit the multi-resolution generation process of diffusion models using rollouts and confidence scores to reduce the generalization gap. We demonstrate that Confidence Bootstrapping significantly improves the ability of ML-based docking methods to dock to unseen protein classes, edging closer to accurate and generalizable blind docking methods.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Structural Inference AI4Science
Scores: [ 10 6 6 ]
This paper introduces a novel approach to structural inference, combining a variational dynamics encoder with partial correlation coefficients. In contrast to prior methods, our approach leverages variational inference to encode node dynamics within latent variables, and structural reconstruction relies on the calculation of partial correlation coefficients derived from these latent variables.This unique design endows our method with scalability and extends its applicability to both one-dimensional and multi-dimensional feature spaces.Furthermore, by reorganizing latent variables according to temporal steps, our approach can effectively reconstruct directed graph structures. We validate our method through extensive experimentation on twenty datasets from a benchmark dataset and biological networks. Our results showcase the superior scalability, accuracy, and versatility of our proposed approach compared to existing methods.Moreover, experiments conducted on noisy data affirm the robustness of our method.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: variational quantum algorithm quantum machine learning symmetry-preserving circuit
Scores: [ 5 6 8 8 ]
With the arrival of the Noisy Intermediate-Scale Quantum (NISQ) era, Variational Quantum Algorithms (VQAs) have emerged as popular approaches to obtain possible quantum advantage in the relatively near future. In particular, how to effectively incorporate the common symmetries in physical systems as hard constraints in VQAs remains a critical and open question. In this paper, we revisit the Hamming Weight (HW) preserving ansatz and establish the links from ansatz to various symmetries and constraints, which both enlarges the usage of HW preserving ansatz and provides a coherent solution for constrained VQAs. Meanwhile, we utilize the quantum optimal control theory and quantum overparameterization theory to analyze the capability and expressivity of HW preserving ansatz and verify these theoretical results on unitary approximation problem. We further conduct detailed numerical experiments on two well-studied symmetry-preserving problems, namely ground state energy estimation and feature selection in machine learning. The superior performance demonstrates the efficiency and supremacy of the proposed HW preserving ansatz on constrained VQAs.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Weakly Supervised Object Detection Limited Annotation Time Bounding Box Regression Electron Microscopy
Scores: [ 5 6 5 6 ]
Current state-of-the-art methods for object detection rely on annotated bounding boxes of large data sets for training. However, obtaining such annotations is expensive and can require up to hundreds of hours of manual labor. This poses a challenge, especially since such annotations can only be provided by experts, as they require knowledge about the scientific domain. To tackle this challenge, we propose a domain-specific weakly supervised object detection algorithm that only relies on image-level annotations, which are significantly easier to acquire. Our method distills the knowledge of a pre-trained model, on the task of predicting the presence or absence of a virus in an image, to obtain a set of pseudo-labels that can be used to later train a state-of-the-art object detection model. To do so, we use an optimization approach with a shrinking receptive field to extract virus particles directly without specific network architectures. Through a set of extensive studies, we show how the proposed pseudo-labels are easier to obtain, and, more importantly, are able to outperform other existing weak labeling methods, and even ground truth labels, in cases where the time to obtain the annotation is limited.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: neural operators
Scores: [ 6 8 5 8 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: atomic property prediction pre-training 3D atomic pre-training graph neural networks multi-task learning molecules materials
Scores: [ 5 8 5 5 ]
The role of machine learning in computing atomic properties is expanding rapidly for a wide range of applications from healthcare to climate change. One important ingredient that has enabled this development is the creation of large and diverse molecular datasets. Given the extreme computational cost of these datasets, an important question moving forward is: Can we limit the need for exhaustive large dataset creation by pre-training a foundation style model over multiple chemical domains to generate transferable atomic representations for downstream fine-tuning tasks? Generalization across the entire molecular space is challenging due to the range and complexity of atomic interactions that exist. In this paper, we present Joint Multi-domain Pre-training (JMP), a robust supervised pre-training strategy that utilizes data from multiple chemical domains, $\sim$120 million examples in total. We demonstrate state-of-the-art results across many targets of the rMD17, QM9, MatBench, QMOF, SPICE, and MD22 datasets. Finally, we conduct ablations to study the impact of different components of JMP on downstream performance.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Sharpness-Aware Minimization Molecular Graph
Scores: [ 6 6 6 6 ]
Sharpness-aware minimization (SAM) has received increasing attention in computer vision since it can effectively eliminate the sharp local minima from the training trajectory and mitigate generalization degradation. However, SAM requires two sequential gradient computations during the optimization of each step: one to obtain the perturbation gradient and the other to obtain the updating gradient. Compared with the base optimizer (e.g., Adam), SAM doubles the time overhead due to the additional perturbation gradient. By dissecting the theory of SAM and observing the training gradient of the molecular graph transformer, we propose a new algorithm named GraphSAM, which reduces the training cost of SAM and improves the generalization performance of graph transformer models. There are two key factors that contribute to this result: (i) \textit{gradient approximation}: we use the updating gradient of the previous step to approximate the perturbation gradient at the intermediate steps smoothly (\textbf{increases efficiency}); (ii) \textit{loss landscape approximation}: we theoretically prove that the loss landscape of GraphSAM is limited to a small range centered on the expected loss of SAM (\textbf{guarantees generalization performance}). The extensive experiments on six datasets with different tasks demonstrate the superiority of GraphSAM, especially in optimizing the model update process.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: docking path prediction protein complex structure prompt learning
Scores: [ 8 5 5 3 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Physics-informed Neural Network Fluid Dynamics Conservation Law Partial Differential Equation Conditional Normalizing Flows Bird-Migration
Scores: [ 6 8 8 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Drug Design Molecule Generation Deep Learning Computational Biology
Scores: [ 8 8 8 8 ]
Advanced generative model (\textit{e.g.}, diffusion model) derived from simplified continuity assumptions of data distribution, though showing promising progress, has been difficult to apply directly to geometry generation applications due to the \textit{multi-modality} and \textit{noise-sensitive} nature of molecule geometry. This work introduces Geometric Bayesian Flow Networks (GeoBFN), which naturally fits molecule geometry by modeling diverse modalities in the differentiable parameter space of distributions. GeoBFN maintains the SE-(3) invariant density modeling property by incorporating equivariant inter-dependency modeling on parameters of distributions and unifying the probabilistic modeling of different modalities. Through optimized training and sampling techniques, we demonstrate that GeoBFN achieves state-of-the-art performance on multiple 3D molecule generation benchmarks in terms of generation quality (90.87% molecule stability in QM9 and 85.6% atom stability in GEOM-DRUG\footnote{The scores are reported at 1k sampling steps for fair comparison, and our scores could be further improved if sampling sufficiently longer steps.}). GeoBFN can also conduct sampling with any number of steps to reach an optimal trade-off between efficiency and quality (\textit{e.g.}, 20$\times$ speedup without sacrificing performance).
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Experimental Design Supervised Feature Selection Multi-Channel Imaging Hyperspectral Imaging Magnetic Resonance Imaging (MRI)
Scores: [ 6 6 6 6 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: molecule spherical harmonics equivariant symmetry generation
Scores: [ 6 6 6 8 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Partial differential equations Physics simulation Dynamics learning
Scores: [ 8 6 6 6 ]
We consider using deep neural networks to solve time-dependent partial differential equations (PDEs), where multi-scale processing is crucial for modeling complex, time-evolving dynamics. While the U-Net architecture with skip connections is commonly used by prior studies to enable multi-scale processing, our analysis shows that the need for features to evolve across layers results in temporally misaligned features in skip connections, which limits the model’s performance. To address this limitation, we propose SineNet, consisting of multiple sequentially connected U-shaped network blocks, referred to as waves. In SineNet, high-resolution features are evolved progressively through multiple stages, thereby reducing the amount of misalignment within each stage. We furthermore analyze the role of skip connections in enabling both parallel and sequential processing of multi-scale information. Our method is rigorously tested on multiple PDE datasets, including the Navier-Stokes equations and shallow water equations, showcasing the advantages of our proposed approach over conventional U-Nets with a comparable parameter budget. We further demonstrate that increasing the number of waves in SineNet while maintaining the same number of parameters leads to a monotonically improved performance. The results highlight the effectiveness of SineNet and the potential of our approach in advancing the state-of-the-art in neural PDE solver design.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Symbolic Mathematics Pre-training Transformers Symbolic Regression Deep Learning
Scores: [ 8 6 8 8 ]
In an era where symbolic mathematical equations are indispensable for modeling complex natural phenomena, scientific inquiry often involves collecting observations and translating them into mathematical expressions. Recently, deep learning has emerged as a powerful tool for extracting insights from data. However, existing models typically specialize in either numeric or symbolic domains, and are usually trained in a supervised manner tailored to specific tasks. This approach neglects the substantial benefits that could arise from a task-agnostic unified understanding between symbolic equations and their numeric counterparts. To bridge the gap, we introduce SNIP, a Symbolic-Numeric Integrated Pre-training, which employs joint contrastive learning between symbolic and numeric domains, enhancing their mutual similarities in the pre-trained embeddings. By performing latent space analysis, we observe that SNIP provides cross-domain insights into the representations, revealing that symbolic supervision enhances the embeddings of numeric data and vice versa. We evaluate SNIP across diverse tasks, including symbolic-to-numeric mathematical property prediction and numeric-to-symbolic equation discovery, commonly known as symbolic regression. Results show that SNIP effectively transfers to various tasks, consistently outperforming fully supervised baselines and competing strongly with established task-specific methods, especially in few-shot learning scenarios where available data is limited.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: CConv GNN SPH Lagrangian Fluids Fourier Methods
Scores: [ 6 6 5 6 ]
Learning physical simulations has been an essential and central aspect of many recent research efforts in machine learning, particularly for Navier-Stokes-based fluid mechanics. Classic numerical solvers have traditionally been computationally expensive and challenging to use in inverse problems, whereas Neural solvers aim to address both concerns through machine learning. We propose a general formulation for continuous convolutions using separable basis functions as a superset of existing methods and evaluate a large set of basis functions in the context of (a) a compressible 1D SPH simulation, (b) a weakly compressible 2D SPH simulation, and (c) an incompressible 2D SPH Simulation. We demonstrate that even and odd symmetries included in the basis functions are key aspects of stability and accuracy.Our broad evaluation shows that Fourier-based continuous convolutions outperform all other architectures regarding accuracy and generalization. Finally, using these Fourier-based networks, we show that prior inductive biases, such as window functions, are no longer necessary. An implementation of our approach, as well as complete datasets and solver implementations, is available at REDACTED FOR DOUBLE-BLIND REVIEW.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: dynamical systems bifurcations topological invariance Hopf bifurcation physics-informed machine learning augmentation RNA velocity repressilator pancreatic endocrinogenesis
Scores: [ 6 3 6 6 ]
Dynamical systems across the sciences, from electrical circuits to ecological networks, undergo qualitative and often catastrophic changes in behavior called \textit{bifurcations} when their underlying parameters cross a threshold. Existing methods predict oncoming catastrophes from time-series in individual systems but struggle both to categorize qualitative dynamical regimes across diverse systems and to generalize to real data. To address this challenge, we propose a data-driven, physically-informed deep-learning framework for classifying dynamical regimes and characterizing bifurcation boundaries based on the extraction of topologically invariant features. We focus on the paradigmatic case of the supercritical Hopf bifurcation, which is used to model periodic dynamics across a wide range of applications. Our convolutional attention method is trained with data augmentations that encourage the learning of topological invariants which can be used to detect bifurcation boundaries in unseen systems and to design models of biological systems like oscillatory gene regulatory networks. We further demonstrate our method's use in analyzing real data, recovering distinct proliferation and differentiation dynamics along pancreatic endocrinogenesis trajectory in gene expression space based on single-cell data. Our method provides valuable insights into the qualitative, long-term behavior of a wide range of dynamical systems as well as detect bifurcations or catastrophic transitions in large-scale physical and biological systems.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: energy minimization conformational optimization
Scores: [ 8 6 6 6 ]
Molecular conformation optimization is crucial to computer-aided drug discovery and materials design.Traditional energy minimization techniques rely on iterative optimization methods that use molecular forces calculated by a physical simulator (oracle) as anti-gradients.However, this is a computationally expensive approach that requires many interactions with a physical simulator.One way to accelerate this procedure is to replace the physical simulator with a neural network.Despite recent progress in neural networks for molecular conformation energy prediction, such models are prone to distribution shift, leading to inaccurate energy minimization.We find that the quality of energy minimization with neural networks can be improved by providing optimization trajectories as additional training data.Still, it takes around 5×105 additional conformations to match the physical simulator's optimization quality.In this work, we present the Gradual Optimization Learning Framework (GOLF) for energy minimization with neural networks that significantly reduces the required additional data.The framework consists of an efficient data-collecting scheme and an external optimizer.The external optimizer utilizes gradients from the energy prediction model to generate optimization trajectories, and the data-collecting scheme selects additional training data to be processed by the physical simulator. Our results demonstrate that the neural network trained with GOLF performs \textit{on par} with the oracle on a benchmark of diverse drug-like molecules using $50$x less additional data.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: molecular conformation prediction molecule modeling graph neural network graph transformer
Scores: [ 3 8 8 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Meta Learning Physics Informed Neural Network (PINN) Adaptive Sampler Partial Differential Equations
Scores: [ 6 6 6 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Latent Dynamical Model Invariant Decomposition Neural Network Spatio-Temporal Attention
Scores: [ 5 6 6 6 ]
We propose a new model class aimed at predicting dynamical trajectories from high-dimensional empirical data. This is done by combining variational autoencoders and spatio-temporal attention within a framework designed to enforce certain scientifically-motivated invariances.The models allow inference of systembehaviour at any continuous time and generalization well beyond the data distributions seen during training.Furthermore, the models do not require anexplicit neural ODE formulation, making them efficient and highly scalable in practice.We study behaviour through simple theoretical analyses and extensive experiments on synthetic and real-world datasets. The latter investigate the ability to predict the trajectories of very complicated systems based on finite data and show that the proposed approaches can outperform existing neural-dynamical models.We study also more general inductive bias in the context of transfer to data obtained under entirely novel system interventions. Overall, our results provide a new framework for efficiently learning complicated dynamics in a data-driven manner, with potential applications in a wide range of fields including physics, biology, and engineering.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Physics-Informed Neural Networks Transformer Self-Attention
Scores: [ 5 5 8 8 ]
Physics-Informed Neural Networks (PINNs) have emerged as a promising deep learning framework for approximating numerical solutions to partial differential equations (PDEs). However, conventional PINNs, relying on multilayer perceptrons (MLP), neglect the crucial temporal dependencies inherent in practical physics systems and thus fail to propagate the initial condition constraints globally and accurately capture the true solutions under various scenarios. In this paper, we introduce a novel Transformer-based framework, termed PINNsFormer, designed to address this limitation. PINNsFormer can accurately approximate PDE solutions by utilizing multi-head attention mechanisms to capture temporal dependencies. PINNsFormer transforms point-wise inputs into pseudo sequences and replaces point-wise PINNs loss with a sequential loss. Additionally, it incorporates a novel activation function, \texttt{Wavelet}, which anticipates Fourier decomposition through deep neural networks. Empirical results demonstrate that PINNsFormer achieves superior generalization ability and accuracy across various scenarios, including PINNs failure modes and high-dimensional PDEs. Moreover, PINNsFormer offers flexibility in integrating existing learning schemes for PINNs, further enhancing its performance.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: deformable image registration plug-and-play priors deep equilibrium models iterative algorithms
Scores: [ 8 8 6 6 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Long Range Interaction; Molecular Graph;
Scores: [ 6 8 6 ]
Graph Neural Networks (GNNs) have been widely adopted for drug discovery with molecular graphs. Nevertheless, current GNNs are mainly good at leveraging short-range interactions (SRI) but struggle to capture long-range interactions (LRI), both of which are crucial for determining molecular properties. To tackle this issue, we propose a method that implicitly projects all original atoms into a few \textit{Neural Atoms}, which abstracts the collective information of atomic groups within a molecule. Specifically, we explicitly exchange the information among neural atoms and project them back to the atoms’ representations as an enhancement. With this mechanism, neural atoms establish the communication channels among distant nodes, effectively reducing the interaction scope of arbitrary node pairs into a single hop. To provide an inspection of our method from a physical perspective, we reveal its connection with the traditional LRI calculation method, Ewald Summation. We conduct extensive experiments on three long-range graph benchmarks, covering both graph-level and link-level tasks on molecular graphs. We empirically justify that our method can be equipped with an arbitrary GNN and help to capture LRI.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: proteins conformational sampling diffusion models score-based models generative modeling equivariant network
Scores: [ 6 6 6 6 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: large language model multimodal medical imaging chest X-ray bidirectional instruction-tuning vision-question answering
Scores: [ 5 8 6 6 6 ]
Following the impressive development of LLMs, vision-language alignment in LLMs is actively being researched to enable multimodal reasoning and visual input/output. This direction of research is particularly relevant to medical imaging because accurate medical image analysis and generation consist of a combination of reasoning based on visual features and prior knowledge. Many recent works have focused on training adapter networks that serve as an information bridge between image processing (encoding or generating) networks and LLMs; but presumably, in order to achieve maximum reasoning potential of LLMs on visual information as well, visual and language features should be allowed to interact more freely. This is especially important in the medical domain because understanding and generating medical images such as chest X-rays (CXR) require not only accurate visual and language-based reasoning but also a more intimate mapping between the two modalities. Thus, taking inspiration from previous work on the transformer and VQ-GAN combination for bidirectional image and text generation, we build upon this approach and develop a method for instruction-tuning an LLM pre-trained only on text to gain vision-language capabilities for medical images. Specifically, we leverage a pretrained LLM’s existing question-answering and instruction-following abilities to teach it to understand visual inputs by instructing it to answer questions about image inputs and, symmetrically, output both text and image responses appropriate to a given query by tuning the LLM with diverse tasks that encompass image-based text-generation and text-based image-generation. We show that our LLM-CXR trained in this approach shows better image-text alignment in both CXR understanding and generation tasks while being smaller in size compared to previously developed models that perform a narrower range of tasks.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: crystal material property prediction geometric graph representation learning
Scores: [ 6 5 3 6 ]
Crystal structures are characterized by atomic bases within a primitive unit cell that repeats along a regular lattice throughout 3D space. The periodic and infinite nature of crystals poses unique challenges for geometric graph representation learning. Specifically, constructing graphs that effectively capture the complete geometric information of crystals and handle chiral crystals remains an unsolved and challenging problem. In this paper, we introduce a novel approach that utilizes the periodic patterns of unit cells to establish the lattice-based representation for each atom, enabling efficient and expressive graph representations of crystals. Furthermore, we propose ComFormer, a SE(3) transformer designed specifically for crystalline materials. ComFormer includes two variants; namely, iComFormer that employs invariant geometric descriptors of Euclidean distances and angles, and eComFormer that utilizes equivariant vector representations. Experimental results demonstrate the state-of-the-art predictive accuracy of ComFormer variants on various tasks across three widely-used crystal benchmarks.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Single-cell analysis Pretrained models AI for science
Scores: [ 8 6 6 6 ]
The current state-of-the-art single-cell pre-trained models are greatly inspired by the success of large language models. They trained transformers by treating genes as tokens and cells as sentences. However, three fundamental differences between single-cell data and natural language data are overlooked: (1) scRNA-seq data are presented as bag-of-genes instead of sequences of RNAs; (2) Cell-cell relations are more intricate and important than inter-sentence relations; and (3) The quantity of single-cell data is considerably inferior to text data, and they are very noisy. In light of these characteristics, we propose a new pre-trained model, CellPLM, which takes cells as tokens and tissues as sentences. In addition, we leverage spatially-resolved transcriptomic data in pre-training to facilitate learning cell-cell relationships and introduce a Gaussian prior distribution as an additional inductive bias to overcome data limitations. CellPLM is the first single-cell pre-trained transformer that encodes cell-cell relations and it consistently achieves state-of-the-art performance across distinct downstream tasks.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: protein structure structural biology drug discovery molecular docking
Scores: [ 8 5 6 ]
Molecular docking is critical to structure-based virtual screening, yet the throughput of such workflows is limited by the expensive optimization of scoring functions involved in most docking algorithms. We explore how machine learning can accelerate this process by learning a scoring function with a functional form that allows for more rapid optimization. Specifically, we define the scoring function to be the cross-correlation of multi-channel ligand and protein scalar fields parameterized by equivariant graph neural networks, enabling rapid optimization over rigid-body degrees of freedom with fast Fourier transforms. Moreover, the runtime of our approach can be amortized at several levels of abstraction, and is particularly favorable for virtual screening settings with a common binding pocket. We benchmark our scoring functions on two simplified docking-related tasks: decoy pose scoring and rigid conformer docking. Our method attains similar but faster performance on crystal structures compared to the Vina and Gnina scoring functions, and is more robust on computationally predicted structures.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Protein representation learning self-supervised learning implicit neural representation
Scores: [ 5 6 6 6 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: hamiltonian dynamics cross domain generalization learning physics meta learning
Scores: [ 5 6 6 6 ]
Recent advancements in deep learning for physics have focused on discovering shared representations of target systems by incorporating physics priors or inductive biases into neural networks. While effective, these methods are confined to the system domain in which the type of system remains consistent and thus cannot ensure the adaptation to new, or unseen physical systems governed by different laws. For example, a neural network trained on a mass-spring system cannot guarantee the accurate prediction of the behavior of a two-body system or any other system with different physical laws. In this work, we take a significant leap forward by targeting cross domain generalization within the field of Hamiltonian dynamics. We model our system with a graph neural network and employ a meta learning algorithm to enable the model to gain experience over a distribution of tasks and make it adapt to new physics. Our approach aims to learn a unified Hamiltonian representation that is generalizable across multiple system domains, thereby overcoming the limitations of system-specific models. We validate our approach on a dataset comprising various physical systems and evaluate its adaptability to a new type of dynamical system with previously unseen physics. Our results demonstrate that the meta trained model not only adapts effectively to new systems but also captures a generalized Hamiltonian representation that is consistent across different physical domains.Overall, through the use of meta learning, we offer a framework that achieves cross domain generalization, providing a step towards a unified model for understanding a wide array of dynamical systems via deep learning.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Protein design Protein structure encoder Inverse folding Protein diffusion
Scores: [ 8 6 8 ]
Advances like protein diffusion have marked revolutionary progress in de novo protein design, a central topic in life science. These methods typically depend on protein structure encoders to model residue backbone frames, where atoms do not exist. Most prior encoders rely on atom-wise features, such as angles and distances between atoms, which are not available in this context. Only a few basic encoders, like IPA, have been proposed for this scenario, exposing the frame modeling as a bottleneck. In this work, we introduce the Vector Field Network (VFN), that enables network layers to perform learnable vector computations between coordinates of frame-anchored virtual atoms, thus achieving a higher capability for modeling frames. The vector computation operates in a manner similar to a linear layer, with each input channel receiving 3D virtual atom coordinates instead of scalar values. The multiple feature vectors output by the vector computation are then used to update the residue representations and virtual atom coordinates via attention aggregation. Remarkably, VFN also excels in modeling both frames and atoms, as the real atoms can be treated as the virtual atoms for modeling, positioning VFN as a potential universal encoder. In protein diffusion (frame modeling), VFN exhibits a impressive performance advantage over IPA, excelling in terms of both designability (67.04% vs. 53.58%) and diversity (66.54% vs. 51.98%). In inverse folding(frame and atom modeling), VFN outperforms the previous SoTA model, PiFold (54.7% vs. 51.66%), on sequence recovery rate; we also propose a method of equipping VFN with the ESM model, which significantly surpasses the previous ESM-based SoTA (62.67% vs. 55.65%), LM-Design, by a substantial margin.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: graph transformer physics-based simulation mesh collision flexible
Scores: [ 6 6 6 6 ]
Recently, many mesh-based graph neural network (GNN) models have been proposed for modeling complex high-dimensional physical systems. Remarkable achievements have been made in significantly reducing the solving time compared to traditional numerical solvers. These methods are typically designed to i) reduce the computational cost in solving physical dynamics and/or ii) propose techniques to enhance the solution accuracy in fluid and rigid body dynamics. However, it remains under-explored whether they are effective in addressing the challenges of flexible body dynamics, where instantaneous collisions occur within a very short timeframe.In this paper, we present Hierarchical Contact Mesh Transformer (HCMT), which uses hierarchical mesh structures and can learn long-range dependencies (occurred by collisions) among spatially distant positions of a body --- two close positions in a higher-level mesh corresponds to two distant positions in a lower-level mesh. HCMT enables long-range interactions, and the hierarchical mesh structure quickly propagates collision effects to faraway positions. To this end, it consists of a contact mesh Transformer and a hierarchical mesh Transformer (CMT and HMT, respectively). Lastly, we propose a unique flexible body dynamics dataset, which is commonly used for product designs. We also compare the performance of several baselines using well-known benchmark datasets. Our results show that HCMT provides significant performance improvements over existing methods.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: neural operator; multigrid; universal approximation; boundary condition
Scores: [ 6 6 6 8 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Molecular generative models reinforcement learning natural language processing drug discovery sample-efficiency explainability
Scores: [ 3 8 8 8 ]
Generative molecular design has moved from proof-of-concept to real-world applicability, as marked by the surge in very recent papers reporting experimental validation. Key challenges in explainability and sample efficiency present opportunities to enhance generative design to directly optimize expensive high-fidelity oracles and provide actionable insights to domain experts. Here, we propose Beam Enumeration to exhaustively enumerate the most probable sub-sequences from language-based molecular generative models and show that molecular substructures can be extracted. When coupled with reinforcement learning, extracted substructures become meaningful, providing a source of explainability and improving sample efficiency through self-conditioned generation. Beam Enumeration is generally applicable to any language-based molecular generative model and notably further improves the performance of the recently reported Augmented Memory algorithm, which achieved the new state-of-the-art on the Practical Molecular Optimization benchmark for sample efficiency. The combined algorithm generates more high reward molecules and faster, given a fixed oracle budget. Beam Enumeration is the first method to jointly address explainability and sample efficiency for molecular design.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Molecular Representation Batch Effect Contrastive Learning Information Maximization Drug Discovery
Scores: [ 6 8 6 6 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Error Correction Codes Foundation Model
Scores: [ 8 3 6 8 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Retrosynthesis Reactions Chemistry Drug Discovery Markov Bridge
Scores: [ 8 6 8 6 ]
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Physics simulation interpolation
Scores: [ 8 8 8 6 8 ]
Modern techniques for physical simulations rely on numerical schemes and mesh-refinement methods to address trade-offs between precision and complexity, but these handcrafted solutions are tedious and require high computational power. Data-driven methods based on large-scale machine learning promise high adaptivity by integrating long-range dependencies more directly and efficiently. In this work, we focus on computational fluid dynamics and address the shortcomings of a large part of the literature, which are based on fixed support for computations and predictions in the form of regular or irregular grids. We propose a novel setup to perform predictions in a continuous spatial and temporal domain while being trained on sparse observations. We formulate the task as a double observation problem and propose a solution with two interlinked dynamical systems defined on, respectively, the sparse positions and the continuous domain, which allows to forecast and interpolate a solution from the initial condition. Our practical implementation involves recurrent GNNs and a spatio-temporal attention observer capable of interpolating the solution at arbitrary locations. Our model not only generalizes to new initial conditions (as standard auto-regressive models do) but also performs evaluation at arbitrary space and time locations. We evaluate on three standard datasets in fluid dynamics and compare to strong baselines, which are outperformed in classical settings and the extended new task requiring continuous predictions.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: protein complex structure prediction docking path prediction policy network reinforcement learning
Scores: [ 6 6 6 6 ]
Structure prediction of large protein complexes (a.k.a., protein multimer mod-elling, PMM) can be achieved through the one-by-one assembly using provideddimer structures and predicted docking paths. However, existing PMM methodsstruggle with vast search spaces and generalization challenges: (1) The assemblyof a N -chain multimer can be depicted using graph structured data, with eachchain represented as a node and assembly actions as edges. Thus the assemblygraph can be arbitrary acyclic undirected connected graph, leading to the com-binatorial optimization space of N^(N −2) for the PMM problem. (2) Knowledgetransfer in the PMM task is non-trivial. The gradually limited data availability asthe chain number increases necessitates PMM models that can generalize acrossmultimers of various chains. To address these challenges, we propose GAPN, aGenerative Adversarial Policy Network powered by domain-specific rewards andadversarial loss through policy gradient for automatic PMM prediction. Specifi-cally, GAPN learns to efficiently search through the immense assembly space andoptimize the direct docking reward through policy gradient. Importantly, we de-sign a adversarial reward function to enhance the receptive field of our model. Inthis way, GAPN will simultaneously focus on a specific batch of multimers andthe global assembly rules learned from multimers with varying chain numbers.Empirically, we have achieved both significant accuracy (measured by RMSDand TM-Score) and efficiency improvements compared to leading complex mod-eling software. GAPN outperforms the state-of-the-art method (MoLPC) with upto 27% improvement in TM-Score, with a speed-up of 600×.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Brownian dynamics stochastic differential equation graph neural network scientific machine learning
Scores: [ 6 8 6 6 ]
Neural networks (NNs) that exploit strong inductive biases based on physical laws and symmetries have shown remarkable success in learning the dynamics of physical systems directly from their trajectory. However, these works focus only on the systems that follow deterministic dynamics, such as Newtonian or Hamiltonian. Here, we propose a framework, namely Brownian graph neural networks (BroGNet), combining stochastic differential equations (SDEs) and GNNs to learn Brownian dynamics directly from the trajectory. We modify the architecture of BroGNet to enforce linear momentum conservation of the system, which, in turn, provides superior performance on learning dynamics as revealed empirically. We demonstrate this approach on several systems, namely, linear spring, linear spring with binary particle types, and non-linear spring systems, all following Brownian dynamics at finite temperatures. We show that BroGNet significantly outperforms proposed baselines across all the benchmarked Brownian systems. In addition, we demonstrate zero-shot generalizability of BroGNet to simulate unseen system sizes that are two orders of magnitude larger and to different temperatures than those used during training. Finally, we show that BroGNet conserves the momentum of the system resulting in superior performance and data efficiency. Altogether, our study contributes to advancing the understanding of the intricate dynamics of Brownian motion and demonstrates the effectiveness of graph neural networks in modeling such complex systems.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Machine learning for PDE spectral methods neural network differentiation spectral loss
Scores: [ 8 8 8 3 ]
We present Neural Spectral Methods, a technique to solve parametric Partial Differential Equations (PDEs), grounded in classical spectral methods. In contrast to current machine learning approaches which enforce PDE constraints by minimizing the numerical quadrature of the residuals in the spatiotemporal domain, our method uses orthogonal bases to learn PDE solutions as mappings between spectral coefficients. By leveraging Parseval's Identity, we introduce a new training strategy through a \textit{spectral loss}. This enables more efficient differentiation through the neural network, and substantially reduces training complexity. During inference time, the computational cost of our method remains constant, regardless of the spatiotemporal resolution of the domain. Our experimental results demonstrate that our method significantly outperforms previous machine learning approaches in terms of speed and accuracy by several orders of magnitude. When compared to numerical solvers of the same accuracy, our method demonstrates a 10× increase in performance speed.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Keywords: Materials design diffusion model metal-organic framework carbon capture generative model AI for Science
Scores: [ 8 8 8 8 ]
Primary Area: applications to robotics, autonomy, planning
Keywords: path planning diffusion model
Scores: [ 6 6 8 6 6 ]
Primary Area: applications to robotics, autonomy, planning
Keywords: Embodied AI Reinforcement Learning Large Language Models Foundational Models
Scores: [ 6 6 6 5 ]
We show that large language models (LLMs) can be adapted to be generalizable policies for embodied visual tasks. Our approach, called Large LAnguage model Reinforcement Learning Policy (LLaRP), adapts a pre-trained frozen LLM to take as input text instructions and visual egocentric observations and output actions directly in the environment. Using reinforcement learning, we train LLaRP to see and act solely through environmental interactions. We show that LLaRP is robust to complex paraphrasings of task instructions and can generalize to new tasks that require novel optimal behavior. In particular, on 1,000 unseen tasks it achieves 42% success rate, 1.7x the success rate of other common learned baselines or zero-shot applications of LLMs. Finally, to aid the community in studying language conditioned, massively multi-task, embodied AI problems we release a novel benchmark, Language Rearrangement, consisting of 150,000 training and 1,000 testing tasks for language-conditioned rearrangement.
Primary Area: applications to robotics, autonomy, planning
Keywords: Generative simulator simulating real-world interactions planning reinforcement learning vision language models video generation
Scores: [ 8 8 8 6 ]
Generative models trained on internet data have revolutionized how text, image, and video content can be created. Perhaps the next milestone for generative models is to simulate realistic experience in response to actions taken by humans, robots, and other interactive agents. Applications of a real-world simulator range from controllable content creation in games and movies, to training embodied agents purely in simulation that can be directly deployed in the real world. We explore the possibility of learning a universal simulator (UniSim) of real-world interaction through generative modeling. We first make the important observation that natural datasets available for learning a real-world simulator are often rich along different axes (e.g., abundant objects in image data, densely sampled actions in robotics data, and diverse movements in navigation data). With careful orchestration of diverse datasets, each providing a different aspect of the overall experience, UniSim can emulate how humans and agents interact with the world by simulating the visual outcome of both high-level instructions such as “open the drawer” and low-level controls such as “move by x,y” from otherwise static scenes and objects. There are numerous use cases for such a real-world simulator. As an example, we use UniSim to train both high-level vision-language planners and low-level reinforcement learning policies, each of which exhibit zero-shot real-world transfer after training purely in a learned real-world simulator. We also show that other types of intelligence such as video captioning models can benefit from training with simulated experience in UniSim, opening up even wider applications.
Primary Area: applications to robotics, autonomy, planning
Keywords: Grounding LLM Learning Mode Abstractions for Manipulation Learning from Demonstration Robotics Task and Motion Planning
Scores: [ 8 6 8 ]
Grounding the abstract knowledge captured by Large Language Models (LLMs) inphysical domains remains a pivotal yet unsolved problem. Whereas prior workshave largely focused on leveraging LLMs for generating abstract plans in symbolicspaces, this work uses LLMs to guide the learning for structures and constraintsin robot manipulation tasks. Specifically, we borrow from manipulation plan-ning literature the concept of mode families, defining specific types of motionconstraints among sets of objects, to serve as an intermediate layer that connectshigh-level language representations with low-level physical trajectories. By lo-cally perturbing a small set of successful human demonstrations, we augment thedataset with additional successful executions as well as counterfactuals that failthe task. Our explanation-based learning framework trains neural network-basedclassifiers to differentiate success task executions from failures and as a by-productlearns classifiers that ground low-level states into mode families without denselabeling. This further enables us to learn structured policies for the target task.Experimental validation in both 2D continuous-space and robotic manipulationenvironments demonstrates the robustness of our mode-basedimitation methods under external perturbations.
Primary Area: applications to robotics, autonomy, planning
Keywords: human-ai interaction state abstractions
Scores: [ 5 6 6 ]
We describe a framework for using natural language to design state abstractions for imitation learning. Generalizable policy learning in high-dimensional observation spaces is facilitated by well-designed state representations, which can surface important features of an environment and hide irrelevant ones. Today, these state representations must often be manually specified, or derived from other labor-intensive labeling procedures. Our method, LGA (\textit{language-guided abstraction}), uses a combination of natural language supervision and background knowledge from language models (LMs) to automatically build state representations tailored to unseen tasks. In LGA, a user first provides a (possibly incomplete) description of a target task in natural language; next, a pre-trained LM translates this task description into a state abstraction function that masks out irrelevant features; finally, an imitation policy is trained using a small number of demonstrations and LGA-generated abstract states. Experiments on simulated robotic tasks show that LGA yields state abstractions similar to human-designed ones, but in a fraction of the time, and that these abstractions improve generalization and robustness in the presence of spurious correlations and ambiguous specifications. We illustrate the utility of the learned abstractions on mobile manipulation tasks with a Spot robot.
Primary Area: applications to robotics, autonomy, planning
Keywords: Reinforcement Learning Robot Manipulation Movement Primitives Trajectory Correlation Modeling
Scores: [ 6 6 6 6 ]
In applying Reinforcement Learning (RL) to robot trajectory generation, two key challenges commonly emerge. First, the stochastic exploration strategies of step-based RL are unable to produce high-order smooth trajectories. Second, existing methods struggle with effectively modeling movement correlations among different time steps and degrees of freedom, which are crucial for complex tasks and safety measures. Episodic RL methods address these challenges by employing parameterized trajectory generators like Movement Primitives (MP), framing the problem as contextual optimization. While effective in generating smooth trajectories and capturing some movement correlations, these methods lack in utilizing temporal structure within trajectories, resulting in suboptimal sample efficiency. We introduce the Temporally-Correlated Episodic RL (TCE) method to address these shortcomings. TCE enhances exploration efficiency by sampling multi-second trajectories in a parameterized space, thereby assuring high-order trajectory smoothness and the capture of movement correlations. For policy updates, TCE breaks down the entire trajectory into smaller segments, evaluating each segment on its distinct advantages. This nuanced approach allows us to utilize temporal information obtained during trajectory execution.We validate the effectiveness of TCE through experiments on various robotic manipulation tasks, showing its advantages over step-based and episodic RL approaches.
Primary Area: applications to robotics, autonomy, planning
Keywords: 3D Lane Detection Multi-modal
Scores: [ 10 3 6 6 6 ]
Primary Area: applications to robotics, autonomy, planning
Keywords: sim2real system identification exploration
Scores: [ 6 8 5 8 ]
Model-free control strategies such as reinforcement learning have shown the ability to learn control strategies without requiring an accurate model or simulator of the world. While this is appealing due to the lack of modeling requirements, real-world RL can be unsafe and sample inefficient, making it impractical in many safety-critical domains. On the other hand, model-based control techniques leveraging accurate simulators can circumvent these challenges and use a large amount of cheap simulation data to learn controllers that can effectively transfer to the real world. The challenge with such model-based techniques is the requirement for an extremely accurate simulation, requiring both the specification of appropriate simulation assets and physical parameters. This requires considerable human effort to design for every environment being considered. In this work, we propose a learning system that can leverage a small amount of real-world data to autonomously refine a simulation model, and then plan an accurate control strategy that can be deployed in the real world. Our approach critically relies on utilizing an initial (possibly inaccurate) simulator to design effective exploration policies that, when deployed in the real world, collect high-quality data. We demonstrate the efficacy of this paradigm in identifying articulation, mass, and other physical parameters in several challenging robotic manipulation tasks, and illustrate that only a small amount of real-world data can allow for effective sim-to-real transfer.
Primary Area: applications to robotics, autonomy, planning
Keywords: Task Planning Object Search Deep-RL
Scores: [ 6 6 8 ]
Primary Area: applications to robotics, autonomy, planning
Keywords: learning from demonstration dynamical systems contraction theory
Scores: [ 8 6 5 ]
Primary Area: applications to robotics, autonomy, planning
Keywords: Long-horizon robot learning reinforcement learning LLMs
Scores: [ 6 8 6 ]
Large Language Models (LLMs) are highly capable of performing planning for long-horizon robotics tasks, yet existing methods require access to a pre-defined skill library (e.g. picking, placing, pulling, pushing, navigating). However, LLM planning does not address how to design or learn those behaviors, which remains challenging particularly in long-horizon settings. Furthermore, for many tasks of interest, the robot needs to be able to adjust its behavior in a fine-grained manner, requiring the agent to be capable of modifying low-level control actions. Can we instead use the internet-scale knowledge from LLMs for high-level policies, guiding reinforcement learning (RL) policies to efficiently solve robotic control tasks online without requiring a pre-determined set of skills? In this paper, we propose Plan-Seq-Learn (PSL): a modular approach that uses motion planning to bridge the gap between abstract language and learned low-level control for solving long-horizon robotics tasks from scratch. We demonstrate that PSL is capable of solving 20+ challenging single and multi-stage robotics tasks on four benchmarks at success rates of over 80% from raw visual input, out-performing language-based, classical, and end-to-end approaches. Video results and code at https://planseqlearn.github.io/
Primary Area: applications to robotics, autonomy, planning
Keywords: motion prediction autonomous driving self-supervised learning
Scores: [ 6 8 6 8 ]
Motion prediction is crucial for autonomous vehicles to operate safely in complex traffic environments. Extracting effective spatiotemporal relationships among traffic elements is key to accurate forecasting. Inspired by the successful practice of pretrained large language models, this paper presents SEPT, a modeling framework that leverages self-supervised learning to develop powerful spatiotemporal understanding for complex traffic scenes. Specifically, our approach involves three masking-reconstruction modeling tasks on scene inputs including agents' trajectories and road network, pretraining the scene encoder to capture kinematics within trajectory, spatial structure of road network, and interactions among roads and agents. The pretrained encoder is then finetuned on the downstream forecasting task. Extensive experiments demonstrate that SEPT, without elaborate architectural design or manual feature engineering, achieves state-of-the-art performance on the Argoverse 1 and Argoverse 2 motion forecasting benchmarks, outperforming previous methods on all main metrics by a large margin.
Primary Area: applications to robotics, autonomy, planning
Keywords: Web Navigation Foundation Models Large Language Models Instruction Finetuning Decision Making Multimodal Document Understanding
Scores: [ 5 8 5 ]
The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data.In this work, we study data-driven offline training for web agents with vision-language foundation models.We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages and outputs web navigation actions, such as click and type.WebGUM is trained by jointly finetuning an instruction-finetuned language model and a vision encoder with temporal and local perception on a large corpus of demonstrations.We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning, outperforming prior works by a significant margin. On the MiniWoB, we improve over the previous best offline methods by more than 45.8%, even outperforming online-finetuned SoTA, humans, and GPT-4-based agent. On the WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing SoTA, PaLM-540B.Furthermore, WebGUM exhibits strong positive transfer to the real-world planning tasks on the Mind2Web.We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction.
Primary Area: applications to robotics, autonomy, planning
Keywords: Tactile sensing Simulation Robotic manipulation
Scores: [ 8 6 6 6 ]
We introduce DIFFTACTILE, a physics-based and fully differentiable tactile simulation system designed to enhance robotic manipulation with dense and physically-accurate tactile feedback. In contrast to prior tactile simulators which primarily focus on manipulating rigid bodies and often rely on simplified approximations to model stress and deformations of materials in contact, DIFFTACTILE emphasizes physics-based contact modeling with high fidelity, supporting simulations of diverse contact modes and interactions with objects possessing a wide range of material properties. Our system incorporates several key components, including a Finite Element Method (FEM) -based soft body model for simulating the sensing elastomer, a multi-material simulator for modeling diverse object types (such as elastic, plastic, cables) under manipulation, a penalty-based contact model for handling contact dynamics. The differentiable nature of our system facilitates gradient-based optimization for both 1) refining physical properties in simulation using real-world data, hence narrowing the sim-to-real gap, and 2) efficient learning of tactile-assisted grasping and contact-rich manipulation skills. Additionally, we introduce a method to infer the optical response of our tactile sensor to contact using an efficient pixel-based neural module. We anticipate that DIFFTACTILE will serve as a useful platform for studying contact-rich manipulations, leveraging the benefits of dense tactile feedback and differentiable physics. The source codes of DIFFTACTILE will be publicly available.
Primary Area: applications to robotics, autonomy, planning
Keywords: Optimal Control Hybrid Actions Robotics Approximate Dynamic Programming Tensor Approximation
Scores: [ 8 8 8 6 ]
Control of dynamic systems involving hybrid actions is a challenging task in robotics. To address this, we present a novel algorithm called Generalized Policy Iteration using Tensor Train (TTPI) that belongs to the class of Approximate Dynamic Programming (ADP). We use a low-rank tensor approximation technique called Tensor Train (TT) to approximate the state-value and advantage function which enables us to efficiently handle hybrid systems. We demonstrate the superiority of our approach over previous baselines for some benchmark problems with hybrid action spaces. Additionally, the robustness and generalization of the policy for hybrid systems are showcased through a real-world robotics experiment involving a non-prehensile manipulation task which is considered to be a highly challenging control problem.
Primary Area: applications to robotics, autonomy, planning
Keywords: Navigation Embodied AI Perception
Scores: [ 8 6 6 8 ]
Most recent work in goal oriented visual navigation resorts to large-scale machine learning in simulated environments. The main challenge lies in learning compact representations generalizable to unseen environments and in learning high-capacity perception modules capable of reasoning on high-dimensional input. The latter is particularly difficult when the goal is not given as a category ("ObjectNav") but as an exemplar image ("ImageNav"), as the perception module needs to learn a comparison strategy requiring to solve an underlying visual correspondence problem. This has been shown to be difficult from reward alone or with standard auxiliary tasks. We address this problem through a sequence of two pretext tasks, which serve as a prior for what we argue is one of the main bottleneck in perception, extremely wide-baseline relative pose estimation and visibility prediction in complex scenes. The first pretext task, cross-view completion is a proxy for the underlying visual correspondence problem, while the second task addresses goal detection and finding directly. We propose a new dual encoder with a large-capacity binocular ViT model and show that correspondence solutions naturally emerge from the training signals. Experiments show significant improvements and SOTA performance on the two benchmarks, ImageNav and the Instance-ImageNav variant, where camera intrinsics and height differ between observation and goal.
Primary Area: applications to robotics, autonomy, planning
Keywords: robot learning diffusion model
Scores: [ 6 6 8 5 ]
Primary Area: applications to robotics, autonomy, planning
Keywords: Neural radiance field Multi-cameras under-calibration color inconsistency
Scores: [ 6 6 5 6 6 ]
Multi-camera setups find widespread use across various applications, such as autonomous driving, as they greatly expand sensing capabilities. Despite the fast development of Neural radiance field (NeRF) techniques and their wide applications in both indoor and outdoor scenes, applying NeRF to multi-camera systems remains very challenging. This is primarily due to the inherent under-calibration issues in multi-camera setup, including inconsistent imaging effects stemming from separately calibrated image signal processing units in diverse cameras, and system errors arising from mechanical vibrations during driving that affect relative camera poses.In this paper, we present UC-NeRF, a novel method tailored for novel view synthesis in under-calibrated multi-view camera systems.Firstly, we propose a layer-based color correction to rectify the color inconsistency in different image regions. Second, we propose virtual warping to generate more viewpoint-diverse but color-consistent virtual views for color correction and 3D recovery. Finally, a spatiotemporally constrained pose refinement is designed for more robust and accurate pose calibration in multi-camera systems.Our method not only achieves state-of-the-art performance of novel view synthesis in multi-camera setups, but also effectively facilitates depth estimation in large-scale outdoor scenes with the synthesized novel views.
Primary Area: applications to robotics, autonomy, planning
Keywords: self-driving traffic modeling autonomous vehicles simulation tracking transformer
Scores: [ 6 6 6 ]
A longstanding challenge for self-driving development is the ability to simulate dynamic driving scenarios seeded from recorded driving logs. Given an initial scene observed during a test drive, we seek the ability to sample plausible scene-consistent future trajectories for all agents in the scene, even when the actions for a subset of agents are chosen by an external source, such as a black-box self-driving planner. In order to model the complicated spatial and temporal interaction across agents in driving scenarios, we propose to tokenize the motion of dynamic agents and use tools from language modeling to model the full sequence of multi-agent actions. Our traffic model explicitly captures intra-timestep dependence between agents, which we show is essential for simulation given only a single frame of historical scene context, as well as enabling improvements when provided longer historical context. We demonstrate competitive results sampling scenarios given initializations from the Waymo Open Dataset with full autonomy as well as partial autonomy, and show that the representations learned by our model can quickly be adapted to improve performance on nuScenes, a considerably smaller dataset. We additionally use density estimates from our model to quantify the saliency of context length and intra-timestep interaction for the traffic modeling task.
Primary Area: applications to robotics, autonomy, planning
Keywords: reinforcement learning self-supervised learning contrastive learning goal-conditioned RL offline RL robotics
Scores: [ 5 8 8 8 ]
Robotic systems that rely primarily on self-supervised learning have the potential to decrease the amount of human annotation and engineering effort required to learn control strategies. In the same way that prior robotic systems have leveraged self-supervised techniques from computer vision (CV) and natural language processing (NLP), our work builds on prior work showing that the reinforcement learning (RL) itself can be cast as a self-supervised problem: learning to reach any goal without human-specified rewards or labels. Despite the seeming appeal, little (if any) prior work has demonstrated how self-supervised RL methods can be practically deployed on robotic systems. By first studying a challenging simulated version of this task, we discover design decisions about architectures and hyperparameters that increase the success rate by 2×. These findings lay the groundwork for our main result: we demonstrate that a self-supervised RL algorithm based on contrastive learning can solve real-world, image-based robotic manipulation tasks, with tasks being specified by a single goal image provided after training.
Primary Area: applications to robotics, autonomy, planning
Keywords: Autonomous Agent Meta Programming Multi-Agent Society Group Intelligence
Scores: [ 8 8 3 ]
Recently, remarkable progress has been made on automated problem solving through societies of agents based on large language models (LLMs). Previous LLM-based multi-agent systems can already solve simple dialogue tasks. More complex tasks, however, face challenges through logic inconsistencies due to cascading hallucinations caused by naively chaining LLMs. Here we introduce MetaGPT, an innovative meta-programming framework incorporating efficient human workflows into LLM-based multi-agent collaborations. MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences for more streamlined workflows, thus allowing agents with human-like domain expertise to verify intermediate results and reduce errors. MetaGPT utilizes an assembly line paradigm to assign diverse roles to various agents, efficiently breaking down complex tasks into subtasks involving many agents working together. On collaborative software engineering benchmarks, MetaGPT generates more coherent solutions than previous chat-based multi-agent systems.
Primary Area: applications to robotics, autonomy, planning
Keywords: Robot Learning Geometric Deep Learning Robotic Manipulation Equivariant deep learning
Scores: [ 6 6 8 5 ]
Primary Area: applications to robotics, autonomy, planning
Keywords: Human Motion Driven Control Legged Locomotion Unsupervised Reinforcement Learning
Scores: [ 6 6 8 ]
Human motion driven control (HMDC) is an effective approach for generating natural and compelling robot motions while preserving high-level semantics. However, establishing the correspondence between humans and robots with different body structures is not straightforward due to the mismatches in kinematics and dynamics properties, which causes intrinsic ambiguity to the problem. Many previous algorithms approach this motion retargeting problem with unsupervised learning, which requires the prerequisite skill sets. However, it will be extremely costly to learn all the skills without understanding the given human motions, particularly for high-dimensional robots. In this work, we introduce CrossLoco, a guided unsupervised reinforcement learning framework that simultaneously learns robot skills and their correspondence to human motions. Our key innovation is to introduce a cycle-consistency-based reward term designed to maximize the mutual information between human motions and robot states. We demonstrate that the proposed framework can generate compelling robot motions by translating diverse human motions, such as running, hopping, and dancing. We quantitatively compare our CrossLoco against the manually engineered and unsupervised baseline algorithms along with the ablated versions of our framework and demonstrate that our method translates human motions with better accuracy, diversity, and user preference. We also showcase its utility in other applications, such as synthesizing robot movements from
Primary Area: applications to robotics, autonomy, planning
Keywords: Logic rule human-robot interaction and collaboration
Scores: [ 8 5 6 6 ]
We present a principled framework aimed at utilizing logical rules to enhance human-robot perception and collaboration effectively. Logical rules not only offer interpretable explanations for predictions but also possess the ability to generalize to various tasks, making them invaluable for learning. In this paper, we introduce a set of logic rules as knowledge, automatically derived from observational data, to deduce human goals and guide human-like agents in successful collaboration. We treat logic rules as latent variables and simultaneously trains a rule generator and a reasoning predictor utilizing these rules. In the first stage, we evaluate the posterior distribution over the latent rule content by learning rule embeddings from observed triplets, jointly representing entities and relations in a unified embedding space. These learned embeddings enable us to calculate a confidence score for each rule, indicating its consistency with observed triplets. In the second stage, we jointly optimize the rule generator and model parameters by maximizing the current expected log-likelihood. Extensive experiments validate each component of our framework, and results on multiple benchmarks demonstrate that our model outperforms the majority of existing approaches.
Primary Area: applications to robotics, autonomy, planning
Keywords: 3D open set segmentation neural radiance fields VLM features
Scores: [ 6 5 8 5 ]
Primary Area: applications to robotics, autonomy, planning
Keywords: human trajectory prediction robot navigation autonomous driving attention mechanism
Scores: [ 6 5 5 3 ]
Accurate human trajectory prediction is crucial for applications such as autonomous vehicles, robotics, and surveillance systems. Yet, existing models often fail to fully leverage the non-verbal social cues human subconsciously communicate when navigating the space. To address this, we introduce \textit{Social-Transmotion}, a generic model that exploits the power of transformers to handle diverse and numerous visual cues, capturing the multi-modal nature of human behavior. We translate the idea of a prompt from Natural Language Processing (NLP) to the task of human trajectory prediction, where a prompt can be a sequence of x-y coordinates on the ground, bounding boxes or body poses. This, in turn, augments trajectory data, leading to enhanced human trajectory prediction.Our model exhibits flexibility and adaptability by capturing spatiotemporal interactions between pedestrians based on the available visual cues, whether they are poses, bounding boxes, or a combination thereof.By the masking technique, we ensure our model's effectiveness even when certain visual cues are unavailable, although performance is further boosted with the presence of comprehensive visual data.We delve into the merits of using 2d versus 3d poses, and a limited set of poses. Additionally, we investigate the spatial and temporal attention map to identify which keypoints and frames of poses are vital for optimizing human trajectory prediction.Our approach is validated on multiple datasets, including JTA, JRDB, Pedestrians and Cyclists in Road Traffic, and ETH-UCY.
Primary Area: applications to robotics, autonomy, planning
Keywords: LLM Complex Reasoning Language Model Reasoning
Scores: [ 6 6 8 ]
Large Language Models (LLMs) have achieved remarkable success in reasoning tasks with the development of prompting methods. However, existing prompting approaches cannot reuse insights of solving similar problems and suffer from accumulated errors in multi-step reasoning, since they prompt LLMs to reason \textit{from scratch}.To address these issues, we propose \textbf{\textit{Thought Propagation} (TP)}, which explores the analogous problems and leverages their solutions to enhance the complex reasoning ability of LLMs.These analogous problems are related to the input one, with reusable solutions and problem-solving strategies.Thus, it is promising to propagate insights of solving previous analogous problems to inspire new problem-solving. To achieve this, TP first prompts LLMs to propose and solve a set of analogous problems that are related to the input one. Then, TP reuses the results of analogous problems to directly yield a new solution or derive a knowledge-intensive plan for execution to amend the initial solution obtained from scratch.TP is compatible with existing prompting approaches, allowing plug-and-play generalization and enhancement in a wide range of tasks without much labor in task-specific prompt engineering. Experiments across three challenging tasks demonstrate TP enjoys a substantial improvement over the baselines by an average of 12% absolute increase in finding the optimal solutions in Shortest-path Reasoning, 13% improvement of human preference in Creative Writing, and 15% enhancement in the task completion rate of LLM-Agent Planning.
Primary Area: applications to robotics, autonomy, planning
Keywords: discrete diffusion; world model; autonomous driving
Scores: [ 6 8 6 10 6 ]
Primary Area: applications to robotics, autonomy, planning
Keywords: Autonmous Driving Large Language Model Embodied AI
Scores: [ 6 6 8 5 ]
Primary Area: applications to robotics, autonomy, planning
Keywords: LLM Code Generation Robotic Simulation Multi-task Policy Learning
Scores: [ 8 8 8 8 ]
Primary Area: applications to robotics, autonomy, planning
Keywords: Visual Robot Manipulation Video Generative Pre-Training Causal Transformer
Scores: [ 8 6 5 3 ]
Generative pre-trained models have demonstrated remarkable effectiveness in language and vision domains by learning useful representations. In this paper, we extend the scope of this effectiveness by showing that visual robot manipulation can significantly benefit from large-scale video generative pre-training. We introduce GR-1, a straightforward GPT-style model designed for multi-task language-conditioned visual robot manipulation. GR-1 takes as inputs a language instruction, a sequence of observation images, and a sequence of robot states. It predicts robot actions as well as future images in an end-to-end manner. Thanks to a flexible design, GR-1 can be seamlessly finetuned on robot data after pre-trained on a large-scale video dataset. We perform extensive experiments on the challenging CALVIN benchmark and a real robot. On CALVIN benchmark, our method outperforms state-of-the-art baseline methods and improves the success rate from 88.9% to 94.9%. When trained on 10% data of the full dataset, GR-1 achieves a success rate of 77.8%, while the best baseline method achieves 66.8%. In the zero-shot generalization setting, GR-1 improves the success rate from 53.3% to 85.4%. In real robot experiments, GR-1 also outperforms the comparing baseline method. We provide inaugural evidence that a unified GPT-style transformer, augmented with large-scale video generative pre-training, exhibits remarkable generalization to multi-task visual robot manipulation. Code will be made available.
Primary Area: applications to robotics, autonomy, planning
Keywords: Image Matching Pose Estimation 3D Reconstruction
Scores: [ 6 8 10 8 ]
Image matching is a fundamental computer vision problem. While learning-based methods achieve state-of-the-art performance on existing benchmarks, they generalize poorly to in-the-wild images. Such methods typically need to train separate models for different scene types (e.g., indoor vs. outdoor) and are impractical when the scene type is unknown in advance. One of the underlying problems is the limited scalability of existing data construction pipelines, which limits the diversity of standard image matching datasets. To address this problem, we propose GIM, a self-training framework for learning a single generalizable model based on any image matching architecture using internet videos, an abundant and diverse data source. Given an architecture, GIM first trains it on standard domain-specific datasets and then combines it with complementary matching methods to create dense labels on nearby frames of novel videos. These labels are filtered by robust fitting, and then enhanced by propagating them to distant frames. The final model is trained on propagated data with strong augmentations. Not relying on complex 3D reconstruction makes GIM much more efficient and less likely to fail than standard SfM-and-MVS based frameworks. We also propose ZEB, the first zero-shot evaluation benchmark for image matching. By mixing data from diverse domains, ZEB can thoroughly assess the cross-domain generalization performance of different methods. Experiments demonstrate the effectiveness and generality of GIM. Applying GIM consistently improves the zero-shot performance of 3 state-of-the-art image matching architectures as the number of downloaded videos increases (Fig. 1 (a)); with 50 hours of YouTube videos, the relative zero-shot performance improves by 8.4% − 18.1%. GIM also enables generalization to extreme cross-domain data such as Bird Eye View (BEV) images of projected 3D point clouds (Fig. 1 (c)). More importantly, our single zero-shot model consistently outperforms domain-specific baselines when evaluated on downstream tasks inherent to their respective domains. The code will be released upon acceptance.
Primary Area: applications to robotics, autonomy, planning
Keywords: Video-Based Policy Video Dense Correspondence
Scores: [ 6 8 10 6 ]
In this work, we present an approach to construct a video-based robot policy capable of reliably executing diverse tasks across different robots and environments from few video demonstrations without using any action annotations. Our method leverages images as a task-agnostic representation, encoding both the state and action information, and text as a general representation for specifying robot goals. By synthesizing videos that “hallucinate” robot executing actions and in combination with dense correspondences between frames, our approach can infer the closed-formed action to execute to an environment without the need of any explicit action labels. This unique capability allows us to train the policy solely based on RGB videos and deploy learned policies to various robotic tasks. We demonstrate the efficacy of our approach in learning policies on table-top manipulation and navigation tasks. Additionally, we contribute an open-source framework for efficient video modeling, enabling the training of high-fidelity policy models with four GPUs within a single day.
Primary Area: applications to robotics, autonomy, planning
Keywords: Laneline detection autonomous driving topology reasoning
Scores: [ 6 5 5 8 ]
Primary Area: applications to robotics, autonomy, planning
Keywords: differentiable physics simulation thin-shell object manipulation
Scores: [ 8 8 8 8 8 ]
Primary Area: applications to robotics, autonomy, planning
Keywords: pretraining robotics manipulation object representation representation learning
Scores: [ 8 5 5 10 ]
Primary Area: applications to robotics, autonomy, planning
Keywords: Language Agent AI Agent Reinforcement Learning
Scores: [ 6 8 3 ]
Primary Area: applications to robotics, autonomy, planning
Keywords: Fleet Learning Weight Merging Multi-task Policy Learning
Scores: [ 8 5 8 ]
Primary Area: applications to robotics, autonomy, planning
Keywords: dexterous manipulaiton one-shot manipulation transfer distilled feature field implicit field
Scores: [ 6 6 6 6 ]
Humans excel at transferring manipulation skills across diverse object shapes, poses, and appearances due to their understanding of semantic correspondences between different instances. To endow robots with a similar high-level understanding, we develop a DFF for 3D scenes, leveraging large 2D vision models to distill semantic features from multiview images. While current research demonstrates advanced performance in reconstructing DFF from dense views, the development of learning a DFF from sparse views is relatively nascent, despite its prevalence in numerous manipulation tasks with fixed cameras. In this work, we introduce \method, a novel method for acquiring view-consistent 3D Distilled Feature Field from sparse RGBD observations, enabling one-shot learning of dexterous manipulations that are transferable to novel scenes. Specifically, we map the image features to the 3D point cloud, allowing for propagation across the 3D space to establish a dense feature field. At the core of SparseDFF is a lightweight feature refinement network, optimized with a contrastive loss between pairwise views after back-projecting the image features onto the 3D point cloud. Additionally, we implement a point-pruning mechanism to augment feature continuity within each local neighborhood. By establishing coherent feature fields on both source and target scenes, we devise an energy function that facilitates the minimization of feature discrepancies w.r.t. the end-effector parameters between the demonstration and the target manipulation. We evaluate our approach using a dexterous hand, mastering real-world manipulations on both rigid and deformable objects, and showcase robust generalization in the face of object and scene-context variations.
Primary Area: applications to robotics, autonomy, planning
Keywords: Embodied Agents Large Language Models Task Planning
Scores: [ 5 8 5 3 ]
This paper studies close-loop task planning, which refers to the process of generating a sequence of skills (a plan) to accomplish a specific goal while adapting the plan based on real-time observations.Recently, prompting Large Language Models (LLMs) to generate actions iteratively has become a prevalent paradigm due to its superior performance and user-friendliness.However, this paradigm is plagued by two inefficiencies: high token consumption and redundant error correction, both of which hinder its scalability for large-scale testing and applications.To address these issues, we propose Tree-Planner, which reframes task planning with LLMs into three distinct phases: plan sampling, action tree construction, and grounded deciding.Tree-Planner starts by using an LLM to sample a set of potential plans before execution, followed by the aggregation of them to form an action tree.Finally, the LLM performs a top-down decision-making process on the tree, taking into account real-time environmental information.Experiments show that Tree-Planner achieves state-of-the-art performance while maintaining high efficiency.By decomposing LLM queries into a single plan-sampling call and multiple grounded-deciding calls,a considerable partof the prompt are less likely to be repeatedly consumed. As a result, token consumption is reduced by 92.2% compared to the previously best-performing model.Additionally, by enabling backtracking on the action tree as needed, the correction process becomes more flexible, leading to a 40.5% decrease in error corrections.
Primary Area: applications to robotics, autonomy, planning
Keywords: Scene Flow Distillation Scaling
Scores: [ 5 5 8 ]
Primary Area: applications to robotics, autonomy, planning
Keywords: robotics robot learning robot manipulation task representation behavior cloning multitask imitation learning goal conditioning
Scores: [ 5 8 8 ]
Generalization remains one of the most important desiderata for robust robot learning systems. While recently proposed approaches show promise in generalization to novel objects, semantic concepts, or visual distribution shifts, generalization to new tasks remains challenging. For example, a language-conditioned policy trained on pick-and-place tasks will not be able to generalize to a folding task, even if the arm trajectory of folding is similar to pick-and-place. Our key insight is that this kind of generalization becomes feasible if we represent the task through rough trajectory sketches. We propose a policy conditioning method using such rough trajectory sketches, which we call RT-Trajectory, that is practical, easy to specify, and allows the policy to effectively perform new tasks that would otherwise be challenging to perform. We find that trajectory sketches strike a balance between being detailed enough to express low-level motion-centric guidance while being coarse enough to allow the learned policy to interpret the trajectory sketch in the context of situational visual observations. In addition, we show how trajectory sketches can provide a useful interface to communicate with robotic policies -- they can be specified through simple human inputs like drawings or videos, or through automated methods such as modern image-generating or waypoint-generating methods. We evaluate RT-Trajectory at scale on a variety of real-world robotic tasks, and find that RT-Trajectory is able to perform a wider range of tasks compared to language-conditioned and goal-conditioned policies, when provided the same training data.
Primary Area: applications to robotics, autonomy, planning
Keywords: Large Language Models Embodied Intelligence Multi-Agent Cooperation Human-AI Interaction Communication
Scores: [ 6 6 6 8 ]
In this work, we address challenging multi-agent cooperation problems with decentralized control, raw sensory observations, costly communication, and multi-objective tasks instantiated in various embodied environments. While previous research either presupposes a cost-free communication channel or relies on a centralized controller with shared observations, we harness the commonsense knowledge, reasoning ability, language comprehension, and text generation prowess of LLMs and seamlessly incorporate them into a cognitive-inspired modular framework that integrates with perception, memory, and execution. Thus building a Cooperative Embodied Language Agent CoELA, who can plan, communicate, and cooperate with others to accomplish long-horizon tasks efficiently. Our experiments on C-WAH and TDW-MAT demonstrate that CoELA driven by GPT-4 can surpass strong planning-based methods and exhibit emergent effective communication. Though current Open LMs like LLAMA-2 still underperform, we fine-tune a CoLLAMA with data collected with our agents and show how they can achieve promising performance. We also conducted a user study for human-agent interaction and discovered that CoELA communicating in natural language can earn more trust and cooperate more effectively with humans. Our research underscores the potential of LLMs for future research in multi-agent cooperation. Videos can be found on the project website https://llm-co.github.io/CoELA/ .
Primary Area: applications to robotics, autonomy, planning
Keywords: Robot learning Preference learning Visual reward learning Representation alignment
Scores: [ 6 3 6 6 ]
Primary Area: applications to robotics, autonomy, planning
Keywords: Koopman Autoencoders Dynamical Systems Sequence Modeling Inference-time Methods Planning Unsupervised Learning Representation Learning Robotics
Scores: [ 6 8 6 6 8 ]
Primary Area: applications to robotics, autonomy, planning
Keywords: Planning Hierarchical Planning Language Models Video Models Long-Horizon Planning
Scores: [ 6 8 8 6 ]
We are interested in enabling visual planning for complex long-horizon tasks in the space of generated videos and language, leveraging recent advances in large generative models pretrained on Internet-scale data. To this end, we present video language planning (VLP), an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models. VLP takes as input a long-horizon task instruction and current image observation, and outputs a long video plan that provides detailed multimodal (video and language) specifications that describe how to complete the final task. VLP scales with increasing computation budget where more computation time results in improved video plans, and is able to synthesize long-horizon video plans across different robotics domains -- from multi-object rearrangement, to multi-camera bi-arm dexterous manipulation. Generated video plans can be translated into real robot actions via goal-conditioned policies, conditioned on each intermediate frame of the generated video. Experiments show that VLP substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots (across 3 hardware platforms).
Primary Area: applications to robotics, autonomy, planning
Keywords: interactive segmentation 3D instance segmentation point cloud
Scores: [ 6 5 8 3 ]
Primary Area: applications to robotics, autonomy, planning
Keywords: Web Navigation Web Automation Large Language Models Language Model Agents Tool Use Program Synthesis
Scores: [ 5 8 8 8 ]
Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web automation.However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML.We introduce WebAgent, an LLM-driven agent that learns from self-experience to complete tasks on real websites following natural language instructions.WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via Python programs generated from those.We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization.We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks; achieving 18.7% higher success rate than the prior method on MiniWoB web automation benchmark, and SoTA performance on Mind2Web, an offline task planning evaluation.
Primary Area: applications to robotics, autonomy, planning
Keywords: Reinforcement Learning Quadrupedal Locomotion Internal Model
Scores: [ 5 8 6 6 ]
Primary Area: applications to robotics, autonomy, planning
Keywords: Navigation Embodied AI Perception
Scores: [ 6 8 6 ]
Agents navigating in 3D environments require some form of memory, which should hold a compact and actionable representation of the history of observations useful for decision taking and planning. In most end-to-end learning approaches the representation is latent and usually does not have a clearly defined interpretation, whereas classical robotics addresses this with scene reconstruction resulting in some form of map, usually estimated with geometry and sensor models and/or learning. In this work we propose to learn an actionable representation of the scene independently of the targeted downstream task and without explicitly optimizing reconstruction. The learned representation is optimized by a blind auxiliary agent trained to navigate with it on multiple short sub episodes branching out from a waypoint and, most importantly, without any direct visual observation. We argue and show that the blindness property is important and forces the (trained) latent representation to be the only means for planning. With probing experiments we show that the learned representation optimizes navigability and not reconstruction. On downstream tasks we show that it is robust to changes in distribution, in particular the sim2real gap, which we evaluate with a real physical robot in a real office building, significantly improving performance.
Primary Area: applications to robotics, autonomy, planning
Keywords: Embodied AI Continual Learning
Scores: [ 6 6 6 6 ]
In learning an embodied agent executing daily tasks via language directives, the literature largely assumes that the agent learns all training data at the beginning. We argue that such a learning scenario is less realistic, since a robotic agent is supposed to learn the world continuously as it explores and perceives it. To take a step towards a more realistic embodied agent learning scenario, we propose two continual learning setups for embodied agents; learning new behaviors (Behavior Incremental Learning, Behavior-IL) and new environments (Environment Incremental Learning, Environment-IL) For the tasks, previous ‘data prior’ based continual learning methods maintain logits for the past tasks. However, the stored information is often insufficiently learned information and requires task boundary information, which might not always be available. Here, we propose to update them based on confidence scores without task boundary information (i.e., task-free) in a moving average fashion, named Confidence-Aware Moving Average (CAMA). In the proposed challenging Behavior-IL and Environment-IL setups, our simple CAMA outperforms prior arts in our empirical validations by noticeable margins.
Primary Area: applications to robotics, autonomy, planning
Keywords: Tracking defence spatial-temporal implicit representation languange-image model
Scores: [ 5 6 6 ]
Visual object tracking plays a critical role in visual-based autonomous systems, as it aims to estimate the position and size of the object of interest within a live video. Despite significant progress made in this field, state-of-the-art (SOTA) trackers often fail when faced with adversarial perturbations in the incoming frames. This can lead to significant robustness and security issues when these trackers are deployed in the real world. To achieve high accuracy on both clean and adversarial data, we propose building a spatial-temporal continuous representation using the semantic text guidance of the object of interest. This novel continuous representation enables us to reconstruct incoming frames to maintain semantic and appearance consistency with the object of interest and its clean counterparts. As a result, our proposed method successfully defends against different SOTA adversarial tracking attacks while maintaining high accuracy on clean data. In particular, our method significantly increases tracking accuracy under adversarial attacks with around 90% relative improvement on UAV123, which is even higher than the accuracy on clean data.
Primary Area: applications to robotics, autonomy, planning
Keywords: Planning and Learning Learning Abstractions Compositional Generalization Robotic Manipulation
Scores: [ 6 8 5 3 ]
This paper presents a framework for learning state and action abstractions in sequential decision-making domains. Our framework, planning abstraction from language (PARL), utilizes language-annotated demonstrations to automatically discover a symbolic and abstract action space and induce a latent state abstraction based on it. PARL consists of three stages: 1) recovering object-level and action concepts, 2) learning state abstractions, abstract action feasibility, and transition models, and 3) applying low-level policies for abstract actions. During inference, given the task description, PARL first makes abstract action plans using the latent transition and feasibility functions, then refines the high-level plan using low-level policies. PARL generalizes across scenarios involving novel object instances and environments, unseen concept compositions, and tasks that require longer planning horizons than settings it is trained on.
Primary Area: applications to robotics, autonomy, planning
Keywords: Combinatorial Optimization Neural Architecture Multi-agent Path Planning
Scores: [ 6 3 6 6 ]
Multi-agent path finding (MAPF) is the combinatorial problem of planning optimal collision-avoiding paths for multiple agents, with application to robotics, logistics, and transportation. Though many recent learning-based works have focused on large-scale combinatorial problems by guiding their decomposition into sequences of smaller subproblems, the combined spatiotemporal and time-restricted nature of MAPF poses a particular challenge for learning-based guidance of iterative approaches like large neighborhood search (LNS), which is already a state-of-the-art approach for MAPF even without learning. We address this challenge of neural-guided LNS for MAPF by designing an architecture which interleaves convolution and attention to efficiently represent MAPF subproblems, enabling practical guidance of LNS in benchmark settings. We demonstrate the speedup of our method over existing state-of-the-art LNS-based methods for MAPF as well as the robustness of our method to unseen settings. Our proposed method expands the horizon of effective deep learning-guided LNS methods into multi-path planning problems, and our proposed representation may be more broadly applicable for representing path-wise interactions.
Primary Area: applications to robotics, autonomy, planning
Keywords: Human-Scene Interaction Chain-of-Contacts Unified LLM
Scores: [ 10 5 8 6 ]
Primary Area: applications to robotics, autonomy, planning
Keywords: Autonomous Driving 3D Domain Transfer Zero-shot 3D Detection
Scores: [ 6 5 6 6 ]
Domain shifts such as sensor type changes and geographical situation variations are prevalent in Autonomous Driving (AD), which poses a challenge since AD model relying on the previous-domain knowledge can be hardly directly deployed to a new domain without additional costs. In this paper, we provide a new perspective and approach of alleviating the domain shifts, by proposing a Reconstruction-Simulation-Perception (ReSimAD) scheme. Specifically, the implicit reconstruction process is based on the knowledge from the previous old domain, aiming to convert the domain-related knowledge into domain-invariant representations, e.g., 3D scene-level meshes. Besides, the point clouds simulation process of multiple new domains is conditioned on the above reconstructed 3D meshes, where the target-domain-like simulation samples can be obtained, thus reducing the cost of collecting and annotating new-domain data for the subsequent perception process. For experiments, we consider different cross-domain situations such as Waymo-to-KITTI, Waymo-to-nuScenes, Waymo-to-ONCE, etc, to verify the zero-shot target-domain perception using ReSimAD. Results demonstrate that our method is beneficial to boost the domain generalization ability, even promising for 3D pre-training. Code and simulated points are available for result reproduction.
Primary Area: applications to robotics, autonomy, planning
Keywords: Quantization 3D Object Detection Autonomous Driving
Scores: [ 6 6 6 6 ]
Due to highly constrained computing power and memory, deploying 3D lidar-based detectors on edge devices equipped in autonomous vehicles and robots poses a crucial challenge. Being a convenient and straightforward model compression approach, Post-Training Quantization (PTQ) has been widely adopted in 2D vision tasks. However, applying it directly to 3D lidar-based tasks inevitably leads to performance degradation. As a remedy, we propose an effective PTQ method called LiDAR-PTQ, which is particularly curated for 3D lidar detection (both SPConv-based and SPConv-free). Our LiDAR-PTQ features three main components, (1) a sparsity-based calibration method to determine the initialization of quantization parameters, (2) an adaptive rounding-to-nearest operation to minimize the layerwise reconstruction error, (3) a Task-guided Global Positive Loss (TGPL) to reduce the disparity between the final predictions before and after quantization. Extensive experiments demonstrate that our LiDAR-PTQ can achieve state-of-the-art quantization performance when applied to CenterPoint (both Pillar-based and Voxel-based). To our knowledge, for the very first time in lidar-based 3D detection tasks, the PTQ INT8 model's accuracy is almost the same as the FP32 model while enjoying 3X inference speedup. Moreover, our LiDAR-PTQ is cost-effective being 6X faster than the quantization-aware training method. The code will be released.
Primary Area: causal reasoning
Keywords: Heterogeneous Treatment Effect Estimation Conditional Average Treatment Effect Causal Inference Model Selection
Scores: [ 5 8 8 8 ]
Primary Area: causal reasoning
Keywords: counterfactual density estimation kernel Stein discrepancy causal inference kernel methods
Scores: [ 6 8 5 6 ]
Primary Area: causal reasoning
Keywords: Causal Inference Variational AutoEncoder Front Door Criterion
Scores: [ 6 6 6 5 ]
Primary Area: causal reasoning
Keywords: Causal inference Design of experiments Interference Random graph Spillover effects Treatment effects Potential outcomes
Scores: [ 5 6 5 6 ]
Interference is ubiquitous when conducting causal experiments over social networks. Except for certain network structures, causal inference on the network in the presence of interference is difficult due to the entanglement between the treatment assignments and the interference levels. In this article, we conduct causal inference under interference on an observed, sparse but connected network, and we propose a novel design of experiments based on an independent set. Compared to conventional designs, the independent-set design focuses on an independent subset of data and controls their interference exposures through the assignments to the rest (auxiliary set). The independent-set design enhances the performance of causal estimators by trading sample quantity for sample quality. We show the capacity of our approach for various causal inference tasks, justify its superiority over conventional methods, and illustrate the empirical performance through simulations.
Primary Area: causal reasoning
Keywords: counterfactual domain causal representation learning
Scores: [ 6 5 6 6 ]
Answering counterfactual queries has many important applications such as knowledge discovery and explainability, but is challenging when causal variables are unobserved and we only see a projection onto an observation space, for instance, image pixels. One approach is to recover the latent Structural Causal Model (SCM), but this typically needs unrealistic assumptions, such as linearity of the causal mechanisms. Another approach is to use naïve ML approximations, such as generative models, to generate counterfactual samples; however, these lack guarantees of accuracy. In this work, we strive to strike a balance between practicality and theoretical guarantees by focusing on a specific type of causal query called domain counterfactuals, which hypothesizes what a sample would have looked like if it had been generated in a different domain (or environment). Concretely, by only assuming invertibility, sparse domain interventions and access to observational data from different domains, we aim to improve domain counterfactual estimation both theoretically and practically with less restrictive assumptions. We define domain counterfactually equivalent models and prove necessary and sufficient properties for equivalent models that provide a tight characterization of the domain counterfactual equivalence classes. Building upon this result, we prove that every equivalence class contains a model where all intervened variables are at the end when topologically sorted by the causal DAG, i.e., all non-intervened variables have non-intervened ancestors. This surprising result suggests that a model design that only allows intervention in the last k latent variables may improve model estimation for counterfactuals. We then test this model design on extensive simulated and image-based experiments which show the sparse canonical model indeed improves counterfactual estimation over baseline non-sparse models.
Primary Area: causal reasoning
Keywords: treatment effect estimation neural differential equation variational Bayes medicine
Scores: [ 6 8 6 6 ]
Treatment effect estimation in continuous time is crucial for personalized medicine. However, existing methods for this task are limited to point estimates of the potential outcomes, whereas uncertainty estimates have been ignored. Needless to say, uncertainty quantification is crucial for reliable decision-making in medical applications. To fill this gap, we propose a novel Bayesian neural controlled differential equation (BNCDE) for treatment effect estimation in continuous time. In our BNCDE, the time dimension is modeled through a coupled system of neural controlled differential equations and neural stochastic differential equations, where the neural stochastic differential equations allow for tractable variational Bayesian inference. Thereby, for an assigned sequence of treatments, our BNCDE provides meaningful posterior predictive distributions of the potential outcomes. To the best of our knowledge, ours is the first tailored neural method to provide uncertainty estimates of treatment effects in continuous time. As such, our method is of direct practical value for promoting reliable decision-making in medicine.
Primary Area: causal reasoning
Keywords: Causal Inference Variational AutoEncoder Instrumental Variable
Scores: [ 8 6 8 5 ]
Primary Area: causal reasoning
Keywords: Treatment Effects over Time
Scores: [ 5 8 5 8 8 ]
Inferring unbiased treatment effects has received widespread attention in the machine learning community. In recent years, our community has proposed numerous solutions in standard settings, high-dimensional treatment settings, and even longitudinal settings. While very diverse, the solution has almost always relied on neural networks for inference and simultaneous correction of assignment bias. New approaches typically build on top of previous approaches by proposing new (or refined) architectures and learning algorithms. However, the end result—a neural-network-based inference machine—remains unchallenged. In this paper, we introduce a different type of solution in the longitudinal setting: a closed-form and human-readable ordinary differential equation (ODE). While we still rely on continuous optimization to learn an ODE, the resulting inference machine is no longer a neural network. Doing so yields several advantages such as interpretability, irregular sampling, and a different set of identification assumptions. Above all, we consider the introduction of a completely new type of solution to be our most important contribution as it may spark entirely new innovations in treatment effects in general. We facilitate this by formulating our contribution as a framework that can transform any ODE discovery method into a treatment effects method.
Primary Area: causal reasoning
Keywords: causality generalisation causal discovery domain adaptation out-of-distribution generalization
Scores: [ 8 6 10 8 ]
It has long been hypothesised that causal reasoning plays a fundamental role in robust and general intelligence. However, it is not known if agents must learn causal models in order to generalise under distributional shifts, or if other inductive biases are sufficient. We answer this question, showing that any agent capable of satisfying a regret bound under a large set of distributional shifts must have learned an approximate causal model of the data generating process, which converges to the true causal model for optimal agents. We discuss the implications of this result for several research areas including transfer learning and causal inference.
Primary Area: causal reasoning
Keywords: Causal Reasoning Causal Discovery Structural Causal Models Large Language Models
Scores: [ 6 3 8 8 ]
Primary Area: causal reasoning
Keywords: Causal inference covariate balance nonparametric estimation observational studies weighting method
Scores: [ 8 6 5 ]
Primary Area: causal reasoning
Keywords: causal discovery latent variable model structure learning
Scores: [ 8 6 6 6 ]
Primary Area: causal reasoning
Keywords: Causal Discovery Structure Learning Federated Learning Heterogeneous Data
Scores: [ 5 5 8 5 8 ]
Primary Area: causal reasoning
Keywords: Dynamical Causality Reinforcement Learning
Scores: [ 5 6 3 8 ]
Primary Area: causal reasoning
Keywords: Causal machine learning treatment effect estimation sensitivity analysis unobserved confounding uncertainty estimation
Scores: [ 6 6 8 6 ]
Primary Area: causal reasoning
Keywords: Causal Inference Curriculum Learning Reinforcement Learning
Scores: [ 8 6 3 6 ]
Primary Area: causal reasoning
Keywords: Gene regulatory network Single-cell RNA-sequencing Dropout Zero-inflated data Causal model Causal discovery Nonparametric
Scores: [ 8 8 6 ]
Primary Area: causal reasoning
Keywords: causal inference representation learning individualized treatment effect estimation
Scores: [ 8 8 5 8 ]
Primary Area: causal reasoning
Keywords: causality bayesian optimization
Scores: [ 6 6 6 6 ]
Primary Area: causal reasoning
Keywords: Structure Learning Causal Discovery Generative Model Variational Inference Differential Equation
Scores: [ 6 6 8 8 ]
Discovering the underlying relationships among variables from temporal observations has been a longstanding challenge in numerous scientific disciplines, including biology, finance, and climate science. The dynamics of such systems are often best described using continuous-time stochastic processes. Unfortunately, most existing structure learning approaches assume that the underlying process evolves in discrete-time and/or observations occur at regular time intervals. These mismatched assumptions can often lead to incorrect learned structures and models. In this work, we introduce a novel structure learning method, SCOTCH, which combines neural stochastic differential equations (SDE) with variational inference to infer a posterior distribution over possible structures. This continuous-time approach can naturally handle both learning from and predicting observations at arbitrary time points. Theoretically, we establish sufficient conditions for an SDE and SCOTCH to be structurally identifiable, and prove its consistency under infinite data limits. Empirically, we demonstrate that our approach leads to improved structure learning performance on both synthetic and real-world datasets compared to relevant baselines under regular and irregular sampling intervals.
Primary Area: causal reasoning
Keywords: causal effect continuous treatment kernel function doubly robust
Scores: [ 6 6 8 ]
Primary Area: causal reasoning
Keywords: treatment effect estimation continuous treatment measurement error
Scores: [ 8 8 6 ]
Estimating treatment effects has numerous real-world applications in various fields, such as epidemiology and political science. While much attention has been devoted to addressing the challenge using fully observational data, there has been comparatively limited exploration of this issue in cases when the treatment is not directly observed. In this paper, we tackle this problem by developing a general variational framework, which is flexible to integrate with advanced neural network-based approaches, to identify the average dose-response function (ADRF) with the continuously valued error-contaminated treatment. Our approach begins with the formulation of a probabilistic data generation model, treating the unobserved treatment as a latent variable. In this model, we leverage a learnable density estimation neural network to derive its prior distribution conditioned on covariates. This module also doubles as a generalized propensity score estimator, effectively mitigating selection bias arising from observed confounding variables. Subsequently, we calculate the posterior distribution of the treatment, taking into account the observed measurement and outcome. To mitigate the impact of treatment error, we introduce a re-parametrized treatment value, replacing the error-affected one, to make more accurate predictions regarding the outcome. To demonstrate the adaptability of our framework, we incorporate two state-of-the-art ADRF estimation methods and rigorously assess its efficacy through extensive simulations and experiments using semi-synthetic data.
Primary Area: causal reasoning
Keywords: counterfactual generation individual treatment effect synthetic data Gaussian mixture model
Scores: [ 6 6 3 6 ]
Primary Area: causal reasoning
Keywords: causality causal discovery latent variables
Scores: [ 5 6 8 5 ]
Most traditional causal discovery approaches typically assume the absence of latent variables, a simplification that often does not align with real-world situations. Recently, there has been a surge of causal discovery methods that explicitly consider latent variables. While causal discovery with latent variables aims to reveal causal relations between observed variables in the presence of latent variables, latent causal structure learning seeks to identify latent variables and infer their causal relations, typically entailing strong distributional and graphical assumptions. In this paper, we endeavor to recover the whole causal structure involving both latent and observed variables under relatively milder assumptions. Specifically, we introduce two sets of assumptions, one allows arbitrary distribution and requires only one pure child per latent variable, the other requires no pure child and imposes the non-Gaussianity requirement on only a small subset of variables, and both of them allow causal edges between observed variables. Under either of them, we prove that the whole causal structure of linear latent variable models is identifiable. Our proof is constructive, leading to both theoretically sound and computationally efficient algorithms, which first identify latent variables from observed data and then infer causal relations between any two variables.
Primary Area: causal reasoning
Keywords: Causal Discovery; Latent Variables
Scores: [ 8 6 8 8 ]
Primary Area: causal reasoning
Keywords: Causal Information Stochastic Neural Network Adaptive Stochastic Gradient MCMC Sparsity Missing Data
Scores: [ 8 6 6 ]
Primary Area: causal reasoning
Keywords: instrument variable experiment design indirect experiments adaptive design
Scores: [ 6 6 8 6 ]
Indirect experiments provide a valuable framework for estimating treatment effects in situations where conducting randomized control trials (RCTs) is impractical or unethical. Unlike RCTs, indirect experiments estimate treatment effects by leveraging (conditional) instrumental variables, enabling estimation through encouragement and recommendation rather than strict treatment assignment. However, the sample efficiency of such estimators depends not only on the inherent variability in outcomes but also on the varying compliance levels of users with the instrumental variables and the choice of estimator being used, especially when dealing with numerous instrumental variables. While adaptive experiment design has a rich literature for \textit{direct} experiments, in this paper we take the initial steps towards enhancing sample efficiency for \textit{indirect} experiments by adaptively designing a data collection policy over instrumental variables. Our main contribution is a practical computational procedure that utilizes influence functions to search for an optimal data collection policy, minimizing the mean-squared error of the desired (non-linear) estimator. Through experiments conducted in various domains inspired by real-world applications, we showcase how our method can significantly improve the sample efficiency of indirect experiments.
Primary Area: causal reasoning
Keywords: causality extrapolation exogenous variables causal representation learning identifiable representation learning control functions instrumental variables invariance
Scores: [ 8 8 8 8 ]
Primary Area: causal reasoning
Keywords: expertise model selection balancing representations treatment effect estimation
Scores: [ 8 8 5 6 ]
Decision-makers are often experts of their domain and take actions based on their domain knowledge. Doctors, for instance, may prescribe treatments by predicting the likely outcome of each available treatment. Actions of an expert thus naturally encode part of their domain knowledge, and can help make inferences within the same domain: Knowing doctors try to prescribe the best treatment for their patients,we can tell treatments prescribed more frequently are likely to be more effective. Yet in machine learning, the fact that most decision-makers are experts is often overlooked, and “expertise” is seldom leveraged as an inductive bias. This is especially true for the literature on treatment effect estimation, where often the only assumption made about actions is that of overlap. In this paper, we argue that expertise—particularly the type of expertise the decision-makers of a domain are likely to have—can be informative in designing and selecting methods for treatment effect estimation. We formally define two types of expertise, predictive and prognostic, and demonstrate empirically that: (i) the prominent type of expertise in a domain significantly influences the performance of different methods in treatment effect estimation, and (ii) it is possible to predict the type of expertise present in a dataset, which can provide a quantitative basis for model selection.
Primary Area: datasets and benchmarks
Keywords: Large language models Autonomous agents Reasoning Evaluation Benchmark
Scores: [ 6 8 6 8 3 ]
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments.We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting.Our extensive test over 27 API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and OSS competitors.We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents.Training on code and high quality multi-turn alignment data could improve agent performance.Datasets, environments, and an integrated evaluation package for AgentBench are released.
Primary Area: datasets and benchmarks
Keywords: off-policy evaluation offline reinforcement learning offline policy selection risk-return tradeoff
Scores: [ 8 5 8 5 ]
Primary Area: datasets and benchmarks
Keywords: uncertainty segmentation validation
Scores: [ 8 8 6 8 ]
Primary Area: datasets and benchmarks
Keywords: dataset documentation data-centric AI computational social science
Scores: [ 5 8 5 8 ]
Advances in machine learning are closely tied to the creation of datasets. While data documentation is widely recognized as essential to the reliability, reproducibility, and transparency of ML, we lack a systematic empirical understanding of current dataset documentation practices. To shed light on this question, here we take Hugging Face - one of the largest platforms for sharing and collaborating on ML models and datasets - as a prominent case study. By analyzing all 7,433 dataset documentation on Hugging Face, our investigation provides an overview of the Hugging Face dataset ecosystem and insights into dataset documentation practices, yielding 5 main findings: (1) The dataset card completion rate shows marked heterogeneity correlated with dataset popularity: While 86.0% of the top 100 downloaded dataset cards fill out all sections suggested by Hugging Face community, only 7.9% of dataset cards with no downloads complete all these sections. (2) A granular examination of each section within the dataset card reveals that the practitioners seem to prioritize Dataset Description and Dataset Structure sections, accounting for 36.2% and 33.6% of the total card length, respectively, for the most downloaded datasets. In contrast, the Considerations for Using the Data section receives the lowest proportion of content, accounting for just 2.1% of the text. (3) By analyzing the subsections within each section and utilizing topic modeling to identify key topics, we uncover what is discussed in each section, and underscore significant themes encompassing both technical and social impacts, as well as limitations within the Considerations for Using the Data section. (4) Our findings also highlight the need for improved accessibility and reproducibility of datasets in the Usage sections. (5) In addition, our human annotation evaluation emphasizes the pivotal role of comprehensive dataset content in shaping individuals' perceptions of a dataset card's overall quality. Overall, our study offers a unique perspective on analyzing dataset documentation through large-scale data science analysis and underlines the need for more thorough dataset documentation in machine learning research.
Primary Area: datasets and benchmarks
Keywords: Document Understanding Dataset Segmentation Detection OCR Captioning
Scores: [ 8 6 6 6 ]
We introduce ADoPD, a large-scale document page decomposition dataset for document understanding, encompassing document entity segmentation, text detection, tagging, and captioning. ADoPD stands out with its novel document taxonomy, meticulously crafted through a data-driven approach enriched by both large-scale pretrained models and human expertise. Our dataset achieves diversity by combining outlier detection with a human-in-the-loop approach. This significant contribution advances the field of document analysis, deepening our insights into document structures and substantially enhancing document processing and analysis techniques. The amalgamation of data-driven exploration, thorough annotation, and the human-in-the-loop methodology paves the way for innovative improvements in document analysis capabilities and the advancement of document processing applications. We conduct a comprehensive evaluation of ADoPD using various methods and demonstrate its effectiveness.
Primary Area: datasets and benchmarks
Keywords: federated learning distributed learning domain generalization out-of-distribution generalization benchmarking data paritioning.
Scores: [ 8 6 6 ]
While prior federated learning (FL) methods mainly consider client heterogeneity, we focus on the Federated Domain Generalization (DG) task, which introduces train-test heterogeneity in the FL context.Existing evaluations in this field are limited in terms of the scale of the clients and dataset diversity.Thus, we propose a Federated DG benchmark that aim to test the limits of current methods with high client heterogeneity, large numbers of clients, and diverse datasets. Towards this objective, we introduce a novel data partitioning method that allows us to distribute any domain dataset among few or many clients while controlling client heterogeneity. We then introduce and apply our methodology to evaluate 13 Federated DG methods, which include centralized DG methods adapted to the FL context, FL methods that handle client heterogeneity, and methods designed specifically for Federated DG on 7 datasets.Our results suggest that, despite some progress, significant performance gaps remain in Federated DG, especially when evaluating with a large number of clients, high client heterogeneity, or more realistic datasets. Furthermore, our extendable benchmark code will be publicly released to aid in benchmarking future Federated DG approaches.
Primary Area: datasets and benchmarks
Keywords: Large Language Models Natural Language Inference Causal Reasoning Inferring Causation from Correlation Benchmark Dataset
Scores: [ 6 6 6 ]
Causal inference is one of the hallmarks of human intelligence. While the field of CausalNLP has attracted much interest in the recent years, existing causal inference datasets in NLP primarily rely on discovering causality from empirical knowledge (e.g. commonsense knowledge). In this work, we propose the first benchmark dataset to test the pure causal inference skills of large language models (LLMs). Specifically, we formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables. We curate a large-scale dataset of more than 400K samples, on which we evaluate seventeen existing LLMs. Through our experiments, we identify a key shortcoming of LLMs in terms of their causal inference skills, and show that these models achieve almost close to random performance on the task. This shortcoming is somewhat mitigated when we try to re-purpose LLMs for this skill via finetuning, but we find that these models still fail to generalize – they can only perform causal inference in in-distribution settings when variable names and textual expressions used in the queries are similar to those in the training set, but fail in out-of-distribution settings generated by perturbing these queries. Corr2Cause is a challenging task for LLMs, and would be helpful in guiding future research on improving LLMs’ pure reasoning ability and generalizability.
Primary Area: datasets and benchmarks
Keywords: Synthetic Data Object re-ID Benchmarks
Scores: [ 6 6 8 6 ]
For object re-identification (re-ID), learning from synthetic data has become a promising strategy to cheaply acquire large-scale annotated datasets and effective models, with few privacy concerns. Many interesting research problems arise from this strategy, e.g., how to reduce the domain gap between synthetic source and real-world target. To facilitate developing more new approaches in learning from synthetic data, we introduce the Alice benchmarks, large-scale datasets providing benchmarks as well as evaluation protocols to the research community. Within the Alice benchmarks, two object re-ID tasks are offered: person and vehicle re-ID. We collected and annotated two challenging real-world target datasets: AlicePerson and AliceVehicle, captured under various illuminations, image resolutions, etc. As an important feature of our real target, the clusterability of its training set is not manually guaranteed to make it closer to a real domain adaptation test scenario. Correspondingly, we reuse existing PersonX and VehicleX as synthetic source domains. The primary goal is to train models from synthetic data that can work effectively in the real world. In this paper, we detail the settings of Alice benchmarks, provide an analysis of existing commonly-used domain adaptation methods, and discuss some interesting future directions. An online server will be set up for the community to evaluate methods conveniently and fairly.
Primary Area: datasets and benchmarks
Keywords: Biological sequence analysis enhancer annotation gene finding gene annotation Language model genome modelling benchmark LLM embeddings representations DNA
Scores: [ 3 6 6 5 ]
The genome sequence contains the blueprint for governing cellular processes. While the availability of genomes has vastly increased over the last decades, experimental annotation of the various functional, non-coding and regulatory elements encoded in the DNA sequence remains both expensive and challenging. This has sparked interest in unsupervised language modeling of genomic DNA, a paradigm that has seen great success for protein sequence data. Although various DNA language models have been proposed, evaluation tasks often differ between individual works, and might not fully recapitulate the fundamental challenges of genome annotation, including the length, scale and sparsity of the data. In this study, we introduce BEND, a BENchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks defined on the human genome. We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features. BEND is available at https://anonymous.4open.science/r/BEND-8C42/README.md
Primary Area: datasets and benchmarks
Keywords: large language models instruction tuning evaluation benchmark instruction following
Scores: [ 8 6 8 ]
As research in large language models (LLMs) continues to accelerate, LLM-based evaluation has emerged as a scalable and cost-effective alternative to human evaluations for comparing the ever-increasing list of models. This paper investigates the efficacy of these “LLM evaluators”, particularly in using them to assess instruction following, a metric that gauges how closely generated text adheres to theinstructions. We introduce a challenging meta-evaluation benchmark, LLMBAR, designed to test the ability of an LLM evaluator to discern instruction-following outputs. The authors curated 419 pairs of outputs, one adhering to instructions while the other diverging, yet may possess deceptive qualities that could mislead an LLM evaluator. Contrary to existing meta-evaluation, we discover that different evaluators (i.e., combinations of LLMs and prompts) exhibit distinct performance on LLMBAR and even the highest-scoring LLM evaluators have substantial room for improvement. We also present a novel suite of prompting strategies that further close the gap between LLM and human evaluators. With LLMBAR, we hope to offer more insight into the behavior of LLM evaluators and foster research in developing better instruction-following models.
Primary Area: datasets and benchmarks
Keywords: human feedback text to image generative AI image quality scoring
Scores: [ 5 8 8 6 ]
Social reward as a form of community recognition provides a strong source of motivation for users of online platforms to actively engage and contribute with content to accumulate peers approval. In the realm of text-conditioned image syn- thesis, the recent surge in progress has ushered in a collaborative era where users and AI systems coalesce to refine visual creations. This co-creative process in the landscape of online social networks empowers users to craft original visual art- works seeking for community validation. Nevertheless, assessing these models in the context of collective community preference introduces distinct challenges. Ex- isting evaluation methods predominantly center on limited size user studies guided by image quality and alignment with prompts. This work pioneers a paradigm shift, unveiling Social Reward - an innovative reward modeling framework that leverages implicit feedback from social network users engaged in creative editing of generated images. We embark on an extensive journey of dataset curation and refinement, drawing from an anonymous online visual creation and editing platform (referred to as Platform A in this paper), yielding the first million-user- scale dataset of implicit human preferences for user-generated visual art named PA Image-Social. Our analysis exposes the shortcomings of current metrics in modeling community creative preference of text-to-image models’ outputs, compelling us to introduce a novel predictive model explicitly tailored to address these limitations. Rigorous quantitative experiments and user study show that our Social Reward model aligns better with social popularity than existing metrics. Furthermore, we utilize Social Reward to fine-tune text-to-image models, yielding images that are more favored by not only Social Reward, but also other established metrics. These findings highlight the relevance and effectiveness of Social Reward in assessing community appreciation for AI-generated artworks, establishing a closer alignment with users’ creative goals: creating popular visual art.
Primary Area: datasets and benchmarks
Keywords: low-resource languages indigenous languages endangered languages long context field linguistics unseen tasks large language models machine translation benchmark
Scores: [ 6 8 8 ]
Large language models (LLMs) can perform impressive feats with in-context learning or lightweight finetuning. It is natural to wonder how well these models adapt to genuinely new tasks, but how does one find tasks that are unseen in internet-scale training sets? We turn to a field that is explicitly motivated and bottlenecked by a scarcity of web data: low-resource languages. In this paper, we introduce MTOB (Machine Translation from One Book), a benchmark for learning to translate between English and Kalamang—a language with less than 200 speakers and therefore virtually no presence on the web—using several hundred pages of field linguistics reference materials. This task framing is novel in that it asks a model to learn a language from a single human-readable book of grammar explanations, rather than a large mined corpus of in-domain data, more akin to L2 language learning than L1 language acquisition. We demonstrate that baselines using current LLMs are promising but fall short of human performance, achieving 44.7 chrF on Kalamang to English translation and 45.8 chrF on English to Kalamang translation, compared to 51.6 and 57.0 chrF by a human who learned Kalamang from the same reference materials. We hope that MTOB will help measure LLM capabilities along a new dimension, and that the methods developed to solve it could help expand access to language technology for underserved communities by leveraging qualitatively different kinds of data than traditional machine translation.
Primary Area: datasets and benchmarks
Keywords: VFL Attacks and Defenses Benchmark Platform
Scores: [ 8 3 8 ]
Primary Area: datasets and benchmarks
Keywords: Benchmark STEM Multimodal Vision-language models Language models
Scores: [ 6 6 6 ]
We introduce a new challenge to test the STEM skills of neural models. Unlike existing datasets, our dataset requires the understanding of multimodal vision-language information. Our dataset features one of the largest and most comprehensive datasets for the challenge. It includes 448 skills and 1,073,146 questions spanning all STEM (science, technology, engineering, math) subjects. Compared to existing datasets that often focus on examining expert-level ability, our dataset includes fundamental skills and questions designed based on the K-12 curriculum. We also add state-of-the-art foundation models such as CLIP and ChatGPT to our dataset. Results show that the recent model advances only help master a very limited number of lower grade-level skills (2.5% in the third grade) in our dataset. In fact, these models are still well below (averaging 54.7%) the performance of elementary students, not to mention near expert-level performance. To understand and increase the performance on our dataset, we teach the models on a training split of our dataset. Even though we observe improved performance, the model performance remains relatively low compared to average elementary students. To solve STEM problems, we will need novel algorithmic innovations from the community. The code and dataset are available at https://anonymous.4open.science/r/STEM-Dataset-ICLR-2024 and will be made publicly available.
Primary Area: datasets and benchmarks
Keywords: Large Langauge Models Agent Prompt Tunning Time series forcasting
Scores: [ 8 8 5 ]
We introduce SocioDojo, an open-ended lifelong learning environment for developing ready-to-deploy autonomous agents capable of performing human-like analysis and decision-making on societal topics such as economics, finance, politics, and culture. It consists of (1) information sources from news, social media, reports, etc., (2) a knowledge base built from books, journals, and encyclopedias, plus a toolbox of Internet and knowledge graph search interfaces, (3) 30K high-quality time series in finance, economy, society, and polls, which support a novel task called "hyperportfolio", that can reliably and scalably evaluate societal analysis and decision-making power of agents, inspired by portfolio optimization with time series as assets to "invest". We also propose a novel Analyst-Assistant-Actuator architecture for the hyperportfolio task, and a Hypothesis & Proof prompting for producing in-depth analyses on input news, articles, etc. to assist decision-making. We perform experiments and ablation studies to explore the factors that impact performance. The results show that our proposed method achieves improvements of 32.4% and 30.4% compared to the state-of-the-art method in the two experimental settings.
Primary Area: datasets and benchmarks
Keywords: newborn controlled rearing object recognition object segmentation reinforcement learning benchmark Turing test development
Scores: [ 6 8 8 3 ]
Primary Area: datasets and benchmarks
Keywords: semantic parsing text-to-code ambiguity NLP calibration
Scores: [ 8 6 6 8 ]
Despite the frequent challenges posed by ambiguity when representing meaning via natural language, it is often ignored or deliberately removed in tasks mapping language to formally-designed representations, which generally assume a one-to-one mapping between linguistic and formal representations. We attempt to address this shortcoming by introducing AmP, a framework, dataset, and challenge for translating ambiguous natural language to formal representations like logic and code. We define templates and generate data for five well-documented linguistic ambiguities.Using AmP, we investigate how several few-shot text-to-code systems handle ambiguity, introducing three new metrics.We find that large pre-trained models perform poorly at capturing the distribution of possible meanings without deliberate instruction.However, models are able to capture the distribution well when ambiguity is attested in their inputs. These results motivate a call for including ambiguity explicitly in datasets and promote considering the distribution of possible outputs when evaluating systems. We release our data and code.
Primary Area: datasets and benchmarks
Keywords: doamin generalization model accuracy prediction testbeds
Scores: [ 6 6 6 8 ]
Analyzing model performance in various unseen environments is a critical research problem in the machine learning community. To study this problem, it is important to construct a testbed with out-of-distribution test sets that have broad coverage of environmental discrepancies. However, existing testbeds typically either have a small number of domains or are synthesized by image corruptions, hindering algorithm design that demonstrates real-world effectiveness. In this paper, we introduce CIFAR-10-Warehouse, consisting of 180 datasets collected by prompting image search engines and diffusion models in various ways. Generally sized between 300 and 8,000 images, the datasets contain natural images, cartoons, certain colors, or objects that do not naturally appear. With CIFAR-10-W, we aim to enhance the evaluation and deepen the understanding of two generalization tasks: domain generalization and model accuracy prediction in various out-of-distribution environments. We conduct extensive benchmarking and comparison experiments and show that CIFAR-10-W offers new and interesting insights inherent to these tasks. We also discuss other fields that would benefit from CIFAR-10-W.
Primary Area: datasets and benchmarks
Keywords: large language model code completion benchmark
Scores: [ 5 8 6 ]
Large Language Models (LLMs) have greatly advanced code auto-completion systems, with a potential for substantial productivity enhancements for developers. However, current benchmarks mainly focus on single-file tasks, leaving an assessment gap for more complex, real-world, multi-file programming scenarios. To fill this gap, we introduce RepoBench, a new benchmark specifically designed for evaluating repository-level code auto-completion systems. RepoBench consists of three interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline). Each task respectively measures the system's ability to retrieve the most relevant code snippets from other files as cross-file context, predict the next line of code with cross-file and in-file context, and handle complex tasks that require a combination of both retrieval and next-line prediction. RepoBench aims to facilitate a more complete comparison of performance and encouraging continuous improvement in auto-completion systems.
Primary Area: datasets and benchmarks
Keywords: ICU Intensive Care Unit EHR ML Time Series Patient Monitoring Clinical ML Benchmark Multi-Center MIMIC eICU HiRID AmsterdamUMCdb
Scores: [ 6 8 5 6 6 ]
Primary Area: datasets and benchmarks
Keywords: language-guided agents; web automation; benchmark
Scores: [ 5 6 8 ]
With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and designed to emulate tasks that humans routinely perform on the internet. We experiment with several baseline agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%. These results highlight the need for further development of robust agents, that current state-of-the-art large language models are far from perfect performance in these real-life tasks, and that \ours can be used to measure such progress.\footnote{Code, data, environment reproduction instructions, video demonstrations are available in the supplementary.}
Primary Area: datasets and benchmarks
Keywords: intent recognition multimodal dataset out-of-scope detection multi-turn conversations benchmark framework benchmark evaluation
Scores: [ 6 8 6 6 ]
Multimodal intent recognition poses significant challenges, requiring the incorporation of non-verbal modalities from real-world contexts to enhance the comprehension of human intentions. However, most existing multimodal intent benchmark datasets are limited in scale and suffer from difficulties in handling out-of-scope samples that arise in multi-turn conversational interactions. In this paper, we introduce MIntRec2.0, a large-scale benchmark dataset for multimodal intent recognition in multi-party conversations. It contains 1,245 high-quality dialogues with 15,040 samples, each annotated within a new intent taxonomy of 30 fine-grained classes, across text, video, and audio modalities. In addition to more than 9,300 in-scope samples, it also includes over 5,700 out-of-scope samples appearing in multi-turn contexts, which naturally occur in real-world open scenarios, enhancing its practical applicability. Furthermore, we provide comprehensive information on the speakers in each utterance, enriching its utility for multi-party conversational research. We establish a general framework supporting the organization of single-turn and multi-turn dialogue data, modality feature extraction, multimodal fusion, as well as in-scope classification and out-of-scope detection. Evaluation benchmarks are built using classic multimodal fusion methods, ChatGPT, and human evaluators. While existing methods incorporating nonverbal information yield improvements, effectively leveraging context information and detecting out-of-scope samples remains a substantial challenge. Notably, powerful large language models exhibit a significant performance gap compared to humans, highlighting the limitations of machine learning methods in the advanced cognitive intent understanding task. We believe that MIntRec2.0 will serve as a valuable resource, providing a pioneering foundation for research in human-machine conversational interactions, and significantly facilitating related applications.
Primary Area: datasets and benchmarks
Keywords: Chip Design Machine Learning Dataset
Scores: [ 6 5 6 6 ]
Primary Area: datasets and benchmarks
Keywords: embodied AI decision making
Scores: [ 8 8 6 5 ]
Recent advances in high-fidelity virtual environments serve as one of the major driving forces for building intelligent embodied agents to perceive, reason and interact with the physical world. Typically, these environments remain unchanged unless agents interact with them. However, in real-world scenarios, agents might also face dynamically changing environments characterized by unexpected events and need to rapidly take action accordingly. To remedy this gap, we propose a new simulated embodied benchmark, called HAZARD, specifically designed to assess the decision-making abilities of embodied agents in dynamic situations. HAZARD consists of three unexpected disaster scenarios, including fire, flood, and wind, and specifically supports the utilization of large language models (LLMs) to assist common sense reasoning and decision-making. This benchmark enables us to evaluate autonomous agents' decision-making capabilities across various pipelines, including reinforcement learning (RL), rule-based, and search-based methods in dynamically changing environments. As a first step toward addressing this challenge using large language models, we further develop an LLM-based agent and perform an in-depth analysis of its promise and challenge of solving these challenging tasks.
Primary Area: datasets and benchmarks
Keywords: Social Interaction Agent Social intelligence Large Language Models Evaluation Theory of Mind
Scores: [ 8 6 6 ]
Humans are social beings; we pursue social goals in our daily interactions, which is a crucial aspect of social intelligence. Yet, AI systems' abilities in this realm remain elusive. We present SOTOPIA, an open-ended environment to simulate complex social interactions between artificial agents and evaluate their social intelligence. In our environment, agents role-play and interact under a wide variety of scenarios; they coordinate, collaborate, exchange, and compete with each other to achieve complex social goals. We simulate the role-play interaction between LLM-based agents and humans within this task space and evaluate their performance with a holistic evaluation framework called SOTOPIA-Eval. With SOTOPIA, we find significant differences between these models in terms of their social intelligence, and we identify a subset of SOTOPIA scenarios, SOTOPIA-hard, that is generally challenging for all models. We find that on this subset, GPT-4 achieves a significantly lower goal completion rate than humans and struggles to exhibit social commonsense reasoning and strategic communication skills. These findings demonstrate SOTOPIA's promise as a general platform for research on evaluating and improving social intelligence in artificial agents.
Primary Area: datasets and benchmarks
Keywords: contamination memorization llm codeforces project euler datasets benchmarks training cutoff
Scores: [ 8 8 6 5 ]
Recent claims about the impressive abilities of large language models (LLMs) are often supported by evaluating publicly available benchmarks. Since LLMs train on wide swaths of the internet, this practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data. Data contamination remains notoriously challenging to measure and mitigate, even with partial attempts like controlled experimentation of training data, canary strings, or embedding similarities. In this work, we conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models to look at benchmarks released over time.Specifically, we consider two code/mathematical problem-solving datasets, Codeforces and Project Euler, and find statistically significant trends among LLM pass rate vs. GitHub popularity and release date that provide strong evidence of contamination. By open-sourcing our dataset, raw results, and evaluation framework, our work paves the way for rigorous analyses of data contamination in modern models. We conclude with a discussion of best practices and future steps for publicly releasing benchmark in the age of LLMs which train on webscale data.
Primary Area: datasets and benchmarks
Keywords: vision-language benchmark spatio-temporal grounding video-language models benchmark evaluation zero-shot
Scores: [ 6 6 6 6 ]
With the ever-increasing popularity of pretrained Video-Language Models (VidLMs), there is a pressing need to develop robust evaluation methodologies that delve deeper into their visio-linguistic capabilities. To address this challenge, we present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of these models on a firm footing. Task-based evaluations, while valuable, fail to capture the complexities and specific temporal aspects of moving images that VidLMs need to process. Through carefully curated counterfactuals, ViLMA offers a controlled evaluation suite that sheds light on the true potential of these models, as well as their performance gaps compared to human-level understanding. ViLMA also includes proficiency tests, which assess basic capabilities deemed essential to solving the main counterfactual tests. We show that current VidLMs’ grounding abilities are no better than those of vision-language models which use static images. This is especially striking once the performance on proficiency tests is factored in. Our benchmark serves as a catalyst for future research on VidLMs, helping to highlight areas that still need to be explored.
Primary Area: datasets and benchmarks
Keywords: constrained text generation large language models compositional benchmark
Scores: [ 6 8 6 10 ]
Text generation under constraints have seen increasing interests in natural language processing, especially with the rapidly improving capabilities of large language models. However, existing benchmarks for constrained generation usually focus on fixed constraint types (e.g. generate a sentence containing certain words) that have proved to be easy for state-of-the-art models like GPT-4. We present COLLIE, a grammar-based framework that allows the specification of rich, compositional constraints with diverse generation levels (word, sentence, paragraph, passage) and modeling challenges (e.g. language understanding, logical reasoning, counting, semantic planning). We also develop tools for automatic extraction of task instances given a constraint structure and a raw text corpus. Using COLLIE, we compile the COLLIE-v1 dataset with 1,132 instances comprising 13 constraint structures. We perform systematic experiments across five state-of-the-art instruction-tuned language models and analyze their performances to reveal shortcomings. COLLIE is designed to be extensible and lightweight, and we hope the community finds it useful to develop more complex constraints and evaluations in the future.
Primary Area: datasets and benchmarks
Keywords: LLMs VLMs Benchmark
Scores: [ 5 5 8 ]
Primary Area: datasets and benchmarks
Keywords: Ethereum Twitter GNN
Scores: [ 8 5 6 ]
While numerous public blockchain datasets are available, their utility is constrained by a singular focus on blockchain data. This constraint limits the incorporation of relevant social network data into blockchain analysis, thereby diminishing the breadth and depth of insight that can be derived.To address the above limitation, we introduce ETGraph, a novel dataset that authentically links Ethereum and Twitter, marking the first and largest dataset of its kind. ETGraph combines Ethereum transaction records (2 million nodes and 30 million edges) and Twitter following data (1 million nodes and 3 million edges), bonding 30,667 Ethereum addresses with verified Twitter accounts sourced from OpenSea. Detailed statistical analysis on ETGraph highlights the structural differences between Twitter-matched and non-Twitter-matched Ethereum addresses. Extensive experiments, including Ethereum link prediction, wash-trading Ethereum addresses detection, and Twitter-Ethereum matching link prediction, emphasize the significant role of Twitter data in enhancing Ethereum analysis. ETGraph is available at https://etgraph.deno.dev/.
Primary Area: datasets and benchmarks
Keywords: large language model multi-turn interaction learning from feedback reinforcement learning from human feedback instruction tuning
Scores: [ 5 8 8 6 ]
Primary Area: datasets and benchmarks
Keywords: web-scale dataset natural language processing large language model reasoning AI for math
Scores: [ 5 5 8 ]
There is growing evidence that pretraining on high quality, carefully thought-out tokens such as code or mathematics plays an important role in improving the reasoning abilities of large language models. For example, Minerva, a PaLM model finetuned on billions of tokens of mathematical documents from arXiv and the web, reported dramatically improved performance on problems that require quantitative reasoning. However, because all known open source web datasets employ preprocessing that does not faithfully preserve mathematical notation, the benefits of large scale training on quantitive web documents are unavailable to the research community. We introduce OpenWebMath, an open dataset inspired by these works containing 14.7B tokens of mathematical webpages from Common Crawl. We describe in detail our method for extracting text and LaTeX content and removing boilerplate from HTML documents, as well as our methods for quality filtering and deduplication. Additionally, we run small-scale experiments by training 1.4B language models on OpenWebMath, showing that models trained on 14.7B tokens of our dataset surpass the performance of models trained on over 20x the amount of general language data. We hope that our dataset, open-sourced and released on the Hugging Face Hub, will help spur advances in the reasoning abilities of large language models.
Primary Area: datasets and benchmarks
Keywords: image description evaluation referenceless metrics multimodal evaluation
Scores: [ 6 6 6 ]
Referenceless metrics (e.g., CLIPScore) use pretrained vision--language models to assess image descriptions directly without costly ground-truth reference texts. Such methods can facilitate rapid progress, but only if they truly align with human preference judgments. In this paper, we introduce ContextRef, a benchmark for assessing referenceless metrics for such alignment. ContextRef has two components: human ratings along a variety of established quality dimensions, and ten diverse robustness checks designed to uncover fundamental weaknesses. A crucial aspect of ContextRef is that images and descriptions are presented in context, reflecting prior work showing that context is important for description quality. Using ContextRef, we assess a variety of pretrained models, scoring functions, and techniques for incorporating context. None of the methods is successful with ContextRef, but we show that careful fine-tuning yields substantial improvements. ContextRef remains a challenging benchmark though, in large part due to the challenge of context dependence.
Primary Area: datasets and benchmarks
Keywords: Large Language Models Retrieval Augmented Generation Program Synthesis Program Optimization Fine-Tuning Goal-Conditioning Data Augmentation
Scores: [ 8 5 8 8 ]
With the waning of Moore's law, optimizing program performance has become a major focus of software research. However, high-level optimizations such as API and algorithm changes remain elusive due to the difficulty of understanding the semantics of code.Simultaneously, pretrained large language models (LLMs) have demonstrated strong capabilities at solving a wide range of programming tasks.To that end, we introduce a framework for adapting LLMs to high-level program optimization.First, we curate a dataset of performance-improving edits made by human programmers of over 77,000 competitive C++ programming submission pairs, accompanied by extensive unit tests.A major challenge is the significant variability of measuring performance on commodity hardware, which can lead to spurious "improvements".To isolate and reliably evaluate the impact of program optimizations, we design an environment based on the gem5 full system simulator, the de facto simulator used in academia and industry.Next, we propose a broad range of adaptation strategies for code optimization; for prompting, these include retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play.A combination of these techniques achieves an average speedup of 5.65 times on CodeLlama-13B and 6.86 times on GPT-3.5, surpassing the best human performance (4.06 times).We find our proposed performance-conditioned generation is particularly effective at improving performance as well as increasing the fraction of optimized programs.
Primary Area: datasets and benchmarks
Keywords: data-centric AI hardness characterization data-centric ML DMLR benchmarking
Scores: [ 8 6 8 1 8 ]
Primary Area: datasets and benchmarks
Keywords: Time series Causal discovery Neural networks Benchmark Dataset
Scores: [ 6 8 8 5 ]
Primary Area: datasets and benchmarks
Keywords: Referring Object Detection; Visual Grounding; Foundation Models
Scores: [ 5 6 6 6 ]
We propose InstructDET, a data-centric method for referring object detection (ROD) that localizes target objects based on user instructions. While deriving from referring expressions (REC), the instructions we leverage are greatly diversified to encompass common user intentions related to object detection. For one image, we produce tremendous instructions that refer to every single object and different combinations of multiple objects. Each instruction and its corresponding object bounding boxes (bbxs) constitute one training data pair. In order to encompass common detection expressions, we involve emerging vision-language model (VLM) and large language model (LLM) to generate instructions guided by text prompts and object bbxs, as the generalizations of foundation models are effective to produce human-like expressions (e.g., describing object property, category, and relationship). We name our constructed dataset as InDET. It contains images, bbxs and generalized instructions that are from foundation models. Our InDET is developed from existing REC datasets and object detection datasets, with the expanding potential that any image with object bbxs can be incorporated through using our InstructDET method. By using our InDET dataset, we show that a conventional ROD model surpasses existing methods on standard REC datasets and our InDET test set. Our data-centric method InstructDET, with automatic data expansion by leveraging foundation models, directs a promising field that ROD can be greatly diversified to execute common object detection instructions.
Primary Area: datasets and benchmarks
Keywords: Few-shot learning Visual reasoning Open world learning
Scores: [ 8 8 5 6 ]
Primary Area: datasets and benchmarks
Keywords: Vertical federated learning benchmark feature correlation feature importance
Scores: [ 6 8 6 6 ]
Primary Area: datasets and benchmarks
Keywords: Large Language Models Evaluation Data Contamination
Scores: [ 8 6 6 6 ]
Primary Area: datasets and benchmarks
Keywords: Large Language Models Benchmark
Scores: [ 8 8 8 3 ]
Primary Area: datasets and benchmarks
Keywords: neural architecture search robustness benchmark generalization theory
Scores: [ 6 6 6 ]
Primary Area: datasets and benchmarks
Keywords: diabetes management continuous glucose monitors (CGM) glucose trajectory prediction artificial pancreas systems public datasets standardized tasks benchmark models glycemic control
Scores: [ 5 6 8 ]
The rising rates of diabetes necessitate innovative methods for its management. Continuous glucose monitors (CGM) are small medical devices that measure blood glucose levels at regular intervals providing insights into daily patterns of glucose variation. Forecasting of glucose trajectories based on CGM data holds the potential to substantially improve diabetes management, by both refining artificial pancreas systems and enabling individuals to make adjustments based on predictions to maintain optimal glycemic range. Despite numerous methods proposed for CGM-based glucose trajectory prediction, these methods are typically evaluated on small, private datasets, impeding reproducibility, further research, and practical adoption. The absence of standardized prediction tasks and systematic comparisons between methods has led to uncoordinated research efforts, obstructing the identification of optimal tools for tackling specific challenges. As a result, only a limited number of prediction methods have been implemented in clinical practice.
Primary Area: datasets and benchmarks
Keywords: language model evaluation benchmark constraint satisfaction information retrieval retrieval-augmented architecture
Scores: [ 8 6 6 5 ]
We study the ability of state-of-the art models to answer constraint satisfaction queries for information retrieval (e.g., “a list of ice cream shops in San Diego”). In the past, such queries were considered as tasks that could only be solved via web-search or knowledge bases. More recently, large language models (LLMs) have demonstrated initial emergent abilities in this task. However, many current retrieval benchmarks are either saturated or do not measure constraint satisfaction. Motivated by rising concerns around factual incorrectness and hallucinations of LLMs, we present KITAB, a new dataset for measuring constraint satisfaction abilities of language models. KITAB consists of book-related data across more than 600 authors and 13,000 queries, and also offers an associated dynamic data collection and constraint verification approach for acquiring similar test data for other authors. Our extended experiments on GPT4 and GPT3.5 characterize and decouple common failure modes across dimensions such as information popularity, constraint types, and context availability. Results show that in the absence of context, models exhibit severe limitations as measured by irrelevant information, factual errors, and incompleteness, many of which exacerbate as information popularity decreases. While context availability mitigates irrelevant information, it is not helpful for satisfying constraints, identifying fundamental barriers to constraint satisfaction. We open source our contributions to foster further research on improving constraint satisfaction abilities of future models.
Primary Area: datasets and benchmarks
Keywords: Out-of-distribution Detection Failure Detection Object Discovery Novelty Detection Robustness
Scores: [ 8 6 6 6 ]
Primary Area: datasets and benchmarks
Keywords: Language models Natural language processing Software engineering
Scores: [ 5 6 8 6 ]
Primary Area: datasets and benchmarks
Keywords: graph neural network explanations factual explanations counterfactual explanations
Scores: [ 6 6 8 3 ]
Numerous explainability methods have been proposed to shed light on the inner workings of GNNs. Despite the inclusion of empirical evaluations in all the proposed algorithms, the interrogative aspects of these evaluations lack diversity. As a result, various facets of explainability pertaining to GNNs, such as a comparative analysis of counterfactual reasoners, their stability to variational factors such as different GNN architectures, noise, stochasticity in non-convex loss surfaces, feasibility amidst domain constraints, and so forth, have yet to be formally investigated. Motivated by this need, we present a benchmarking study on perturbation-based explainability methods for GNNs, aiming to systematically evaluate and compare a wide range of explainability techniques. Among the key findings of our study, we identify the Pareto-optimal methods that exhibit superior efficacy and stability in the presence of noise. Nonetheless, our study reveals thatall algorithms are affected by stability issues when faced with noisy data. Furthermore, we have established that the current generation of counterfactual explainers often fails to provide feasible recourses due to violations of topological constraints encoded by domain-specific considerations. Overall, this benchmarking study empowers stakeholders in the field of GNNs with a comprehensive understanding of the state-of-the-art explainability methods, potential research problems for further enhancement, and the implications of their application in real-world scenarios.
Primary Area: datasets and benchmarks
Keywords: Interactive Environments Benchmark Reinforcement Learning Language Agent Multi-agent
Scores: [ 6 8 8 ]
The generalization of decision-making agents encompasses two fundamental elements: learning from past experiences and reasoning in novel contexts. However, the predominant emphasis in most interactive environments is on learning, often at the expense of complexity in reasoning. In this paper, we introduce CivRealm, an environment inspired by the Civilization game. Civilization's profound alignment with human history and society necessitates extensive learning, while its ever-changing situations demand strong reasoning to generalize. Particularly, CivRealm sets up an imperfect-information general-sum game with a changing number of players; it presents a plethora of complex features, challenging the agent to deal with open-ended stochastic environments that require diplomacy and negotiation skills. Within CivRealm, we provide interfaces for two typical agent types: tensor-based agents that focus on learning, and language-based agents that emphasize reasoning. To catalyze further research, we present initial results for both paradigms. The canonical RL-based agents exhibit reasonable performance in mini-games, whereas both RL- and LLM-based agents struggle to make substantial progress in the full game. Overall, CivRealm stands as a unique challenge to explore the frontiers of both learning and reasoning for decision-making agents.
Primary Area: datasets and benchmarks
Keywords: fairness benchmark
Scores: [ 8 5 8 6 ]
This paper introduces the Fair Fairness Benchmark (\textsf{FFB}), a benchmarking framework for in-processing group fairness methods. Ensuring fairness in machine learning is critical for ethical and legal compliance. However, there exist challenges in comparing and developing of fairness methods due to inconsistencies in experimental settings, lack of accessible algorithmic implementations, and limited extensibility of current fairness packages and tools. To address these issues, we introduce an open-source, standardized benchmark for evaluating in-processing group fairness methods and provide a comprehensive analysis of state-of-the-art methods to ensure different notions of group fairness. This work offers the following key contributions: the provision of flexible, extensible, minimalistic, and research-oriented open-source code; the establishment of unified fairness method benchmarking pipelines; and extensive benchmarking, which yields key insights from 45,079 experiments. We believe our work will significantly facilitate the growth and development of the fairness research community.
Primary Area: datasets and benchmarks
Keywords: CLIP Continual learning distribution shift benchmark
Scores: [ 6 5 8 6 ]
Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TIC-DataComp, TIC-YFCC, and TIC-RedCaps with over 12.7B timestamped image-text pairs spanning 9 years (2014--2022). We first use our benchmarks to curate various dynamic evaluations to measure temporal robustness of existing models. We show OpenAI’s CLIP (trained on data up to 2020) loses ≈8% zero-shot accuracy on our curated retrieval task from 2021--2022 compared with more recently trained models in OpenCLIP repository. We then study how to efficiently train models on time-continuous data. We demonstrate that a simple rehearsal-based approach that continues training from the last checkpoint and replays old data reduces compute by 2.5× when compared to the standard practice of retraining from scratch.
Primary Area: datasets and benchmarks
Keywords: Large language model IT operations
Scores: [ 5 5 6 6 ]
With the rapid advancement of IT operations, managing and analyzing large data volumes efficiently for practical applications has become increasingly critical. Natural Language Processing (NLP) techniques have demonstrated remarkable capabilities in various tasks, including named entity recognition, machine translation, and dialogue systems. Recently, Large Language Models (LLMs) have achieved significant improvements across various domain-specific areas. However, there is a noticeable gap in the development of specialized Large Language Models (LLMs) tailored for IT operations. In this paper, we introduce the OWL, a large language model trained on our constructed Owl-Instruct with a wide range of IT-related information. Specifically, limited by the maximum input length, we propose the \textbf{H}omogeneous \textbf{M}arkov \textbf{C}ontext \textbf{E}xtension method (HMCE). The mixture-of-adapter strategy is leveraged to improve the parameter-efficient tuning across different domains or tasks.Further, we evaluate the performance of OWL on the Owl-Bench established by us and open IT-related benchmarks. OWL demonstrates superior performance results on IT tasks, which outperforms existing models by significant margins. Moreover, we hope that the findings of our work will provide more insights to revolutionize the techniques of IT operations with specialized LLMs.
Primary Area: datasets and benchmarks
Keywords: Large Language Models
Scores: [ 8 5 6 8 ]
Recent large language models (LLMs) have demonstrated great potential toward intelligent agents and next-gen automation, but there currently lacks a systematic benchmark for evaluating LLMs' abilities as agents. We introduce SmartPlay: both a challenging benchmark and a methodology for evaluating LLMs as agents. SmartPlay consists of 6 different games, including Rock-Paper-Scissors, Tower of Hanoi, Minecraft. Each game features a unique setting, providing up to 20 evaluation settings and infinite environment variations. Each game in SmartPlay uniquely challenges a subset of 9 important capabilities of an intelligent LLM agent, including reasoning with object dependencies, planning ahead, spatial reasoning, learning from history, and understanding randomness. The distinction between the set of capabilities each game test allows us to analyze each capability separately.SmartPlay serves not only as a rigorous testing ground for evaluating the overall performance of LLM agents but also as a road-map for identifying gaps in current methodologies. We release our benchmark at github.com/LLMsmartplay/SmartPlay
Primary Area: datasets and benchmarks
Keywords: nlp dataset analaysis data-statistics data-quality PII
Scores: [ 10 5 8 5 ]
Large text corpora are the backbone of language models.However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination).In this work, we propose What's In My Big Data? (WIMBD), a platform and a set of 16 high-level analyses that allow us to reveal and compare the contents of large text corpora. WIMBD builds on two basic capabilities---count and search---at scale, which allows us to analyze more than 35 terabytes on a standard compute node. We apply WIMBD to 10 different corpora used to train popular language models, including C4, The Pile, and RedPajama.Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content, personally identifiable information, toxic language, and benchmark contamination. For instance, we find that about 50% of the documents in RedPajama and LAION-2B-en are duplicates. In addition, several datasets used for benchmarking models trained on such corpora are contaminated with respect to important benchmarks, including the Winograd Schema Challenge and parts of GLUE and SuperGLUE.We open-source WIMBD code and artifacts to provide a standard set of evaluations for new text-based corpora and to encourage more analyses and transparency around them.
Primary Area: datasets and benchmarks
Keywords: Tool usage large language model benchmark dataset
Scores: [ 6 6 8 5 ]
Primary Area: datasets and benchmarks
Keywords: Event sequence Temporal point process open benchmarking
Scores: [ 8 3 3 10 ]
Continuous-time event sequences play a vital role in real-world domains such as healthcare, finance, online shopping, social networks, and so on. To model such data, temporal point processes (TPPs) have emerged as the most natural and competitive models, making a significant impact in both academic and application communities. Despite the emergence of many powerful models in recent years, there hasn't been a central benchmark for these models and future research endeavors. This lack of standardization impedes researchers and practitioners from comparing methods and reproducing results, potentially slowing down progress in this field. In this paper, we present EasyTPP, the first central repository of research assets (e.g., data, models, evaluation programs, documentations) in the area of event sequence modeling. Our EasyTPP makes several unique contributions to this area: a unified interface of using existing datasets and adding new datasets; a wide range of evaluation programs that are easy to use and extend as well as facilitate reproducible research; implementations of popular neural TPPs, together with a rich library of modules by composing which one could quickly build complex models. All the data and implementation can be found anonymously at Github repository: https://github.com/Anonymous0006/EasyTPP. We will actively maintain this benchmark and welcome contributions from other researchers and practitioners. Our benchmark will help promote reproducible research in this field, thus accelerating research progress as well as making more significant real-world impacts.
Primary Area: datasets and benchmarks
Keywords: Image Generation Image Editing Evaluation Benchmark Diffusion Model
Scores: [ 6 8 8 5 ]
Recently, a myriad of conditional image generation and editing models have been developed to serve different downstream tasks, including text-to-image generation, text-guided image editing, subject-driven image generation, control-guided image generation, etc. However, we observe huge inconsistencies in experimental conditions: datasets, inference, and evaluation metrics -- render fair comparisons difficult. This paper proposes ImagenHub, which is a one-stop library to standardize the inference and evaluation of all the conditional image generation models. Firstly, we define seven prominent tasks and curate high-quality evaluation datasets for them. Secondly, we built a unified inference pipeline to ensure fair comparison. Thirdly, we design two human evaluation scores, i.e. Semantic Consistency and Perceptual Quality, along with comprehensive guidelines to evaluate generated images. We train expert raters to evaluate the model outputs based on the proposed metrics. Our human evaluation achieves a high inter-worker agreement of Krippendorff’s alpha on 76% models with a value higher than 0.4. We comprehensively evaluated a total of around 30 models and observed three key takeaways: (1) the existing models’ performance is generally unsatisfying except for Text-guided Image Generation and Subject-driven Image Generation, with 74% models achieving an overall score lower than 0.5. (2) we examined the claims from published papers and found 83% of them hold with a few exceptions. (3) None of the existing automatic metrics has a Spearman's correlation higher than 0.2 except subject-driven image generation. Moving forward, we will continue our efforts to evaluate newly published models and update our leaderboard to keep track of the progress in conditional image generation.
Primary Area: datasets and benchmarks
Keywords: conformer ensembles geometric learning
Scores: [ 8 5 6 ]
Primary Area: datasets and benchmarks
Keywords: video-language dataset video understanding video generation multimodal understanding
Scores: [ 6 8 6 8 ]
This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words. Our core contribution is to develop a scalable approach to autonomously build a high-quality video-text dataset with large language models (LLM), thereby showcasing its efficacy in learning video-language representation at scale. Specifically, we utilize a multi-scale approach to generate video-related descriptions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L. Learned on InternVid via contrastive learning, this model demonstrates leading zero-shot action recognition and competitive video retrieval performance. Beyond basic video understanding tasks like recognition and retrieval, our dataset and model have broad applications. They are particularly beneficial for generating interleaved video-text data for learning a video-centric dialogue system, advancing video-to-text and text-to-video generation research. These proposed resources provide a tool for researchers and practitioners interested in multimodal video understanding and generation.
Primary Area: datasets and benchmarks
Keywords: large language models large multimodal models mathematical reasoning vision-language reasoning foundation models and their evaluations
Scores: [ 8 8 8 5 ]
Although Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive skills in various domains, their ability for mathematical reasoning within visual contexts has not been formally examined. Equipping LLMs and LMMs with this capability is vital for general-purpose AI assistants and showcases promising potential in education, data analysis, and scientific discovery. To bridge this gap, we present MathVista, a benchmark designed to amalgamate challenges from diverse mathematical and visual tasks. We first taxonomize the key task types, reasoning skills, and visual contexts from the literature to guide our selection from 28 existing math-focused and visual question answering datasets. Then, we construct three new datasets, IQTest, FunctionQA, and PaperQA, to accommodate for missing types of visual contexts. The problems featured often require deep visual understanding beyond OCR or image captioning, and compositional reasoning with rich domain-specific tools, thus posing a notable challenge to existing models. We conduct a comprehensive evaluation of 11 prominent open-source and proprietary foundation models (LLMs, LLMs augmented with tools, and LMMs). The best-performing model, Multimodal Bard, achieves only 58% of human performance (34.8% vs 60.3%), indicating ample room for further improvement. Given this significant gap, MathVista fuels future research in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks.
Primary Area: datasets and benchmarks
Keywords: datasets dataset filtering dataset curation large scale machine learning
Scores: [ 6 5 8 8 ]
Large training sets have become a cornerstone of machine learning and are the foundation for recent advances in language modeling and multimodal learning. While data curation for pre-training is often still ad-hoc, one common paradigm is to first collect a massive pool of data from the Web and then filter this candidate pool down to an actual training set via various heuristics. In this work, we study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset. Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks: for instance, a model that performs well on ImageNet can yield worse training sets than a model with low ImageNet accuracy that is trained on a small amount of high-quality data. Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets. Specifically, our best performing dataset DFN-5B enables us to train state-of-the-art models for their compute budgets: among other improvements on a variety of tasks, a ViT-H trained on our dataset achieves 83.0% zero-shot transfer accuracy on ImageNet, out-performing larger models trained on other datasets such as LAION-2B, DataComp-1B, or OpenAI’s WIT. In order to facilitate further research in dataset design, we also release a new 2 billion example dataset DFN-2B and show that high performance data filtering networks can be trained from scratch using only publicly available data.
Primary Area: datasets and benchmarks
Keywords: data pruning dataset distillation random sampling corset selection data-efficient learning
Scores: [ 6 3 3 8 6 ]
Primary Area: datasets and benchmarks
Keywords: Large language model skill evaluation LLM benchmark emergence
Scores: [ 5 8 6 ]
As the role of LLMs shifts from statistical modeling of language to serving as general-purpose AI agents, how should LLM evaluations change? Arguably, a key ability of an AI agent is to flexibly combine, as needed, the basic skills it has learned. This capability to combine skills plays an important role in (human) pedagogy and also in a recent paper on emergence phenomena (Arora & Goyal,2023). Our paper introduces an evaluation, Skill-Mix, to measure this capability. Using a list of N skills the evaluator repeatedly picks random subsets of k skills and asks the LLM to produce text combining that subset of skills. Since the number of subsets grows like Nk, for even modest k this evaluation will, with high probability, require the LLM to produce text it has not seen in the training set. The paper develops a methodology for (a) designing and administering such an evaluation, and (b) automatic grading (plus spot-checking by humans) of the results using the open LLaMA-2 70b model as well as GPT-4. Administering a version of Skill-Mix to popular chatbots gave results that, while generally in line with prior expectations, contained surprises. We found sizeable differences in capabilities among models ---including suspected cases of cramming for the leaderboard''--- that had not been revealed by the (much simpler) evaluations used in popular LLM leaderboards. Our methodology can flexibly change to future models and model capabilities, by expanding the set of skills being tested and increasing k. We hope Skill-Mix (which will be publicly released, including all prompts and code) may grow into an eco-system of open evaluations for AI capabilities, including in multi-modal settings.
Primary Area: datasets and benchmarks
Keywords: instruction tuning multimodal large language model hallucination datasets
Scores: [ 8 8 8 6 ]
Primary Area: datasets and benchmarks
Keywords: Calibration
Scores: [ 6 3 6 ]
Deep neural networks are increasingly utilized in various machine learning tasks. However, as these models grow in complexity, they often face calibration issues, despite enhanced prediction accuracy. Many studies have endeavored to improve calibration performance through the use of specific loss functions, data preprocessing and training frameworks. Yet, investigations into calibration properties have been somewhat overlooked. Our study leverages the Neural Architecture Search (NAS) search space, offering an exhaustive model architecture space for thorough calibration properties exploration. We specifically create a model calibration dataset. This dataset evaluates 90 bin-based and 12 additional calibration measurements across 117,702 unique neural networks within the widely employed NATS-Bench search space. Our analysis aims to answer several longstanding questions in the field, using our proposed dataset: (i) Can model calibration be generalized across different datasets? (ii) Can robustness be used as a calibration measurement? (iii) How reliable are calibration metrics? (iv) Does a post-hoc calibration method affect all models uniformly? (v) How does calibration interact with accuracy? (vi) What is the impact of bin size on calibration measurement? (vii) Which architectural designs are beneficial for calibration? Additionally, our study bridges an existing gap by exploring calibration within NAS. By providing this dataset, we enable further research into NAS calibration. As far as we are aware, our research represents the first large-scale investigation into calibration properties and the premier study of calibration issues within NAS.
Primary Area: datasets and benchmarks
Keywords: Open-ended VQA benchmark Vision-Language VL Vision-Text VLM Vision-Language models Image classification Visual question answering Text-generating VLM
Scores: [ 8 8 5 ]
The evaluation of text-generative vision-language models is a challenging yet crucial endeavor. By addressing the limitations of existing Visual Question Answering (VQA) benchmarks and proposing innovative evaluation methodologies, our research seeks to advance our understanding of these models’ capabilities. We propose a novel VQA benchmark based on well-known visual classification datasets which allows a granular evaluation of text-generative vision-language models and their comparison with discriminative vision-language models. To improve assessment of coarse answers on fine-grained classification tasks, we suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category. Finally, we compare traditional NLP and LLM-based metrics for the problem of evaluating model predictions given ground-truth answers. We perform a human evaluation study upon which we base our decision on the final metric. We apply our benchmark to a suite of vision-language models and show a detailed comparison of their abilities on object, action, and attribute classification. Our contributions aim to lay the foundation for more precise and meaningful assessments, facilitating targeted progress in the exciting field of vision-language modeling.
Primary Area: datasets and benchmarks
Keywords: Benchmark; Few-shot classification; Spurious-correlation shifts
Scores: [ 8 5 8 ]
Out-of-distribution (OOD) problems in few-shot classification (FSC) occur when novel classes sampled from testing distributions differ from base classes drawn from training distributions, which considerably degrades the performance of deep learning models deployed in real-world applications. Recent studies suggest that the OOD problems in FSC mainly including: (a) cross-domain few-shot classification (CD-FSC) and (b) spurious-correlation few-shot classification (SC-FSC). Specifically, CD-FSC occurs when a classifier learns transferring knowledge from base classes drawn from \underline{seen} training distributions but recognizes novel classes sampled from unseen testing distributions. In contrast, SC-FSC arises when a classifier relies on non-causal features (or contexts) that happen to be correlated with the labels (or concepts) in base classes but such relationships no longer hold during the model deployment. Despite CD-FSC has been extensively studied, SC-FSC remains understudied due to lack of the corresponding evaluation benchmarks. To this end, we present Meta Concept Context (MetaCoCo), a benchmark with spurious-correlation shifts collected from real-world scenarios. Moreover, to quantify the extent of spurious-correlation shifts of the presented MetaCoCo, we further propose a metric by using CLIP as a pre-trained vision-language model. Extensive experiments on the proposed benchmark are performed to evaluate the state-of-the-art methods in FSC, cross-domain shifts, and self-supervised learning. The experimental results show that the performance of the existing methods degrades significantly in the presence of spurious-correlation shifts. We open-source all codes of our benchmark and hope that the proposed MetaCoCo can facilitate future research on spurious-correlation shifts problems in FSC.
Primary Area: datasets and benchmarks
Keywords: causal inference datasets benchmarks spatial confounding public health
Scores: [ 8 6 6 ]
Spatial confounding poses a significant challenge in scientific studies involving spatial data, where unobserved spatial variables can influence both treatment and outcome, possibly leading to spurious associations. To address this problem, we introduce SpaCE: The Spatial Confounding Environment, the first toolkit to provide realistic benchmark datasets and tools for systematically evaluating causal inference methods designed to alleviate spatial confounding. Each dataset includes training data, true counterfactuals, a spatial graph with coordinates, and smoothness and confounding scores characterizing the effect of a missing spatial confounder. It also includes realistic semi-synthetic outcomes and counterfactuals, generated using state-of-the-art machine learning ensembles, following best practices for causal inference benchmarks. The datasets cover real treatment and covariates from diverse domains, including climate, health and social sciences. SpaCE facilitates an automated end-to-end pipeline, simplifying data loading, experimental setup, and evaluating machine learning and causal inference models. The SpaCE project provides several dozens of datasets of diverse sizes and spatial complexity. It is publicly available as a Python package, encouraging community feedback and contributions.
Primary Area: datasets and benchmarks
Keywords: large language models language model evaluation natural language processing
Scores: [ 8 8 6 ]
Evaluation of Large Language Models (LLMs) is challenging because instruction-following necessitates alignment with human values and the required set of skills varies depending on the instruction. However, previous studies have mainly focused on coarse-grained evaluation (i.e. overall preference-based evaluation), which limits interpretability since it does not consider the nature of user instructions that require instance-wise skill composition. In this paper, we introduce FLASK (Fine-grained Language Model Evaluation based on Alignment Skill Sets), a fine-grained evaluation protocol for both human-based and model-based evaluation which decomposes coarse-level scoring to a skill set-level scoring for each instruction. We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance and increasing the reliability of the evaluation. Using FLASK, we compare multiple open-source and proprietary LLMs and observe a high correlation between model-based and human-based evaluations.
Primary Area: datasets and benchmarks
Keywords: reinforcement learning jax combinatorial research
Scores: [ 5 8 6 6 ]
Primary Area: datasets and benchmarks
Keywords: new corpus user study dialogues conversations chatGPT instruction fine-tuning toxicity safe guarding AI safety
Scores: [ 5 6 6 8 ]
Chatbots such as GPT-4 and ChatGPT are now serving millions of users. Despite their widespread use, there remains a lack of public datasets showcasing how these tools are used by a population of users in practice. To bridge this gap, we offered free access to ChatGPT for online users in exchange for their affirmative, consensual, opt-in for anonymous collection of their chat transcripts. From this, we compiled (InThe)WildChat, a corpus of 570K user-ChatGPT conversations, which consists of over 1.5 million interaction turns. We compare WildChat with other popular user-chatbot interaction datasets, and find that our dataset offers the most diverse user prompts, contains the largest number of languages, and presents the richest variety of potentially toxic use-cases for researchers to study. In particular, in WildChat we find that a majority of the potentially unsafe use is produced by users attempting to “jailbreak” the model using prompts posted on online platforms; these are successful more than 70% of the time for ChatGPT. Finally, because it captures a broad range of use cases, we demonstrate the dataset’s potential utility in fine-tuning state-of-the-art instruction following models. WildLlama, a chatbot fine-tuned on WildChat, outperforms the latest Vicuna model of the same size on MT-Bench, which shows that WildChat has a high utility in addition to being a source for toxicity study. We will release WildChat and WildLlama with a license that emphasizes on accountability, collaboration, and transparency. The clean portion of WildChat will be publicly available, and the portion that contains potentially unsafe content will be made available upon request with a justification for AI safety research.
Primary Area: datasets and benchmarks
Keywords: large language models dataset conversation safety benchmark
Scores: [ 6 8 8 8 ]
Studying how people interact with large language models (LLMs) in real-world scenarios is increasingly important due to their widespread use in various applications.In this paper, we introduce RealChat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art LLMs.This dataset is collected from 210K unique IP addresses in the wild on our chat demo website.We offer an overview of the dataset's content, including its curation process, basic statistics, and topic distribution, highlighting its diversity, originality, and scale.We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions.We believe that this dataset will serve as a valuable resource for understanding and advancing LLM capabilities.The dataset will be publicly available.
Primary Area: datasets and benchmarks
Keywords: Compression Large Language Models Pruning Quantization
Scores: [ 6 8 5 8 ]
Despite their remarkable achievements, modern Large Language Models (LLMs) encounter exorbitant computational and memory footprints. Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs achieving 50-60% sparsity and reducing the bit-width down to 3 or 4 bits per weight, with negligible perplexity degradation over the uncompressed baseline. As recent research efforts are focused on developing increasingly sophisticated compression methods, our work takes a step back, and re-evaluates the effectiveness of existing SoTA compression methods, which rely on a fairly simple and widely questioned metric, perplexity (even for dense LLMs). We introduce Knowledge-Intensive Compressed LLM BenchmarK (LLM-KICK), a collection of carefully-curated tasks to re-define the evaluation protocol for compressed LLMs, which have significant alignment with their dense counterparts, and perplexity fail to capture subtle change in their true capabilities. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods: all pruning methods suffer significant performance degradation, sometimes at trivial sparsity ratios (e.g., 25-30%), and fail for N:M sparsity on knowledge-intensive tasks; current quantization methods are more successful than pruning; yet, pruned LLMs even at ≥50% sparsity are robust in-context retrieval and summarization systems; among others. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc. We hope our study can foster the development of better LLM compression methods. All our related codes are planed to be open-sourced.
Primary Area: datasets and benchmarks
Keywords: Large Language Model World Knowledge Evolving Benchmark
Scores: [ 5 6 8 8 ]
Primary Area: datasets and benchmarks
Keywords: Transfer Learning Medical Image Analysis Organ Segmentation
Scores: [ 6 8 6 6 8 ]
The pre-training and fine-tuning paradigm has become prominent in transfer learning. For example, if the model is pre-trained on ImageNet and then fine-tuned to PASCAL, it can significantly outperform that trained directly on PASCAL. While ImageNet pre-training has shown enormous success, it is formed in 2D and the learned features are for classification tasks. Therefore, when transferring to more diverse tasks, like 3D image segmentation, its performance is inevitably compromised due to the deviation from the original ImageNet context. A significant challenge lies in the lack of large, annotated 3D datasets rivaling the scale of ImageNet for model pre-training. To overcome this challenge, we make two contributions. Firstly, we construct ImageNetCT-9K that comprises 9,262 three-dimensional computed tomography (CT) volumes with high-quality, per-voxel annotations. Secondly, we develop a suite of models that is supervised pre-trained on our ImageNetCT-9K. Our preliminary analyses indicate that the model trained only with 20 CT volumes, 640 masks, and 40 GPU hours has a transfer learning ability similar to the model trained with 5,050 CT volumes and 1,152 GPU hours. More importantly, the transfer learning ability of supervised models can further scale up with larger annotated datasets (i.e., SPT), achieving significantly better performance than all existing 3D models, irrespective of their pre-training methodologies or sources. We hope this study can facilitate collective efforts in constructing larger 3D vision datasets and more releases of supervised pre-trained models. Our code is attached as supplementary and will be publicly available.
Primary Area: datasets and benchmarks
Keywords: task planning language models benchmarking embodied agents home robots
Scores: [ 6 6 6 6 ]
Primary Area: datasets and benchmarks
Keywords: Large Language Models Resource and Evaluation Interpretability NLP Application
Scores: [ 8 6 5 8 ]
Large language models (LLMs) have recently driven striking performance improvements across a range of natural language processing tasks. The factual knowledge acquired during pretraining and instruction tuning can be useful in various downstream tasks, such as question answering, and language generation. Unlike conventional Knowledge Bases (KBs) that explicitly store factual knowledge, LLMs implicitly store facts in their parameters. Content generated by the LLMs can often exhibit inaccuracies or deviations from the truth, due to facts that can be incorrectly induced or become obsolete over time. To this end, we aim to comprehensively evaluate the extent and scope of factual knowledge within LLMs by designing the benchmark Pinocchio. Pinocchio contains 20K diverse factual questions that span different sources, timelines, domains, regions, and languages. Furthermore, we investigate whether LLMs are able to compose multiple facts, update factual knowledge temporally, reason over multiple pieces of facts, identify subtle factual differences, and resist adversarial examples. Extensive experiments on different sizes and types of LLMs show that existing LLMs still lack factual knowledge and suffer from various spurious correlations. We believe this is a critical bottleneck for realizing trustworthy artificial intelligence. The dataset Pinocchio and our codes will be publicly available.
Primary Area: datasets and benchmarks
Keywords: theorem proving math word problem mathematical reasoning benchmark
Scores: [ 8 6 8 ]
Recent large language models (LLMs) have witnessed significant advancement in various tasks, including mathematical reasoning and theorem proving. As these two tasks require strict and formal multi-step inference, they are appealing domains for exploring the reasoning ability of LLMs but still formidable challenges. Previous studies such as Chain-of-Thought (CoT) have revealed the effectiveness of intermediate steps guidance. However, such step-wise annotation requires heavy labor, leading to insufficient training steps for current benchmarks. To fill this gap, this work introduces MUSTARD, a data generation framework that masters uniform synthesis of theorem and proof data of high quality and diversity. MUSTARD synthesizes data in three stages: (1) It samples a few mathematical concept seeds as the problem category. (2) Then it prompts a generative language model with the sampled concepts to obtain both the problems and their step-wise formal solutions. (3) Lastly, the framework utilizes a proof assistant (e.g., Lean Prover) to filter the valid proofs. With the proposed MUSTARD, we present a theorem-and-proof benchmark MUSTARDSAUCE with 7,335 valid data points. Each data point contains an informal statement, an informal proof, and a translated formal proof that passes the prover validation. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data. We further apply the MUSTARDSAUCE for fine-tuning smaller language models. The fine-tuned Llama 2-7B achieves improvements by 20.9% on zero-shot inference on GSM8K and achieves 8.7 of pass@1 on mathlib, which again demonstrates the data quality and their effectiveness on mathematical tasks.
Primary Area: datasets and benchmarks
Keywords: audio audio-language compositional reasoning
Scores: [ 6 6 8 6 ]
A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved performance in many downstream applications, including zero-shot audio classification, audio retrieval, etc. However, the ability of these models to effectively perform compositional reasoning remains largely unexplored and necessitates additional research. In this paper, we propose \textbf{CompA}, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs. Our proposed CompA-order evaluates how well an ALM understands the order or occurrence of acoustic events in audio, and CompA-attribute evaluates attribute binding of acoustic events. An instance from either benchmark consists of two audio-caption pairs, where both audios have the same acoustic events but with different compositions. An ALM is evaluated on how well it matches the right audio to the right caption. Using this benchmark, we first show that current ALMs perform only marginally better than random chance, thereby struggling with compositional reasoning. Next, we propose CompA-CLAP, where we fine-tune CLAP using a novel learning method to improve its compositional reasoning abilities. To train CompA-CLAP, we first propose improvements to contrastive training with composition-aware hard negatives, allowing for more focused training. Next, we propose a novel modular contrastive loss that helps the model learn fine-grained compositional understanding and overcomes the acute scarcity of openly available compositional audios. CompA-CLAP significantly improves over all our baseline models on the CompA benchmark, indicating its superior compositional reasoning capabilities.
Primary Area: datasets and benchmarks
Keywords: multimodal representation learning visual-language models embodied question answering
Scores: [ 8 8 8 8 ]
Primary Area: datasets and benchmarks
Keywords: summarization evaluation long context prompting LLM
Scores: [ 8 10 8 8 ]
Summarizing book-length documents ($>$100K tokens) that exceed the context window size of large language models (LLMs) requires first breaking the input document into smaller chunks and then prompting an LLM to merge, update, and compress chunk-level summaries. Despite the complexity and importance of this task, it has yet to be meaningfully studied due to the challenges of evaluation: existing book-length summarization datasets (e.g., BookSum) are in the pretraining data of most public LLMs, and existing evaluation methods struggle to capture errors made by modern LLM summarizers. In this paper, we present the first study of the coherence of LLM-based book-length summarizers implemented via two prompting workflows: (1) hierarchically merging chunk-level summaries, and (2) incrementally updating a running summary. We obtain 1193 fine-grained human annotations on GPT-4 generated summaries of 100 recently-published books and identify eight common types of coherence errors made by LLMs. Because human evaluation is expensive and time-consuming, we develop an automatic metric, BooookScore, that measures the proportion of sentences in a summary that do not contain any of the identified error types. BooookScore has high agreement with human annotations and allows us to systematically evaluate the impact of many other critical parameters (e.g., chunk size, base LLM) while saving $15K and 500 hours in human evaluation costs. We find that closed-source LLMs such as GPT-4 and Claude 2 produce summaries with higher BooookScore than the oft-repetitive ones generated by LLaMA 2. Incremental updating yields lower BooookScore but higher level of detail than hierarchical merging, a trade-off sometimes preferred by human annotators. We release code and annotations after blind review to spur more principled research on book-length summarization.
Primary Area: datasets and benchmarks
Keywords: Imitation Learning Benchmark Datasets Diverse Behaviors
Scores: [ 8 6 8 ]
Primary Area: datasets and benchmarks
Keywords: Label errors dataset cleaning AI safety toxicity harmless language models
Scores: [ 6 6 6 5 ]
Primary Area: datasets and benchmarks
Keywords: LLM evaluation
Scores: [ 5 8 8 ]
Instruction tuning large language models (LLMs) remains a challenging task, owing to the complexity of hyperparameter selection and the difficulty involved in evaluating the tuned models. To determine the optimal hyperparameters, an automatic, robust, and reliable evaluation benchmark is essential. However, establishing such a benchmark is not a trivial task due to the challenges associated with evaluation accuracy and privacy protection. In response to these challenges, we introduce a judge large language model, named PandaLM, which is trained to distinguish the superior model given several LLMs. PandaLM's focus extends beyond just the objective correctness of responses, which is the main focus of traditional evaluation datasets. It addresses vital subjective factors such as relative conciseness, clarity, adherence to instructions, comprehensiveness, and formality. To ensure the reliability of PandaLM, we collect a diverse human-annotated test dataset, where all contexts are generated by humans and labels are aligned with human preferences. Our findings reveal that PandaLM-7B offers a performance comparable to both GPT-3.5 and GPT-4. Impressively, PandaLM-70B surpasses their performance. PandaLM enables the evaluation of LLM to be fairer but with less cost, evidenced by significant improvements achieved by models tuned through PandaLM compared to their counterparts trained with default Alpaca's hyperparameters. In addition, PandaLM does not depend on API-based evaluations, thus avoiding potential data leakage.
Primary Area: datasets and benchmarks
Keywords: Embodied AI Simulation
Scores: [ 8 6 6 ]
Primary Area: datasets and benchmarks
Keywords: language model evaluation dynamic evaluation alignment cooperative AI agency
Scores: [ 5 3 8 ]
Primary Area: datasets and benchmarks
Keywords: Instruction Tuning Large Language Models Automatic Data Generation
Scores: [ 6 6 5 8 ]
Primary Area: datasets and benchmarks
Keywords: Intuitive physics physical reasoning
Scores: [ 6 6 8 6 ]
Primary Area: datasets and benchmarks
Keywords: Benchmark Vision-Language Large Language Models Low-level Vision Image Quality Assessment
Scores: [ 8 8 6 ]
The rapid evolution of Multi-modality Large Language Models (MLLMs) has catalyzed a shift in computer vision from specialized models to general-purpose foundation models. Nevertheless, there is still an inadequacy in assessing the abilities of MLLMs on low-level visual perception and understanding. To address this gap, we present Q-Bench, a holistic benchmark crafted to systematically evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment. a) To evaluate the low-level perception ability, we construct the LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped with a human-asked question focusing on its low-level attributes. We then measure the correctness of MLLMs on answering these questions. b) To examine the description ability of MLLMs on low-level information, we propose the LLDescribe dataset consisting of long expert-labelled golden low-level text descriptions on 499 images, and a GPT-involved comparison pipeline between outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we further measure their visual quality assessment ability to align with human opinion scores. Specifically, we design a softmax-based strategy that enables MLLMs to predict quantifiable quality scores, and evaluate them on various existing image quality assessment (IQA) datasets. Our evaluation across the three abilities confirms that MLLMs possess preliminary low-level visual skills. However, these skills are still unstable and relatively imprecise, indicating the need for specific enhancements on MLLMs towards these abilities. We hope that our benchmark can encourage the research community to delve deeper to discover and enhance these untapped potentials of MLLMs.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Mean Field Games Equilibrium Learning Networks Large Graphs Multi Agent Systems
Scores: [ 6 8 6 ]
Learning the behavior of large agent populations is an important task for numerous research areas. Although the field of multi-agent reinforcement learning (MARL) has made significant progress towards solving these systems, solutions for many agents often remain computationally infeasible and lack theoretical guarantees. Mean Field Games (MFGs) address both of these issues and can be extended to Graphon MFGs (GMFGs) to include network structures between agents. Despite their merits, the real world applicability of GMFGs is limited by the fact that graphons only capture dense graphs. Since most empirically observed networks show some degree of sparsity, such as power law graphs, the GMFG framework is insufficient for capturing these network topologies. Thus, we introduce the novel concept of Graphex MFGs (GXMFGs) which builds on the graph theoretical concept of graphexes. Graphexes are the limiting objects to sparse graph sequences that also have other desirable features such as the small world property. Learning equilibria in these games is challenging due to the rich and sparse structure of the underlying graphs. To tackle these challenges, we design a new learning algorithm tailored to the GXMFG setup. This hybrid graphex learning approach leverages that the system mainly consists of a highly connected core and a sparse periphery. After defining the system and providing a theoretical analysis, we state our learning approach and demonstrate its learning capabilities on both synthetic graphs and real-world networks. This comparison shows that our GXMFG learning algorithm successfully extends MFGs to a highly relevant class of hard, realistic learning problems that are not accurately addressed by current MARL and MFG methods.
Primary Area: general machine learning (i.e., none of the above)
Keywords: parameter sharing model compression pruning
Scores: [ 3 5 8 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: recommender systems collaborative filtering scalability
Scores: [ 8 8 6 ]
Excellent tail performance is crucial for modern machine learning tasks, such as algorithmic fairness, class imbalance, and risk-sensitive decision making, as it ensures the effective handling of challenging samples within a dataset. Tail performance is also a vital determinant of success for personalized recommender systems to reduce the risk of losing users with low satisfaction. This study introduces a "safe" collaborative filtering method that prioritizes recommendation quality for less-satisfied users rather than focusing on the average performance. Our approach minimizes the conditional value at risk (CVaR), which represents the average risk over the tails of users' loss. To overcome computational challenges for web-scale recommender systems, we develop a robust yet practical algorithm that extends the most scalable method, implicit alternating least squares (iALS). Empirical evaluation on real-world datasets demonstrates the excellent tail performance of our approach while maintaining competitive computational efficiency.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Sparse Mixture-of-Experts Multilingual Machine Translation Language Guided Routing
Scores: [ 8 8 6 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Explainability Surrogate Models Neural Tangent Kernel Deep Learning Attribution
Scores: [ 8 8 8 5 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Low-rank plus Quantized Matrix Decomposition Efficient Language Model Finetuning
Scores: [ 5 6 8 8 ]
We propose a simple approach for memory-efficient adaptation of pretrained language models. Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component. During finetuning, the quantized component remains fixed and only the low-rank component is updated. We present an integer linear programming formulation of the quantization component, which enables dynamic configuration of quantization parameters (e.g., bit-width, block size) for each matrix given an overall target memory budget. We further explore a data-aware version of the algorithm which uses an approximation of the Fisher information matrix to weight the reconstruction objective during matrix decomposition. Across standard NLP benchmarks, our low-rank plus quantized matrix decomposition approach (LQ-LoRA) is found to perform well against strong baselines.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Federated Learning Federated Recommendation System
Scores: [ 6 8 8 ]
Building recommendation systems via federated learning (FL) is a new emerging challenge for next-generation Internet service. Existing FL models share item embedding across clients while keeping the user embedding private and local on the client side. However, identical item embedding cannot capture users' individual differences in perceiving the same item and may lead to poor personalization. Moreover, dense item embedding in FL results in expensive communication costs and latency. To address these challenges, we propose \underline{\textbf{Fed}}erated \underline{\textbf{R}}ecommendation with \underline{\textbf{A}}dditive \underline{\textbf{P}}ersonalization (\textbf{FedRAP}), which learns a global view of items via FL and a personalized view locally on each user. FedRAP encourages a sparse global view to save FL's communication cost and enforces the two views to be complementary via two regularizers. We propose an effective curriculum to learn the local and global views progressively with increasing regularization weights. To produce recommendations for a user, FedRAP adds the two views together to obtain a personalized item embedding. FedRAP achieves the best performance in FL setting on multiple benchmarks. It outperforms recent federated recommendation methods and several ablation study baselines. \looseness-1
Primary Area: general machine learning (i.e., none of the above)
Keywords: large language model preference optimization language model alignment
Scores: [ 6 6 5 ]
Improving the alignment of language models with human preferences remains an active research challenge. Previous approaches have primarily utilized online Reinforcement Learning from Human Feedback (RLHF). Recently, offline methods such as Sequence Likelihood Calibration (SLiC) and Direct Preference Optimization (DPO) have emerged as attractive alternatives, offering improvements in stability and scalability while maintaining competitive performance. SLiC refines its loss function using sequence pairs sampled from a supervised fine-tuned (SFT) policy, while DPO directly optimizes language models based on preference data, foregoing the need for a separate reward model. However, the maximum likelihood estimator (MLE) of the target optimal policy requires labeled preference pairs sampled from that policy. The absence of a reward model in DPO constrains its ability to sample preference pairs from the optimal policy. Meanwhile, SLiC can only sample preference pairs from the SFT policy. To address these limitations, we introduce a novel approach called Statistical Rejection Sampling Optimization (RSO) designed to source preference data from the target optimal policy using rejection sampling, enabling a more accurate estimation of the optimal policy. We also propose a unified framework that enhances the loss functions used in both SLiC and DPO from a preference modeling standpoint. Through extensive experiments across diverse tasks, we demonstrate that RSO consistently outperforms both SLiC and DPO as evaluated by both Large Language Models (LLMs) and human raters.
Primary Area: general machine learning (i.e., none of the above)
Keywords: graph neural networks positional encoding stability
Scores: [ 6 5 6 5 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: hyperspectral imaging optical modulation real-time detection vision transformer pre-acquisition modulation learnable mask weight binarization
Scores: [ 6 6 6 ]
Bandwidth constraints during signal acquisition frequently impede real-time detection applications. Hyperspectral data is a notable example, whose vast volume compromises real-time hyperspectral detection. To tackle this hurdle, we introduce a novel approach leveraging pre-acquisition modulation to reduce the acquisition volume. This modulation process is governed by a deep learning model, utilizing prior information. Central to our approach is LUM-ViT, a Vision Transformer variant. Uniquely, LUM-ViT incorporates a learnable under-sampling mask tailored for pre-acquisition modulation. To further optimize for optical calculations, we propose a kernel-level weight binarization technique and a three-stage fine-tuning strategy. Our evaluations reveal that, by sampling a mere 10% of the original image pixels, LUM-ViT maintains the accuracy loss within 1.8% on the ImageNet classification task. The method sustains near-original accuracy when implemented on real-world optical hardware, demonstrating its practicality.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Multi-View Stereo Transformer Depth Estimation
Scores: [ 6 5 5 8 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Generalized Linear Models Learning with Varying Data Differential Equations
Scores: [ 8 6 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Graph Visualization Optimization Scalability Graph Neural Networks
Scores: [ 6 6 6 5 ]
Graph Visualization, also known as Graph Drawing, aims to find geometric embeddings of graphs that optimize certain criteria. Stress is a widely used metric; stress is minimized when every pair of nodes is positioned at their shortest path distance. However, stress optimization presents computational challenges due to its inherent complexity and is usually solved using heuristics in practice. We introduce a scalable Graph Neural Network (GNN) based Graph Drawing framework with sub-quadratic runtime that can learn to optimize stress. Inspired by classical stress optimization techniques and force-directed layout algorithms, we create a coarsening hierarchy for the input graph. Beginning at the coarsest level, we iteratively refine and un-coarsen the layout, until we generate an embedding for the original graph. To enhance information propagation within the network, we propose a novel positional rewiring technique based on intermediate node positions. Our empirical evaluation demonstrates that the framework achieves state-of-the-art performance while remaining scalable.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Structured Sparse Pruning Projection
Scores: [ 6 6 6 ]
Wide networks usually yield better accuracy than their narrower counterpart at the expense of the massive mult cost.To break this tradeoff, we advocate a novel concept of Structured Activation Sparsification, dubbed SAS, which boosts accuracy without increasing computation by utilizing the projected sparsity in activation maps with a specific structure. Concretely, the projected sparse activation is allowed to have N nonzero value among M consecutive activations.Owing to the local structure in sparsity, the wide matmul between a dense weight and the sparse activation is executed as an equivalent narrow matmul between a dense weight and dense activation, which is compatible with NVIDIA's SparseTensorCore developed for the N:M structured sparse weight.In extensive experiments, we demonstrate that increasing sparsity monotonically improves accuracy (up to 7% on CIFAR10) without increasing the mult count.Furthermore, we show that structured sparsification of activation scales better than that of weight given the same computational budget.
Primary Area: general machine learning (i.e., none of the above)
Keywords: vision transformers efficiency redundacy in attention maps improved throughput
Scores: [ 6 8 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Computational Neuroscience Phenomenological Neuron Modeling Cortical Neurons Recurrent Neural Networks Machine Learning Biologically Inspired Modelling
Scores: [ 5 8 6 8 ]
Biological cortical neurons are remarkably sophisticated computational devices,temporally integrating their vast synaptic input over an intricate dendritic tree,subject to complex, nonlinearly interacting internal biological processes. A recentstudy proposed to characterize this complexity by fitting accurate surrogate modelsto replicate the input-output relationship of a detailed biophysical cortical pyramidalneuron model and discovered it needed temporal convolutional networks (TCN)with millions of parameters. Requiring these many parameters, however, couldbe the result of a misalignment between the inductive biases of the TCN andcortical neuron’s computations. In light of this, and with the aim to explorethe computational implications of leaky memory units and nonlinear dendriticprocessing, we introduce the Expressive Leaky Memory (ELM) neuron model, abiologically inspired phenomenological model of a cortical neuron. Remarkably, byexploiting a few such slowly decaying memory-like hidden states and two-layerednonlinear integration of synaptic input, our ELM neuron can accurately matchthe aforementioned input-output relationship with under ten-thousand trainableparameters. To further assess the computational ramifications of our neuron design,we evaluate on various tasks with demanding temporal structures, including theLong Range Arena (LRA) datasets, as well as a novel neuromorphic dataset basedon the Spiking Heidelberg Digits dataset (SHD-Adding). Leveraging a largernumber of memory units with sufficiently long timescales, and correspondinglysophisticated synaptic integration, the ELM neuron proves to be competitive onboth datasets, reliably outperforming the classic Transformer or Chrono-LSTMarchitectures on latter, even solving the Pathfinder-X task with over 70% accuracy(16k context length). These findings indicate the importance of inductive biasesfor efficient surrogate neuron models and the potential for biologically motivatedmodels to enhance performance in challenging machine learning tasks.
Primary Area: general machine learning (i.e., none of the above)
Keywords: federated learning personalized federated learning bayesian coreset submodularity variational inference coresets optimization
Scores: [ 8 5 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Deep neural network random initialisation sparsity gaussian process
Scores: [ 6 6 8 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: adversarial attacks adversarial training adversarial purification
Scores: [ 6 5 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: coreset selection data pruning data graph message passing data distillation
Scores: [ 6 6 6 5 5 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: multiagent debate large language models inter-model communication embedding representation
Scores: [ 5 6 6 5 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Pruning Fusion
Scores: [ 8 8 6 8 6 ]
Pruning is one of the mainstream methods to compress over-parameterized neural networks, resulting in significant practical benefits. Recently, another line of work has explored the direction of fusion, i.e. merging, independently trained neural networks. Here, we seek to marry the two approaches in a bid to combine their advantages into a single approach, which we term `Intra-Fusion'. Specifically, we implicitly utilize the pruning criteria to result in more informed fusion. Agnostic to the choice of a specific neuron-importance metric, Intra-Fusion can typically prune an additional considerable amount of the parameters while retaining the same accuracy as the standard pruning approach. Additionally, we explore how fusion can be added to the pruning process to significantly decrease the training time while maintaining competitive performance. We benchmark our results for various networks on commonly used datasets such as CIFAR10, CIFAR100, and ImageNet. More broadly, we hope that the proposed approach invigorates exploration into a fresh alternative to the predominant compression approaches.
Primary Area: general machine learning (i.e., none of the above)
Keywords: transformers mixture models linear regression
Scores: [ 6 8 6 6 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Non-Decomposable Objectives Long-Tail Learning Semi-Supervised Learning
Scores: [ 6 8 8 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Wasserstein distance Federated Learning ; Triangle inequality
Scores: [ 8 6 6 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: tabular tabular data architecture deep learning neural networks
Scores: [ 3 6 6 8 ]
Deep learning (DL) models for tabular data problems (e.g. classification, regression) are currently receiving increasingly more attention from researchers.However, despite the recent efforts, the non-DL algorithms based on gradient-boosted decision trees (GBDT) remain a strong go-to solution for these problems.One of the research directions aimed at improving the position of tabular DL involves designing so-called retrieval-augmented models.For a target object, such models retrieve other objects (e.g. the nearest neighbors) from the available training data and use their features and labels to make a better prediction.In this work, we present TabR -- essentially, a feed-forward network with a novel k-Nearest-Neighbors-like component in the middle. On a set of public benchmarks with datasets up to several million objects, TabR marks a big step forward for tabular DL: it demonstrates the best average performance among tabular DL models, becomes the new state-of-the-art on several datasets, and even outperforms GBDT models on the recently proposed "GBDT-friendly" benchmark (see Figure 1).Among the novel findings and technical details powering TabR, the main ones lie in the attention-like mechanism that is responsible for retrieving the nearest neighbors and extracting valuable signal from them.In addition to the much higher performance, TabR is simple and significantly more efficient compared to prior retrieval-based tabular DL models.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Dataset Distillation
Scores: [ 8 6 8 6 ]
The ultimate goal of Dataset Distillation is to synthesize a small synthetic dataset such that a model trained on this synthetic set will perform equally well as a model trained on the full, real dataset. Until now, no method of Dataset Distillation has reached this completely lossless goal, in part due to the fact that previous methods only remain effective when the total number of synthetic samples is extremely small. Since only so much information can be contained in such a small number of samples, it seems that to achieve truly loss dataset distillation, we must develop a distillation method that remains effective as the size of the synthetic dataset grows. In this work, we present such an algorithm and elucidate why existing methods fail to generate larger, high-quality synthetic sets. Current state-of-the-art methods rely on trajectory-matching, or optimizing the synthetic data to induce similar long-term training dynamics as the real data. We empirically find that the training stage of the trajectories we choose to match (i.e., early or late) greatly affects the effectiveness of the distilled dataset. Specifically, early trajectories (where the teacher network learns easy patterns) work well for a low-cardinality synthetic set since there are fewer examples wherein to distribute the necessary information. Conversely, late trajectories (where the teacher network learns hard patterns) provide better signals for larger synthetic sets since there are now enough samples to represent the necessary complex patterns. Based on our findings, we propose to align the difficulty of the generated patterns with the size of the synthetic dataset. In doing so, we successfully scale trajectory matching-based methods to larger synthetic datasets, achieving lossless dataset distillation for the very first time. Code and distilled datasets will be released.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Progressive Learning Large Language Models Model Growth
Scores: [ 8 6 6 ]
Accelerating large language model pre-training is a critical issue in present research. In this paper, we focus on speeding up pre-training by progressively growing from a small Transformer structure to a large one. There are two main research problems associated with progressive growth: determining the optimal growth schedule, and designing efficient growth operators. In terms of growth schedule, the impact of each single dimension on a schedule’s efficiency is underexplored by existing work. Regarding the growth operators, existing methods rely on the initialization of new weights to inherit knowledge, and achieve only non-strict function preservation, limiting further improvements on training dynamics. To address these issues, we propose Masked Structural Growth (MSG), including (i) growth schedules involving all possible dimensions and (ii) strictly function-preserving growth operators that is independent of the initialization of new weights. Experiments show that MSG is significantly faster than related work: we achieve up to 2.2x speedup in pre-training different types of language models while maintaining comparable or better downstream performances. Code is publicly available.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Federated Learning Label Deficiency
Scores: [ 6 6 6 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: hypergraphs hypergraph projection learning-based reconstruction
Scores: [ 6 8 8 8 ]
We study the implications of the modeling choice to use a graph, instead of a hypergraph, to represent real-world interconnected systems whose constituent relationships are of higher order by nature. Such a modeling choice typically involves an underlying projection process that maps the original hypergraph onto a graph, and is prevalent in graph-based analysis. While hypergraph projection can potentially lead to loss of higher-order relations, there exists very limited studies on the consequences of doing so, as well as its remediation. This work fills this gap by doing two things: (1) we develop analysis based on graph and set theory, showing two ubiquitous patterns of hyperedges that are root to structural information loss in all hypergraph projections; we also quantify the combinatorial impossibility of recovering the lost higher-order structures if no extra help is provided; (2) we still seek to recover the lost higher-order structures in hypergraph projection, and in light of (1)'s findings we make reasonable assumptions to allow the help of some prior knowledge of the application domain. Under this problem setting, we develop a learning-based hypergraph reconstruction method based on an important statistic of hyperedge distributions that we find. Our reconstruction method is systematically evaluated on 8 real-world datasets under different settings, and exhibits consistently top performance. We also demonstrate benefits of the reconstructed hypergraphs through use cases of protein rankings and link predictions.
Primary Area: general machine learning (i.e., none of the above)
Keywords: large language models tool making tool using serving efficiency
Scores: [ 5 5 6 3 ]
Recent research has highlighted the potential of large language models (LLMs)to improve their problem-solving capabilities with the aid of suitable externaltools. In our work, we further advance this concept by introducing a closed-loop framework, referred to as LLMs A s Tool Makers (LATM), where LLMscreate their own reusable tools for problem-solving. Our approach consists of twophases: 1) tool making: an LLM acts as the tool maker that crafts tools for a setof tasks, where a tool is implemented as a Python utility function. 2) tool using:another LLM acts as the tool user, which applies the tool built by the tool makerfor problem-solving. The tool user can be either the same or a different LLMfrom the tool maker. On the problem-solving server side, tool-making enablescontinual tool generation and caching as new requests emerge. This frameworkenables subsequent requests to access cached tools via their corresponding APIs,enhancing the efficiency of task resolution. Beyond enabling LLMs to create theirown tools, our framework also uncovers intriguing opportunities to optimize theserving cost of LLMs: Recognizing that tool-making requires more sophisticatedcapabilities, we assign this task to a powerful, albeit resource-intensive, model.Conversely, the simpler tool-using phase is delegated to a lightweight model. Thisstrategic division of labor allows the once-off cost of tool-making to be spreadover multiple instances of tool-using, significantly reducing average costs whilemaintaining strong performance. Furthermore, our method offers a functionalcache through the caching and reuse of tools, which stores the functionality ofa class of requests instead of the natural language responses from LLMs, thusextending the applicability of the conventional cache mechanism. We evaluateour approach across various complex reasoning tasks, including Big-Bench tasks.With GPT-4 as the tool maker and GPT-3.5 as the tool user, LATM demonstratesperformance equivalent to using GPT-4 for both roles, but with a significantlyreduced inference cost.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Neural Architecture Search Training-free NAS Bayesian Optimization Precision@K Partial Monitoring
Scores: [ 5 3 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: consistency language models self-critique
Scores: [ 6 8 6 ]
As of September 2023, ChatGPT correctly answers “what is 7+8” with 15, but when asked “7+8=15, True or False” it responds with “False”. This inconsistency between generating and validating an answer is prevalent in language models (LMs) and erodes trust. In this paper, we propose a framework for measuring the consistency between generation and validation (which we call generator-validator consistency, or GV-consistency), finding that even GPT-4, a state-of-the-art LM, is GV-consistent only 76% of the time. To improve the consistency of LMs, we propose to finetune on the filtered generator and validator responses that are GV-consistent, and call this approach consistency fine-tuning. We find that this approach improves GV-consistency of Alpaca-30B from 60% to 93%, and the improvement extrapolates to unseen tasks and domains (e.g., GV-consistency for positive style transfers extrapolates to unseen styles like humor). In addition to improving consistency, consistency fine-tuning improves both generator quality and validator accuracy without using any labeled data. Evaluated across 6 tasks, including math questions, knowledge-intensive QA, and instruction following, our method improves the generator quality by 16% and the validator accuracy by 6.3% across all tasks.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Structured Matrix Block Low Rank Low Rank Efficient Neural Network Transformer Fourier Dirichlet Kernel FFT Boxcar Pruning Compression
Scores: [ 6 5 6 ]
This paper investigates efficient deep neural networks (DNNs) to replace dense unstructured weight matrices with structured ones that possess desired properties. The challenge arises because the optimal weight matrix structure in popular neural network models is obscure in most cases and may vary from layer to layer even in the same network. Prior structured matrices proposed for efficient DNNs were mostly hand-crafted without a generalized framework to systematically learn them. To address this issue, we propose a generalized and differentiable framework to learn efficient structures of weight matrices by gradient descent. We first define a new class of structured matrices that covers a wide range of structured matrices in the literature by adjusting the structural parameters. Then, the frequency-domain differentiable parameterization scheme based on the Gaussian-Dirichlet kernel is adopted to learn the structural parameters by proximal gradient descent. Finally, we introduce an effective initialization method for the proposed scheme. Our method learns efficient DNNs with structured matrices, achieving lower complexity and/or higher performance than prior approaches that employ low-rank, block-sparse, or block-low-rank matrices.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Large Language Models Sparsity Activation Function ReLU Activation Function
Scores: [ 6 8 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: post-training quantization large language model Affine Transformation
Scores: [ 8 5 8 8 ]
The significant resource requirements associated with Large-scale Language Models (LLMs) have generated considerable interest in the development of techniques aimed at compressing and accelerating neural networks. Among these techniques, Post-Training Quantization (PTQ) has emerged as a subject of considerable interest due to its noteworthy compression efficiency and cost-effectiveness in the context of training.Existing PTQ methods for LLMs limit the optimization scope to scaling transformations between pre- and post-quantization weights. This constraint results in significant errors after quantization, particularly in low-bit configurations. In this paper, we advocate for the direct optimization using equivalent Affine transformations in PTQ (AffineQuant). This approach extends the optimization scope and thus significantly minimizing quantization errors. Additionally, by employing the corresponding inverse matrix, we can ensure equivalence between the pre- and post-quantization outputs of PTQ, thereby maintaining its efficiency and generalization capabilities. To ensure the invertibility of the transformation during optimization, we further introduce a gradual mask optimization method. This method initially focuses on optimizing the diagonal elements and gradually extends to the other elements. Such an approach aligns with the Levy-Desplanques theorem, theoretically ensuring invertibility of the transformation. As a result, significant performance improvements are evident across different LLMs on diverse datasets. Notably, these improvements are most pronounced when using very low-bit quantization, enabling the deployment of large models on edge devices. To illustrate, we attain a C4 perplexity of 14.89 ({ 10.00$\downarrow$} vs 24.89 in OmniQuant) on the LLaMA-$7$B model of W$2$A$16$ quantization. AffineQuant significantly outperforms OmniQuant on smaller models, achieving a perplexity of 42.29 ({ 33.14$\downarrow$} vs 75.43 in OmniQuant) when using 2-bit 128-group quantization for OPT-$125$M, which setting a new state-of-the-art benchmark for PTQ in LLMs. Codes are available in the supplementary materials.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Legendre Memory Unit Spiking Neural Network Recurrent Neural Network
Scores: [ 6 6 8 3 ]
Transformer models have demonstrated high accuracy in numerous applications but have high complexity and lack sequential processing capability making them ill-suited for many streaming applications at the edge where devices are heavily resource-constrained. Thus motivated, many researchers have proposed reformulating the transformer models as RNN modules which modify the self-attention computation with explicit states. However, these approaches often incur significant performance degradation.The ultimate goal is to develop a model that has the following properties: parallel training, streaming and low-cost inference, and state-of-the-art (SOTA) performance. In this paper, we propose a new direction to achieve this goal. We show how architectural modifications to a fully-sequential recurrent model can help push its performance toward Transformer models while retaining its sequential processing capability. Specifically, inspired by the recent success of Legendre Memory Units (LMU) in sequence learning tasks, we propose LMUFormer, which augments the LMU with convolutional patch embedding and convolutional channel mixer. Moreover, we present a spiking version of this architecture, which introduces the benefit of states within the patch embedding and channel mixer modules while simultaneously reducing the computing complexity. We evaluated our architectures on multiple sequence datasets. Of particular note is our performance on the Speech Commands V2 dataset (35 classes). In comparison to SOTA transformer-based models within the ANN domain, our LMUFormer demonstrates comparable performance while necessitating a remarkable 70× reduction in parameters and a substantial 140× decrement in FLOPs. Furthermore, when benchmarked against extant low-complexity SNN variants, our model establishes a new SOTA with an accuracy of 96.12%. Additionally, owing to our model's proficiency in real-time data processing, we are able to achieve a 32.03% reduction in sequence length, all while incurring an inconsequential decline in performance.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Self-consistency Chain-of-Thoughts Multi-Step Reasoning Large Language Models
Scores: [ 8 5 5 5 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Federated Learning Device Heterogeneity Cross-silo Federated Learning
Scores: [ 8 6 6 5 ]
Cross-silo federated learning offers a promising solution to collaboratively train robust and generalized AI models without compromising the privacy of local datasets, e.g., healthcare, financial, as well as scientific projects that lack a centralized data facility. Nonetheless, because of the disparity of computing resources among different clients (i.e., device heterogeneity), synchronous federated learning algorithms suffer from degraded efficiency when waiting for straggler clients. Similarly, asynchronous federated learning algorithms experience degradation in the convergence rate and final model accuracy on non-identically and independently distributed (non-IID) heterogeneous datasets due to stale local models and client drift. To address these limitations in cross-silo federated learning with heterogeneous clients and data, we propose FedCompass, an innovative semi-asynchronous federated learning algorithm with a computing power-aware scheduler on the server side, which adaptively assigns varying amounts of training tasks to different clients using the knowledge of the computing power of individual clients. FedCompass ensures that multiple locally trained models from clients are received almost simultaneously as a group for aggregation, effectively reducing the staleness of local models. At the same time, the overall training process remains asynchronous, eliminating prolonged waiting periods from straggler clients. Using diverse non-IID heterogeneous distributed datasets, we demonstrate that FedCompass achieves faster convergence and higher accuracy than other asynchronous algorithms while remaining more efficient than synchronous algorithms when performing federated learning on heterogeneous clients.
Primary Area: general machine learning (i.e., none of the above)
Keywords: machine learning adversarial malware smoothing robustness
Scores: [ 8 6 6 6 5 ]
Machine Learning (ML) models have been utilized for malware detection for over two decades. Consequently, this ignited an ongoing arms race between malware authors and antivirus systems, compelling researchers to propose defenses for malware-detection models against evasion attacks. However, most if not all existing defenses against evasion attacks suffer from sizable performance degradation and/or can defend against only specific attacks, which makes them less practical in real-world settings. In this work, we develop a certified defense, DRSM (De-Randomized Smoothed MalConv), by redesigning the de-randomized smoothing technique for the domain of malware detection. Specifically, we propose a window ablation scheme to provably limit the impact of adversarial bytes while maximally preserving local structures of the executables. After showing how DRSM is theoretically robust against attacks with contiguous adversarial bytes, we verify its performance and certified robustness experimentally, where we observe only marginal accuracy drops as the cost of robustness. To our knowledge, we are the first to offer certified robustness in the realm of static detection of malware executables. More surprisingly, through evaluating DRSM against 9 empirical attacks of different types, we observe that the proposed defense is empirically robust to some extent against a diverse set of attacks, some of which even fall out of the scope of its original threat model. In addition, we collected 15.5K recent benign raw executables from diverse sources, which will be made public as a dataset called PACE (Publicly Accessible Collection(s) of Executables) to alleviate the scarcity of publicly available benign datasets for studying malware detection and provide future research with more representative data of the time.
Primary Area: general machine learning (i.e., none of the above)
Keywords: data valuation probabilistic values approximation distributional Shapley
Scores: [ 8 6 8 8 ]
The family of probabilistic values, axiomatically-grounded and proposed in cooperative game theory, has recently received much attention in data valuation. However, it is often computationally expensive to compute exactly (exponential w.r.t. N, the number of data being valuated). Existing generic estimators cost O(N2ϵ2logNδ) utility evaluations to achieve an (ϵ, δ)-approximation under the 2-norm, while faster estimators have been developed recently for special cases (e.g., empirically for the Shapley value and theoretically for the Banzhaf value). In this work, based on a connection between probabilistic values and least square regressions, we propose two generic estimators for the whole family of probabilistic values that both cost O(Nϵ2logNδ) utility evaluations, largely extending the scope of this currently best complexity bound. Moreover, we show that each distributional value, proposed by Ghorbani et al. (2020) to alleviate the inconsistency of probabilistic values when using distinct databases, can also be cast as optimizing a similar least square regression. This observation makes it the first-time theoretically-grounded to train value estimators such that the distributional value of each unseen data point can be evaluated in a single forward pass. Our experiments verify the faster convergence of our proposed estimators, and demonstrate the effectiveness at learning distributional values.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Robust time series forecasting; learning with noisy labels
Scores: [ 6 5 5 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: extrapolation OOD generalization deep neural networks decision-making
Scores: [ 8 6 6 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: transfer learning batch normalization efficient training
Scores: [ 8 6 8 8 ]
Convolution-BatchNorm (ConvBN) blocks are integral components in various computer vision tasks and other domains. A ConvBN block can operate in three modes: Train, Eval, and Deploy. While the Train mode is indispensable for training models from scratch, the Eval mode is suitable for transfer learning and beyond, and the Deploy mode is designed for the deployment of models. This paper focuses on the trade-off between stability and efficiency in ConvBN blocks: Deploy mode is efficient but suffers from training instability; Eval mode is widely used in transfer learning but lacks efficiency. To solve the dilemma, we theoretically reveal the reason behind the diminished training stability observed in the Deploy mode. Subsequently, we propose a novel Tune mode to bridge the gap between Eval mode and Deploy mode. The proposed Tune mode is as stable as Eval mode for transfer learning, and its computational efficiency closely matches that of the Deploy mode. Through extensive experiments in object detection, classification, and adversarial example generation across 5 datasets and 12 model architectures, we demonstrate that the proposed Tune mode retains the performance while significantly reducing GPU memory footprint and training time, thereby contributing efficient ConvBN blocks for transfer learning and beyond. Our method has been integrated into both PyTorch (general machine learning framework) and MMCV/MMEngine (computer vision framework). Practitioners just need one line of code to enjoy our efficient ConvBN blocks thanks to PyTorch's builtin machine learning compilers.
Primary Area: general machine learning (i.e., none of the above)
Keywords: BNN Hoyer regularizer gradient descent FLOPs object detection
Scores: [ 5 6 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: large-scale graphs signal sampling graphons
Scores: [ 8 8 8 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Federated Learning Adaptive Gradient Methods
Scores: [ 6 6 6 ]
Federated learning (FL) is an emerging learning paradigm in which a set of distributed clients learns a task under the coordination of a central server. The FedAvg algorithm is one of the most widely used methods to solve FL problems. In FedAvg, the learning rate is a constant rather than changing adaptively. Adaptive gradient methods have demonstrated superior performance over the constant learning rate schedules in non-distributed settings, and they have recently been adapted to FL. However, the majority of these methods are designed for unconstrained settings. Meanwhile, many crucial FL applications, like disease diagnosis and biomarker identification, often rely on constrained formulations such as Lasso and group Lasso. It remains an open question as to whether adaptive gradient methods can be effectively applied to FL problems with constrains. In this work, we introduce \textbf{FedDA}, a novel adaptive gradient framework for FL. This framework utilizes a restarted dual averaging technique and is compatible with a range of gradient estimation methods and adaptive learning rate schedules. Specifically, an instantiation of our framework \textbf{FedDA-MVR} achieves gradient complexity ~O(K−1ϵ−1.5) and communication complexity ~O(K−0.25ϵ−1.25) for finding a stationary point ϵ in the constrained setting. We conduct experiments over both constrained and unconstrained tasks to confirm the effectiveness of our approach.
Primary Area: general machine learning (i.e., none of the above)
Keywords: multimodal personalization user modeling generative modeling instruction tuning large language model
Scores: [ 8 6 5 6 ]
Developing a universal model that can effectively harness heterogeneous resources and respond to a wide range of personalized needs has been a longstanding community aspiration. Our daily choices, especially in domains like fashion and retail, are substantially shaped by multi-modal data, such as pictures and textual descriptions. These modalities not only offer intuitive guidance but also cater to personalized user preferences. However, the predominant personalization approaches mainly focus on ID or text-based recommendation problem, failing to comprehend the information spanning various tasks or modalities. In this paper, our goal is to establish a Unified paradigm for Multi-modal Personalization systems (UniMP), which effectively leverages multi-modal data while eliminating the complexities associated with task- and modality-specific customization. We argue that the advancements in foundational generative modeling have provided the flexibility and effectiveness necessary to achieve the objective. In light of this, we develop a generic and extensible personalization generative framework, that can handle a wide range of personalized needs including item recommendation, product search, preference prediction, explanation generation, and further user-guided image generation. Our methodology enhances the capabilities of foundational language models for personalized tasks by seamlessly ingesting interleaved cross-modal user history information, ensuring a more precise and customized experience for users. To train and evaluate the proposed multi-modal personalized tasks, we also introduce a novel and comprehensive benchmark covered a variety of user requirements. Our experiments on the real-world benchmark showcase the model's potential, outperforming competitive methods specialized for each task.
Primary Area: general machine learning (i.e., none of the above)
Keywords: autoencoder matrix completion collaborative filtering
Scores: [ 8 6 8 8 ]
Neural networks have shown promising performance in collaborative filtering and matrix completion but the theoretical analysis is limited and there is still room for improvement in terms of the accuracy of recovering missing values. This paper presents a neuron-enhanced autoencoder matrix completion (AEMC-NE) method and applies it to collaborative filtering. Our AEMC-NE adds an element-wise autoencoder to each output of the main autoencoder to enhance the reconstruction capability. Thus it can adaptively learn an activation function for the output layer to approximate possibly complicated response functions in real data. We provide theoretical analysis for AEMC-NE as well as AEMC to investigate the generalization ability of autoencoder and deep learning in matrix completion, considering both missing completely at random and missing not at random. We show that the element-wise neural network has the potential to reduce the generalization error bound, the data sparsity can be useful, and the prediction performance is closely related to the difference between the numbers of variables and samples. The numerical results on synthetic data and benchmark datasets demonstrated the effectiveness of AEMC-NE in comparison to many baselines.
Primary Area: general machine learning (i.e., none of the above)
Keywords: lossless compression arithmetic coding language models scaling laws in-context learning
Scores: [ 6 6 6 6 ]
It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Pruning Deep Learning Neural Networks Interpretability Loss landscapes Optimization Kurtosis
Scores: [ 6 1 8 5 ]
Pruning is an effective method to reduce the size of deep neural network models, maintain accuracy, and, in some cases, improve the network's overall performance. However, the mechanisms underpinning pruning remain unclear. Why can different methods prune by different percentages yet achieve similar performance? Why can we not prune at the start of training? Why are some models more amenable to being pruned than others? Given a model, what is the maximum amount it can be pruned before significantly affecting the performance? This paper explores and answers these questions from the global unstructured magnitude pruning perspective with one epoch of fine-tuning. We develop the idea that cosine similarity is an effective proxy measure for functional similarity between the parent and the pruned network. We prove that the L1 pruning method is optimal when pruning by cosine similarity. We show that the higher the kurtosis of a model's parameter distribution, the more it can be pruned while maintaining performance. Finally, we present a simple method to determine the optimal amount by which a network can be L1-pruned based on its parameter distribution.
Primary Area: general machine learning (i.e., none of the above)
Keywords: out-of-distribution detection learnability
Scores: [ 6 6 8 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: early-exit dynamic networks efficient inference
Scores: [ 8 8 8 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Combinatorial Optimization Branch and Bound Deep Symbolic Optimization Learn to Optimize Machine Learning for Combinatorial Optimization
Scores: [ 6 8 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: adversarial defense resampling implicit representation
Scores: [ 5 6 6 8 ]
We introduce a novel approach to counter adversarial attacks, namely, image resampling. Image resampling transforms a discrete image into a new one, simulating the process of scene recapturing or rerendering as specified by a geometrical transformation. The underlying rationale behind our idea is that image resampling can alleviate the influence of adversarial perturbations while preserving essential semantic information, thereby conferring an inherent advantage in defending against adversarial attacks. To validate this concept, we present a comprehensive study on leveraging image resampling to defend against adversarial attacks. We have developed basic resampling methods that employ interpolation strategies and coordinate shifting magnitudes. Our analysis reveals that these basic methods can partially mitigate adversarial attacks. However, they come with apparent limitations: the accuracy of clean images noticeably decreases, while the improvement in accuracy on adversarial examples is not substantial.We propose implicit representation-driven image resampling (IRAD) to overcome these limitations. First, we construct an implicit continuous representation that enables us to represent any input image within a continuous coordinate space. Second, we introduce SampleNet, which automatically generates pixel-wise shifts for resampling in response to different inputs. Furthermore, we can extend our approach to the state-of-the-art diffusion-based method, accelerating it with fewer time steps while preserving its defense capability. Extensive experiments demonstrate that our method significantly enhances the adversarial robustness of diverse deep models against various attacks while maintaining high accuracy on clean images.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Speech Continuation Spoken Question Answering
Scores: [ 8 8 6 5 ]
We present a novel approach to adapting pre-trained large language models (LLMs) to perform question answering (QA) and speech continuation. By endowing the LLM with a pre-trained speech encoder, our model becomes able to take speech inputs and generate speech outputs. The entire system is trained end-to-end and operates directly on spectrograms, simplifying our architecture. Key to our approach is a training objective that jointly supervises speech recognition, text continuation, and speech synthesis using only paired speech-text pairs, enabling a `cross-modal' chain-of-thought within a single decoding pass. Our method surpasses existing spoken language models in speaker preservation and semantic coherence. Furthermore, the proposed model improves upon direct initialization in retaining the knowledge of the original LLM as demonstrated through spoken QA datasets.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Sparse Training Network Science Epitopological Learning Network Automata Network Percolation Hyperbolic Network
Scores: [ 8 6 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: bias-variance decomposition ensemble deep learning
Scores: [ 8 6 6 8 ]
Classical wisdom in machine learning holds that the generalization error can be decomposed into bias and variance, and these two terms exhibit a \emph{trade-off}. However, in this paper, we show that for an ensemble of deep learning based classification models, bias and variance are \emph{aligned} at a sample level, where squared bias is approximately \emph{equal} to variance for correctly classified sample points. We present empirical evidence confirming this phenomenon in a variety of deep learning models and datasets. Moreover, we study this phenomenon from two theoretical perspectives: calibration and neural collapse. We first show theoretically that under the assumption that the models are well calibrated, we can observe the bias-variance alignment. Second, starting from the picture provided by the neural collapse theory, we show an approximate correlation between bias and variance.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Speech-to-Speecn Translatiom Audio Language Model Voice Clone
Scores: [ 8 5 3 8 ]
With the huge success of GPT models in natural language processing, there is a growing interest in applying language modeling approaches to speech tasks.Currently, the dominant architecture in speech-to-speech translation (S2ST) remains the encoder-decoder paradigm, creating a need to investigate the impact of language modeling approaches in this area. In this study, we introduce PolyVoice, a language model-based framework designed for S2ST systems. Our framework comprises three decoder-only language models: a translation language model, a duration language model, and a speech synthesis language model. These language models employ different types of prompts to extract learned information effectively. By utilizing unsupervised semantic units, our framework can transfer semantic information across these models, making it applicable even to unwritten languages. We evaluate our system on Chinese → English and English → Spanish language pairs. Experimental results demonstrate that PolyVoice outperforms the state-of-the-art encoder-decoder model, producing voice-cloned speech with high translation and audio quality.Speech samples are available at \url{https://polyvoice.github.io}.
Primary Area: general machine learning (i.e., none of the above)
Keywords: federated learning multi-domain federated learning
Scores: [ 5 6 5 8 ]
Federated learning (FL) enhances data privacy with collaborative in-situ training on decentralized clients. Nevertheless, FL encounters challenges due to non-independent and identically distributed (non-i.i.d) data, leading to potential performance degradation and hindered convergence. While prior studies predominantly addressed the issue of skewed label distribution, our research addresses a crucial yet frequently overlooked problem known as multi-domain FL. In this scenario, clients' data originate from diverse domains with distinct feature distributions, instead of label distributions. To address the multi-domain problem in FL, we propose a novel method called Federated Learning Without Normalizations (FedWon). FedWon draws inspiration from the observation that batch normalization (BN) faces challenges in effectively modeling the statistics of multiple domains, while existing normalization techniques possess their own limitations. In order to address these issues, FedWon eliminates the normalization layers in FL and reparameterizes convolution layers with scaled weight standardization. Through extensive experimentation on five datasets and five models, our comprehensive experimental results demonstrate that FedWon surpasses both FedAvg and the current state-of-the-art method (FedBN) across all experimental setups, achieving notable accuracy improvements of more than 10% in certain domains. Furthermore, FedWon is versatile for both cross-silo and cross-device FL, exhibiting robust domain generalization capability, showcasing strong performance even with a batch size as small as 1, thereby catering to resource-constrained devices. Additionally, FedWon can also effectively tackle the challenge of skewed label distribution.
Primary Area: general machine learning (i.e., none of the above)
Keywords: equivariance invariance polynomials non-compact special linear group data augmentation
Scores: [ 8 8 5 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Directed Acyclic Graphs Structure Learning
Scores: [ 6 6 5 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Federated Learning Deep Learning
Scores: [ 6 8 6 8 ]
Federated Learning (FL) models typically suffer from client drift caused by heterogeneous data, where data distributions vary with clients. To this end, advanced works mainly focus on manipulating exist gradients to obtain more similar client models. In this paper, we propose a different view of client drift and correct it by producing better local models. First, we analyze the generalization contribution of local training and conclude that the generalization contribution of local training is bounded by the conditional Wasserstein distance between clients' data distributions. Then, we propose FedImpro, to constructs similar conditional distributions for local training. Specifically, FedImpro decouples the model into high-level and low-level parts and trains the high-level part on reconstructed feature distributions, causing promoted generalization contribution and alleviated gradient dissimilarity of FL. Experimental results demonstrate that FedImpro can help FL defend against data heterogeneity and improve model generalization
Primary Area: general machine learning (i.e., none of the above)
Keywords: Active Learning Neural Networks
Scores: [ 6 6 8 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Alignment Safety AGI position paper
Scores: [ 5 3 8 5 ]
AI systems based on deep learning have reached or surpassed human performance in a range of narrow domains. In coming decades, artificial general intelligence (AGI) may surpass human capabilities at many critical tasks. In this position paper, we examine the technical difficulty of fine-tuning hypothetical AGI systems based on pretrained deep models to pursue goals that are aligned with human interests. We argue that, if trained like today's most capable models, AGI systems could learn to act deceptively to receive higher reward, learn internally-represented goals which generalize beyond their fine-tuning distributions, and pursue those goals using power-seeking strategies. We review emerging evidence for these properties. AGIs with these properties would be difficult to align and may appear aligned even when they are not.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Visual Document Understanding Optical Character Recognition Mathematical Expression Recognition Information Extraction
Scores: [ 5 3 8 10 ]
Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human- readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.
Primary Area: general machine learning (i.e., none of the above)
Keywords: deep learning hypernetwork stability optimization
Scores: [ 5 8 6 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: directed acyclic graph concomitant scale estimation causal discovery graph structure learning
Scores: [ 3 8 6 ]
We deal with the combinatorial problem of learning directed acyclic graph (DAG) structure from observational data adhering to a linear structural equation model (SEM). Leveraging advances in differentiable, nonconvex characterizations of acyclicity, recent efforts have advocated a continuous constrained optimization paradigm to efficiently explore the space of DAGs. Most existing methods employ lasso-type score functions to guide this search, which (i) require expensive penalty parameter retuning when the \emph{unknown} SEM noise variances change across problem instances; and (ii) implicitly rely on limiting homoscedasticity assumptions. In this work, we propose a new convex score function for sparsity-aware learning of linear DAGs, which incorporates concomitant estimation of scale and thus effectively decouples the sparsity parameter from noise levels. Regularization via a smooth, nonconvex acyclicity penalty term yields CoLiDE ($\textbf{Co}$ncomitant $\textbf{Li}$near $\textbf{D}$AG $\textbf{E}$stimation), a regression-based criterion amenable to efficient gradient computation and closed-form estimation of exogenous noise levels in heteroscedastic scenarios. Our algorithm outperforms state-of-the-art methods without incurring added complexity, especially when the DAGs are larger and the noise level profile is heterogeneous. We also find CoLiDE exhibits enhanced stability manifested via reduced standard deviations in several domain-specific metrics, underscoring the robustness of our novel linear DAG estimator.
Primary Area: general machine learning (i.e., none of the above)
Keywords: In-context learning Transformers Large language models Boolean functions
Scores: [ 8 6 6 8 ]
In order to understand the in-context learning phenomenon, recent works have adopted a stylized experimental framework and demonstrated that Transformers can learn gradient-based learning algorithms for various classes of real-valued functions. However, the limitations of Transformers in implementing learning algorithms, and their ability to learn other forms of algorithms are not well understood. Additionally, the degree to which these capabilities are confined to attention-based models is unclear. Furthermore, it remains to be seen whether the insights derived from these stylized settings can be extrapolated to pretrained Large Language Models (LLMs). In this work, we take a step towards answering these questions by demonstrating the following: (a) On a test-bed with a variety of Boolean function classes, we find that Transformers can nearly match the optimal learning algorithm for 'simpler' tasks, while their performance deteriorates on more 'complex' tasks. Additionally, we find that certain attention-free models perform (almost) identically to Transformers on a range of tasks. (b) When provided a teaching sequence, i.e. a set of examples that uniquely identifies a function in a class, we show that Transformers learn more sample-efficiently. Interestingly, our results show that Transformers can learn to implement two distinct algorithms to solve a single task, and can adaptively select the more sample-efficient algorithm depending on the sequence of in-context examples. (c) Lastly, we show that extant LLMs, e.g. LLaMA-2, GPT-4, can compete with nearest-neighbor baselines on prediction tasks that are guaranteed to not be in their training set.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Machine learning GNN Graph Learning
Scores: [ 8 6 6 ]
This paper introduces a framework for formally establishing a connection between a portion of an algebraic language and a Graph Neural Network (GNN). The framework leverages Context-Free Grammars (CFG) to organize algebraic operations into generative rules that can be translated into a GNN layer model. As CFGs derived directly from a language tend to contain redundancies in their rules and variables, we present a grammar reduction scheme. By applying this strategy, we define a CFG that conforms to the third-order Weisfeiler-Lehman (3-WL) test using the matricial language MATLANG. From this 3-WL CFG, we derive a GNN model, named G$^2$N$^2$, which is provably 3-WL compliant. Through various experiments, we demonstrate the superior efficiency of G$^2$N$^2$ compared to other 3-WL GNNs across numerous downstream tasks. Specifically, one experiment highlights the benefits of grammar reduction within our framework.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Sparse Mixture-of-Experts Efficiency Merging Compression
Scores: [ 8 8 3 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Defense; DNN Transferability; Neural Architecture Search
Scores: [ 3 8 6 ]
Deep neural network (DNN) models, despite their impressive performance, are vulnerable to exploitation by attackers who attempt to adapt them to other tasks for their own benefit. Current defense strategies mainly address this vulnerability at the model parameter level, leaving the potential of architectural-level defense largely unexplored. This paper, for the first time, addresses the issue of model protection by reducing transferability at the architecture level. Specially, we present a novel neural architecture search (NAS)-enabled algorithm that employs zero-cost proxies and evolutionary search, to design model architectures with low transferability. Our method, namely ArchLock, aims to achieve high performance on the source task, while degrading the performance on target tasks, i.e., locking the transferability of a DNN model.To achieve efficient cross-task search without having access to the training data owned by the attackers, we utilize zero-cost proxies to speed up architecture evaluation and simulate potential target task embeddings to assist cross-task search with a binary performance predictor. Extensive experiments on NAS-Bench-201 and TransNAS-Bench-101 demonstrate that ArchLock reduces transferability by up to 30% and 50%, respectively, with negligible performance degradation on source tasks (<2%).
Primary Area: general machine learning (i.e., none of the above)
Keywords: Neural Architecture Search Diffusion Model Meta Learning
Scores: [ 6 6 6 5 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Machine Learning Transformers Large Language Models
Scores: [ 6 5 6 5 8 ]
Large language models like GPT-4 exhibit emergent capabilities across general-purpose tasks, such as basic arithmetic, when trained on extensive text data, even though these tasks are not explicitly encoded by the unsupervised, next-token prediction objective. This study investigates how even small transformers, trained from random initialization, can efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square root, using the next-token prediction objective. We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that includes intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and parameter scaling. Additionally, we discuss the challenges associated with length generalization. Our work highlights the importance of high-quality, instructive data that considers the particular characteristics of the next-word prediction loss for rapidly eliciting arithmetic capabilities.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Rejection abstention loss function consistency learning theory decontextualization natural language processing
Scores: [ 6 6 6 ]
We study the problem of classification with a reject option for a fixed predictor, crucial to natural language processing. We introduce a new problem formulation for this scenario, and an algorithm minimizing a new surrogate loss function. We provide a complete theoretical analysis of the surrogate loss function with a strong H-consistency guarantee. For evaluation, we choose the \textit{decontextualization} task, and provide a manually-labelled dataset of 2,000 examples. Our algorithm significantly outperforms the baselines considered, with a ∼25% improvement in coverage when halving the error rate, which is only ∼3% away from the theoretical limit.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Out-of-distribution generalization
Scores: [ 8 8 5 ]
Generalization to out-of-distribution (OOD) data is a critical challenge in machine learning. Ensemble-based methods, like weight space ensembles that interpolate model parameters, have been shown to achieve superior OOD performance. However, the underlying mechanism for their effectiveness remains unclear. In this study, we closely examine WiSE-FT, a popular weight space ensemble method that interpolates between a pre-trained and a fine-tuned model. We observe an unexpected FalseFalseTrue" phenomenon, in which WiSE-FT successfully corrects many cases where each individual model makes incorrect predictions, which contributes significantly to its OOD effectiveness. To gain further insights, we conduct theoretical analysis in a multi-class setting with a large number of spurious features. Our analysis predicts the above phenomenon and it further shows that ensemble-based models reduce prediction errors in the OOD settings by utilizing a more diverse set of spurious features. Contrary to the conventional wisdom that focuses on learning invariant features for better OOD performance, our findings suggest that incorporating a large number of diverse spurious features weakens their individual contributions, leading to improved overall OOD generalization performance. Empirically we demonstrate the effectiveness of utilizing diverse spurious features on a MultiColorMNIST dataset, and our experimental results are consistent with the theoretical analysis. Building upon the new theoretical insights into the efficacy of ensemble methods, we further identify an issue of WiSE-FT caused by the overconfidence of fine-tuned models in OOD situations. This overconfidence magnifies the fine-tuned model's incorrect prediction, leading to deteriorated OOD ensemble performance. To remedy this problem, we propose a novel method called BAlaNced averaGing (BANG) to mitigate the overconfidence problem, which significantly enhances the OOD performance of WiSE-FT.
Primary Area: general machine learning (i.e., none of the above)
Keywords: game theory stochastic optimization nash equilbrium normal-form game x-armed bandits
Scores: [ 8 8 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Time series analysis Time series forecasting Complex-valued neural network
Scores: [ 8 8 8 8 ]
In this paper, we introduce FITS, a lightweight yet powerful model for time series analysis. Unlike existing models that directly process raw time-domain data, FITS operates on the principle that time series can be manipulated through interpolation in the complex frequency domain, achieving performance comparable to state-of-the-art models for time series forecasting and anomaly detection tasks. Notably, FITS accomplishes this with a svelte profile of just about 10k parameters, making it ideally suited for edge devices and paving the way for a wide range of applications. The code is available for review at: \url{https://anonymous.4open.science/r/FITS}.
Primary Area: general machine learning (i.e., none of the above)
Keywords: neural networks learning dynamics permutation symmetry loss landscape
Scores: [ 6 8 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: language model natural language processing inductive reasoning
Scores: [ 8 8 8 8 ]
The ability to derive the underlying principles from a handful of observations and then generalize to novel situations---known as inductive reasoning---is central to human intelligence. Prior work suggests that language models (LMs) often fall short on inductive reasoning, despite achieving impressive success on research benchmarks. In this work, we conduct a systematic study of the inductive reasoning capabilities of LMs through iterative hypothesis refinement, a technique that more closely mirrors the human inductive process than standard input-output prompting. Iterative hypothesis refinement employs a three-step process: proposing, selecting, and refining hypotheses in the form of textual rules. By examining the intermediate rules, we observe that LMs are phenomenal hypothesis proposers (i.e., generating candidate rules), and when coupled with a (task-specific) symbolic interpreter that is able to systematically filter the proposed set of rules, this hybrid approach achieves strong results across inductive reasoning benchmarks that require inducing causal relations, language-like instructions, and symbolic concepts. However, they also behave as puzzling inductive reasoners, showing notable performance gaps in rule induction (i.e., identifying plausible rules) and rule application (i.e., applying proposed rules to instances), suggesting that LMs are proposing hypotheses without being able to actually apply the rules. Through extensive empirical and human analyses, we further reveal several discrepancies between the inductive reasoning processes of LMs and humans, shedding light on both the potentials and limitations of using LMs in inductive reasoning tasks.
Primary Area: general machine learning (i.e., none of the above)
Keywords: In-context learning Large language models Repetition
Scores: [ 6 5 5 6 ]
This paper explores the elusive mechanism underpinning in-context learning in Large Language Models (LLMs). Our work provides a novel perspective by examining in-context learning via the lens of surface repetitions. We quantitatively investigate the role of surface features in text generation, and empirically establish the existence of token co-occurrence reinforcement, a principle that strengthens the relationship between two tokens based on their contextual co-occurrences. By investigating the dual impacts of these features, our research illuminates the internal workings of in-context learning and expounds on the reasons for its failures. This paper provides an essential contribution to the understanding of in-context learning and its potential limitations, providing a fresh perspective on this exciting capability.
Primary Area: general machine learning (i.e., none of the above)
Keywords: loss reweighting epistemic uncertainty bi-level optimization model calibration bayesian neural networks
Scores: [ 6 6 8 ]
Predictive uncertainty--a model’s self-awareness regarding its accuracy on an input--is key for both building robust models via training interventions and for test-time applications such as selective classification. We propose a novel instance-conditional reweighting approach that captures predictive uncertainty using an auxiliary network, and unifies these train- and test-time applications. The auxiliary network is trained using a meta-objective in a bilevel optimization framework. A key contribution of our proposal is the meta-objective of minimizing dropout variance, an approximation of Bayesian predictive uncertainty, We show in controlled experiments that we effectively capture diverse specific notions of uncertainty through this meta-objective, while previous approaches only capture certain aspects. These results translate to significant gains in real-world settings–selective classification, label noise, domain adaptation, calibration–and across datasets–Imagenet, Cifar100, diabetic retinopathy, Camelyon, WILDs, Imagenet-C,-A,-R, Clothing-1.6M, etc. For Diabetic Retinopathy, we see upto 3.4%/3.3% accuracy & AUC gains over SOTA in selective classification. We also improve upon large-scale pretrained models such as PLEX.
Primary Area: general machine learning (i.e., none of the above)
Keywords: In-context Learning; Out-of-distribution Generalization; Reliability
Scores: [ 5 5 5 ]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering. The recent progress in large-scale generative models has further expanded their use in real-world language applications. However, the critical challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored. While previous in-context learning research has focused on enhancing models to adhere to users' specific instructions and quality expectations, and to avoid undesired outputs, little to no work has explored the use of task-specific fine-tuned Language Models (SLMs) to improve LLMs' in-context learning during the inference stage. Our primary contribution is the establishment of a simple yet effective framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks. Using our proposed plug-in method, enhanced versions of Llama 2 and ChatGPT surpass their original versions regarding generalizability and factuality. We offer a comprehensive suite of resources, including 16 curated datasets, prompts, model checkpoints, and LLM outputs across 9 distinct tasks. Our empirical analysis sheds light on the advantages of incorporating discriminative models into LLMs and highlights the potential of our methodology in fostering more reliable LLMs.
Primary Area: general machine learning (i.e., none of the above)
Keywords: in-context learning mechanistic interpretability language models induction heads
Scores: [ 8 8 10 10 ]
Transformer models exhibit in-context learning: the ability to accurately predict the response to a novel query based on illustrative examples in the input sequence, which contrasts with traditional in-weights learning of query-output relationships. What aspects of the training data distribution and architecture favor in-context vs in-weights learning? Recent work has shown that specific distributional properties inherent in language, such as burstiness, large dictionaries and skewed rank-frequency distributions, control the trade-off or simultaneous appearance of these two forms of learning. We first show that these results are recapitulated in a minimal attention-only network trained on a simplified dataset. In-context learning (ICL) is driven by the abrupt emergence of an induction head, which subsequently competes with in-weights learning. By identifying progress measures that precede in-context learning and targeted experiments, we construct a two-parameter model of an induction head which emulates the full data distributional dependencies displayed by the attention-based network. A phenomenological model of induction head formation traces its abrupt emergence to the sequential learning of three nested logits enabled by an intrinsic curriculum. We propose that the sharp transitions in attention-based networks arise due to a specific chain of multi-layer operations necessary to achieve ICL, which is implemented by nested nonlinearities sequentially learned during training.
Primary Area: general machine learning (i.e., none of the above)
Keywords: time series sparse system identification long term prediction
Scores: [ 8 8 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: generalization prompt engineering PAC-Bayes foundation models
Scores: [ 6 8 8 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Large Language Models Hallucination Detection LogDet of Covariance Matrix Eigenvalues Feature Clipping
Scores: [ 8 6 6 6 ]
Knowledge hallucination have raised widespread concerns for the security and reliability of deployed LLMs. Previous efforts in detecting hallucinations have been employed at logit-level uncertainty estimation or language-level self-consistency evaluation, where the semantic information is inevitably lost during the token-decoding procedure. Thus, we propose to explore the dense semantic information retained within LLMs' \textbf{IN}ternal \textbf{S}tates for halluc\textbf{I}nation \textbf{DE}tection (\textbf{INSIDE}). In particular, a simple yet effective \textbf{EigenScore} metric is proposed to better evaluate responses' self-consistency, which exploits the eigenvalues of responses' covariance matrix to measure the semantic consistency/diversity in the dense embedding space. Furthermore, from the perspective of self-consistent hallucination detection, a test time feature clipping approach is explored to truncate extreme activations in the internal states, which reduces overconfident generations and potentially benefits the detection of overconfident hallucinations. Extensive experiments and ablation studies are performed on several popular LLMs and question-answering (QA) benchmarks, showing the effectiveness of our proposal.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Deep Learning
Scores: [ 6 8 6 8 ]
Data augmentation is one of the most prevalent tools in deep learning, underpinning many recent advances, including those from classification, generative models, and representation learning. The standard approach to data augmentation combines simple transformations like rotations and flips to generate new images from existing ones. However, these new images lack diversity along key semantic axes present in the data. Current augmentations cannot alter the high-level semantic attributes, such as animal species present in a scene, to enhance the diversity of data. We address the lack of diversity in data augmentation with image-to-image transformations parameterized by pre-trained text-to-image diffusion models. Our method edits images to change their semantics using an off-the-shelf diffusion model, and generalizes to novel visual concepts from a few labelled examples. We evaluate our approach on few-shot image classification tasks, and on a real-world weed recognition task, and observe an improvement in accuracy in tested domains.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Graph Neural Networks Uncertainity Quantification Calibration
Scores: [ 6 5 6 ]
While graph neural networks (GNNs) are widely used for node and graph representation learning tasks, the reliability of GNN uncertainty estimates under distribution shifts remains relatively under-explored. Indeed, while \textit{post-hoc} calibration strategies can be used to improve in-distribution calibration, they need not also improve calibration under distribution shift. However, techniques which produce GNNs with better \textit{intrinsic} uncertainty estimates are particularly valuable, as whey can always be combined with post-hoc strategies later. Therefore, in this work, we propose G-$\Delta$UQ, a novel training framework designed to improve intrinsic GNN uncertainty estimates. Our framework adapts the principle of stochastic data centering to graph data through novel graph anchoring strategies, and is able to support partially stochastic GNNs. While, the prevalent wisdom is that fully stochastic networks are necessary to obtain reliable estimates, we find that the structure induced by our anchoring strategies renders this unnecessary and allows us to support \gduq~ on pretrained models. Indeed, through extensive evaluation under covariate, concept and graph size shifts, we show that G-$\Delta$UQ leads to better calibrated GNNs for node and graph classification. Further, it also improves performance on other uncertainty-based tasks like out-of-distribution detection and generalization gap estimation. Overall, our work provides insights into uncertainty estimation for GNNs, and demonstrates the utility of G-$\Delta$UQ in obtaining reliable estimates.
Primary Area: general machine learning (i.e., none of the above)
Keywords: large language models knowledge graphs reasoning
Scores: [ 8 6 8 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: large language model evolutionary algorithm prompt engineering natural language processing
Scores: [ 6 6 8 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Unstructured pruning Sparse neural networks Lottery ticket hypothesis Weight averaging Bayesian neural networks
Scores: [ 6 6 6 8 6 ]
Given the ever-increasing size of modern neural networks, the significance of sparse architectures has surged due to their accelerated inference speeds and minimal memory demands. When it comes to global pruning techniques, Iterative Magnitude Pruning (IMP) still stands as a state-of-the-art algorithm despite its simple nature, particularly in extremely sparse regimes. In light of the recent finding that the two successive matching IMP solutions are linearly connected without a loss barrier, we propose Sparse Weight Averaging with Multiple Particles (SWAMP), a straightforward modification of IMP that achieves performance comparable to an ensemble of two IMP solutions. For every iteration, we concurrently train multiple sparse models, referred to as particles, using different batch orders yet the same matching ticket, and then weight average such models to produce a single mask. We demonstrate that our method consistently outperforms existing baselines across different sparsities through extensive experiments on various neural network structures and data.
Primary Area: general machine learning (i.e., none of the above)
Keywords: deep learning sparsity disparate impact constrained optimization pruning fairness
Scores: [ 6 8 6 5 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: calibration reliability ECE theory
Scores: [ 6 6 6 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: multi-label classification complex performance metrics macro-at-k extreme classification
Scores: [ 8 8 6 8 ]
We consider the optimization of complex performance metrics in multi-label classification under the population utility framework. We mainly focus on metrics linearly decomposable into a sum of binary classification utilities applied separately to each label with an additional requirement of exactly k labels predicted for each instance. These macro-at-k'' metrics possess desired properties for extreme classification problems with long tail labels. Unfortunately, the at-k constraint couples the otherwise independent binary classification tasks, leading to a much more challenging optimization problem than standard macro-averages. We provide a statistical framework to study this problem, prove the existence and the form of the optimal classifier, and propose a statistically consistent and practical learning algorithm based on the Frank-Wolfe method. Interestingly, our main results concern even more general metrics being non-linear functions of label-wise confusion matrices. Empirical results provide evidence for the competitive performance of the proposed approach.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Tabular Data; Architectures; Ensembles; Decision Trees; Gradient Descent
Scores: [ 6 8 6 6 ]
Despite the success of deep learning for text and image data, tree-based ensemble models are still state-of-the-art for machine learning with heterogeneous tabular data. However, there is a significant need for tabular-specific gradient-based methods due to their high flexibility. In this paper, we propose GRANDE, $\text{GRA}die\text{N}$t-Based $\text{D}$ecision Tree $\text{E}$nsembles, a novel approach for learning hard, axis-aligned decision tree ensembles using end-to-end gradient descent. GRANDE is based on a dense representation of tree ensembles, which affords to use backpropagation with a straight-through operator to jointly optimize all model parameters. Our method combines axis-aligned splits, which is a useful inductive bias for tabular data, with the flexibility of gradient-based optimization. Furthermore, we introduce an advanced instance-wise weighting that facilitates learning representations for both, simple and complex relations, within a single model. We conducted an extensive evaluation on a predefined benchmark with 19 classification datasets and demonstrate that our method outperforms existing gradient-boosting and deep learning frameworks on most datasets.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Sliced Wasserstein Monte Carlo Methods Point-Cloud Quasi-Monte Carlo Optimal Transport
Scores: [ 8 8 6 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: factuality hallucination language model dpo
Scores: [ 6 6 5 6 ]
The fluency and creativity of large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines. However, language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations', which can harmfully perpetuate myths and misconceptions. Further, manual fact-checking of model responses is a time-consuming process, making human factuality labels expensive to acquire. In this work, we leverage two key recent innovations in NLP to fine-tune language models to be more factual without human labeling, targeting more open-ended generation settings than past work. First, several recent works have proposed methods for scoring the factuality of open-ended text derived from consistency with an external knowledge base or simply a large model's confidence scores. Second, the Direct Preference Optimization algorithm enables straightforward fine-tuning of language models on objectives other than supervised imitation, using a preference ranking over possible model responses. We show that learning from preference rankings generated by either automated criterion significantly improves the factuality of Llama-2 on held-out topics (percent of generated claims that are correct) compared with existing RLHF procedures or decoding strategies targeted at factuality, showing over 50% and 20-30% error reduction for biographies and medical questions respectively.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Attention Retrieval Tree
Scores: [ 5 5 8 8 ]
Cross Attention is a popular method for retrieving information from a set of context tokens for making predictions. At inference time, for each prediction, Cross Attention scans the full set of O(N) tokens. In practice, however, often only a small subset of tokens are required for good performance. Methods such as Perceiver IO are cheap at inference as they distill the information to a smaller-sized set of latent tokens L<N on which cross attention is then applied, resulting in only O(L) complexity. However, in practice, as the number of input tokens and the amount of information to distill increases, the number of latent tokens needed also increases significantly. In this work, we propose Tree Cross Attention (TCA) - a module based on Cross Attention that only retrieves information from a logarithmic O(log(N)) number of tokens for performing inference. TCA organizes the data in a tree structure and performs a tree search at inference time to retrieve the relevant tokens for prediction. Leveraging TCA, we introduce ReTreever, a flexible architecture for token-efficient inference. We show empirically that Tree Cross Attention (TCA) performs comparable to Cross Attention across various classification and uncertainty regression tasks while being significantly more token-efficient. Furthermore, we compare ReTreever against Perceiver IO, showing significant gains while using the same number of tokens for inference.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Dataset Distillation Reconstruction Attacks Neural Tangent Kernel
Scores: [ 6 8 6 ]
Modern deep learning requires large volumes of data, which could contain sensitive or private information that cannot be leaked. Recent work has shown for homogeneous neural networks a large portion of this training data could be reconstructed with only access to the trained network parameters. While the attack was shown to work empirically, there exists little formal understanding of its effective regime which datapoints are susceptible to reconstruction. In this work, we first build a stronger version of the dataset reconstruction attack and show how it can provably recover the \emph{entire training set} in the infinite width regime. We then empirically study the characteristics of this attack on two-layer networks and reveal that its success heavily depends on deviations from the frozen infinite-width Neural Tangent Kernel limit. Next, we study the nature of easily-reconstructed images. We show that both theoretically and empirically, reconstructed images tend to outliers'' in the dataset, and that these reconstruction attacks can be used for \textit{dataset distillation}, that is, we can retrain on reconstructed images and obtain high predictive accuracy.
Primary Area: general machine learning (i.e., none of the above)
Keywords: OOD Generalization Hypersphere
Scores: [ 3 6 6 ]
Out-of-distribution (OOD) generalization is critical for machine learning models deployed in the real world. However, achieving this can be fundamentally challenging, as it requires the ability to learn invariant features across different domains or environments. In this paper, we propose a novel framework for OOD generalization, which provably learns domain-invariant representations in a hyperspherical space. In particular, our hyperspherical learning algorithm is guided by intra-class variation and inter-class separation principles---ensuring that features from the same class (across different training domains) are closely aligned with their class prototypes, while different class prototypes are maximally separated. We also provide theoretical justifications on how our prototypical learning objective improves the OOD generalization bound. Through extensive experiments on challenging OOD benchmarks, we demonstrate that our approach outperforms competitive baselines and achieves superior performance.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Personalized federated learning Data heterogeneity Long-tailed Learning
Scores: [ 5 6 8 5 ]
Federated Long-Tailed Learning (Fed-LT), a paradigm wherein data collected from decentralized local clients manifests a globally prevalent long-tailed distribution, has garnered considerable attention in recent times. In the context of Fed-LT, existing works have predominantly centered on addressing the data imbalance issue to enhance the efficacy of the generic global model while neglecting the performance at the local level. In contrast, conventional Personalized Federated Learning (pFL) techniques are primarily devised to optimize personalized local models under the presumption of a balanced global data distribution. This paper introduces an approach termed Federated Local and Generic Model Training in Fed-LT (FedLoGe), which enhances both local and generic model performance through the integration of representation learning and classifier alignment within a neural collapse framework. Our investigation reveals the feasibility of employing a shared backbone as a foundational framework for capturing overarching global trends, while concurrently employing individualized classifiers to encapsulate distinct refinements stemming from each client’s local features. Building upon this discovery, we establish the Static Sparse Equiangular Tight Frame Classifier (SSE-C), inspired by neural collapse principles that naturally prune extraneous noisy features and foster the acquisition of potent data representations. Furthermore, leveraging insights from imbalance neural collapse's classifier norm patterns, we develop Global and Local Adaptive Feature Realignment (GLA-FR) via an auxiliary global classifier and personalized Euclidean norm transfer to align global features with client preferences. Extensive experimental results on CIFAR-10/100-LT, ImageNet, and iNaturalist demonstrate the advantage of our method over state-of-the-art pFL and Fed-LT approaches.
Primary Area: general machine learning (i.e., none of the above)
Keywords: recurrent neural networks real-time recurrent learning online recurrent learning reinforcement learning actor-critic policy gradients
Scores: [ 6 6 6 8 ]
Real-time recurrent learning (RTRL) for sequence-processing recurrent neural networks (RNNs) offers certain conceptual advantages over backpropagation through time (BPTT). RTRL requires neither caching past activations nor truncating context, and enables online learning. However, RTRL's time and space complexity make it impractical. To overcome this problem, most recent work on RTRL focuses on approximation theories, while experiments are often limited to diagnostic settings. Here we explore the practical promise of RTRL in more realistic settings. We study actor-critic methods that combine RTRL and policy gradients, and test them in several subsets of DMLab-30, ProcGen, and Atari-2600 environments. On DMLab memory tasks, our system trained on fewer than 1.2B environmental frames is competitive with or outperforms well-known IMPALA and R2D2 baselines trained on 10B frames. To scale to such challenging tasks, we focus on certain well-known neural architectures with element-wise recurrence, allowing for tractable RTRL without approximation. Importantly, we also discuss rarely addressed limitations of RTRL in real-world applications, such as its complexity in the multi-layer case.
Primary Area: general machine learning (i.e., none of the above)
Keywords: symbolic regression dynamical systems differential equations transformer
Scores: [ 8 3 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Time Series Embedding LLM
Scores: [ 6 8 5 5 ]
This work summarizes two ways to accomplish Time-Series (TS) tasks in today's Large Language Model (LLM) context: LLM-for-TS (model-centric) designs and trains a fundamental large model, or fine-tunes a pre-trained LLM for TS data; TS-for-LLM (data-centric) converts TS into a model-friendly representation to enable the pre-trained LLM to handle TS data. Given the lack of data, limited resources, semantic context requirements, and so on, this work focuses on TS-for-LLM, where we aim to activate LLM's ability for TS data by designing a TS embedding method suitable for LLM. The proposed method is named TEST. It first tokenizes TS, builds an encoder to embed TS via instance-wise, feature-wise, and text-prototype-aligned contrast, where the TS embedding space is aligned to LLM’s embedding layer space, then creates soft prompts to make LLM more open to that embeddings, and finally implements TS tasks using the frozen LLM. We also demonstrate the feasibility of TS-for-LLM through theory and experiments. Experiments are carried out on TS classification, forecasting, and representation tasks using eight frozen LLMs with various structures and sizes. The results show that the pre-trained LLM with TEST strategy can achieve better or comparable performance than today's SOTA TS models, and offers benefits for few-shot and generalization. By treating LLM as the pattern machine, TEST can endow LLM's ability to process TS data without compromising language ability. We hope that this study will serve as a foundation for future work to support TS+LLM progress.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Episodic Memory Spatial Inference Prediction Generation Reinforcement Learning
Scores: [ 6 8 6 8 ]
Episodic memory plays a crucial role in various cognitive processes, such as the ability to mentally recall past events. While cognitive science emphasizes the significance of spatial context in the formation and retrieval of episodic memory, the current primary approach to implementing episodic memory in AI systems is through transformers that store temporally ordered experiences, which overlooks the spatial dimension. As a result, it is unclear how the underlying structure could be extended to incorporate the spatial axis beyond temporal order alone and thereby what benefits can be obtained. To address this, this paper explores the use of Spatially-Aware Transformer models that incorporate spatial information. These models enable the creation of place-centric episodic memory that considers both temporal and spatial dimensions. Adopting this approach, we demonstrate that memory utilization efficiency can be improved, leading to enhanced accuracy in various place-centric downstream tasks. Additionally, we propose the Adaptive Memory Allocator, a memory management method based on reinforcement learning that aims to optimize efficiency of memory utilization. Our experiments demonstrate the advantages of our proposed model in various environments and across multiple downstream tasks, including prediction, generation, reasoning, and reinforcement learning. The source code for our models and experiments will be available at \href{https://github.com/spatially_aware_transformer}{https://github.com/spatially_aware_transformer}.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Post-Training Pruning Combinatorial Optimization Large Language Models Inference Acceleration
Scores: [ 8 6 6 6 ]
With the rapid growth of large language models (LLMs), there is increasing demand for memory and computation for LLMs. Recent efforts on post-training pruning of LLMs aim to reduce the model size and computation, yet the performance is still sub-optimal. In this paper, we present a plug-and-play solution for post-training pruning of LLMs.The proposed solution has two innovative components: 1) Relative Importance and Activations (RIA), a new pruning metric that jointly considers the weight and activations efficiently on LLMs; and 2) Channel Permutation, a new approach to maximally preserve important weights under N:M sparsity.The proposed two components can be readily combined to further enhance the N:M structuredly pruned LLMs.Our empirical experiments show that RIA alone can already surpass all existing post-training pruning methods on prevalent LLMs, e.g., LLaMA ranging from 7B to 65B. Furthermore, N:M structured pruning with channel permutation can even outperform the original LLaMA2 70B on zero-shot tasks, together with practical speed-up on specific hardware.
Primary Area: general machine learning (i.e., none of the above)
Keywords: geospatial knowledge extraction large language models pre-trained language models natural language processing
Scores: [ 6 5 8 6 3 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: active learning uncertainty closeness disagree metric diversity
Scores: [ 6 6 8 10 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Sliced Wasserstein Monte Carlo Methods Point-Cloud Generative Models Optimal Transport
Scores: [ 6 6 6 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: in-context learning large language models active learning
Scores: [ 8 6 3 5 8 ]
In-context learning is a promising paradigm that utilizes in-context examples as prompts for the predictions of large language models. These prompts are crucial for achieving strong performance. However, since the prompts need to be sampled from a large volume of annotated examples, finding the right prompt may result in high annotation costs. To address this challenge, this paper introduces an influence-driven selective annotation method that aims to minimize annotation costs while improving the quality of in-context examples. The essence of our method is to select a pivotal subset from a large-scale unlabeled data pool to annotate for the subsequent sampling of prompts. Specifically, a directed graph is first constructed to represent unlabeled data. Afterward, the influence of candidate unlabeled subsets is quantified with a diffusion process. A simple yet effective greedy algorithm for unlabeled data selection is lastly introduced. It iteratively selects the data if it provides a maximum marginal gain with respect to quantified influence. Compared with previous efforts on selective annotations, our influence-driven method works in an end-to-end manner, avoids an intractable explicit balance between data diversity and representativeness, and enjoys theoretical support. Experiments confirm the superiority of the proposed method on various benchmarks, achieving better performance under lower time consumption during subset selection.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Fast Adversarial Training Catastrophic Overfitting Local Linearity
Scores: [ 5 6 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Long-tailed learning data augmentation imbalance fairness
Scores: [ 8 6 5 8 ]
Real-world tasks are universally associated with training samples that exhibit a long-tailed class distribution, and traditional deep learning models are not suitable for fitting this distribution, thus resulting in a biased trained model. To surmount this dilemma, massive deep long-tailed learning studies have been proposed to achieve inter-class fairness models by designing sophisticated sampling strategies or improving existing model structures and loss functions. Habitually, these studies tend to apply data augmentation strategies to improve the generalization performance of their models. However, this augmentation strategy applied to balanced distributions may not be the best option for long-tailed distributions. For a profound understanding of data augmentation, we first theoretically analyze the gains of traditional augmentation strategies in long-tailed learning, and observe that augmentation methods cause the long-tailed distribution to be imbalanced again, resulting in an intertwined imbalance: inherent data-wise imbalance and extrinsic augmentation-wise imbalance, i.e., two 'birds' co-exist in long-tailed learning. Motivated by this observation, we propose an adaptive Dynamic Optional Data Augmentation (DODA) to address this intertwined imbalance, i.e., one 'stone' simultaneously 'kills' two 'birds', which allows each class to choose appropriate augmentation methods by maintaining a corresponding augmentation probability distribution for each class during training. Extensive experiments across mainstream long-tailed recognition benchmarks (e.g., CIFAR-100-LT, ImageNet-LT, and iNaturalist 2018) prove the effectiveness and flexibility of the DODA in overcoming the intertwined imbalance. The code is available inhttps://anonymous.4open.science/r/Code-for-DODA-FE12.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Machine Translation Evaluation
Scores: [ 6 8 8 5 ]
Traditionally, Machine Translation (MT) Evaluation has been treated as a regression problem—producing an absolute translation-quality score. However, this approach has two limitations: i) the scores lack interpretability and human annotators struggle with giving consistent scores; ii) most scoring methods are based on(reference, translation) pairs, limiting their applicability in real-world scenarioswhere references are absent. In practice, we often care about whether a new MTsystem is better or worse than some competitors. In addition, reference-free MTevaluation is increasingly practical and necessary. However, these two practicalconsiderations have yet to be jointly explored. In this work, we formulate thereference-free MT evaluation into a pairwise ranking problem. Given the sourcesentence and a pair of translations, our system predicts which translation is better. In addition to proposing this new formulation, we further show that this newparadigm can demonstrate superior performance by merely using indirect supervision from natural language inference and weak supervision from our syntheticdata. In the context of reference-free evaluation, our system, trained without anyhuman annotations, achieves state-of-the-art results on the WMT Shared Metrics Task benchmarks DARR20, MQM20, and MQM21. On a more challengingbenchmark ACES, which contains fine-grained evaluation criteria such as addition, omission, and mistranslation errors our system marks state-of-the-art againstreference-free as well as reference-based baselines.
Primary Area: general machine learning (i.e., none of the above)
Keywords: in-context learning neural architectures
Scores: [ 5 6 5 6 ]
What is the relationship between model architecture and the ability to perform in-context learning? In this empirical study, we take the first steps towards answering this question. In particular, we evaluate fifteen model architectures across a suite of synthetic in-context learning tasks. The selected architectures represent a broad range of paradigms, including recurrent and convolution-based neural networks, transformers, and state-space models. We discover that all considered architectures can perform in-context learning under certain conditions. However, contemporary architectures are found to be the best performing, especially as task complexity grows. Additionally, our follow-up experiments delve into various factors that influence in-context learning. We observe varied sensitivities among architectures with respect to hyperparameter settings. Our study of training dynamics reveals that certain architectures exhibit a smooth, progressive learning trajectory, while others demonstrate periods of stagnation followed by abrupt mastery of the task. Finally, and somewhat surprisingly, we find that several state-space model variants are more robust in-context learners than transformers; since state-space models have constant-sized memory footprints at inference time, this result opens the future possibility of scaling up in-context learning to vastly larger numbers of in-context examples.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Data Selection Large Language Model Optimal Transport Data-centric AI Data Efficiency
Scores: [ 8 5 3 6 ]
This work focuses on leveraging and selecting from vast, unlabeled, open data to \emph{pre-fine-tune} a pre-trained language model. The goal is to minimize the need for costly domain-specific data for subsequent fine-tuning while achieving desired performance levels. While many data selection algorithms have been designed for small-scale applications, rendering them unsuitable for our context, some emerging methods do cater to language data scales. However, they often prioritize data that aligns with the target distribution. While this strategy may be effective when training a model from scratch, it can yield limited results when the model has already been pre-trained on a different distribution. Differing from prior work, our key idea is to select data that nudges the pre-training distribution closer to the target distribution. We show the optimality of this approach for fine-tuning tasks under certain conditions. We demonstrate the efficacy of our methodology across a diverse array of tasks, showing that it consistently surpasses other selection methods. Moreover, our proposed method is significantly faster than existing techniques, scaling to millions of samples within a single GPU hour. Our code is open-sourced \footnote{Code repository: \url{https://anonymous.4open.science/r/DV4LLM-D761/}}. While fine-tuning offers significant potential for enhancing performance across diverse tasks, its associated costs often limit its widespread adoption; with this work, we hope to lay the groundwork for cost-effective fine-tuning, making its benefits more accessible.
Primary Area: general machine learning (i.e., none of the above)
Keywords: sparsity pruning natural robustness flatness sharpness generalization deep learning out of distribution
Scores: [ 6 6 6 6 ]
Robustness and compactness are two essential attributes of deep learning models that are deployed in the real world. The goals of robustness and compactness may seem to be at odds, since robustness requires generalization across domains, while the process of compression exploits specificity in one domain. We introduce \textit{Adaptive Sharpness-Aware Pruning (AdaSAP)}, which unifies these goals through the lens of network sharpness. The AdaSAP method produces sparse networks that are robust to input variations which are \textit{unseen at training time}. We achieve this by strategically incorporating weight perturbations in order to optimize the loss landscape. This allows the model to be both primed for pruning and regularized for improved robustness. AdaSAP improves the robust accuracy of pruned models on image classification by up to +6% on ImageNet C and +4% on ImageNet V2, and on object detection by +4% on a corrupted Pascal VOC dataset, over a wide range of compression ratios, pruning criteria, and network architectures, outperforming recent pruning art by large margins.
Primary Area: general machine learning (i.e., none of the above)
Keywords: quantization sparsity large language models
Scores: [ 6 8 6 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: network pruning sparsity large language models network architectures outlier features
Scores: [ 6 6 5 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: web navigation large language models prompting
Scores: [ 8 6 8 ]
Building agents with large language models (LLMs) for computer control is a burgeoning research area, where the agent receives the states, e.g., texts or webpages, of the computer and generate actions to complete the complex tasks. Previous methods leverage in-context learning (ICL), using a few exemplars to prompt LLMs for computer control. However, their performance is hindered by several issues: i) The limited context length of LLMs and complex states restrict the number of exemplars, for a single webpage could occupy all the context. ii) The exemplars in current methods, such as high-level plans and multi-choice questions, cannot represent the full trajectories of the controlling process, leading to suboptimal performance in many-step tasks, e.g., more than 20 steps. iii) Existing methods require task-specific exemplars and neglect the reuse of experiences from similar tasks, leading to the poor generalizability to new tasks. To address these challenges, we introduce Synapse, an agent which incorporates trajectory-as-exemplar prompting with the associated memory for solving computer control tasks. Our contributions are three-fold. First, we propose the state abstraction method, which abstracts the task-relevant information from the raw state, thereby allowing more exemplars as the contexts of LLMs. Second, we propose the trajectory-as-exemplar (TaE) prompting method to prompt the LLM with the full trajectories of the task, including the abstracted states and user actions, significantly improving performance in many-step decision-making. Third, we introduce the associated TaE memory to store the trajectories and retrieve relevant trajectories as exemplars from the memory via similarity search, thus significantly improving the generalizability to novel tasks. We conduct the experiments on MiniWoB++ and the more realistic benchmark Mind2Web. In MiniWoB++, Synapse achieves 99.2% (10% relative improvement) average success rate across 64 tasks with demonstrations from merely 48 tasks. Notably, Synapse is the first prompting method to solve book-flight in MiniWoB++. Synapse also exhibits a 53% relative improvement in average step success rate over previous state-of-the-art prompting scheme in Mind2Web.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Combinatorial Optimization Boolean Satisfiability Problem Graph Generation Graph Matching
Scores: [ 5 5 8 ]
The Boolean satisfiability problem (SAT) stands as a canonical NP-complete combinatorial optimization (CO) problem, with wide impact on both theoretical and industrial scenarios. In particular, the scarcity of real-world SAT instances and their usefulness for tuning SAT solvers underscore the necessity for effective and efficient ways of hard instance generation, whereas existing methods either struggle to maintain plausible hardness or suffer from limited applicability. Different from the typical construction-based methods, this paper introduces an adaptive and efficient graph interpolation approach that in place modifies the raw structure of graph-represented SAT instance by replacing it with a counterpart from another instance. Specifically, our method involves a two-stage matching and mixing pipeline. The matching aims to find a correspondence map of literal nodes from two instance graphs via learned features from a matching network; while the mixing stage involves iteratively exchanging clause pairs with the highest correspondence scores until a specified replacement ratio is achieved. We further show that under our matching-mixing framework, moderate randomness can avoid hardness degradation of SAT instances by introducing Gumbel noise. Experimental results show the superiority of the proposed method with both resemblance in structure and hardness, as well as general applicability in an efficient way. Source code will be released.
Primary Area: general machine learning (i.e., none of the above)
Keywords: NeRF; NeRF Compression; 3D Data Compression
Scores: [ 6 8 8 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Audio-Visual Multi-Modal Time-Frequency-Domain Speech-Separation Model-Compression
Scores: [ 8 6 8 8 ]
Audio-visual speech separation methods aim to integrate different modalities to generate high-quality separated speech, thereby enhancing the performance of downstream tasks such as speech recognition. Most existing state-of-the-art (SOTA) models operate in the time domain. However, their overly simplistic approach to modeling acoustic features often necessitates larger and more computationally intensive models in order to achieve SOTA performance. In this paper, we present a novel time-frequency domain audio-visual speech separation method: Recurrent Time-Frequency Separation Network (RTFS-Net), which applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform. We model and capture the time and frequency dimensions of the audio independently using a multi-layered RNN along each dimension. Furthermore, we introduce a unique attention-based fusion technique for the efficient integration of audio and visual information, and a new mask separation approach that takes advantage of the intrinsic spectral nature of the acoustic features for a clearer separation. RTFS-Net outperforms the previous SOTA method using only 10% of the parameters and 18% of the MACs. This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Memory efficient training Activation-compressed training Average Quantization NLP Transformer
Scores: [ 6 6 6 6 ]
One of the key challenges in deep neural network training is the substantial amount of GPU memory required to store activations obtained in the forward pass. Various Activation-Compressed Training (ACT) schemes have been proposed to mitigate this issue; however, it is challenging to adopt those approaches in recent transformer-based large language models (LLMs), which experience significant performance drops when the activations are deeply compressed during training. In this paper, we introduce ALAM, a novel ACT framework that utilizes average quantization and a lightweight sensitivity calculation scheme, enabling large memory saving in LLMs while maintaining training performance. We first demonstrate that compressing activations into their group average values minimizes the gradient variance. Employing this property, we propose Average Quantization which provides high-quality deeply compressed activations with an effective precision of less than 1 bit and improved flexibility of precision allocation. In addition, we present a cost-effective yet accurate sensitivity calculation algorithm that solely relies on the L2 norm of parameter gradients, substantially reducing memory overhead due to sensitivity calculation. In experiments, the ALAM framework significantly reduces activation memory without compromising accuracy, achieving up to a 12.5$\times$ compression rate in LLMs.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Knowledge Distillation Bilevel Optimization Re-weighting
Scores: [ 8 8 5 5 ]
Knowledge distillation aims to train a compact student network using soft supervision from a larger teacher network and hard supervision from ground truths. However, determining an optimal knowledge fusion ratio that balances these supervisory signals remains challenging. Prior methods generally resort to a constant or heuristic-based fusion ratio, which often falls short of a proper balance. In this study, we introduce a novel adaptive method for learning a sample-wise knowledge fusion ratio, exploiting both the correctness of teacher and student, as well as how well the student mimics the teacher on each sample. Our method naturally leads to the \textit{intra-sample} trilateral geometric relations among the student prediction (S), teacher prediction (T), and ground truth (G). To counterbalance the impact of outliers, we further extend to the \textit{inter-sample} relations, incorporating the teacher's global average prediction (¯T) for samples within the same class. A simple neural network then learns the implicit mapping from the intra- and inter-sample relations to an adaptive, sample-wise knowledge fusion ratio in a bilevel-optimization manner. Our approach provides a simple, practical, and adaptable solution for knowledge distillation that can be employed across various architectures and model sizes. Extensive experiments demonstrate consistent improvements over other loss re-weighting methods on image classification, attack detection, and click-through rate prediction.
Primary Area: general machine learning (i.e., none of the above)
Keywords: 3D referring 3D visual grounding localization 3D
Scores: [ 6 6 6 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Spatial and temporal graph neural network bias missing values time series forecasting
Scores: [ 6 6 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Vision generalists Text-to-image models Multi-task learning
Scores: [ 3 8 6 5 ]
Recent advances in generative diffusion models have enabled text-controlled synthesis of realistic and diverse images with impressive quality. Despite these remarkable advances, the application of text-to-image generative models in computer vision for standard visual recognition tasks remains limited. The current de facto approach for these tasks is to design model architectures and loss functions that are tailored to the task at hand. In this paper, we develop a unified language interface for computer vision tasks that abstracts away task specific design choices and enables task execution by following natural language instructions. Our approach involves casting multiple computer vision tasks as text-to-image generation problems. Here, the text represents an instruction describing the task, and the resulting image is a visually-encoded task output. To train our model, we pool commonly-used computer vision datasets covering a range of tasks, including segmentation, object detection, depth estimation, and classification. We then use a large language model to paraphrase prompt templates that convey the specific tasks to be conducted on each image, and through this process, we create a multi-modal and multi-task training dataset comprising input and output images along with annotated instructions. Following the InstructPix2Pix architecture, we apply instruction-tuning to a text-to-image diffusion model using our constructed dataset, steering its functionality from a generative model to an instruction-guided multi-task vision learner. Experiments demonstrate that our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models. Moreover, it exhibits compelling generalization capabilities to unseen data, categories, and user instructions.
Primary Area: general machine learning (i.e., none of the above)
Keywords: The Expressive Power of Transformers with Chain of Thought
Scores: [ 6 8 8 8 ]
Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers' reasoning can be improved by allowing them to use a "chain of thought" or "scratchpad", i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this, we ask: Does such intermediate generation fundamentally extend the computational power of a decoder-only transformer? We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. For instance, we find that transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps adds a clear new ability (under standard complexity conjectures): recognizing all regular languages. Our results also imply that linear steps keep transformer decoders within context-sensitive languages, and polynomial steps make them recognize exactly the class of polynomial-time solvable problems---the first exact characterization of a type of transformers in terms of standard complexity classes. Together, our results provide a nuanced framework for understanding how the length of a transformer’s chain of thought or scratchpad impacts its reasoning power.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Federated learning data heterogeneity
Scores: [ 6 6 6 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Federated Learning Parameter-efficient Fine-tuning Differential Privacy Large Language Model
Scores: [ 6 5 5 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Hebbian learning neuromorphic computing continual learning spiking neural networks
Scores: [ 8 6 6 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Combinatorial Optimization; Mixed Integer Programming; Presolving
Scores: [ 8 6 6 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Algorithm Design Diversity OOD Generalization Spurious Correlation Understanding Neural Networks
Scores: [ 5 6 6 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Fusion Transformers
Scores: [ 8 6 6 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Network Pruning Model Compression Optimal Transport Wasserstein Distance Deep Learning
Scores: [ 6 6 6 8 6 ]
This study unveils a cutting-edge technique for neural network pruning that judiciously addresses noisy gradients during the computation of the empirical Fisher Information Matrix (FIM). We introduce an entropic Wasserstein regression (EWR) formulation, capitalizing on the geometric attributes of the optimal transport (OT) problem. This is analytically showcased to excel in noise mitigation by adopting neighborhood interpolation across data points. The unique strength of the Wasserstein distance is its intrinsic ability to strike a balance between noise reduction and covariance information preservation. Extensive experiments performed on various networks show comparable performance of the proposed method with state-of-the-art (SoTA) network pruning algorithms. Our proposed method outperforms the SoTA when the network size or the target sparsity is large, the gain is even larger with the existence of noisy gradients, possibly from noisy data, analog memory, or adversarial attacks. Notably, our proposed method achieves a gain of 6% improvement in accuracy and 8% improvement in testing loss for MobileNetV1 with less than one-fourth of the network parameters remaining.
Primary Area: general machine learning (i.e., none of the above)
Keywords: multi-agent systems imitation learning credit assignment learning from human data
Scores: [ 6 8 8 8 5 ]
AI agents are commonly trained with large datasets of demonstrations of human behavior.However, not all behaviors are equally safe or desirable.Desired characteristics for an AI agent can be expressed by assigning desirability scores, which we assume are assigned to collective trajectories, but not to individual behaviors.For example, in a dataset of vehicle interactions, these scores might relate to the number of incidents that occurred. We first assess the effect of each individual agent's behavior on the collective desirability score, e.g., assessing how likely an agent is to cause incidents.This allows us to afterward only imitate agents with desired behavior, e.g., only imitating agents that are unlikely to cause incidents. To enable this, we propose the concept of an agent's \textit{Exchange Value}, which quantifies an individual agent's contribution to the collective desirability score. This is expressed as the expected change in desirability score when substituting the agent for a randomly selected agent.We propose additional methods for estimating Exchange Values from real-world datasets, enabling us to learn aligned imitation policies that outperform relevant baselines.
Primary Area: general machine learning (i.e., none of the above)
Keywords: large language model self-debugging
Scores: [ 6 6 6 6 ]
Large language models (LLMs) have achieved impressive performance on code generation. However, for complex programming tasks, generating the correct solution in one go becomes challenging, thus some prior works have designed program repair approaches to improve code generation performance. In this work, we propose self-debugging, which teaches a large language model to debug its predicted program. In particular, we demonstrate that self-debugging can teach the large language model to perform rubber duck debugging; i.e., without any human feedback on the code correctness or error messages, the model is able to identify its mistakes by leveraging code execution and explaining the generated code in natural language. Self-debugging achieves the state-of-the-art performance on several code generation benchmarks, including the Spider dataset for text-to-SQL generation, TransCoder for C++-to-Python translation, and MBPP for text-to-Python generation. On the Spider benchmark where there are no unit tests to verify the correctness of predictions, self-debugging with code explanation consistently improves the baseline by 2-3%, and improves the prediction accuracy on problems of the hardest level by 9%. On TransCoder and MBPP where unit tests are available, self-debugging improves the baseline accuracy by up to 12%. Meanwhile, by leveraging feedback messages and reusing failed predictions, self-debugging notably improves sample efficiency, and can match or outperform baseline models that generate more than 10$\times$ candidate programs.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Federate Learning Learning rate Hyperparameter Hypergradient
Scores: [ 5 8 8 8 ]
The theoretical landscape of federated learning (FL) undergoes rapid evolution, but its practical application encounters a series of intricate challenges, and hyperparameter optimization is one of these critical challenges. Amongst the diverse adjustments in hyperparameters, the adaptation of the learning rate emerges as a crucial component, holding the promise of significantly enhancing the efficacy of FL systems. In response to this critical need, this paper presents FedHyper, a novel hypergradient-based learning rate adaptation algorithm specifically designed for FL. FedHyper serves as a universal learning rate scheduler that can adapt both global and local rates as the training progresses. In addition, FedHyper not only showcases unparalleled robustness to a spectrum of initial learning rate configurations but also significantly alleviates the necessity for laborious empirical learning rate adjustments. We provide a comprehensive theoretical analysis of FedHyper’s convergence rate and conduct extensive experiments on vision and language benchmark datasets. The results demonstrate that FEDHYPER consistently converges 1.1-3× faster than FedAvg and the competing baselines while achieving superior final accuracy. Moreover, FEDHYPER catalyzes a remarkable surge in accuracy, augmenting it by up to 15% compared to FedAvg under suboptimal initial learning rate settings.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Structure Learning Continuous Acyclic Optimization Directed Acyclic Graphs DAG
Scores: [ 6 5 6 6 5 ]
The structure learning problem consists of fitting data generated by a Directed Acyclic Graph (DAG) to correctly reconstruct its arcs. In this context, differentiable approaches constrain or regularize an optimization problem with a continuous relaxation of the acyclicity property. The computational cost of evaluating graph acyclicity is cubic on the number of nodes and significantly affects scalability. In this paper, we introduce COSMO, a constraint-free continuous optimization scheme for acyclic structure learning. At the core of our method lies a novel differentiable approximation of an orientation matrix parameterized by a single priority vector. Differently from previous works, our parameterization fits a smooth orientation matrix and the resulting acyclic adjacency matrix without evaluating acyclicity at any step. Despite this absence, we prove that COSMO always converges to an acyclic solution. In addition to being asymptotically faster, our empirical analysis highlights how COSMO performance on graph reconstruction compares favorably with competing structure learning methods.
Primary Area: general machine learning (i.e., none of the above)
Keywords: robust model selection multi-modal learning multi-step reasoning with LLM
Scores: [ 6 5 6 5 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: time series imputation; time series interpolation; information bottleneck
Scores: [ 8 6 6 6 ]
Time series imputation presents a significant challenge because it requires capturing the underlying temporal dynamics from partially observed time series data. Among the recent successes of imputation methods based on generative models, the information bottleneck (IB) framework offers a well-suited theoretical foundation for multiple imputations, allowing us to account for the uncertainty associated with the imputed values. However, directly applying the IB framework to time series data without considering their temporal context can lead to a substantial loss of temporal dependencies, which, in turn, can degrade the overall imputation performance. To address such a challenge, we propose a novel conditional information bottleneck (CIB) approach for time series imputation, which aims to mitigate the potentially negative consequences of the regularization constraint by focusing on reducing the redundant information conditioned on the temporal context. We provide a theoretical analysis of its effect by adapting variational decomposition. We use the resulting insight and propose a novel deep learning method that can approximately achieve the proposed CIB objective for time series imputation as a combination of evidence lower bound and novel temporal kernel-enhanced contrastive optimization. Our experiments, conducted on multiple real-world datasets, consistently demonstrate that our method significantly improves imputation performance (including both interpolation and extrapolation), and also enhances classification performance based on the imputed values.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Multivariate time series forecasting channel dependence lead-lag relationships
Scores: [ 6 6 6 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: prompt prefix LLM fine-tuning theory
Scores: [ 8 6 6 6 ]
Context-based fine-tuning methods like prompting, in-context learning, soft prompting (prompt tuning) and prefix-tuning have gained popularity as they often match the performance of full fine-tuning with a fraction of the parameters. Despite their empirical successes, there is little theoretical understanding of how these techniques influence the internal computation of the model and their expressiveness limitations. We show that despite the continuous embedding space being much more expressive than the discrete token space, soft-prompting and prefix-tuning are strictly less expressive than full fine-tuning. Concretely, context-based fine-tuning cannot change the relative attention pattern over the content and can only bias the outputs of an attention layer in a fixed direction. While this means that fine-tuning techniques such as prompting, in-context learning, soft prompting and prefix-tuning can successfully elicit or combine skills already present in the pretrained model, they cannot learn tasks requiring new attention patterns.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Position Embeddin Length Extrapolation Large Language Model Natural Language Processing
Scores: [ 8 6 10 ]
The extrapolation capability of Large Language Models (LLMs) based on Rotary Position Embedding \cite{su2021roformer} is currently a topic of considerable interest. The mainstream approach to addressing extrapolation with LLMs involves modifying RoPE by replacing 10000, the rotary base of θn=10000−2n/d in the original RoPE, with a larger value and providing longer fine-tuning text. In this work, we first observe that fine-tuning a RoPE-based LLM with either a smaller or larger base in pre-training context length could significantly enhance its extrapolation performance. After that, we propose \textbf{\textit{Scaling Laws of RoPE-based Extrapolation}}, a unified framework from the periodic perspective, to describe the relationship between the extrapolation performance and base value as well as tuning context length. In this process, we also explain the origin of the RoPE-based extrapolation issue by \textbf{\textit{critical dimension for extrapolation}}. Besides these observations and analyses, we achieve extrapolation up to 1 million context length within only 16K training length on LLaMA2 7B and 13B \citep{touvron2023llama2}.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Large Language Models Self-Correction Reasoning
Scores: [ 8 8 6 5 ]
Large Language Models (LLMs) have emerged as a groundbreaking technology with their unparalleled text generation capabilities across various applications. Nevertheless, concerns persist regarding the accuracy and appropriateness of their generated content. A contemporary methodology, self-correction, has been proposed as a remedy to these issues. Building upon this premise, this paper critically examines the role and efficacy of self-correction within LLMs, shedding light on its true potential and limitations. Central to our investigation is the notion of intrinsic self-correction, whereby an LLM attempts to correct its initial responses based solely on its inherent capabilities, without the crutch of external feedback. In the context of reasoning, our research indicates that LLMs struggle to self-correct their responses without external feedback, and at times, their performance might even degrade post self-correction. Drawing from these insights, we offer suggestions for future research and practical applications in this field.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Vanishing Gradients Reinforcement Finetuning Supervised Finetuning Language Models
Scores: [ 6 6 5 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: leverage score sampling active learning polynomial regression differential equations pivotal sampling
Scores: [ 8 8 6 6 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Backdoor Defense Data Poisoning
Scores: [ 6 6 6 ]
Deep neural networks have been widely used in many critical applications, such as autonomous vehicles and medical diagnosis. However, their security is threatened by backdoor attack, which is achieved by adding artificial patterns to specific training data. Existing defense strategies primarily focus on using reverse engineering to reproduce the backdoor trigger generated by attackers and subsequently repair the DNN model by adding the trigger into inputs and fine-tuning the model with ground-truth labels. However, once the trigger generated by the attackers is complex and invisible, the defender can not successfully reproduce the trigger. Consequently, the DNN model will not be repaired since the trigger is not effectively removed.In this work, we propose Adversarial Feature Map Pruning for Backdoor~(FMP) to mitigate backdoor from the DNN. Different from existing defense strategies, which focus on reproducing backdoor triggers, FMP tries to prune the backdoor feature maps, which are trained to extract backdoor information from the inputs. After pruning these backdoor feature maps, FMP will fine-tune the model with a secure subset of training data. Our experiments demonstrate that, compared to existing defense strategies, FMP can effectively reduce the Attack Success Rate (ASR) even against the most complex and invisible attack triggers~(e.g., FMP decreases the ASR to 2.86% in CIFAR10, 19.2%-65.41% lower than previous arts). Second, unlike conventional defense methods that tend to exhibit low Robust Accuracy (i.e., the model's accuracy on the poisoned data), FMP achieves higher RA, indicating its superiority in maintaining model performance while mitigating the effects of backdoor attacks~(e.g., FMP obtains 87.40% RA in CIFAR10). Third, compared to existing feature map pruning techniques, FMP can cover more backdoor feature maps~(e.g., FMP removes 83.33% of backdoor feature maps from the model in the CIFAR10 & BadNet scenario).
Primary Area: general machine learning (i.e., none of the above)
Keywords: out-of-distribution generalization distribution shifts spurious correlation noise robustness
Scores: [ 5 6 3 5 ]
Despite the rapid development of machine learning algorithms for domain generalization (DG), there is no clear empirical evidence that the existing DG algorithms outperform the classic empirical risk minimization (ERM) across standard benchmarks. To better understand this phenomenon, we investigate whether there are benefits of DG algorithms over ERM through the lens of label noise.Specifically, our finite-sample analysis reveals that label noise exacerbates the effect of spurious correlations for ERM, undermining generalization. Conversely, we illustrate that DG algorithms exhibit implicit label-noise robustness during finite-sample training even when spurious correlation is present.Such desirable property helps mitigate spurious correlations and improve generalization in synthetic experiments. However, additional comprehensive experiments on real-world benchmark datasets indicate that label-noise robustness does not necessarily translate to better performance compared to ERM. We conjecture that the failure mode of ERM arising from spurious correlations may be less prevalent in practice.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Large Language Model Operations Research
Scores: [ 6 8 8 6 ]
Large language models (LLMs) have emerged as powerful techniques for various NLP tasks, such as mathematical reasoning and plan generation. In this paper, we study automatic modeling and programming for complex operation research (OR) problems, so as to alleviate the heavy dependence on domain experts and benefit a spectrum of industry sectors. We present the first LLM-based solution, namely Chain-of-Experts (CoE), a novel multi-agent cooperative framework to enhance reasoning capabilities. Specifically, each agent is assigned a specific role and endowed with domain knowledge related to OR. We also introduce a conductor to orchestrate these agents via forward thought construction and backward reflection mechanism. Furthermore, we release a benchmark dataset (ComplexOR) of complex OR problems to facilitate OR research and community development. Experimental results show that CoE significantly outperforms the state-of-the-art LLM-based approaches both on LPWP and ComplexOR.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Neural Networks Semidefinite Programming Lipschitz Constant Estimation
Scores: [ 6 6 6 ]
Recently, semidefinite programming (SDP) techniques have shown great promise in providing accurate Lipschitz bounds for neural networks. Specifically, the LipSDP approach (Fazlyab et al., 2019) has received much attention and provides the least conservative Lipschitz upper bounds that can be computed with polynomial time guarantees. However, one main restriction of LipSDP is that its formulation requires the activation functions to be slope-restricted on [0,1], preventing its further use for more general activation functions such as GroupSort, MaxMin, and Householder. One can rewrite MaxMin activations for example as residual ReLU networks. However, a direct application of LipSDP to the resultant residual ReLU networks is conservative and even fails in recovering the well-known fact that the MaxMin activation is 1-Lipschitz. Our paper bridges this gap and extends LipSDP beyond slope-restricted activation functions. To this end, we provide novel quadratic constraints for GroupSort, MaxMin, and Householder activations via leveraging their underlying properties such as sum preservation. Our proposed analysis is general and provides a unified approach for estimating ℓ2 and ℓ∞ Lipschitz bounds for a rich class of neural network architectures, including non-residual and residual neural networks and implicit models, with GroupSort, MaxMin, and HouseHolder activations. Finally, we illustrate the utility of our approach with a variety of experiments and show that our proposed SDPs generate less conservative Lipschitz bounds in comparison to existing approaches.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Hyperparameter Transfer Neural Network Architecture Neural Network Initialization Learning Rate
Scores: [ 6 6 6 8 ]
Training a high-quality deep neural network requires choosing suitable hyperparameters, which is a non-trivial and expensive process. Current works try to automatically optimize or design principles of hyperparameters, such that they can generalize to diverse unseen scenarios. However, most designs of principles or optimization methods are agnostic to the choice of network structures, and thus largely ignore the impact of neural architectures on hyperparameters. In this work, we precisely characterize the dependence of initializations and maximal learning rates on the network architecture, which includes the network depth, width, convolutional kernel size, and connectivity patterns. By pursuing every parameter to be maximally updated with the same mean squared change in pre-activations, we can generalize our initialization and learning rates across MLPs (multi-layer perception) and CNNs (convolutional neural network) with sophisticated graph topologies. We verify our principles with comprehensive experiments. More importantly, our strategy further sheds light on advancing current benchmarks for architecture design. A fair comparison of AutoML algorithms requires accurate network rankings. However, we demonstrate that network rankings can be easily changed by better training networks in benchmarks with our architecture-aware learning rates and initialization.
Primary Area: general machine learning (i.e., none of the above)
Keywords: set learning calibration overconfidence class imbalance long-tailed classification low data classification calibration safety
Scores: [ 8 6 3 8 ]
Model overconfidence and poor calibration are common in machine learning and difficult to account for when applying standard empirical risk minimization. In this work, we propose a novel method to alleviate these problems that we call odd-k-out learning (OKO), which minimizes the cross-entropy error for sets rather than for single examples. This naturally allows the model to capture correlations across data examples and achieves both better accuracy and calibration, especially in limited training data and class-imbalanced regimes. Perhaps surprisingly, OKO often yields better calibration even when training with hard labels and dropping any additional calibration parameter tuning, such as temperature scaling. We provide theoretical justification, establishing that OKO naturally yields better calibration, and provide extensive experimental analyses that corroborate our theoretical findings. We emphasize that OKO is a general framework that can be easily adapted to many settings and the trained model can be applied to single examples at inference time, without introducing significant run-time overhead or architecture changes.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Data Selection Interpretability Fairness Robustness
Scores: [ 6 8 6 6 6 ]
Classification models are ubiquitously deployed in society and necessitate high utility, fairness, and robustness performance. Current research efforts mainly focus on improving model architectures and learning algorithms on fixed datasets to achieve this goal. In contrast, in this paper, we address an orthogonal yet crucial problem: given a fixed convex learning model (or a convex surrogate for a non-convex model) and a function of interest, we assess what data benefits the model by interpreting the feature space, and then aim to improve performance as measured by this function. To this end, we propose the use of influence estimation models for interpreting the classifier's performance from the perspective of the data feature space. Additionally, we propose data selection approaches based on influence that enhance model utility, fairness, and robustness. Through extensive experiments on synthetic and real-world datasets, we validate and demonstrate the effectiveness of our approaches not only for conventional classification scenarios, but also under more challenging scenarios such as distribution shifts, fairness poisoning attacks, utility evasion attacks, online learning, and active learning.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Diffusion models speech generation text to speech autoregressive model
Scores: [ 5 6 6 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Tool Creation Tool Retrieval LLM Multimodality Reasoning
Scores: [ 8 6 6 ]
Large language models (LLMs) are often augmented with tools to solve complex tasks. By generating code snippets and executing them through task-specific Application Programming Interfaces (APIs), they can offload certain functions to dedicated external modules, such as image encoding and performing calculations. However, most existing approaches to augment LLMs with tools are constrainedby general-purpose APIs and lack the flexibility for tailoring them to specific tasks. In this work, we present CRAFT, a general tool creation and retrieval framework for LLMs. It creates toolsets specifically curated for the tasks and equips LLMs with a component that retrieves tools from these sets to enhance their capability to solve complex tasks. For each task, we collect specific code solutions by promptingGPT-4 to solve the training examples. Following a validation step ensuring the correctness, these solutions are abstracted into code snippets to enhance reusability, and deduplicated for higher quality. At inference time, the language model retrieves snippets from the toolsets and then executes them or generates the output conditioning on the retrieved snippets. Our method is designed to be flexible andoffers a plug-and-play approach to adapt off-the-shelf LLMs to unseen domains and modalities, without any finetuning. Experiments on vision-language, tabular processing, and mathematical reasoning tasks show that our approach achieves substantial improvements compared to strong baselines. In addition, our in-depth analysis reveals that: (1) consistent performance improvement can be achieved byscaling up the number of tools and the capability of the backbone models; (2) each component of our approach contributes to the performance gains; (3) the created tools are well-structured and reliable with low complexity and atomicity.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Recommender System Causal Inference Bias Debias Balancing
Scores: [ 8 6 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Dataset Condensation Dataset Distillation Image Classification
Scores: [ 8 8 8 6 ]
While dataset condensation effectively enhances training efficiency, its application in on-device scenarios brings unique challenges. 1) Due to the fluctuating computational resources of these devices, there's a demand for a flexible dataset size that diverges from a predefined size. 2) The limited computational power on devices often prevents additional condensation operations. These two challenges connect to the "subset degradation problem" in traditional dataset condensation: a subset from a larger condensed dataset is often unrepresentative compared to directly condensing the whole dataset to that smaller size. In this paper, we propose Multisize Dataset Condensation (MDC) by compressing N condensation processes into a single condensation process to obtain datasets with multiple sizes. Specifically, we introduce an "adaptive subset loss" on top of the basic condensation loss to mitigate the "subset degradation problem". Our MDC method offers several benefits: 1) No additional condensation process is required; 2) reduced storage requirement by reusing condensed images. Experiments validate our findings on networks including ConvNet, ResNet and DenseNet, and datasets including SVHN, CIFAR-10, CIFAR-100 and ImageNet. For example, we achieved 6.40% average accuracy gains on condensing CIFAR-10 to ten images per class.
Primary Area: general machine learning (i.e., none of the above)
Keywords: bayesian posterior bayesian neural networks dataset ensembles uncertainty quantification weight-space symmetries computer vision
Scores: [ 6 3 5 6 8 ]
The distribution of the weights of modern deep neural networks (DNNs) - crucial for uncertainty quantification and robustness - is an eminently complex object due to its extremely high dimensionality. This paper proposes one of the first large-scale explorations of the posterior distribution of deep Bayesian Neural Networks (BNNs), expanding our study to real-world vision tasks and architectures. Specifically, we investigate the optimal approach for approximating the posterior, analyze the connection between posterior quality and uncertainty quantification, delve into the impact of modes on the posterior, and explore methods for visualizing the posterior. Moreover, we uncover weight-space symmetries as a critical aspect for understanding the posterior. To this extent, we develop an in-depth assessment of the impact of both permutation and scaling symmetries that tend to obfuscate the Bayesian posterior. While the first type of transformation is known for duplicating modes, we explore the relationship between the latter and L2 regularization, challenging previous misconceptions. Finally, to help the community improve our understanding of the Bayesian posterior, we will release the first large-scale checkpoint dataset, including thousands of real-world models, along with our codes, after the anonymity period.
Primary Area: general machine learning (i.e., none of the above)
Keywords: pre-training fine-tuning decouple scale reward alignment
Scores: [ 8 6 6 6 ]
Widely used language models (LMs) are typically built by scaling up a two-stage training pipeline: a pre-training stage that uses a very large, diverse dataset of text and a fine-tuning (sometimes, 'alignment') stage using more targeted examples of specific behaviors and/or human preferences. While it has been hypothesized that knowledge and skills come from pre-training, and fine-tuning mostly filters this knowledge and skillset, this intuition has not been rigorously tested. In this paper, we test this hypothesis with a novel methodology for scaling these two stages independently, essentially asking, What would happen if we combined the knowledge learned by a large model during pre-training with the knowledge learned by a small model during fine-tuning (or vice versa)? Using an RL-based framework derived from recent developments in learning from human preferences, we introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales. Our experiments with EFT show that scaling up fine-tuning tends to improve helpfulness, while scaling up pre-training tends to improve factuality. Further, we show that EFT enables test-time adjustment of competing behavioral factors like helpfulness and harmlessness without additional training. Finally, we find that a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling small fine-tuned models with large pre-trained models, essentially 'emulating' the result of fine-tuning the large pre-trained model. Up-scaling consistently improves helpfulness and factuality of widely used pre-trained models like Llama, Llama-2, and Falcon, without additional hyperparameters or training.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Code Generation Memory-augmented LLMs Large Language Models (LLMs) LLM coder agent
Scores: [ 8 8 6 6 8 ]
Transformer-based large language models (LLMs) are constrained by the fixed context window of the underlying transformer architecture, hindering their ability to produce long and logically consistent code. Memory-augmented LLMs are a promising solution, but current approaches cannot handle long code generation tasks since they (1) only focus on reading memory and reduce its evolution to the concatenation of new memories or (2) use very specialized memories that cannot adapt to other domains. This paper presents L2MAC, the first practical LLM-based stored-program automatic computer for long and consistent code generation. Its memory has two components: the instruction registry, which is populated with a prompt program to solve the user-given task, and a file store, which will contain the final and intermediate outputs. Each instruction is executed by a separate LLM instance, whose context is managed by a control unit capable of precise memory reading and writing to ensure effective interaction with the file store. These components enable L2MAC to generate virtually unbounded code structures, bypassing the constraints of the finite context window while producing code that fulfills complex user-specified requirements. We empirically show that L2MAC succeeds in generating large code bases for system design tasks where other coding methods fall short in implementing user requirements and provide insight into the reasons for this performance gap.
Primary Area: general machine learning (i.e., none of the above)
Keywords: self-attention locality sensitive hashing long-range context
Scores: [ 5 8 6 ]
We present an approximate attention mechanism named HyperAttention
to address the computational challenges posed by the growing complexity of long contexts used in Large Language Models (LLMs). Recent work suggests that in the worst-case scenario, the quadratic time is necessary unless the entries of the attention matrix are bounded or the matrix has low stable rank. We introduce two parameters which measure: (1) the max column norm in the normalized attention matrix, and (2) the ratio of row norms in the unnormalized attention matrix after detecting and removing large entries. We use these fine-grained parameters to capture the hardness of the problem. Despite previous lower bounds, we are able to achieve a linear time sampling algorithm even when the matrix has unbounded entries or a large stable rank, provided the above parameters are small.HyperAttention features a modular design that easily accommodates integration of other fast low-level implementations, particularly FlashAttention. Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods, giving significant speed improvements compared to state-of-the-art solutions like FlashAttention. This development presents substantial implications for enabling LLMs to handle significantly larger contexts.
Primary Area: general machine learning (i.e., none of the above)
Keywords: active learning Lewis weight sampling machine learning deep learning
Scores: [ 5 5 6 5 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: parallel algorithm recurrent neural networks neural ordinary differential equations sequential models
Scores: [ 6 6 6 ]
Sequential models, such as Recurrent Neural Networks and Neural Ordinary Differential Equations, have long suffered from slow training due to their inherent sequential nature.For many years this bottleneck has persisted, as many thought sequential models could not be parallelized.We challenge this long-held belief with our parallel algorithm that accelerates GPU evaluation of sequential models by up to 3 orders of magnitude faster without compromising output accuracy.The algorithm does not need any special structure in the sequential models' architecture, making it applicable to a wide range of architectures.Using our method, training sequential models can be more than 10 times faster than the common sequential method without any meaningful difference in the training results.Leveraging this accelerated training, we discovered the efficacy of the Gated Recurrent Unit in a long time series classification problem with 17k time samples.By overcoming the training bottleneck, our work serves as the first step to unlock the potential of non-linear sequential models for long sequence problems.
Primary Area: general machine learning (i.e., none of the above)
Keywords: state-space models sequence models Long-Range Arena recurrent neural networks
Scores: [ 6 6 8 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Out-of-Domain Generalization; Domain Relations; Distribution Shift
Scores: [ 6 6 6 8 8 6 ]
Distribution shift presents a significant challenge in machine learning, where models often underperform during the test stage when faced with a different distribution than the one they were trained on. In this paper, we focus on domain shifts, which occur when the model is applied to new domains that are different from the ones it was trained on, and propose a new approach called DG. Unlike previous approaches that aim to learn a single model that is domain invariant, DG leverages domain similarities based on domain metadata to learn domain-specific models. Concretely, DG learns a set of training-domain-specific functions during the training stage and reweights them based on domain relations during the test stage. These domain relations can be directly obtained and learned from domain metadata. Under mild assumptions, we theoretically prove that using domain relations to reweight training-domain-specific functions achieves stronger out-of-domain generalization compared to the conventional averaging approach. Empirically, we evaluate the effectiveness of DG using both toy and real-world datasets for tasks such as temperature regression, land use classification, and molecule-protein binding affinity prediction. Our results show that DG consistently outperforms state-of-the-art methods.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Lossless Compression Autoregressive Model Acceleration Entropy Coding Autoencoder
Scores: [ 6 6 6 6 ]
Learned lossless data compression has garnered significant attention recently due to its superior compression ratios compared to traditional compressors. However, the computational efficiency of these models jeopardizes their practicality. This paper proposes a novel system for improving the compression ratio while maintaining computational efficiency for learned lossless data compression. Our approach incorporates two essential innovations. First, we propose the Finite-State AutoRegressive (FSAR) entropy coder, an efficient autoregressive Markov model based entropy coder that utilizes a lookup table to expedite autoregressive entropy coding. Next, we present a Straight-Through Hardmax Quantization (STHQ) scheme to enhance the optimization of discrete latent space. Our experiments show that the proposed lossless compression method could improve the compression ratio by up to 6% compared to the baseline, with negligible extra computational time. Our work provides valuable insights into enhancing the computational efficiency of learned lossless data compression, which can have practical applications in various fields. Code will be available publicly after the paper is accepted.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Low-quality image Visual recognition Classification Low-resolution Input variations
Scores: [ 6 6 6 ]
A standard practice in developing image recognition models is to train a model on a specific image resolution and then deploy it. However, in real-world inference, models often encounter images different from the training sets in resolution and/or subject to natural variations such as weather changes, noise types and compression artifacts. While traditional solutions involve training multiple models for different resolutions or input variations, these methods are computationally expensive and thus do not scale in practice. To this end, we propose a novel neural network model, parallel-structured and all-component Fourier neural operator (PAC-FNO), that addresses the problem. Unlike conventional feed-forward neural networks, PAC-FNO operates in the frequency domain, allowing it to handle images of varying resolutions within a single model. We also propose a two-stage algorithm for training PAC-FNO with a minimal modification to the original, downstream model. Moreover, the proposed PAC-FNO is ready to work with existing image recognition models. Extensively evaluating methods with seven image recognition benchmarks, we show that the proposed PAC-FNO improves the performance of existing baseline models on images with various resolutions by up to 77.1% and various types of natural variations in the images at inference.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Federated Learning heterogeneity convex optimization.
Scores: [ 8 5 8 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: denoising diffusion models multi-granularity time series forecasting
Scores: [ 6 6 6 ]
Recently, diffusion probabilistic models have attracted attention in generative time series forecasting due to their remarkable capacity to generate high-fidelity samples. However, the effective utilization of their strong modeling ability in the probabilistic time series forecasting task remains an open question, partially due to the challenge of instability arising from their stochastic nature. To address this challenge, we introduce a novel Multi-Granularity Time Series Diffusion (MG-TSD) model, which achieves state-of-the-art predictive performance by leveraging the inherent granularity levels within the data as given targets at intermediate diffusion steps to guide the learning process of diffusion models. The way to construct the targets is motivated by the observation that forward process of the diffusion model, which sequentially corrupts the data distribution to a standard normal distribution, intuitively aligns with the process of smoothing fine-grained data into a coarse-grained representation, both of which result in a gradual loss of fine distribution features. In the study, we derive a novel multi-granularity guidance diffusion loss function and propose a concise implementation method to effectively utilize coarse-grained data across various granularity levels.More importantly, our approach does not rely on additional external data, making it versatile and applicable across various domains. Extensive experiments conducted on real-world datasets demonstrate that our MG-TSD model outperforms existing time series prediction methods.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Test-time training
Scores: [ 5 5 5 6 ]
Many recent efforts aim to augment language models with relevant information retrieved from a database at test time. We avoid the need for prompt engineering by directly fine-tuning the model on data retrieved at test time using its standard training setup. For this purpose, we build a large-scale distributed nearest neighbor index based on text embeddings of the Pile dataset. Given a query to a language model, our system retrieves the neighbors of the query and fine-tunes the model on the text data corresponding to those neighbors. Surprisingly, retrieving and training on as few as 20 neighbors, each for only one gradient iteration, drastically improves performance across more than twenty language modeling tasks in the Pile benchmark. For example, test-time training significantly narrows the performance gap between a small GPT2 model and a GPTNeo model, more than ten times larger, that was specifically trained to convergence on the Pile. Sufficient index quality and size, however, are important. Our work establishes a valuable first baseline for implementing test-time training in the context of large language models, opening the door to numerous promising research avenues.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Dynamic Data Pruning; Training acceleration
Scores: [ 6 8 6 8 ]
Data pruning aims to obtain lossless performances with less overall cost. A common approach is to filter out samples that make less contribution to the training. This could lead to gradient expectation bias compared to the original data. To solve this problem, we propose InfoBatch, a novel framework aiming to achieve lossless training acceleration by unbiased dynamic data pruning. Specifically, InfoBatchrandomly prunes a portion of less informative samples based on the loss distribution and rescales the gradients of the remaining samples to approximate the original gradient. As a plug-and-play and architecture-agnostic framework, InfoBatch consistently obtains lossless training results on classification, semantic segmentation, vision pertaining, and instruction fine-tuning tasks. On CIFAR10/100, ImageNet-1K, and ADE20K, InfoBatch losslessly saves 40% overall cost. For pertaining MAE and diffusion model, InfoBatch can respectively save 24.8% and 27% cost. For LLaMA instruction fine-tuning, InfoBatch is also able to save 20% cost and is compatible with coreset selection methods. The code will be made public.
Primary Area: general machine learning (i.e., none of the above)
Keywords: large language model optimizer prompting
Scores: [ 8 8 6 5 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Conformal prediction randomized smoothing adversarial robustness
Scores: [ 6 8 8 6 ]
Conformal prediction is a powerful tool to generate uncertainty sets with guaranteed coverage using any predictive model, under the assumption that the training and test data are i.i.d.. Recently, it has been shown that adversarial examples are able to manipulate conformal methods to construct prediction sets with invalid coverage rates, as the i.i.d. assumption is violated. To address this issue, a recent work, Randomized Smoothed Conformal Prediction (RSCP), was first proposed to certify the robustness of conformal prediction methods to adversarial noise. However, RSCP has two major limitations: (i) its robustness guarantee is flawed when used in practice and (ii) it tends to produce large uncertainty sets. To address these limitations, we first propose a novel framework called RSCP+ to provide provable robustness guarantee in evaluation, which fixes the issues in the original RSCP method. Next, we propose two novel methods, Post-Training Transformation (PTT) and Robust Conformal Training (RCT), to effectively reduce prediction set size with little computation overhead. Experimental results in CIFAR10, CIFAR100, and ImageNet suggest the baseline method only yields trivial predictions including full label set, while our methods could boost the efficiency by up to 4.36×, 5.46×, and 16.9× respectively and provide practical robustness guarantee.
Primary Area: general machine learning (i.e., none of the above)
Keywords: data selection instruction tuning large language models
Scores: [ 5 6 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: SSM Dimensional State Spaces Spatial Representation
Scores: [ 6 6 6 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Large Language Model Tool Use Tree Search A* Search
Scores: [ 8 6 8 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: large language models token-resource utilization prompt
Scores: [ 6 8 5 6 ]
The ever-increasing token limits of large language models (LLMs) have enabled long context as input. Many LLMs are trained/fine-tuned to perform zero-shot/few-shot inference using instruction-based prompts. Crafting prompts for these LLMs typically requires the user to provide a detailed task description, demonstrations, and single example of context for inference. This regular prompt baseline is referred to as “SinglePrompt” in this paper. However, for NLP tasks where each data point for inference is not necessarily lengthy, the token countfor instructions and few-shot examples in the prompt may be considerably larger than that of the data point, resulting in lower token-resource utilization compared with encoder-based models like fine-tuned BERT. This cost-efficiency issue, affecting inference speed and compute budget, counteracts the many benefits LLMs have to offer. This paper aims to alleviate the preceding problem by batching multiple data points into a single prompt, a prompting strategy we refer to as “BatchPrompt”. This strategy increases the “density” of data points, which in turn leads to improved token utilization. Applying BatchPrompt na ̈ıvely, however, is very challenging due to significant performance degradation, as observed in our experiments. We also noticed varying inference outcomes for the same data points appearing in different positions within a prompt. Based on this observation, to address the quality issue while remain high token-resource utilization, we introduce Batch Permutation and Ensembling (BPE) for BatchPrompt, a simple majority voting way that recovers labeling quality through repeatedly permutating data positions in a batch at the price of more token usage. To counterbalance the additional token usage caused by the voting process, we further propose Self-reflection-guided EArly Stopping (SEAS), which can terminate the voting process early for data points the LLM confidently handles. Our comprehensive experimental evaluation demonstrates that BPE +SEAS can boost the performance of BatchPrompt with a striking margin on a range of popular NLP tasks, including question answering (Boolq), textual entailment (RTE), and duplicate questions identification (QQP). These performances are even competitive with/higher than single-data prompting (SinglePrompt), while BatchPrompt requires much fewer LLM calls and input tokens (For SinglePrompt v.s. BatchPrompt+BPE +SEAS with batch size 32, using just 15.7% the number of LLM calls, Boolq accuracy 90.6% → 90.9% with 27.4% tokens, QQP accuracy 87.2% → 88.4% with 18.6% tokens, RTE accuracy 91.5% → 91.1% with 30.8% tokens). We hope our simple yet effective approach will shed light on the future research of large language models. The code will be released.
Primary Area: general machine learning (i.e., none of the above)
Keywords: large language model prompting analogical reasoning
Scores: [ 8 5 5 5 ]
Chain-of-thought (CoT) prompting for language models demonstrates impressive performance across reasoning tasks, but typically needs labeled exemplars of the reasoning process. In this work, we introduce a new prompting approach, Analogical Prompting, designed to automatically guide the reasoning process of large language models. Inspired by analogical reasoning, a cognitive process in which humans draw from relevant past experiences to tackle new problems, our approach prompts language models to self-generate relevant exemplars or knowledge in the context, before proceeding to solve the given problem. This method presents several advantages: it obviates the need for labeling or retrieving exemplars, offering generality and convenience; it can also tailor the generated exemplars and knowledge to each problem, offering adaptability. Experimental results show that our approach outperforms 0-shot CoT and few-shot CoT in a variety of reasoning tasks, including math problem solving in GSM8K and MATH, code generation in Codeforces, and other reasoning tasks in BIG-Bench, with an average accuracy gain of +4%.
Primary Area: general machine learning (i.e., none of the above)
Keywords: out-of-distribution detection deep neural networks transformers representation analysis uncertainty quantification text classification
Scores: [ 6 6 5 5 ]
Effective OOD detection is crucial for reliable machine learning models, yet most current methods are limited in practical use due to requirements like access to training data or intervention in training. We present a novel method for detecting OOD data in deep neural networks based on transformation smoothness between intermediate layers of a network (BLOOD), which is applicable to pre-trained models without access to training data. BLOOD utilizes the tendency of between-layer representation transformations of in-distribution (ID) data to be smoother than the corresponding transformations of OOD data, a property that we also demonstrate empirically for Transformer networks. We evaluate BLOOD on several text classification tasks with Transformer networks and demonstrate that it outperforms methods with comparable resource requirements. Our analysis also suggests that when learning simpler tasks, OOD data transformations maintain their original sharpness, whereas sharpness increases with more complex tasks.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Federated Learning Backdoor Attack
Scores: [ 6 6 6 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: zero-shot anomaly detection; Industrial Informatics;
Scores: [ 6 5 6 3 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Evidential Neural Network hyperdomain vagueness
Scores: [ 6 6 6 6 ]
Deep neural networks (DNNs) have been shown to perform well on exclusive, multi-class classification tasks. However, when different classes have similar visual features, it becomes challenging for human annotators to differentiate them. When an image is ambiguous, such as a blurry one where an annotator can't distinguish between a husky and a wolf, it may be labeled with both classes: {husky, wolf}. This scenario necessitates the use of composite set labels. In this paper, we propose a novel framework called Hyper-Evidential Neural Network (HENN) that explicitly models predictive uncertainty caused by composite set labels in training data in the context of the belief theory called Subjective Logic (SL).By placing a Grouped Dirichlet distribution on the class probabilities, we treat predictions of a neural network as parameters of hyper-subjective opinions and learn the network that collects both single and composite evidence leading to these hyper-opinions by a deterministic DNN from data.We introduce a new uncertainty type called vagueness originally designed for hyper-opinions in SL to quantify composite classification uncertainty for DNNs.Our experiments prove that HENN outperforms its state-of-the-art counterparts based on four image datasets.The code and datasets are available at: https://shorturl.at/dhoqx.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Influence function Data valuation
Scores: [ 6 6 6 6 ]
Quantifying the impact of training data points is crucial for understanding the outputs of machine learning models and for improving the transparency of the AI pipeline. The influence function is a principled and popular data attribution method, but its computational cost often makes it challenging to use. This issue becomes more pronounced in the setting of large language models and text-to-image models. In this work, we propose DataInf, an efficient influence approximation method that is practical for large-scale generative AI models. Leveraging an easy-to-compute closed-form expression, DataInf outperforms existing influence computation algorithms in terms of computational and memory efficiency. Our theoretical analysis shows that DataInf is particularly well-suited for parameter-efficient fine-tuning techniques such as LoRA. Through systematic empirical evaluations, we show that DataInf accurately approximates influence scores and is orders of magnitude faster than existing methods. In applications to RoBERTa-large, Llama-2-13B-chat, and stable-diffusion-v1.5 models, DataInf effectively identifies the most influential fine-tuning examples better than other approximate influence scores. Moreover, it can help to identify which data points are mislabeled.
Primary Area: general machine learning (i.e., none of the above)
Keywords: watermarking large language models distillation
Scores: [ 6 6 6 5 ]
Language model watermarking enables reliable detection of model-generated text, which has many applications in the responsible deployment of language models. Existing watermarking strategies operate by altering the decoder of an existing language model, and the ability for a language model to directly learn to generate the watermark would have significant implications for the real-world deployment of watermarks. First, learned watermarks could be used to build open models that naturally generate watermarked text, allowing for open models to benefit from watermarking. Second, if watermarking is used to determine the provenance of generated text, an adversary can damage the reputation of a victim model by spoofing its watermark and generating harmful watermarked text. To investigate the learnability of watermarks, we propose watermark distillation, which trains a student model to behave like a teacher model that uses decoding-based watermarking. We test our approach on three distinct decoding-based watermarking strategies, finding that models can learn to generate watermarked text with high detectability. We also find limitations to learnability, including the loss of watermarking capabilities under fine-tuning on normal text and high sample complexity when learning low-distortion watermarks.
Primary Area: general machine learning (i.e., none of the above)
Keywords: automatic differentiation correctness neural networks clarke subdifferential
Scores: [ 6 8 8 8 6 ]
Forward- or reverse-mode automatic differentiation (AD) is a popular algorithm for computing the derivative of a function expressed by a program. AD always outputs the correct derivative if a program does not use any non-differentiable functions and control flows; however, it may return an arbitrary value otherwise. In this work, we investigate what AD computes for neural networks that may contain non-differentiable functions such as ReLU and maxpools. We first prove that AD always returns a generalized derivative called a Clarke subderivative for networks with pointwise activation functions, if the minibatch size is one and all non-differentiable neurons have distinct bias parameters. We show that the same conclusion does not hold otherwise, but does hold under some mild sufficient conditions. We also prove similar results for more general networks that can use maxpools and bias parameters shared across different neurons. We empirically check our sufficient conditions over popular network architectures and observe that AD almost always computes a Clarke subderivative in practical learning setups.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Deepfake Detection Backdoor Attack
Scores: [ 6 8 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Quantum machine learning kernel learning quantum kernels feature map quantum circuit design
Scores: [ 3 6 6 6 ]
Quantum kernels hold great promise for offering computational advantages over classical learners, with the effectiveness of these kernels closely tied to the design of the feature map. However, the challenge of designing effective quantum feature maps for real-world datasets, particularly in the absence of sufficient prior information, remains a significant obstacle. In this study, we present a data-driven approach that automates the design of problem-specific quantum feature maps. Our approach leverages feature-selection techniques to handle high-dimensional data on near-term quantum machines with limited qubits, and incorporates a deep neural predictor to efficiently evaluate the performance of various candidate quantum kernels. Through extensive numerical simulations on different datasets, we demonstrate the superiority of our proposal over prior methods, especially for the capability of eliminating the kernel concentration issue and identifying the feature map with prediction advantages. Our work not only unlocks the potential of quantum kernels for enhancing real-world tasks, but also highlights the substantial role of deep learning in advancing quantum machine learning.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Time series application Point Process Amortized Learning
Scores: [ 5 6 6 5 ]
We tackle the challenge of large-scale network intervention for guiding excitatory point processes, such as infectious disease spread or traffic congestion control. Our model-based reinforcement learning utilizes neural ODEs to capture how the networked excitatory point processes will evolve subject to the time-varying changes in network topology. Our approach incorporates Gradient-Descent based Model Predictive Control (GD-MPC), offering policy flexibility to accommodate prior knowledge and constraints. To address the intricacies of planning and overcome the high dimensionality inherent to such decision-making problems, we design an Amortize Network Interventions (ANI) framework, allowing for the pooling of optimal policies from history and other contexts, while ensuring a permutation equivalent property. This property enables efficient knowledge transfer and sharing across diverse contexts. Our approach has broad applications, from curbing infectious disease spread to reducing carbon emissions through traffic light optimization, and thus has the potential to address critical societal and environmental challenges.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Large Language Model Compression Differentiable Quantization
Scores: [ 6 8 6 6 6 ]
Large language models (LLMs) have revolutionized natural language processing tasks. However, their practical deployment is hindered by their immense memory and computation requirements. Although recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM, they hand-craft quantization parameters, which leads to low performance and fails to deal with extremely low-bit quantization. To tackle this issue, we introduce an Omnidirectionally calibrated Quantization (OmniQuant) technique for LLMs, which achieves good performance in diverse quantization settings while maintaining the computational efficiency of PTQ by efficiently optimizing various quantization parameters. OmniQuant comprises two innovative components including Learnable Weight Clipping (LWC) and Learnable Equivalent Transformation (LET). LWC modulates the extreme values of weights by optimizing the clipping threshold. Meanwhile, LET tackles activation outliers by shifting the challenge of quantization from activations to weights through a learnable equivalent transformation. Operating within a differentiable framework using block-wise error minimization, OmniQuant can optimize the quantization process efficiently for both weight-only and weight-activation quantization. For instance, the LLaMA-2 model family with the size of 7-70B can be processed with OmniQuant on a single A100-40G GPU within 1-16 hours using 128 samples. Extensive experiments validate OmniQuant's superior performance across diverse quantization configurations such as W4A4, W6A6, W4A16, W3A16, and W2A16. Additionally, OmniQuant demonstrates effectiveness in instruction-tuned models and delivers notable improvements in inference speed and memory reduction on real devices. Codes and models are available at https://github.com/anonymous998899/OmniQuant.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Domain Generalization; In-Context Learning
Scores: [ 6 6 6 8 ]
Two lines of work are taking the central stage in AI research. On the one hand, the community is making increasing efforts to build models that discard spurious correlations and generalize better in novel test environments. Unfortunately, the hard lesson so far is that no proposal convincingly outperforms a simple empirical risk minimization baseline. On the other hand, large language models (LLMs) have erupted as algorithms able to learn in-context, generalizing on-the-fly to eclectic contextual circumstances that users enforce by means of prompting. In this paper, we argue that context is environment, and posit that in-context learning holds the key to better domain generalization. Via extensive theory and experiments, we show that paying attention to context$\unicode{x2013}\unicode{x2013}unlabeledexamplesastheyarrive\unicode{x2013}\unicode{x2013}$allows our proposed In-Context Risk Minimization (ICRM) algorithm to zoom-in on the test environment risk minimizer, leading to significant out-of-distribution performance improvements. From all of this, two messages are worth taking home. Researchers in domain generalization should consider environment as context, and harness the adaptive power of in-context learning. Researchers in LLMs should consider context as environment, to better structure data towards generalization.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Personalized Federated Learning Invariant Learning Shortcut Learning Causality Out-of-distribution Generalization
Scores: [ 3 6 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Personalized Federated Learning Heterogeneous Data Feature Distribution Shift
Scores: [ 5 6 8 ]
Personalized federated learning (PFL) has emerged as a promising technique for addressing the challenge of data heterogeneity. While recent studies have made notable progress in mitigating heterogeneity associated with label distributions, the issue of effectively handling feature heterogeneity remains an open question. In this paper, we propose a personalization approach by Local-global updates Mixing (LG-Mix) via Neural Tangent Kernel (NTK)-based convergence. The core idea is to leverage the convergence rate induced by NTK to quantify the importance of local and global updates, and subsequently mix these updates based on their importance. Specifically, we find the trace of the NTK matrix can manifest the convergence rate, and propose an efficient and effective approximation to calculate the trace of a feature matrix instead of the NTK matrix. Such approximation significantly reduces the cost of computing NTK, and the feature matrix explicitly considers the heterogeneous features among samples. We have theoretically analyzed the convergence of our method in the over-parameterize regime, and experimentally evaluated our method on five datasets. These datasets present heterogeneous data features in natural and medical images. With comprehensive comparison to existing state-of-the-art approaches, our LG-Mix has consistently outperformed them across all datasets (largest accuracy improvement of 5.01%), demonstrating the outstanding efficacy of our method for model personalization.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Filter Pruning; Model Compression; Vision Transformer
Scores: [ 8 6 6 ]
Hierarchical vision transformers (ViTs) have two advantages over conventional ViTs. First, hierarchical ViTs achieve linear computational complexity with respect to image size by local self-attention. Second, hierarchical ViTs create hierarchical feature maps by merging image patches in deeper layers for dense prediction. However, existing pruning methods ignore the unique properties of hierarchical ViTs and use the magnitude value as the weight importance. This approach leads to two main drawbacks. First, the "local" attention weights are compared at a "global" level, which may cause some "locally" important weights to be pruned due to their relatively small magnitude "globally". The second issue with magnitude pruning is that it fails to consider the distinct weight distributions of the network, which are essential for extracting coarse to fine-grained features at various hierarchical levels. To solve the aforementioned issues, we have developed a Data-independent Module-Aware Pruning method (DIMAP) to compress hierarchical ViTs. To ensure that "local" attention weights at different hierarchical levels are compared fairly in terms of their contribution, we treat them as a module and examine their contribution by analyzing their information distortion. Furthermore, we introduce a novel weight metric that is solely based on weights and does not require input images, thereby eliminating the dependence on the patch merging process. Our method validates its usefulness and strengths on Swin Transformers of different sizes on ImageNet-1k classification. Notably, the top-5 accuracy drop is only 0.07% when we remove 52.5% FLOPs and 52.7% parameters of Swin-B. When we reduce 33.2% FLOPs and 33.2% parameters of Swin-S, we can even achieve a 0.8% higher relative top-5 accuracy than the original model.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Large Language Models Jailbreak Attack Adversarial Attack
Scores: [ 6 8 6 8 ]
The aligned Large Language Models (LLMs) are powerful language understanding and decision-making tools that are created through extensive alignment with human feedback. However, these large models remain susceptible to jailbreak attacks, where adversaries manipulate prompts to elicit malicious outputs that should not be given by aligned LLMs. Investigating jailbreak prompts can lead us to delve into the limitations of LLMs and further guide us to secure them. Unfortunately, existing jailbreak techniques suffer from either (1) scalability issues, where attacks heavily rely on manual crafting of prompts, or (2) stealthiness problems, as attacks depend on token-based algorithms to generate prompts that are often semantically meaningless, making them susceptible to detection through basic perplexity testing. In light of these challenges, we intend to answer this question: Can we develop an approach that can automatically generate stealthy jailbreak prompts? In this paper, we introduce AutoDAN, a novel jailbreak attack against aligned LLMs. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm. Extensive evaluations demonstrate that AutoDAN not only automates the process while preserving semantic meaningfulness, but also demonstrates superior attack strength in cross-model transferability, and cross-sample universality compared with the baseline. Moreover, we also compare AutoDAN with perplexity-based defense methods and show that AutoDAN can bypass them effectively.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Skill-AI compatibility Agent Systems Decision-making Chess Deep RL
Scores: [ 8 8 5 6 ]
Powerful artificial intelligence systems are often used in settings where they must interact with agents that are computationally much weaker, for example when they work alongside humans or operate in complex environments where some tasks are handled by algorithms, heuristics, or other entities of varying computational power. For AI agents to successfully interact in these settings, however, achieving superhuman performance alone is not sufficient; they also need to account for suboptimal actions or idiosyncratic style from their less-skilled counterparts. We propose a formal evaluation framework for assessing the compatibility of near-optimal AI with interaction partners who may have much lower levels of skill; we use popular collaborative chess variants as model systems to study and develop AI agents that can successfully interact with lower-skill entities. Traditional chess engines designed to output near-optimal moves prove to be inadequate partners when paired with engines of various lower skill levels in this domain, as they are not designed to consider the presence of other agents. We contribute three methodologies to explicitly create skill-compatible AI agents in complex decision-making settings, and two chess game frameworks designed to foster collaboration between powerful AI agents and less-skilled partners. On these frameworks, our agents outperform state-of-the-art chess AI (based on AlphaZero) despite being weaker in conventional chess, demonstrating that skill-compatibility is a tangible trait that is qualitatively and measurably distinct from raw performance. Our evaluations further explore and clarify the mechanisms by which our agents achieve skill-compatibility.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Causality; Domain Generalization; Invariance Learning
Scores: [ 8 5 8 5 ]
Invariance learning methods aim to learn invariant features in the hope that they generalize under distributional shift. Although many tasks are naturally characterized by continuous domains, current invariance learning techniques generally assume categorically indexed domains. For example, auto-scaling in cloud computing often needs a CPU utilization prediction model that generalizes across different times (e.g., time of a day and date of a year), where `time' is a continuous domain index. In this paper, we start by theoretically showing that existing invariance learning methods can fail for continuous domain problems. Specifically, the naive solution of splitting continuous domains into discrete ones ignores the underlying relationship among domains, and therefore potentially leads to suboptimal performance. To address this challenge, we then propose Continuous Invariance Learning (CIL), which extracts invariant features across continuously indexed domains. CIL is a novel adversarial procedure which measures and controls the conditional independence between the labels and continuous domain indices given the extracted features. Our theoretical analysis demonstrates that CIL learns features that satisfy the invariant constraint with infinite samples. Empirical results on both synthetic and real-world datasets (including data collected from production systems) show that CIL consistently outperforms strong baselines among all the tasks.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Combinatorial Optimization Class-Contrained Bin Packing Problems Graph Convolution Network Cluster Decode
Scores: [ 6 6 8 6 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: imperfect-information games regret minimization Nash equilibrium
Scores: [ 8 8 8 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: federated domain adaptation federated learning domain adaptation machine learning
Scores: [ 6 6 6 5 ]
Federated Domain Adaptation (FDA) describes the federated learning (FL) setting where source clients and a server work collaboratively to improve the performance of a target client where limited data is available. The domain shift between the source and target domains, coupled with limited data of the target client, makes FDA a challenging problem, e.g., common techniques such as federated averaging and fine-tuning fail due to domain shift and data scarcity. To theoretically understand the problem, we introduce new metrics that characterize the FDA setting and a theoretical framework with novel theorems for analyzing the performance of server aggregation rules. Further, we propose a novel lightweight aggregation rule, Federated Gradient Projection (FedGP), which significantly improves the target performance with domain shift and data scarcity. Moreover, our theory suggests an auto-weighting scheme that finds the optimal combinations of the source and target gradients. This scheme improves both FedGP and a simpler heuristic aggregation rule. Extensive experiments verify the theoretical insights and illustrate the effectiveness of the proposed methods in practice.
Primary Area: general machine learning (i.e., none of the above)
Keywords: context compression efficient inference natural language processing transformer
Scores: [ 6 5 6 6 ]
This paper presents a novel context compression method for Transformer language models in online scenarios such as ChatGPT, where the context continually expands. As the context lengthens, the attention process requires more memory and computational resources, which in turn reduces the throughput of the language model. To this end, we propose a compressed context memory system that continually compresses the growing context into a compact memory space. The compression process simply involves integrating a lightweight adapter into the language model's forward pass during inference. Based on the compressed context memory, the language model can perform inference with reduced memory and attention operations. Through evaluations on multi-task learning, personalization, and conversation, we demonstrate that our approach achieves the performance level of a full context model with 5× smaller context memory space.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Neural Networks Transformers Self-Attention Linear Attention Scalable Transformers Efficient Attention Attention with Linear Complexity Linearized Attention Self-Attention Analysis
Scores: [ 6 5 5 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: LLM finetuning Scaling Laws Full-model finetuning Parameter efficient tuning Machine Translation Multilingual Summarization
Scores: [ 8 5 8 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Recommender system Selection Bias Neighborhood effect
Scores: [ 8 5 6 8 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Game Theory Deep Learning Representation Learning Nash Equilibrium
Scores: [ 8 5 8 3 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Neural Ordinary Differential Equations Neural Stochastic Differential Equations irregular time series data time series classification
Scores: [ 8 6 6 ]
Irregular sampling intervals and missing values in real-world time series data present challenges for conventional methods that assume consistent intervals and complete data. Neural Ordinary Differential Equations (Neural ODEs) offer an alternative approach, utilizing neural networks combined with ODE solvers to learn continuous latent representations through parameterized vector fields. Neural Stochastic Differential Equations (Neural SDEs) extend Neural ODEs by incorporating a diffusion term, although this addition is not trivial, particularly when addressing irregular intervals and missing values. Consequently, careful design of drift and diffusion functions is crucial for maintaining stability and enhancing performance, while incautious choices can result in adverse properties such as the absence of strong solutions, stochastic destabilization, or unstable Euler discretizations, significantly affecting Neural SDEs' performance. In this study, we propose three stable classes of Neural SDEs—Langevin-type SDE, Linear Noise SDE, and Geometric SDE. Then, we rigorously demonstrate their robustness in maintaining excellent performance under distribution shift, while effectively preventing overfitting. To assess the effectiveness of our approach, we conduct extensive experiments on three benchmark datasets for interpolation and classification tasks, and analyze the robustness of our methods with 30 public datasets under different missing rates. Our results demonstrate the efficacy of the proposed method in handling real-world irregular time series data.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Optimal Transport Circular Measure Probability Metrics
Scores: [ 6 6 6 ]
The optimal transport problem for measures supported on non-Euclidean spaces has recently gained ample interest in diverse applications involving representation learning. In this paper, we focus on circular probability measures, i.e., probability measures supported on the unit circle, and introduce a new computationally efficient metric for these measures, denoted as Linear Circular Optimal Transport (LCOT). The proposed metric comes with an explicit linear embedding that allows one to apply Machine Learning (ML) algorithms to the embedded measures and seamlessly modify the underlying metric for the ML algorithm to LCOT. We show that the proposed metric is rooted in the Circular Optimal Transport (COT) and can be considered the linearization of the COT metric with respect to a fixed reference measure. We provide a theoretical analysis of the proposed metric and derive the computational complexities for pairwise comparison of circular probability measures. Lastly, through a set of numerical experiments, we demonstrate the benefits of LCOT in learning representations from circular measures.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Non-convex optimization Multi-valued solution mapping Generative model Ordinary differential equation Supervised learning
Scores: [ 8 8 5 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Adversarial Attacks 3D Object Detection Autonomous Driving
Scores: [ 8 6 6 5 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Grokking deep neural networks Gaussian Process phase transitions
Scores: [ 6 6 8 6 3 ]
A key property of deep neural networks (DNNs) is their ability to learn new features during training. This intriguing aspect of deep learning stands out most clearly in recently reported Grokking phenomena. While mainly reflected as a sudden increase in test accuracy, Grokking is also believed to be a beyond lazy-learning/Gaussian Process (GP) phenomenon involving feature learning. Here we apply a recent development in the theory of feature learning, the adaptive kernel approach, to two teacher-student models with cubic-polynomial and modular addition teachers. We provide analytical predictions on feature learning and Grokking properties of these models and demonstrate a mapping between Grokking and the theory of phase transitions. We show that after Grokking, the state of the DNN is analogous to the mixed phase following a first-order phase transition. In this mixed phase, the DNN generates useful internal representations of the teacher that are sharply distinct from those before the transition.
Primary Area: general machine learning (i.e., none of the above)
Keywords: data selection data-centric AI information theory kl divergence gradient natural language processing computer vision
Scores: [ 8 8 6 6 ]
It is often advantageous to train models on a subset of the available train examples, because the examples are of variable quality or because one would like to train with fewer examples, without sacrificing performance. We present Gradient Information Optimization (GIO), a scalable, task-agnostic approach to this data selection problem that requires only a small set of (unlabeled) examples representing a target distribution. GIO begins from a natural, information-theoretic objective that is intractable in practice. Our contribution is in showing that it can be made highly scalable through a simple relaxation of the objective and a highly efficient implementation. In experiments with machine translation, spelling correction, and image recognition, we show that GIO delivers outstanding results with very small train sets. These findings are robust to different representation models and hyperparameters for GIO itself. GIO is task- and domain-agnostic and can be applied out-of-the-box to new datasets and domains.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Language models Preference optimization
Scores: [ 5 8 8 8 ]
The increasing capabilities of large language models (LLMs) raise opportunities for artificial general intelligence but concurrently amplify safety concerns, such as potential misuse of AI systems, necessitating effective AI alignment. Reinforcement Learning from Human Feedback (RLHF) has emerged as a promising pathway towards AI alignment but brings forth challenges due to its complexity and dependence on a separate reward model. Direct Preference Optimization (DPO) has been proposed as an alternative; and it remains equivalent to RLHF under the reverse KL regularization constraint. This paper presents f-DPO, a generalized approach to DPO by incorporating diverse divergence constraints. We show that under certain f-divergences, including Jensen-Shannon divergence, forward KL divergences and α-divergences, the complex relationship between the reward and optimal policy can also be simplified by addressing the Karush–Kuhn–Tucker conditions. This eliminates the need for estimating the normalizing constant in the Bradley-Terry model and enables a tractable mapping between the reward function and the optimal policy. Our approach optimizes LLMs to align with human preferences in a more efficient and supervised manner under a broad set of divergence constraints. Empirically, adopting these divergences ensures a balance between alignment performance and generation diversity. Importantly, our f-DPO outperforms PPO-based methods in divergence efficiency, and divergence constraints directly influence expected calibration error (ECE).
Primary Area: general machine learning (i.e., none of the above)
Keywords: Graph Neural Networks Explainability Interpretability Local-level explanation Instance-level explanation
Scores: [ 6 8 6 5 ]
Understanding the decision-making process of Graph Neural Networks (GNNs) is crucial to their interpretability. Most existing methods for explaining GNNs typically rely on training auxiliary models, resulting in the explanations remain black-boxed. This paper introduces Graph Output Attribution (GOAt), a novel method to attribute graph outputs to input graph features, creating GNN explanations that are faithful, discriminative, as well as stable across similar samples. By expanding the GNN as a sum of scalar products involving node features, edge features and activation patterns, we propose an efficient analytical method to compute contribution of each node or edge feature to each scalar product and aggregate the contributions from all scalar products in the expansion form to derive the importance of each node and edge. Through extensive experiments on synthetic and real-world data, we show that our method not only outperforms various state-of-the-art GNN explainers in terms of the commonly used fidelity metric, but also exhibits stronger discriminability, and stability by a remarkable margin.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Large Language Models Network Sparsity
Scores: [ 6 6 6 ]
Primary Area: general machine learning (i.e., none of the above)
Keywords: Deep Neural Networks Gaussian processes Neural Networks initialisation Edge of chaos Large width limit Mean-Field
Scores: [ 5 8 6 6 ]
The infinitely wide neural network has been proven a useful and manageable mathematical model that enables the understanding of many phenomena appearing in deep learning. One example is the convergence of random deep networks to Gaussian processes that enables a rigorous analysis of the way the choice of activation function and network weights impacts the training dynamics. In this paper, we extend the seminal proof of Matthews et al., 2018 to a larger class of initial weight distributions (which we call pseudo-iid), including the established cases of iid and orthogonal weights, as well as the emerging low-rank and structured sparse settings celebrated for their computational speed-up benefits. We show that fully-connected and convolutional networks initialised with pseudo-iid distributions are all effectively equivalent up to their variance. Using our results, one can identify the Edge of Chaos for a broader class of neural networks and tune them at criticality in order to enhance their training.
Primary Area: general machine learning (i.e., none of the above)
Keywords: quantization codebooks hashlists compression efficient inference deep learning
Scores: [ 6 8 5 ]
The massive interest in deep neural networks (DNNs) for both computer vision and natural language processing has been sparked by the growth in computational power. However, this led to an increase in the memory footprint, to a point where it can be challenging to simply load a model on commodity devices such as mobile phones. To address this limitation, quantization is a favored solution as it maps high precision tensors to a low precision, memory efficient format. In terms of memory footprint reduction, its most effective variants are based on codebooks. These methods, however, suffer from two limitations. First, they either define a single codebook for each tensor, or use a memory-expensive mapping to multiple codebooks. Second, gradient descent optimization of the mapping favors jumps toward extreme values, hence not defining a proximal search. In this work, we propose to address these two limitations. First, we initially group similarly distributed neurons and leverage the re-ordered structure to either apply different scale factors to the different groups, or map weights that fall in these groups to several codebooks, without any mapping overhead. Second, stemming from this initialization, we propose a joint learning of the codebook and weight mappings that bears similarities with recent gradient-based post-training quantization techniques. Third, drawing estimation from straight-through estimation techniques, we introduce a novel gradient update definition to enable a proximal search of the codebooks and their mappings. The proposed jointly learnable codebooks and mappings (JLCM) method allows a very efficient approximation of any DNN: as such, a Llama 7B can be compressed down to 2Go and loaded on 5-year-old smartphones.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Information Retrieval Vector Database Neural Architecture Search
Scores: [ 8 8 6 ]
With the increasing number of new neural architecture designs and substantial existing neural architectures, it becomes difficult for the researchers to situate their contributions compared with existing neural architectures or establish the connections between their designs and other relevant ones. To discover similar neural architectures in an efficient and automatic manner, we define a new problem Neural Architecture Retrieval which retrieves a set of existing neural architectures which have similar designs to the query neural architecture. Existing graph pre-training strategies cannot address the computational graph in neural architectures due to the graph size and motifs. To fulfill this potential, we propose to divide the graph into motifs which are used to rebuild the macro graph to tackle these issues, and introduce multi-level contrastive learning to achieve accurate graph representation learning. Extensive evaluations on both human-designed and synthesized neural architectures demonstrate the superiority of our algorithm. Such a dataset which contains 12k real-world network architectures, as well as their embedding, is built for neural architecture retrieval.
Primary Area: general machine learning (i.e., none of the above)
Keywords: Large language models Efficient ML Query Routing
Scores: [ 6 6 3 8 ]
Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size, while smaller models that can be deployed on lower cost (e.g., edge) devices, tend to lag behind in terms of response quality. Therefore in this work we propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality. Our approach uses a router that assigns queries to the small or large model based on the predicted query difficulty and the desired quality level. The desired quality level can be tuned dynamically at test time to seamlessly trade quality for cost as per the scenario requirements. In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.
Primary Area: generative models
Keywords: Diffusion model; Content Suppression; Image Editing; Text Embeddings
Scores: [ 6 6 6 6 ]
The success of recent text-to-image diffusion models is largely due to their capacity to be guided by a complex text prompt, which enables users to precisely describe the desired content. However, these models struggle to effectively suppress the generation of undesired content, which is explicitly requested to be omitted from the generated image in the prompt. In this paper, we analyze how to manipulate the text embeddings and remove unwanted content from them. We introduce two contributions, which we refer to as soft-weighted regularization and inference-time text embedding optimization. The first regularizes the text embedding matrix and effectively suppresses the undesired content. The second method aims to further suppress the unwanted content generation of the prompt, and encourages the generation of desired content. We evaluate our method quantitatively and qualitatively on extensive experiments, validating its effectiveness. Furthermore, our method is generalizability to both the pixel-space diffusion models (i.e. DeepFloyd-IF) and the latent-space diffusion models (i.e. Stable Diffusion).
Primary Area: generative models
Keywords: diffusion models diversity generative models
Scores: [ 8 8 8 ]
While conditional diffusion models are known to have good coverage of the data distribution, they still face limitations in output diversity, particularly when sampled with a high classifier-free guidance scale for optimal image quality or when trained on small datasets. We attribute this problem to the role of the conditioning signal in inference and offer an improved sampling strategy for diffusion models that can increase generation diversity, especially at high guidance scales, with minimal loss of sample quality. Our sampling strategy anneals the conditioning signal by adding scheduled, monotonically decreasing Gaussian noise to the conditioning vector during inference to balance diversity and condition alignment. Our Condition-Annealed Diffusion Sampler (CADS) can be used with any pretrained model and sampling algorithm, and we show that it boosts the diversity of diffusion models in various conditional generation tasks. Further, using an existing pretrained diffusion model, CADS achieves a new state-of-the-art FID of 1.70 and 2.31 for class-conditional ImageNet generation at 256$\times$256 and 512$\times$512 respectively.
Primary Area: generative models
Keywords: mathematical reasoning large language models code generation
Scores: [ 6 6 5 6 ]
Primary Area: generative models
Keywords: entity alignment variational autoencoder generative models knowledge graphs
Scores: [ 8 6 6 ]
Recent embedding-based methods have achieved great successes in exploiting entity alignment from knowledge graph (KG) embeddings of multiple modalities. In this paper, we study embedding-based entity alignment (EEA) from a perspective of generative models. We show that EEA shares similarities with typical generative models and prove the effectiveness of the recently developed generative adversarial network (GAN)-based EEA methods theoretically. We then reveal that their incomplete objective limits the capacity on both entity alignment and entity synthesis (i.e., generating new entities). We mitigate this problem by introducing a generative EEA (GEEA) framework with the proposed mutual variational autoencoder (M-VAE) as the generative model. M-VAE enables entity conversion between KGs and generation of new entities from random noise vectors. We demonstrate the power of GEEA with theoretical analysis and empirical experiments on both entity alignment and entity synthesis tasks.
Primary Area: generative models
Keywords: Diffusion Models; Score-based Models; Generative Models; Generative AI; Deep Generative Models; Distillation Models
Scores: [ 6 8 6 6 ]
Primary Area: generative models
Keywords: Human Motion Transfer Diffusion Model Human Animation Generative model.
Scores: [ 6 6 8 6 ]
Primary Area: generative models
Keywords: Diffusion Probabilistic Model Neural ODE Adjoint Sensitivity Method
Scores: [ 6 6 6 ]
This paper considers a ubiquitous problem underlying several applications of DPMs, i.e., optimizing the parameters of DPMs when the objective is a differentiable metric defined on the generated contents. Since the sampling procedure of DPMs involves recursive calls to the denoising UNet, naive gradient backpropagation requires storing the intermediate states of all iterations, resulting in extremely high memory consumption. To overcome this issue, we propose a novel method AdjointDPM, which first generates new samples from diffusion models by solving the corresponding probability-flow ODEs. It then uses the adjoint sensitivity method to backpropagate the gradients of the loss to the models' parameters (including conditioning signals, network weights, and initial noises) by solving another augmented ODE. To reduce numerical errors in both the forward generation and gradient backpropagation processes, we further reparameterize the probability-flow ODE and augmented ODE as simple non-stiff ODEs using exponential integration. AdjointDPM can effectively compute the gradients of all types of parameters in DPMs, including the network weights, conditioning text prompts, and noisy states.Finally, we demonstrate the effectiveness of AdjointDPM on several interesting tasks: guided generation via modifying sampling trajectories, finetuning DPM weights for stylization, and converting visual effects into text embeddings.
Primary Area: generative models
Keywords: program synthesis code generation large language models machine learning for code
Scores: [ 8 8 6 ]
Primary Area: generative models
Keywords: AI Alignment Large Language Models Principle-Driven Self-Align
Scores: [ 6 6 6 8 ]
Supervised Fine-Tuning (SFT) on human demonstrations combined with Reinforcement Learning from Human Feedback (RLHF) constitutes a powerful alignment paradigm for Large Language Model (LLM) AI-assistant agents. However, a significant limitation of this approach is its substantial dependency on high-quality human annotations, making its broader application to intricate tasks challenging due to difficulties in obtaining consistent response demonstrations and task-specific response preferences. To address this issue, we present a novel alignment paradigm in this paper, termed SALMON (Self-ALignMent with principle-fOllowiNg reward models). This paradigm offers the ability to align base language models with minimal human supervision, using only a select set of human-defined principles, yet achieves superior performance. Central to our approach is a principle-following reward model. Trained on synthetic preference data, this reward model can generate reward scores based on arbitrary human-defined principles. Therefore, during the RL training phase, by merely adjusting these principles, we gain full control over the preferences of the reward model, subsequently influencing the behavior of the RL-trained policy model, and eliminating the traditional reliance on exhaustive online human preference collection. Applying our method to the LLaMA-2-70b base language model, we developed an AI assistant named Dromedary-2. With only 6 exemplars for in-context learning and 31 human-defined principles, Dromedary-2 significantly surpasses the performance of several state-of-the-art AI systems, including LLaMA-2-Chat-70b, on various benchmark datasets. We have open-sourced the code and model weights to encourage further research into aligning LLM-based AI agents with enhanced supervision efficiency, improved controllability, and scalable oversight.
Primary Area: generative models
Keywords: Generative models Diffusion models
Scores: [ 8 6 6 6 6 ]
Primary Area: generative models
Keywords: diffusion model controllable generation object detection autonomous driving
Scores: [ 6 6 6 8 ]
Diffusion models have attracted significant attention due to the remarkable ability to create content and generate data for tasks like image classification. However, the usage of diffusion models to generate the high-quality object detection data remains an underexplored area, where not only image-level perceptual quality but also geometric conditions such as bounding boxes and camera views are essential. Previous studies have utilized either copy-paste synthesis or layout-to-image (L2I) generation with specifically designed modules to encode semantic layouts. In this paper, we propose GeoDiffusion, a simple framework that can flexibly translate various geometric conditions into text prompts and empower pre-trained text-to-image (T2I) diffusion models for high-quality detection data generation. Unlike previous L2I methods, our GeoDiffusion is able to encode not only the bounding boxes but also extra geometric conditions such as camera views in self-driving scenes. Extensive experiments demonstrate GeoDiffusion outperforms previous L2I methods while maintaining 4x training time faster. To the best of our knowledge, this is the first work to adopt diffusion models for layout-to-image generation with geometric conditions and demonstrate that L2I-generated images can be beneficial for improving the performance of object detectors.
Primary Area: generative models
Keywords: Instruction tuning Large language models BERT family Natural language generation
Scores: [ 6 8 6 ]
Language modeling at scale has proven very effective and brought unprecedented success to natural language models. Many typical representatives, especially decoder-only models, e.g., BLOOM and LLaMA, and encoder-decoder models, e.g., Flan-T5 and AlexaTM, have exhibited incredible instruction-following capabilities while keeping strong task completion ability. These large language models can achieve superior performance in various tasks and even yield emergent capabilities, e.g., reasoning and universal generalization. Though the above two paradigms are mainstream and well explored, the potential of the BERT family, which are encoder-only based models and have ever been one of the most representative pre-trained models, also deserves attention, at least should be discussed. In this work, we adopt XML-R to explore the effectiveness of the BERT family for instruction following and zero-shot learning. We first design a simple yet effective strategy to utilize the encoder-only models for generation tasks and then conduct multi-task instruction tuning. Experimental results demonstrate that our fine-tuned model, Instruct-XMLR, outperforms Bloomz on all evaluation tasks and achieves comparable performance with mT0 on most tasks. Surprisingly, Instruct-XMLR also possesses strong task and language generalization abilities, indicating that Instruct-XMLR can also serve as a good instruction follower and zero-shot learner. Besides, Instruct-XMLR can accelerate decoding due to its non-autoregressive generation manner, achieving around 3 times speedup compared with current autoregressive large language models. Although we also witnessed several limitations through our experiments, such as the performance decline in long-generation tasks and the shortcoming of length prediction, Instruct-XMLR can still become a good member of the family of current large language models.
Primary Area: generative models
Keywords: Text-to-Image Model Customization Parameter-Efficient Fine-Tuning Model Evaluation
Scores: [ 6 6 6 8 ]
Text-to-image generative models have garnered immense attention for their ability to produce high-fidelity images from text prompts. Among these, Stable Diffusion distinguishes itself as a leading open-source model in this fast-growing field. However, the intricacies of fine-tuning these models pose multiple challenges from new methodology integration to systematic evaluation. Addressing these issues, this paper introduces LyCORIS (Lora beYond Conventional methods, Other Rank adaptation Implementations for Stable diffusion), an open-source library that offers a wide selection of fine-tuning methodologies for Stable Diffusion. Furthermore, we present a thorough framework for the systematic assessment of varied fine-tuning techniques. This framework employs a diverse suite of metrics and delves into multiple facets of fine-tuning, including hyperparameter adjustments and the evaluation with different prompt types across various concept categories. Through this comprehensive approach, our work provides essential insights into the nuanced effects of fine-tuning parameters, bridging the gap between state-of-the-art research and practical application.
Primary Area: generative models
Keywords: transformers nlp fine-tuning context window attention rotary embedding
Scores: [ 6 8 6 6 ]
Rotary Position Embeddings (RoPE) have been shown to effectively encode positional information in transformer-based language models. However, these models fail to generalize past the sequence length they were trained on. We present YaRN (Yet another RoPE extensioN method), a compute-efficient method to extend the context window of such models, requiring 10x less tokens and 2.5x less training steps than previous methods. Using YaRN, we show that LLaMA models can effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow, while also surpassing previous the state-of-the-art at context window extension. In addition, we demonstrate that YaRN exhibits the capability to extrapolate beyond the limited context of a fine-tuning dataset. The models fine-tuned using YaRN has been made available and reproduced online up to 128k context length.
Primary Area: generative models
Keywords: score-based generative models Barron space curse of dimensionality
Scores: [ 8 6 8 5 ]
While score-based generative models (SGMs) have achieved remarkable successes in enormous image generation tasks, their mathematical foundations are still limited. In this paper, we analyze the approximation and generalization of SGMs in learning a family of sub-Gaussian probability distributions that are defined by exponential tilting of Gaussian distributions. We prove that if the tilting exponent is a Barron function, then the distribution generated by empirical score matching approximates the target distribution in the total variation distance with a dimension-independent rate. An essential ingredient of our proof is to derive a dimension-free deep neural network approximation rate for the true score function associated to the forward process, which is interesting in its own right.
Primary Area: generative models
Keywords: LLM Adaptation Low-rank Code Generation
Scores: [ 8 8 8 8 ]
Primary Area: generative models
Keywords: question answering large language model retrieval
Scores: [ 6 8 6 6 6 ]
Large language models (LLMs) have made significant advancements in various natural language processing tasks but face challenges such as hallucinations and integration of up-to-date knowledge, which is particularly critical for question answering (QA). While incorporating new information with the retrieval of relevant passages is a promising way to improve QA with LLMs, the existing methods often require additional fine-tuning which becomes infeasible with recent LLMs. Retrieval augmentation via prompting has the potential to address this limitation, but this direction has been limitedly explored. To this end, we design a simple yet effective framework to enhance open-domain QA (ODQA) with LLMs, based on the summarized retrieval (SuRe). SuRe helps LLMs predict more grounded answers, which are well-supported by the summarization of retrieved passages that could be viewed as an explicit rationale extracted from the retrieved passages. Specifically, SuRe first constructs summaries of the retrieved passages for each of the multiple answer candidates. Then, SuRe confirms the most plausible answer from the candidate set by evaluating the validity and ranking of the generated summaries. Experimental results on diverse ODQA benchmarks demonstrate the superiority of SuRe, with improvements of up to 4.4% in exact match (EM) and 3.9% in F1 score over standard prompting approaches. SuRe also can be integrated with a broad range of retrieval methods and LLMs. Finally, the generated summaries from SuRe show additional advantages to measure the importance of retrieved passages and serve as more preferred rationales by models and humans.
Primary Area: generative models
Keywords: audio synthesis vocoder GAN
Scores: [ 6 5 8 5 ]
Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that directly generates Fourier spectral coefficients. Vocos not only matches the state-of-the-art in audio quality, as demonstrated in our evaluations, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. The source code and model weights have been open-sourced.
Primary Area: generative models
Keywords: compression sparsification large language models
Scores: [ 6 6 6 5 5 ]
Large language models have become the cornerstone of natural language processing, but their use comes with substantial costs in terms of compute and memory resources. Sparsification provides a solution to alleviate these resource constraints, and recent works have shown that trained models can be sparsified post-hoc. Existing sparsification techniques face challenges as they need additional data structures and offer constrained speedup with current hardware. In this paper we present SliceGPT, a new post-training sparsification scheme which replaces each weight matrix with a smaller (dense) matrix, reducing the embedding dimension of the network. Through extensive experimentation, we show that SliceGPT can remove up to 25% of the model parameters (including embeddings) for OPT 66B and LLAMA-2 70B models with negligible loss in accuracy. Our sliced models run on fewer GPUs and run faster without any additional code optimization: on 24GB consumer GPUs we reduce the total compute for inference on LLAMA-2 70B to 64% of that of the dense model; on 40GB A100 GPUs we reduce it to 66%. We offer a new insight, computational invariance in transformer networks, which enables SliceGPT and we hope it will inspire and enable future avenues to reduce memory and computation demands for pre-trained models.
Primary Area: generative models
Keywords: LLMs chain of thought verifiers human feedback reasoning
Scores: [ 8 3 5 6 ]
In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.
Primary Area: generative models
Keywords: Language Model RLHF Alignment Instruction Tuning
Scores: [ 6 5 6 6 6 ]
We propose Reinforcement Learning from Contrastive Distillation (RLCD), a method for aligning language models to follow principles expressed in natural language (e.g., to be more harmless) without using human feedback. RLCD creates preference pairs from two contrasting model outputs, one using a positive prompt designed to encourage following the given principles, and one using a negative prompt designed to encourage violating them. Using two different prompts causes model outputs to be more differentiated on average, resulting in cleaner preference labels in the absence of human annotations. We then use the preference pairs to train a preference model, which is in turn used to improve a base unaligned language model via reinforcement learning. Empirically, RLCD outperforms RLAIF (Bai et al., 2022b) and context distillation (Huang et al., 2022) baselines across three diverse alignment tasks—harmlessness, helpfulness, and story outline generation—and when using both 7B and 30B model scales for simulating preference data
Primary Area: generative models
Keywords: Vector Graphics Generation Code Generation Science Generation TikZ Text-to-Image
Scores: [ 6 6 8 6 ]
Generating bitmap graphics from text has gained considerable attention, yet for scientific figures, vector graphics are often preferred. Given that vector graphics are typically encoded using low-level graphics primitives, generating them directly is difficult. To address this, we propose the use of TikZ, a well-known abstract graphics language that can be compiled to vector graphics, as an intermediate representation of scientific figures. TikZ offers human-oriented, high-level commands, thereby facilitating conditional language modeling with any large language model. To this end, we introduce DaTikZ the first large-scale TikZ dataset, consisting of 120k TikZ drawings aligned with captions. We fine-tune LLaMA on DaTikZ, as well as our new model CLiMA, which augments LLaMA with multimodal CLIP embeddings. In both human and automatic evaluation, CLiMA and LLaMA outperform commercial GPT-4 and Claude 2 in terms of similarity to human-created figures, with CLiMA additionally improving text-image alignment. Our detailed analysis shows that all models generalize well and are not susceptible to memorization. GPT-4 and Claude 2, however, tend to generate more simplistic figures compared to both humans and our models. We make our framework, AutomaTikZ, along with model weights and datasets, publicly available.
Primary Area: generative models
Keywords: 3D generation diffusion model
Scores: [ 6 6 6 ]
High-resolution 3D object generation remains a challenging task primarily due to the limited availability of comprehensive annotated training data. Recent advancements have aimed to overcome this constraint by harnessing image generative models, pretrained on extensive curated web datasets, using knowledge transfer techniques like Score Distillation Sampling (SDS).Efficiently addressing the requirements of high-resolution rendering often necessitates the adoption of latent representation-based models, such as the Latent Diffusion Model (LDM). In this framework, a significant challenge arises: To compute gradients for individual image pixels, it is necessary to backpropagate gradients from the designated latent space through the frozen components of the image model, such as the VAE encoder used within LDM. However, this gradient propagation pathway has never been optimized, remaining uncontrolled during training.We find that the unregulated gradients adversely affect the 3D model's capacity in acquiring texture-related information from the image generative model, leading to poor quality appearance synthesis.To address this overarching challenge, we propose an innovative operation termed Pixel-wise Gradient Clipping (PGC) designed for seamless integration into existing 3D generative models, thereby enhancing their synthesis quality. Specifically, we control the magnitude of stochastic gradients by clipping the pixel-wise gradients efficiently,while preserving crucial texture-related gradient directions.Despite this simplicity and minimal extra cost, extensive experiments demonstrate the efficacy of our PGCin enhancing the performance of existing 3D generative modelsfor high-resolution object rendering.
Primary Area: generative models
Keywords: watermark bias
Scores: [ 6 8 6 8 5 ]
Primary Area: generative models
Keywords: Style Transfer Schrödinger Bridge Problem Adversarial Training
Scores: [ 6 6 6 8 ]
Diffusion models are a powerful class of generative models which simulate stochastic differential equations (SDEs) to generate data from noise. Although diffusion models have achieved remarkable progress in recent years, they have limitations in the unpaired image-to-image translation tasks due to the Gaussian prior assumption. Schrödinger Bridge (SB), which learns an SDE to translate between two arbitrary distributions, have risen as an attractive solution to this problem. However, none of SB models so far have been successful at unpaired translation between high-resolution images. In this work, we propose the Unpaired Neural Schrödinger Bridge (UNSB), which expresses SB problem as a sequence of adversarial learning problems. This allows us to incorporate advanced discriminators and regularization to learn a SB between unpaired data. We demonstrate that UNSB is scalable and successfully solves various unpaired image-to-image translation tasks.
Primary Area: generative models
Keywords: text-to-image generation diffusion models high resolution generation
Scores: [ 6 6 8 6 ]
In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis. More results are available at the anonymous website: https://scalecrafter.github.io/ScaleCrafter/
Primary Area: generative models
Keywords: Diffusion Models linear Inverse problem Bayesian posterior sampling Bayesian filtering importance sampling
Scores: [ 10 6 6 3 ]
Diffusion models have achieved tremendous success in generating high-dimensional data like images, videos and audio. These models provide powerful data priors that can solve linear inverse problems in zero shot through Bayesian posterior sampling.However, exact posterior sampling for diffusion models is intractable. Current solutions often hinge on approximations that are either computationally expensive or lack strong theoretical guarantees. In this work, we introduce an efficient diffusion sampling algorithm for linear inverse problems that is guaranteed to be asymptotically accurate. We reveal a link between Bayesian posterior sampling and Bayesian filtering in diffusion models, proving the former as a specific instance of the latter. Our method, termed filtering posterior sampling, leverages sequential Monte Carlo methods to solve the corresponding filtering problem. It seamlessly integrates with all Markovian diffusion samplers, requires no model re-training, and guarantees accurate samples from the Bayesian posterior as particle counts rise. Empirical tests demonstrate that our method generates better or comparable results than leading zero-shot diffusion posterior samplers on tasks like image inpainting, super-resolution, and deblurring.
Primary Area: generative models
Keywords: language model alignment preference model Reinforcement Learning from Human Feedback (RLHF) overoptimization interpretability scalable oversight reward hacking
Scores: [ 6 6 3 8 ]
As language models (LMs) become more capable, it is increasingly important to align them with human preferences. However, the dominant paradigm for training Preference Models (PMs) for that purpose suffers from fundamental limitations, such as lack of transparency and scalability, along with susceptibility to overfitting the preference dataset.We propose Compositional Preference Models (CPMs), a novel PM framework that decomposes one global preference assessment into several interpretable features, obtains scalar scores for these features from a prompted LM, and aggregates these scores using a logistic regression classifier. CPMs allow to control which properties of the preference data are used to train the preference model and to build it based on features that are believed to underlie the human preference judgment.Our experiments show that CPMs not only improve generalization and are more robust to overoptimization than standard PMs, but also that best-of-n samples obtained using CPMs tend to be preferred over samples obtained using conventional PMs.Overall, our approach demonstrates the benefits of endowing PMs with priors about which features determine human preferences while relying on LM capabilities to extract those features in a scalable and robust way.
Primary Area: generative models
Keywords: Generative Modeling Stochastic Optimal Control Diffusion Model
Scores: [ 8 8 8 8 ]
Diffusion models (DMs) represent state-of-the-art generative models for continuous inputs. DMs work by constructing a Stochastic Differential Equation (SDE) in the input space (ie, position space), and using a neural network to reverse it. In this work, we introduce a novel generative modeling framework grounded in \textbf{phase space dynamics}, where a phase space is defined as {an augmented space encompassing both position and velocity.} Leveraging insights from Stochastic Optimal Control, we construct a path measure in the phase space that enables efficient sampling. {In contrast to DMs, our framework demonstrates the capability to generate realistic data points at an early stage of dynamics propagation.} This early prediction sets the stage for efficient data generation by leveraging additional velocity information along the trajectory. On standard image generation benchmarks, our model yields favorable performance over baselines in the regime of small Number of Function Evaluations (NFEs). Furthermore, our approach rivals the performance of diffusion models equipped with efficient sampling techniques, underscoring its potential as a new tool generative modeling.
Primary Area: generative models
Keywords: diffusion models text-to-image generation nesting architecture progressive training
Scores: [ 5 6 8 6 ]
Diffusion models are the de-facto approach for generating high-quality images and videos, but learning high-dimensional models remains a formidable task due to computational and optimization challenges. Existing methods often resort to training cascaded models in pixel space, or using a downsampled latent space of a separately trained auto-encoder. In this paper, we introduce Matryoshka Diffusion (MDM), an end-to-end framework for high-resolution image and video synthesis. We propose a diffusion process that denoises inputs at multiple resolutions jointly and uses a NestedUNet architecture where features and parameters for small-scale inputs are nested within those of large scales. In addition, MDM enables a progressive training schedule from lower to higher resolutions, which leads to significant improvements in optimization for high-resolution generation. We demonstrate the effectiveness of our approach on various benchmarks, including class-conditioned image generation, high-resolution text-to-image, and text-to-video applications. Remarkably, we can train a single pixel-space model at resolutions of up to 1024x1024 pixels, demonstrating strong zero-shot generalization using the CC12M dataset, which contains only 12 million images.
Primary Area: generative models
Keywords: likelihood-based modeling diffusion modeling density estimation
Scores: [ 8 8 8 5 ]
Primary Area: generative models
Keywords: large language model bias robustness multiple choice question
Scores: [ 5 8 8 8 ]
Multiple choice questions (MCQs) serve as a common yet important task format in the research of large language models (LLMs). This work shows that LLMs are vulnerable to option position changes in MCQs due to their inherent “selection bias”, namely, they prefer to select specific option IDs as answers (like “Option A”). Through extensive empirical analyses with 20 LLMs on three benchmarks, we pinpoint that this behavioral bias primarily stems from LLMs’ token bias, where the model a priori assigns more probabilistic mass to specific option ID tokens (e.g., A/B/C/D) when predicting answers from the option IDs. To mitigate selection bias, we propose a label-free, inference-time debiasing method, called PriDe, which separates the model’s prior bias for option IDs from the overall prediction distribution. PriDe first estimates the prior by permutating option contents on a small number of test samples, which is then applied to debias the subsequent samples. We demonstrate that PriDe achieves superior debiasing effectiveness and computational efficiency to strong baselines. Furthermore, the prior estimated by PriDe is interpretable and can generalize well across different domains, highlighting its practical potential in broader scenarios.
Primary Area: generative models
Keywords: Dynamic object generation
Scores: [ 10 6 6 6 6 ]
In this paper, we present Consistent4D, a novel approach for generating 4D dynamic objects from uncalibrated monocular videos. Uniquely, we cast the 360-degree dynamic object reconstruction as a 4D generation problem, eliminating the need for tedious multi-view data collection and camera calibration. This is achieved by leveraging the object-level 3D-aware image diffusion model as the primary supervision signal for training dynamic Neural Radiance Fields (DyNeRF). Specifically, we propose a cascade DyNeRF to facilitate stable convergence and temporal continuity under the supervision signal which is discrete along the time axis. To achieve spatial and temporal consistency, we further introduce an interpolation-driven consistency loss. It is optimized by minimizing the L2 distance between rendered frames from DyNeRF and interpolated frames from a pre-trained video interpolation model. Extensive experiments show that our Consistent4D can perform competitively to prior art alternatives, opening up new possibilities for 4D dynamic object generation from monocular videos, whilst also demonstrating advantage for conventional text-to-3D generation tasks
Primary Area: generative models
Keywords: GAN GANN Text Generation NLP NLU Deep Learning Machine Learning
Scores: [ 6 6 6 8 ]
The current advancements in open domain text generation have been spearheaded by Transformer-based large language models. Leveraging efficient parallelization and vast training datasets, these models achieve unparalleled text generation capabilities. Even so, current models are known to suffer from deficiencies such as repetitive texts, looping issues, and lack of robustness. While adversarial training through generative adversarial networks (GAN) is a proposed solution, earlier research in this direction has predominantly focused on older architectures, or narrow tasks. As a result, this approach is not yet compatible with modern language models for open-ended text generation, leading to diminished interest within the broader research community. We propose a computationally efficient GAN approach for sequential data that utilizes the parallelization capabilities of Transformer models. Our method revolves around generating multiple branching sequences from each training sample, while also incorporating the typical next-step prediction loss on the original data. In this way, we achieve a dense reward and loss signal for both the generator and the discriminator, resulting in a stable training dynamic. We apply our training method to pre-trained language models, using data from their original training set but less than 0.01% of the available data. A comprehensive human evaluation shows that our method significantly improves the quality of texts generated by the model while avoiding the previously reported sparsity problems of GAN approaches. Even our smaller models outperform larger original baseline models with more than 16 times the number of parameters. Finally, we corroborate previous claims that perplexity on held-out data is not a sufficient metric for measuring the quality of generated texts.
Primary Area: generative models
Keywords: diffusion models street view generation 3D control
Scores: [ 6 8 5 5 ]
Recent advancements in diffusion models have significantly enhanced the data synthesis with 2D control. Yet, precise 3D control in street view generation, crucial for 3D perception tasks, remains elusive. Specifically, utilizing Bird's-Eye View (BEV) as the primary condition often leads to challenges in geometry control (e.g., height), affecting the representation of object shapes, occlusion patterns, and road surface elevations, all of which are essential to perception data synthesis, especially for 3D object detection tasks. In this paper, we introduce MagicDrive, a novel street view generation framework offering diverse 3D geometry controls, including camera poses, road maps, and 3D bounding boxes, together with textual descriptions, achieved through tailored encoding strategies. Besides, our design incorporates a cross-view attention module, ensuring consistency across multiple camera views. With MagicDrive, we achieve high-fidelity street-view synthesis that captures nuanced 3D geometry and various scene descriptions, enhancing tasks like BEV segmentation and 3D object detection. Project Website: https://magic-drive.github.io/
Primary Area: generative models
Keywords: language modeling; retrieval; legality of language modeling
Scores: [ 8 8 8 5 ]
The legality of training language models (LMs) on copyrighted or otherwise restricted data is under intense debate. However, as we show, model performance significantly degrades if trained only on low-risk text (e.g., out-of-copyright books or government documents), due to its limited size and domain coverage. We present SILO, a new language model that manages this risk-performance tradeoff during inference. SILO is built by (1) training a parametric LM on the Open License Corpus (OLC), a new corpus we curate with 228B tokens of public domain and permissively licensed text and (2) augmenting it with a more general and easily modifiable nonparametric datastore (e.g., containing copyrighted books or news) that is only queried during inference. The datastore allows use of high-risk data without training on it, supports sentence-level data attribution, and enables data producers to opt out from the model by removing content from the store. These capabilities can foster compliance with data-use regulations such as the fair use doctrine in the United States and the GDPR in the European Union. Our experiments show that the parametric LM struggles on its own with domains not covered by OLC. However, access to the datastore greatly improves out of domain performance, closing 90% of the performance gap with an LM trained on the Pile, a more diverse corpus with mostly high-risk text. We also analyze which nonparametric approach works best, where the remaining errors lie, and how performance scales with datastore size. Our results suggest that it is possible to build high quality language models while mitigating legal risk.
Primary Area: generative models
Keywords: large language model knowledge distillation speculative decoding
Scores: [ 6 6 6 ]
Speculative decoding~(SD) accelerates large language model inference by employing a faster {\em draft} model for generating multiple tokens, which are then verified in parallel by the larger {\em target} model, resulting in the text generated according to the target model distribution. However, identifying a compact draft model that is well-aligned with the target model is challenging. To tackle this issue, we propose {\em DistillSpec} that uses knowledge distillation to better align the draft model with the target model, before applying SD. DistillSpec makes two key design choices, which we demonstrate via systematic study to be crucial to improve the draft and target alignment: utilizing \emph{on-policy} data generation from the draft model, and \emph{tailoring the divergence function} to the task and decoding strategy. Notably, DistillSpec yields impressive 10−45% speedups over standard SD on a range of standard benchmarks, using both greedy and non-greedy sampling. Furthermore, we combine DistillSpec with lossy SD to achieve fine-grained control over the latency vs. task performance trade-off. Finally, in practical scenarios with models of varying sizes, first using distillation to boost the performance of the target model and then applying DistillSpec to train a well-aligned draft model can reduce decoding latency by 6−10× with minimal performance drop, compared to standard decoding without distillation.
Primary Area: generative models
Keywords: diffusion models zero-shot text-to-image generative models human visual perception psychophysics cognitive science neuroscience
Scores: [ 8 8 8 ]
What is the best paradigm to recognize objects---discriminative inference (fast but potentially prone to shortcut learning) or using a generative model (slow but potentially more robust)? We build on recent advances in generative modeling that turn text-to-image models into classifiers. This allows us to study their behavior and to compare them against discriminative models and human psychophysical data.We report four intriguing emergent properties of generative classifiers: they show a record-breaking human-like shape bias (99% for Imagen), near human-level out-of-distribution accuracy, state-of-the-art alignment with human classification errors, and they understand certain perceptual illusions. Our results indicate that while the current dominant paradigm for modeling human object recognition is discriminative inference, zero-shot generative models approximate human object recognition data surprisingly well.
Primary Area: generative models
Keywords: set generation keyphrase generation
Scores: [ 6 6 8 8 ]
Generating a set of text is a common challenge for many NLP applications, for example, automatically providing multiple keyphrases for a document to facilitate user reading. Existing generative models use a sequential decoder which generates a single sequence successively, and the set generation problem is converted to sequence generation via concatenating multiple texts into a long text sequence. However, the elements of a set are unordered, which makes this scheme suffer from biased or conflicting training signals. In this paper, we propose a novel branching decoder. It can generate a dynamic number of tokens at each time-step and branch multiple generation paths. In particular, paths are generated individually so that no order dependence is required. Moreover, multiple paths can be generated in parallel which greatly reduces inference time. Experiments on three keyphrase generation datasets demonstrate that our branching decoder is more effective and efficient than the existing sequential decoder.
Primary Area: generative models
Keywords: Inpainting Decoupled Probabilistic Modeling Pixel Spread Model
Scores: [ 6 8 6 5 ]
Primary Area: generative models
Keywords: Diffusion Model; Pose-Guided Image Synthesis
Scores: [ 5 6 8 6 ]
Recent work has showcased the significant potential of diffusion models in pose-guided person image synthesis.However, owing to the inconsistency in pose between the source and target images, synthesizing an image with a distinct pose, relying exclusively on the source image and target pose information, remains a formidable challenge.This paper presents Progressive Conditional Diffusion Models (PCDMs) that incrementally bridge the gap between person images under the target and source poses through three stages.Specifically, in the first stage, we design a simple prior conditional diffusion model that predicts the global features of the target image by mining the global alignment relationship between pose coordinates and image appearance.Then, the second stage establishes a dense correspondence between the source and target images using the global features from the previous stage, and an inpainting conditional diffusion model is proposed to further align and enhance the contextual features, generating a coarse-grained person image.In the third stage, we propose a refining conditional diffusion model to utilize the coarsely generated image from the previous stage as a condition, achieving texture restoration and enhancing fine-detail consistency.The three-stage PCDMs work progressively to generate the final high-quality and high-fidelity synthesized image.Both qualitative and quantitative results demonstrate the consistency and photorealism of our proposed PCDMs under challenging scenarios.Our code and models will be publicly available.
Primary Area: generative models
Keywords: large language models interpretability methods representation recommender systems
Scores: [ 8 6 8 5 ]
Primary Area: generative models
Keywords: Efficient diffusion models short-span attention local-feature enrichment global-content orchestration multi-scale generation
Scores: [ 6 8 6 ]
Diffusion models excel at generating photo-realistic images but come with significant computational costs in both training and sampling. While various techniques address these computational challenges, a less-explored issue is designing an efficient and adaptable network backbone for iterative refinement. Current options like U-Net and Vision Transformer often rely on resource-intensive deep networks and lack the flexibility needed for generating images at variable resolutions or with a smaller network than used in training.This study introduces LEGO bricks, which seamlessly integrate Local-feature Enrichment and Global-content Orchestration. These bricks can be stacked to create a test-time reconfigurable diffusion backbone, allowing selective skipping of bricks to reduce sampling costs and generate higher-resolution images than the training data. LEGO bricks enrich local regions with an MLP and transform them using a Transformer block while maintaining a consistent full-resolution image across all bricks. Experimental results demonstrate that LEGO bricks enhance training efficiency, expedite convergence, and facilitate variable-resolution image generation while maintaining strong generative performance. Moreover, LEGO significantly reduces sampling time compared to other methods, establishing it as a valuable enhancement for diffusion models.
Primary Area: generative models
Keywords: time series forecasting generative modeling deep learning diffusion model
Scores: [ 8 6 5 ]
Primary Area: generative models
Keywords: generative models diffusion model image synthesis
Scores: [ 6 6 8 8 ]
Diffusion models achieved great success in image synthesis, but still face challenges in high-resolution generation. Through the lens of discrete cosine transformation, we find the main reason is that the same noise level on a higher resolution results in a higher Signal-to-Noise Ratio in the frequency domain. In this work, we present Relay Diffusion Model (RDM), which transfers a low-resolution image or noise into an equivalent high-resolution one for diffusion model via blurring diffusion and block noise. Therefore, the diffusion process can continue seamlessly in any new resolution or model without restarting from pure noise or low-resolution conditioning. RDM achieves state-of-the-art FID on CelebA-HQ and sFID on ImageNet 256$\times$256, surpassing previous works such as ADM, LDM and DiT by a large margin. All the codes and checkpoints are open-sourced at https://anonymous.4open.science/r/RelayDiffusion-anonymous-1B3E.
Primary Area: generative models
Keywords: video prediction multi-modality generation diffusion model language-vision model
Scores: [ 8 6 6 6 ]
Imagining the future trajectory is the key for robots to make sound planning and successfully reach their goals. Therefore, text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning.To tackle this task and empower robots with the ability to foresee the future, we propose a sample and computation-efficient model, named Seer, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis. We enhance the U-Net and language conditioning model by incorporating computation-efficient spatial-temporal attention. Furthermore, we introduce a novel Frame Sequential Text Decomposer module that dissects a sentence's global instruction into temporally aligned sub-instructions, ensuring precise integration into each frame of generation. Our framework allows us to effectively leverage the extensive prior knowledge embedded in pretrained T2I models across the frames. With the adaptable-designed architecture, Seer makes it possible to generate high-fidelity, coherent, and instruction-aligned video frames by fine-tuning a few layers on a small amount of data. The experimental results on Something Something V2 (SSv2), Bridgedata and EpicKitchens-100 datasets demonstrate our superior video prediction performance with around 480-GPU hours versus CogVideo with over 12,480-GPU hours: achieving the 31% FVD improvement compared to the current SOTA model on SSv2 and 83.7% average preference in the human evaluation.
Primary Area: generative models
Keywords: Diffusion Models Structure-Based Drug Design Molecule Generation
Scores: [ 6 5 8 6 ]
Generating 3D ligand molecules that bind to specific protein targets via diffusion models has shown great promise for structure-based drug design. The key idea is to disrupt molecules into noise through a fixed forward process and learn its reverse process to generate molecules from noise in a denoising way. However, existing diffusion models primarily focus on incorporating protein-ligand interaction information solely in the reverse process, and neglect the interactions in the forward process. The inconsistency between forward and reverse processes may impair the binding affinity of generated molecules towards target protein. In this paper, we propose a novel Interaction Prior-guided Diffusion model (IPDiff) for the protein-specific 3D molecular generation by introducing geometric protein-ligand interactions into both diffusion and sampling process. Specifically, we begin by pretraining a protein-ligand interaction prior network (IPNet) by utilizing the binding affinity signals as supervision. Subsequently, we leverage the pretrained prior network to (1) integrate interactions between the target protein and the molecular ligand into the forward process for adapting the molecule diffusion trajectories (prior-shifting), and (2) enhance the binding-aware molecule sampling process (prior-conditioning). Empirical studies on CrossDocked2020 dataset show IPDiff can generate molecules with more realistic 3D structures and state-of-the-art binding affinities towards the protein targets, with up to -6.42 Avg. Vina Score, while maintaining proper molecular properties.
Primary Area: generative models
Keywords: layout generation large language model code generation
Scores: [ 5 6 6 8 ]
Primary Area: generative models
Keywords: Large Language Models; Prompt Engineering; Boosting Mechanism;
Scores: [ 8 5 6 6 5 ]
Primary Area: generative models
Keywords: Diffusion models Deep Generative Models Efficient Sampling
Scores: [ 6 8 6 5 ]
Primary Area: generative models
Keywords: Large-vocabulary Objects; 3D Generation; Transformer; Diffusion model
Scores: [ 6 6 6 ]
Creating diverse and high-quality 3D assets with an automatic generative model is highly desirable. Despite extensive efforts on 3D generation, most existing works focus on the generation of a single category or a few categories. In this paper, we introduce a diffusion-based feed-forward framework for synthesizing massive categories of real-world 3D objects \textit{with a single generative model}. Notably, there are three major challenges for this large-vocabulary 3D generation: \textbf{a}) the need for expressive yet efficient 3D representation; \textbf{b}) large diversity in geometry and texture across categories; \textbf{c}) complexity in the appearances of real-world objects. To this end, we propose a novel triplane-based 3D-aware \textbf{Diff}usion model with \textbf{T}rans\textbf{F}ormer, \textbf{DiffTF}, for handling challenges via three aspects. \textbf{1}) Considering efficiency and robustness, we adopt a revised triplane representation and improve the fitting speed and accuracy. \textbf{2}) To handle the drastic variations in geometry and texture, we regard the features of all 3D objects as a combination of generalized 3D knowledge and specialized 3D features. To extract generalized 3D knowledge from diverse categories, we propose a novel 3D-aware transformer with shared cross-plane attention. It learns the cross-plane relations across different planes and aggregates the generalized 3D knowledge with specialized 3D features. \textbf{3}) In addition, we devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge in the encoded triplanes for handling categories with complex appearances. Extensive experiments on ShapeNet and OmniObject3D (over 200 diverse real-world categories) convincingly demonstrate that a single DiffTF model achieves state-of-the-art large-vocabulary 3D object generation performance with large diversity, rich semantics, and high quality. More results are available at https://difftf.github.io/
Primary Area: generative models
Keywords: Generative Models Computer Vision Diffusion Models
Scores: [ 6 3 6 6 ]
Typical diffusion models are trained to accept a particular form of conditioning, most commonly text, and cannot be conditioned on other modalities without retraining. In this work, we propose a universal guidance algorithm that enables diffusion models to be controlled by arbitrary guidance modalities without the need to retrain any use-specific components. We show that our algorithm successfully generates quality images with guidance functions including segmentation, face recognition, object detection, style guidance and classifier signals.
Primary Area: generative models
Keywords: diffusion models; temporal coherency; Gaussian noise field; continuous white noise; noise transport
Scores: [ 6 8 8 8 ]
Video editing and generation methods often rely on pre-trained image-based diffusion models. During the diffusion process, however, the reliance on rudimentary noise sampling techniques that do not preserve correlations present in subsequent frames of a video is detrimental to the quality of the results. This either produces high-frequency flickering, or texture-sticking artifacts that are not amenable to post-processing. With this in mind, we propose a novel method for preserving temporal correlations in a sequence of noise samples. This approach is materialized by a novel noise representation, dubbed ∫-noise (integral noise), that reinterprets individual noise samples as a continuously integrated noise field: pixel values do not represent discrete values, but are rather the integral of an underlying infinite-resolution noise over the pixel area. Additionally, we propose a carefully tailored transport method that uses ∫-noise to accurately advect noise samples over a sequence of frames, maximizing the correlation between different frames while also preserving the noise properties. Our results demonstrate that the proposed ∫-noise can be used for a variety of tasks, such as video restoration, surrogate rendering, and conditional video generation.
Primary Area: generative models
Keywords: Controllable video generation probabilistic adaptation video diffusion black box adaptation
Scores: [ 8 6 8 3 ]
Large text-to-video models trained on internet-scale data have demonstrated exceptional capabilities in generating high-fidelity videos from arbitrary textual descriptions. However, similar to proprietary language models, large text-to-video models are often black boxes whose weight parameters are not publicly available, posing a significant challenge to adapting these models to specific domains such as robotics, animation, and personalized stylization. Inspired by how a large language model can be prompted to perform new tasks without access to the model weights, we investigate how to adapt a black-box pretrained text-to-video model to a variety of downstream domains without weight access to the pretrained model. In answering this question, we propose \emph{\methodname}, which leverages the score function of a large pretrained video diffusion model as a probabilistic prior to guide the generation of a task-specific small video model. Our experiments show that, by incorporating broad knowledge and fidelity of the pretrained model probabilistically, a small model with as few as 1.25% parameters of the pretrained model can generate high-quality yet domain-specific videos for a variety of downstream domains such as animation, egocentric modeling, and modeling of simulated and real-world robotics data. As large text-to-video models starting to become available as a service similar to large language models, we advocate for private institutions to expose scores of video diffusion models as outputs in addition to generated videos to allow flexible adaptation of large pretrained text-to-video models by the general public.
Primary Area: generative models
Keywords: large language models LLMs
Scores: [ 8 5 6 5 ]
Primary Area: generative models
Keywords: Generative Models Diffusion Models Image Editing Image Composition
Scores: [ 6 6 8 ]
Addressing the limitations of text as a source of accurate layout representation in text-conditional diffusion models, many works incorporate additional signals to condition certain attributes within a generated image. Although successful, previous works do not account for the specific localization of said attributes extended into the three dimensional plane. In this context, we present a conditional diffusion model that integrates control over three-dimensional object placement with disentangled representations of global stylistic semantics from multiple exemplar images. Specifically, we first introduce depth disentanglement training to leverage the relative depth of objects as an estimator, allowing the model to identify the absolute positions of unseen objects through the use of synthetic image triplets. We also introduce soft guidance, a method for imposing global semantics onto targeted regions without the use of any additional localization cues. Our integrated framework, Compose and Conquer (CnC), unifies these techniques to localize multiple conditions in a disentangled manner. We demonstrate that our approach allows perception of objects at varying depths while offering a versatile framework for composing localized objects with different global semantics.
Primary Area: generative models
Keywords: Large Language models Transformer model understanding
Scores: [ 8 3 10 ]
Primary Area: generative models
Keywords: quality diversity large language models derivative-free optimization AI feedback
Scores: [ 6 6 6 5 ]
In many text-generation problems, users may prefer not only a single response, but a diverse range of high-quality outputs from which to choose. Quality-diversity (QD) search algorithms aim at such outcomes, by continually improving and diversifying a population of candidates. However, the applicability of QD to qualitative domains, like creative writing, has been limited by the difficulty of algorithmically specifying measures of quality and diversity. Interestingly, recent developments in language models (LMs) have enabled guiding search through \emph{AI feedback}, wherein LMs are prompted in natural language to evaluate qualitative aspects of text. Leveraging this development, we introduce Quality-Diversity through AI Feedback (QDAIF), wherein an evolutionary algorithm applies LMs to both generate variation and evaluate the quality and diversity of candidate text. When assessed on creative writing domains, QDAIF covers more of a specified search space with high-quality samples than do non-QD controls. Further, human evaluation of QDAIF-generated creative texts validates reasonable agreement between AI and human evaluation. Our results thus highlight the potential of AI feedback to guide open-ended search for creative and original solutions, providing a recipe that seemingly generalizes to many domains and modalities. In this way, QDAIF is a step towards AI systems that can independently search, diversify, evaluate, and improve, which are among the core skills underlying human society's capacity for innovation.
Primary Area: generative models
Keywords: Program Synthesis Programming By Example Generalization Compositional Generalization
Scores: [ 6 8 8 6 ]
Primary Area: generative models
Keywords: Transformer Transformer Decoder Decoder-Only Transformer Natural Language Processing NLP Vector Quantization VQ K-Means Clustering Causal Attention Autoregressive Attention Efficient Attention Linear-Time Attention Autoregressive Modeling Generative Modeling Gated Attention Compressive Attention Kernelized Attention Kernelizable Attention Hierarchical Attention Segment-Level Recurrent Attention Long-Context Modeling Long-Range Modeling Long-Range Dependencies Long-Term Dependencies Cached Attention Shift-Equivariant Attention
Scores: [ 8 8 6 ]
Primary Area: generative models
Keywords: Multi-modality Image Completion Diffusion Model
Scores: [ 8 6 6 ]
Vanilla image completion approaches exhibit sensitivity to large missing regions, attributed to the limited availability of reference information for plausible generation. To mitigate this, existing methods incorporate the extra cue as guidance for image completion. Despite improvements, these approaches are often restricted to employing a single modality (e.g., segmentation or sketch maps), which lacks scalability in leveraging multi-modality for more plausible completion.In this paper, we propose a novel, simple yet effective method for Multi-modal Guided Image Completion, dubbed MaGIC, which not only supports a wide range of single modality as the guidance (e.g., text, canny edge, sketch, segmentation, depth, and pose), but also adapts to arbitrarily customized combinations of these modalities (i.e., arbitrary multi-modality) for image completion.For building MaGIC, we first introduce a modality-specific conditional U-Net (MCU-Net) that injects single-modal signal into a U-Net denoiser for single-modal guided image completion. Then, we devise a consistent modality blending (CMB) method to leverage modality signals encoded in multiple learned MCU-Nets through gradient guidance in latent space. Our CMB is training-free, thereby avoids the cumbersome joint re-training of different modalities, which is the secret of MaGIC to achieve exceptional flexibility in accommodating new modalities for completion.Experiments show the superiority of MaGIC over state-of-the-art methods and its generalization to various completion tasks.
Primary Area: generative models
Keywords: generative model diffusion model controllable generation
Scores: [ 6 6 6 8 ]
Despite the recent advancements, conditional image generation still faces challenges of cost, generalizability, and the need for task-specific training. In this paper, we propose Manifold Preserving Guided Diffusion (MPGD), a training-free conditional generation framework that leverages pretrained diffusion models and off-the-shelf neural networks with minimal additional inference cost for a broad range of tasks. Specifically, we leverage the manifold hypothesis to refine the guided diffusion steps and introduce a shortcut algorithm in the process. We then propose two methods for on-manifold training-free guidance using pre-trained autoencoders and demonstrate that our shortcut inherently preserves the manifolds when applied to latent diffusion models. Our experiments show that MPGD is efficient and effective for solving a variety of conditional generation applications in low-compute settings, and can consistently offer up to 3.8× speed-ups with the same number of diffusion steps while maintaining high sample quality compared to the baselines.
Primary Area: generative models
Keywords: Graph generative models graph neural networks graph representation
Scores: [ 5 6 8 6 ]
Primary Area: generative models
Keywords: Diffusion Model Text-to-Image Generation Text-to-Video Editing
Scores: [ 6 8 5 6 ]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing. Nevertheless, prevailing text-guided visual diffusion models primarily focus on incorporating text-visual relationships exclusively into the reverse process, often disregarding their relevance in the forward process. This inconsistency between forward and reverse processes maylimit the precise conveyance of textual semantics in visual synthesis results. To address this issue, we propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample into forward and reverse processes. We propagate this context to all timesteps in the two processes to adapt their trajectories, thereby facilitating cross-modal conditional modeling. We generalize our contextualized diffusion to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing. In each task, our ContextDiff achieves new state-of-the-art performance, significantly enhancing the semantic alignment between text condition and generated samples, as evidenced by quantitative and qualitative evaluations.
Primary Area: generative models
Keywords: language models decoding planning game theory
Scores: [ 6 8 6 10 ]
When applied to question answering and other text generation tasks, language models (LMs) may be queried generatively (by sampling answers from their output distribution) or discriminatively (by using them to score or rank a set of candidate answers). These procedures sometimes yield very different predictions. How do we reconcile mutually incompatible scoring procedures to obtain coherent LM predictions? We introduce a new, a training-free, game-theoretic procedure for language model decoding. Our approach casts language model decoding as a regularized imperfect-information sequential signaling game—which we term the concensus game—in which a generator seeks to communicate an abstract correctness parameter using natural language sentences to a discriminator. We develop computational procedures for finding approximate equilibria of this game, resulting in a decoding algorithm we call equilibrium-ranking. Applied to a large number of tasks (including reading comprehension, commonsense reasoning, mathematical problem-solving, and assistive dialog), equilibrium-ranking consistently improves performance over existing LM decoding procedures. These improvements are sometimes substantial—on multiple benchmarks, we observe that applying equilibrium-ranking to LLaMA-7B outperforms the much larger LLaMA-65B and PaLM-540B models.
Primary Area: generative models
Keywords: large language models large code models instruction tuning
Scores: [ 8 8 6 ]
Primary Area: generative models
Keywords: Prompting Mechanisms Zero-Shot Text-to-Speech Multi-Sentence Prompts
Scores: [ 6 6 6 8 ]
Zero-shot text-to-speech (TTS) aims to synthesize voices with unseen speech prompts, which significantly reduces the data and computation requirements for voice cloning by skipping the fine-tuning process. However, the prompting mechanisms of zero-shot TTS still face challenges in the following aspects: 1) previous works of zero-shot TTS are typically trained with single-sentence prompts, which significantly restricts their performance when the data is relatively sufficient during the inference stage. 2) The prosodic information in prompts is highly coupled with timbre, making it untransferable to each other. This paper introduces Mega-TTS, a generic prompting mechanism for zero-shot TTS, to tackle the aforementioned challenges. Specifically, we design a powerful acoustic autoencoder that separately encodes the prosody and timbre information into the compressed latent space while providing high-quality reconstructions. Then, we propose a multi-reference timbre encoder and a prosody latent language model (P-LLM) to extract useful information from multi-sentence prompts. We further leverage the probabilities derived from multiple P-LLM outputs to produce transferable and controllable prosody. Experimental results demonstrate that Mega-TTS could not only synthesize identity-preserving speech with a short prompt of an unseen speaker from arbitrary sources but consistently outperform the fine-tuning method when the volume of data ranges from 10 seconds to 5 minutes. Furthermore, our method enables to transfer various speaking styles to the target timbre in a fine-grained and controlled manner. Audio samples can be found in https://boostprompt.github.io/boostprompt/.
Primary Area: generative models
Keywords: large language model efficient inference data-centric optimization parallel decoding prompt engineering
Scores: [ 6 6 6 8 5 3 ]
This work aims at decreasing the end-to-end generation latency of large language models (LLMs). One of the major causes of the high generation latency is the sequential decoding approach adopted by almost all state-of-the-art LLMs. In this work, motivated by the thinking and writing process of humans, we propose Skeleton-of-Thought (SoT), which first guides LLMs to generate the skeleton of the answer, and then conducts parallel API calls or batched decoding to complete the contents of each skeleton point in parallel. Not only does SoT provide considerable speed-ups across 12 LLMs, but it can also potentially improve the answer quality on several question categories. SoT is an initial attempt at data-centric optimization for inference efficiency, and further underscores the potential of pushing LLMs to think more like a human for answer quality.
Primary Area: generative models
Keywords: diffusion model time series multiscale
Scores: [ 6 6 8 6 ]
The diffusion model has been successfully used in many computer vision applications, such as text-guided image generation and image-to-image translation. Recently, there have been attempts on extending the diffusion model for time series data. However, these extensions are fairly straightforward and do not utilize the unique properties of time series data. As different patterns are usually exhibited at multiple scales of a time series, we in this paper leverage this multi-resolution temporal structure and propose the \underline{m}ulti-\underline{r}esolution \underline{diff}usion model (\texttt{mr-Diff}). By using the seasonal-trend decomposition, we sequentially extract fine-to-coarse trends from the time series for forward diffusion. The denoising process then proceeds in an easy-to-hard non-autoregressive manner. The coarsest trend is generated first. Finer details are progressively added, using the predicted coarser trends as condition variables. Experimental results on nine real-world time series datasets demonstrate that \texttt{mr-Diff} outperforms state-of-the-art time series diffusion models. It is also better than or comparable across a wide variety of advanced time series prediction models.
Primary Area: generative models
Keywords: Generative Models Diffusion Models Image Editing
Scores: [ 6 8 6 ]
Text-guided diffusion models have become a popular tool in image synthesis, known for producing high-quality and diverse images. However, their application to editing real images often encounters hurdles primarily due to the text condition deteriorating the reconstruction quality and subsequently affecting editing fidelity. Null-text Inversion (NTI) has made strides in this area, but it fails to capture spatial context and requires computationally intensive per-timestep optimization. Addressing these challenges, we present Noise Map Guidance (NMG), an inversion method rich in a spatial context, tailored for real-image editing. Significantly, NMG achieves this without necessitating optimization, yet preserves the editing quality. Our empirical investigations highlight NMG's adaptability across various editing techniques and its robustness to variants of DDIM inversions.
Primary Area: generative models
Keywords: Code Generation Data Quality Synthetic Datasets
Scores: [ 5 8 8 ]
Natural language to code generation is an important application area of LLMs and has received wide attention from the community. The majority of relevant studies have exclusively concentrated on increasing the quantity and functional correctness of training sets while disregarding other stylistic elements of programs. More recently, data quality has garnered a lot of interest and multiple works have showcased its importance for improving performance. In this work, we investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system. We build a novel data-cleaning pipeline that uses these principles to transform existing programs by 1.) renaming variables, 2.) modularizing and decomposing complex code into smaller helper sub-functions, and 3.) inserting natural-language based planning annotations. We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B on our transformed programs improves the performance by up to \textbf{30%} compared to fine-tuning on the original dataset. Additionally, we demonstrate improved performance from using a smaller amount of higher-quality data, finding that a model fine-tuned on the entire original dataset is outperformed by a model trained on one-eighth of our cleaned dataset. Even in comparison to closed-source models, our models outperform the much larger AlphaCode models.
Primary Area: generative models
Keywords: diffusion LLM
Scores: [ 6 5 6 5 ]
Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts describing complex scenes with multiple objects. While excelling in generating images from short, single-object descriptions, these models often struggle to faithfully capture all the nuanced details within longer and more elaborate textual inputs. In response, we present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts, including bounding box coordinates for foreground objects, detailed textual descriptions for individual objects, and a succinct background context. These components form the foundation of our layout-to-image generation model, which operates in two phases. The initial Global Scene Generation utilizes object layouts and background context to create an initial scene but often falls short in faithfully representing object characteristics as specified in the prompts. To address this limitation, we introduce an Iterative Refinement Scheme that iteratively evaluates and refines box-level content to align them with their textual descriptions, recomposing objects as needed to ensure consistency. Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models. This is further validated by a user study, underscoring the efficacy of our approach in generating coherent and detailed scenes from intricate textual inputs. Our iterative framework offers a promising solution for enhancing text-to-image generation models' fidelity with lengthy, multifaceted descriptions, opening new possibilities for accurate and diverse image synthesis from textual inputs.
Primary Area: generative models
Keywords: Generative pretraining Multimodal generalist Foundation models In-context multimodal learning
Scores: [ 6 6 6 6 ]
We present Emu, a multimodal foundation model that seamlessly generates images and text in multimodal context. This omnivore model can take in any single- modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the leverage of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, supporting in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.
Primary Area: generative models
Keywords: large language model scaling in-context learning pruning
Scores: [ 6 6 6 6 ]
We study how down-scaling large language model (LLM) size impacts LLM capabilities. We begin by measuring the effects of weight pruning – a popular technique for reducing model size – on the two abilities of LLMs: (a) recalling facts presented during pre-training and (b) processing information presented in context. Surprisingly, we find that existing pruning techniques affect these two abilities of LLMs differently. For example, pruning more than 30% of weights significantly decreases an LLM’s ability to recall facts presented during pre-training. Yet pruning 60-70% of weights largely preserves an LLM’s ability to process information in-context, ranging from retrieving answers based on information presented in context to learning parameterized functions such as a linear classifier based on a few examples. Moderate pruning impairs LLM’s ability to recall facts learnt from pre-training. However, its effect on model’s ability to process information presented in context is much less pronounced. The said disparate effects similarly arise when replacing the original model with a smaller dense one with reduced width and depth. This similarity suggests that model size reduction in general underpins the said disparity.
Primary Area: generative models
Keywords: large language model multimodal large language model grounding referring
Scores: [ 6 8 6 8 ]
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Specifically, we represent text spans (i.e., referring expressions and noun phrases) as links in Markdown, i.e., [text span](bounding boxes), where object descriptions are sequences of location tokens. To train the model, we construct a large-scale dataset about grounded image-text pairs (GrIT) together with multimodal corpora. In addition to the existing capabilities of MLLMs (e.g., perceiving general modalities, following instructions, and performing in-context learning), Kosmos-2 integrates the grounding capability to downstream applications, while maintaining the conventional capabilities of MLLMs (e.g., perceiving general modalities, following instructions, and performing in-context learning). Kosmos-2 is evaluated on a wide range of tasks, including (i) multimodal grounding, such as referring expression comprehension and phrase grounding, (ii) multimodal referring, such as referring expression generation, (iii) perception-language tasks, and (iv) language understanding and generation. This study sheds a light on the big convergence of language, multimodal perception, and world modeling, which is a key step toward artificial general intelligence. Code can be found in the supplementary material.
Primary Area: generative models
Keywords: Computer Vision Applications Video Editing and Synthesis Text-to-Image Diffusion Models Text-driven Video Editing
Scores: [ 6 8 6 8 ]
The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method. We demonstrate state-of-the-art editing results on a variety of real-world videos.
Primary Area: generative models
Keywords: Generative Modelling Molecule Design Denoising Diffusion Probabilistic Models Ablation Study Equivariant Graph Neural Network 3D Molecule Generation Diffusion Model
Scores: [ 3 6 8 6 ]
Primary Area: generative models
Keywords: Generative AI Computer Vision Machine Learning
Scores: [ 5 6 8 3 ]
We introduce JointNet, a novel neural network architecture for modeling the joint distribution of images and an additional dense modality (e.g., depth maps). JointNet is extended from a pre-trained text-to-image diffusion model, where a copy of the original network is created for the new dense modality branch and is densely connected with the RGB branch. The RGB branch is locked during network fine-tuning, which enables efficient learning of the new modality distribution while maintaining the strong generalization ability of the large-scale pre-trained diffusion model.We demonstrate the effectiveness of JointNet by using the RGB-D diffusion as an example and through extensive experiments, showcasing its applicability in a variety of applications, including joint RGB-D generation, dense depth prediction, depth-conditioned image generation, and high-resolution 3D panorama generation.
Primary Area: generative models
Keywords: image compression diffusion models generative image models
Scores: [ 6 6 6 6 ]
Image codecs are typically optimized to trade-off bitrate vs. distortion metrics. At low bitrates, this leads to compression artefacts which are easily perceptible, even when training with perceptual distortion metrics or adversarial losses. To improve image quality, and to make it less dependent on the bitrate, we propose to leverage diffusion models as a decoder module, which rely on an iterative decoding process instead of feed-forward decoders trained using MSE or LPIPS distortions used in most neural codecs. In addition to conditioning the model on a vector-quantized image representation, we also condition on a global textual image description to provide additional context. We dub our model PerCo for ''perceptual compression'', and compare it to state-of-the-art codecs at bitrates from 0.1 down to 0.003 bits per pixel (bpp). The latter rate is an order of magnitude smaller than those considered in most prior work. At this bitrate a 512x768 Kodak image is encoded in just 148 bytes. Despite this ultra-low bitrate, our approach maintains the ability to reconstruct realistic images. We find that our model leads to reconstructions with state-of-the-art visual quality as measured by FID and KID, and that the visual quality is less dependent on the bitrate than previous methods. Image codecs are typically optimized to trade-off bitrate vs. distortion metrics. At low bitrates, this leads to compression artefacts which are easily perceptible, even when training with perceptual or adversarial losses. To improve image quality, and to make it less dependent on the bitrate, we propose to decode with iterative diffusion models, instead of feed-forward decoders trained using MSE or LPIPS distortions used in most neural codecs. In addition to conditioning the model on a vector-quantized image representation, we also condition on a global textual image description to provide additional context. We dub our model PerCo for ''perceptual compression'', and compare it to state-of-the-art codecs at rates from 0.1 down to 0.003 bits per pixel. The latter rate is an order of magnitude smaller than those considered in most prior work. At this bitrate a 512x768 Kodak image is encoded in less than 153 bytes. Despite this ultra-low bitrate, our approach maintains the ability to reconstruct realistic images. We find that our model leads to reconstructions with state-of-the-art visual quality as measured by FID and KID, and that the visual quality is less dependent on the bitrate than previous methods.
Primary Area: generative models
Keywords: diffusion EDM DDIM DPM-Solver optimal stepsize
Scores: [ 6 6 6 6 ]
Primary Area: generative models
Keywords: 3D modeling generative models
Scores: [ 5 6 8 8 ]
Text-to-3D generation has made remarkable progress recently, particularly with methods based on Score Distillation Sampling (SDS) that leverages pre-trained 2D diffusion models. While the usage of classifier-free guidance is well acknowledged to be crucial for successful optimization, it is considered an auxiliary trick rather than the most essential component. In this paper, we re-evaluate the role of classifier-free guidance in score distillation and discover a surprising finding: the guidance alone is enough for effective text-to-3D generation tasks. We name this method Classifier Score Distillation (CSD), which can be interpreted as using an implicit classification model for generation. This new perspective reveals new insights for understanding existing techniques. We validate the effectiveness of CSD across a variety of text-to-3D tasks including shape generation, texture synthesis, and shape editing, achieving results superior to those of state-of-the-art methods.
Primary Area: generative models
Keywords: Unsupervised Learning Object-Centric Representation Learning Dynamic Scene Decomposition Inverse Rendering
Scores: [ 3 6 6 6 ]
Primary Area: generative models
Keywords: vision-language generative model cycle consistency
Scores: [ 8 6 6 8 ]
Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities. However, automatically collecting such data (e.g. via large-scale web scraping) leads to low quality and poor image-text correlation, while human annotation is more accurate but requires significant manual effort and expense. We introduce ITIT ($\textbf{I}n\textbf{T}$egrating $\textbf{I}$mage $\textbf{T}$ext): an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data. ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework. During training, ITIT leverages a small set of paired image-text data to ensure its output matches the input reasonably well in both directions. Simultaneously, the model is also trained on much larger datasets containing only images or texts. This is achieved by enforcing cycle consistency between the original unpaired samples and the cycle-generated counterparts. For instance, it generates a caption for a given input image and then uses the caption to create an output image, and enforces similarity between the input and output images. Our experiments show that ITIT with unpaired datasets exhibits similar scaling behavior as using high-quality paired data. We demonstrate image generation and captioning performance on par with state-of-the-art text-to-image and image-to-text models with orders of magnitude fewer (only 3M) paired image-text data.
Primary Area: generative models
Keywords: Proteins; Equivariance; Riemannian; Flow Matching; Generative models
Scores: [ 8 8 8 8 ]
The computational design of novel protein structures has the potential to impact numerous scientific disciplines greatly. Toward this goal, we introduce \foldflow, a series of novel generative models of increasing modeling power based on the flow-matching paradigm over 3D rigid motions---i.e. the group SE(3)---enabling accurate modeling of protein backbones. We first introduce FoldFlow-Base, a simulation-free approach to learning deterministic continuous-time dynamics and matching invariant target distributions on SE(3). We next accelerate training by incorporating Riemannian optimal transport to create FoldFlow-OT, leading to the construction of both more simple and stable flows. Finally, we design FoldFlow-SFM, coupling both Riemannian OT and simulation-free training to learn stochastic continuous-time dynamics over SE(3). Our family of FoldFlow, generative models offers several key advantages over previous approaches to the generative modeling of proteins: they are more stable and faster to train than diffusion-based approaches, and our models enjoy the ability to map any invariant source distribution to any invariant target distribution over SE(3). Empirically, we validate our FoldFlow, models on protein backbone generation of up to 300 amino acids leading to high-quality designable, diverse, and novel samples.
Primary Area: generative models
Keywords: Diffusion model Layout generation Constrained Optimization
Scores: [ 6 6 8 6 ]
Primary Area: generative models
Keywords: normalizing flows injective flows manifold learning maximum likelihood generative model auto encoder
Scores: [ 5 8 5 8 ]
Primary Area: generative models
Keywords: Generative Clustering Multimodal VAEs Variational Autoencoder Multimodal Learning Generative Models
Scores: [ 8 6 6 ]
Multimodal VAEs have recently gained significant attention as generative models for weakly-supervised learning with multiple heterogeneous modalities. In parallel, VAE-based methods have been explored as probabilistic approaches for clustering tasks. At the intersection of these two research directions, we propose a novel multimodal VAE model in which the latent space is extended to learn data clusters, leveraging shared information across modalities. Our experiments show that our proposed model improves generative performance over existing multimodal VAEs, particularly for unconditional generation. Furthermore, we propose a post-hoc procedure to automatically select the number of true clusters thus mitigating critical limitations of previous clustering frameworks. Notably, our method favorably compares to alternative clustering approaches, in weakly-supervised settings. Finally, we integrate recent advancements in diffusion models into the proposed method to improve generative quality for real-world images.
Primary Area: generative models
Keywords: Image generation; vision language models
Scores: [ 6 5 5 8 ]
Our understanding of the visual world is centered around various concept axes, characterizing different aspects of visual entities. While different concept axes can be easily specified by language, e.g. color, the exact visual nuances along each axis often exceed the limitations of linguistic articulations, e.g. a particular style of painting. In this work, our goal is to learn a language-informed visual concept representation, by simply distilling large pre-trained vision-language models. Specifically, we train a set of concept encoders to encode the information pertinent to a set of language-informed concept axes, with an objective of reproducing the input image through a pre-trained Text-to-Image (T2I) model. To encourage better disentanglement of different concept encoders, we anchor the concept embeddings to a set of text embeddings obtained from a pre-trained Visual Question Answering (VQA) model. At inference time, the model extracts concept embeddings along various axes from new test images, which can be remixed to generate images with novel compositions of visual concepts. With a lightweight test-time finetuning procedure, it can also generalize to novel concepts unseen at training.
Primary Area: generative models
Keywords: Optimal Transport Generative Adversarial Networks
Scores: [ 6 6 6 6 5 ]
Primary Area: generative models
Keywords: calibration hallucination large language model
Scores: [ 8 6 5 ]
A model is considered well-calibrated when its probability estimate aligns with the true likelihood of the output being correct. Calibrating large language models (LLMs) is crucial, as it plays a vital role in detecting and mitigating hallucinations, a common issue of LLMs, as well as building more trustworthy models. Yet popular neural model calibration techniques are not well-suited for LLMs due to their lack of flexibility in discerning answer correctness and their high computational costs. For instance, post-processing methods, e.g., temperature scaling, are often unable to reorder the candidate generations. Moreover, training-based methods require fine-tuning the entire model, which becomes impractical due to the increasing sizes of modern LLMs. In this paper, we present Litcab, a lightweight calibration mechanism consisting of a single linear layer that takes as input the sentence representation and predicts a bias term, which is then added to the LM output logits. Litcab results with better-calibrated models, by only adding and training <2% of the original model parameters. For evaluation, we construct CaT, a benchmark consisting of six open-ended question-answering (QA) tasks, covering responses ranging from short phrases to paragraphs. We test Litcab with Llama2-7B, where it improves calibration across all tasks. We further conduct a comprehensive evaluation with multiple popular open-sourced LLMs from GPT and LLaMA families, yielding the following key findings: (i) Larger models within the same family exhibit better calibration. (ii) GPT-family models show superior calibration compared to LLaMA, Llama2 and Vicuna models despite having much fewer parameters. (iii) Fine-tuning pretrained model (e.g., LLaMA) with samples of focused purpose (e.g., conversations) may lead to worse calibration, highlighting the importance of fine-tuning setups.
Primary Area: generative models
Keywords: Diffusion models inverse problems
Scores: [ 8 8 6 8 ]
Diffusion models have recently emerged as powerful generative priors for solving inverse problems. However, training diffusion models in the pixel space are both data-intensive and computationally demanding, which restricts their applicability as priors for high-dimensional real-world data such as medical images. Latent diffusion models, which operate in a much lower-dimensional space, offer a solution to these challenges. However, incorporating latent diffusion models to solve inverse problems remains a challenging problem due to the nonlinearity of the encoder and decoder. To address these issues, we propose ReSample, an algorithm that can solve general inverse problems with pre-trained latent diffusion models. Our algorithm incorporates data consistency by solving an optimization problem during the reverse sampling process, a concept that we term as hard data consistency. Upon solving this optimization problem, we propose a novel resampling scheme to map the measurement-consistent sample back onto the noisy data manifold and theoretically demonstrate its benefits. Lastly, we apply our algorithm to solve a wide range of linear and nonlinear inverse problems in both natural and medical images, demonstrating that our approach outperforms existing state-of-the-art approaches, including those based on pixel-space diffusion models.
Primary Area: generative models
Keywords: diffusion video diffusion video generation tuning-free
Scores: [ 6 5 6 6 ]
With the availability of large-scale video datasets and the advances of diffusion models, text-driven video generation has achieved substantial progress. However, existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference. Furthermore, these models only support single-text conditions, whereas real-life scenarios often require multi-text conditions as the video content changes over time. To tackle these challenges, this study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts. 1) We first analyze the impact of initial noise in video diffusion models. Then building upon the observation of noise, we propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models while preserving content consistency. Specifically, instead of initializing noises for all frames, we reschedule a sequence of noises for long-range correlation and perform temporal attention over them by window-based function. 2) Additionally, we design a novel motion injection method to support the generation of videos conditioned on multiple text prompts. Extensive experiments validate the superiority of our paradigm in extending the generative capabilities of video diffusion models. It is noteworthy that compared with the previous best-performing method which brought about 255% extra time cost, our method incurs only negligible time cost of approximately 17%. Generated video samples are available at the anonymous website: https://free-noise.github.io.
Primary Area: generative models
Keywords: Diffusion Models Vision-Language Image Generation
Scores: [ 6 6 6 ]
Recent advancements in text-to-image (T2I) and vision-language-to-image (VL2I) generation have made significant strides. However, the generation from generalized vision-language inputs, especially involving multiple images, remains under-explored. This paper presents Kosmos-G, a model that leverages the advanced perception capabilities of Multimodal Large Language Models (MLLMs) to tackle the aforementioned challenge. Our approach aligns the output space of MLLM with CLIP using the textual modality as an anchor and performs compositional instruction tuning on curated data. Kosmos-G demonstrates a unique capability of zero-shot multi-entity subject-driven generation. Notably, the score distillation instruction tuning requires no modifications to the image decoder. This allows for a seamless substitution of CLIP and effortless integration with a myriad of U-Net techniques ranging from fine-grained controls to personalized image decoder variants. We posit Kosmos-G as an initial attempt towards the goal of image as a foreign language in image generation.''
Primary Area: generative models
Keywords: source separation probabilistic diffusion models music generation
Scores: [ 8 8 8 8 ]
In this work, we define a diffusion-based generative model capable of both music generation and source separation by learning the score of the joint probability density of sources sharing a context. Alongside the classic total inference tasks (i.e., generating a mixture, separating the sources), we also introduce and experiment on the partial generation task of source imputation, where we generate a subset of the sources given the others (e.g., play a piano track that goes well with the drums). Additionally, we introduce a novel inference method for the separation task based on Dirac likelihood functions. We train our model on Slakh2100, a standard dataset for musical source separation, provide qualitative results in the generation settings, and showcase competitive quantitative results in the source separation setting. Our method is the first example of a single model that can handle both generation and separation tasks, thus representing a step toward general audio models.
Primary Area: generative models
Keywords: diffusion models; score estimation; neural networks; neural tangent kernels
Scores: [ 6 5 8 6 ]
Diffusion models have emerged as a powerful tool rivaling GANs in generating high-quality samples with improved fidelity, flexibility, and robustness. A key component of these models is to learn the score function through score matching. Despite empirical success on various tasks, it remains unclear whether gradient-based algorithms can learn the score function with a provable accuracy. As a first step toward answering this question, this paper establishes a mathematical framework for analyzing score estimation using neural networks trained by gradient descent. Our analysis covers both the optimization and the generalization aspects of the learning procedure. In particular, we propose a parametric form to formulate the denoising score-matching problem as a regression with noisy labels. Compared to the standard supervised learning setup, the score-matching problem introduces distinct challenges, including unbounded inputs, vector-valued outputs, and an additional time variable, preventing existing techniques from being applied directly. In this paper, we show that with a properly designed neural network architecture, the score function can be accurately approximated by a reproducing kernel Hilbert space induced by neural tangent kernels. Furthermore, by applying an early-stopping rule for gradient descent and leveraging certain coupling arguments between neural network training and kernel regression, we establish the first generalization error (sample complexity) bounds for learning the score function despite the presence of noise in the observations. Our analysis is grounded in a novel parametric form of the neural network and an innovative connection between score matching and regression analysis, facilitating the application of advanced statistical and optimization techniques.
Primary Area: generative models
Keywords: language model diffusion model video generation visual tokenization
Scores: [ 8 8 8 ]
Primary Area: generative models
Keywords: Generative Evaluation Alignment
Scores: [ 6 5 5 ]
Primary Area: generative models
Keywords: 3D human motion generation diffusion models conditional generation
Scores: [ 6 6 6 8 ]
We present a novel approach named OmniControl for incorporating flexible spatial control signals into a text-conditioned human motion generation model based on the diffusion process. Unlike previous methods that can only control the pelvis trajectory, OmniControl can incorporate flexible spatial control signals over different joints at different times with only one model. Specifically, we propose analytic spatial guidance that ensures the generated motion can tightly conform to the input control signals. At the same time, realism guidance is introduced to refine all the joints to generate more coherent motion. Both the spatial and realism guidance are essential and they are highly complementary for balancing control accuracy and motion realism. By combining them, OmniControl generates motions that are realistic, coherent, and consistent with the spatial constraints. Experiments on HumanML3D and KIT-ML datasets show that OmniControl not only achieves significant improvement over state-of-the-art methods on pelvis control but also shows promising results when incorporating the constraints over other joints. Our code and model weights will be publicly released.
Primary Area: generative models
Keywords: Latent Diffusion Model Text-to-Image Neural Architectures Foundation Models
Scores: [ 8 8 8 8 ]
We introduce Würstchen, a novel architecture for text-to-image synthesis that combines competitive performance with unprecedented cost-effectiveness for large-scale text-to-image diffusion models.A key contribution of our work is to develop a latent diffusion technique in which we learn a detailed but extremely compact semantic image representation used to guide the diffusion process. This highly compressed representation of an image provides much more detailed guidance compared to latent representations of language and this significantly reduces the computational requirements to achieve state-of-the-art results. Our approach also improves the quality of text-conditioned image generation based on our user preference study.The training requirements of our approach consists of 24,602 A100-GPU hours - compared to Stable Diffusion 2.1's 200,000 GPU hours. Our approach also requires less training data to achieve these results. Furthermore, our compact latent representations allows us to perform inference over twice as fast, slashing the usual costs and carbon footprint of a state-of-the-art (SOTA) diffusion model significantly, without compromising the end performance. In a broader comparison against SOTA models our approach is substantially more efficient and compares favourably in terms of image quality.We believe that this work motivates more emphasis on the prioritization of both performance and computational accessibility.
Primary Area: generative models
Keywords: Optimal Transport Neural Networks Generative Modelling Unpaired Learning
Scores: [ 6 6 6 6 ]
Primary Area: generative models
Keywords: Image restoration Diverse image restoration Inverse problems Posterior sampling Meaningful diversity
Scores: [ 8 6 8 ]
Image restoration problems are typically ill-posed in the sense that each degraded image can be restored in infinitely many valid ways. To accommodate this, many works generate a diverse set of outputs by attempting to randomly sample from the posterior distribution of natural images given the degraded input. Here we argue that this strategy is commonly of limited practical value because of the heavy tail of the posterior distribution. Consider for example inpainting a missing region of the sky in an image. Since there is a high probability that the missing region contains no object but clouds, any set of samples from the posterior would be entirely dominated by (practically identical) completions of sky. However, arguably, presenting users with only one clear sky completion, along with several alternative solutions such as airships, birds, and balloons, would better outline the set of possibilities. In this paper, we initiate the study of meaningfully diverse image restoration. We explore several post-processing approaches that can be combined with any diverse image restoration method to yield semantically meaningful diversity. Moreover, we propose a practical approach for allowing diffusion based image restoration methods to generate meaningfully diverse outputs, while incurring only negligent computational overhead. We conduct extensive user studies to analyze the proposed techniques, and find the strategy of reducing similarity between outputs to be significantly favorable over posterior sampling.
Primary Area: generative models
Keywords: diffusion model text-to-image text generation
Scores: [ 6 6 8 6 ]
Diffusion model based Text-to-Image has achieved impressive achievements recently. Although current technology for synthesizing images is highly advanced and capable of generating images with high fidelity, it is still possible to give the show away when focusing on the text area in the generated image, as synthesized text often contains blurred, unreadable, or incorrect characters, making visual text generation one of the most challenging issues in this field. To address this issue, we introduce AnyText, a diffusion-based multilingual visual text generation and editing model, that focuses on rendering accurate and coherent text in the image. AnyText comprises a diffusion pipeline with two primary elements: an auxiliary latent module and a text embedding module. The former uses inputs like text glyph, position, and masked image to generate latent features for text generation or editing. The latter employs an OCR model for encoding stroke data as embeddings, which blend with image caption embeddings from the tokenizer to generate texts that seamlessly integrate with the background. We employed text-control diffusion loss and text perceptual loss for training to further enhance writing accuracy. AnyText can write characters in multiple languages, to the best of our knowledge, this is the first work to address multilingual visual text generation. It is worth mentioning that AnyText can be plugged into existing diffusion models from the community for rendering or editing text accurately. After conducting extensive evaluation experiments, our method has outperformed all other approaches by a significant margin. Additionally, we contribute the first large-scale multilingual text images dataset, AnyWord-3M, containing 3 million image-text pairs with OCR annotations in multiple languages. Based on AnyWord-3M dataset, we propose AnyText-benchmark for the evaluation of visual text generation accuracy and quality. Our project will be open-sourced soon to improve and promote the development of text generation technology.
Primary Area: generative models
Keywords: Text-to-3D diffusion models
Scores: [ 5 5 6 8 ]
It is inherently ambiguous to lift 2D results from pre-trained diffusion models to a 3D world for text-to-3D generation. 2D diffusion models solely learn view-agnostic priors and thus lack 3D knowledge during the lifting, leading to the multi-view inconsistency problem. We find that this problem primarily stems from geometric inconsistency, and avoiding misplaced geometric structures substantially mitigates the problem in the final outputs. Therefore, we improve the consistency by aligning the 2D geometric priors in diffusion models with well-defined 3D shapes during the lifting, addressing the vast majority of the problem. This is achieved by fine-tuning the 2D diffusion model to be viewpoint-aware and to produce view-specific coordinate maps of canonically oriented 3D objects. In our process, only coarse 3D information is used for aligning. This “coarse” alignment not only resolves the multi-view inconsistency in geometries but also retains the ability in 2D diffusion models to generate detailed and diversified high-quality ob-jects unseen in the 3D datasets. Furthermore, our aligned geometric priors (AGP) are generic and can be seamlessly integrated into various state-of-the-art pipelines, obtaining high generalizability in terms of unseen shapes and visual appearance while greatly alleviating the multi-view inconsistency problem. Our method represents a new state-of-the-art performance with a 85+% consistency rate by human evaluation, while many previous methods are around 30%.
Primary Area: generative models
Keywords: text-to-video generation diffusion models large language models
Scores: [ 6 6 6 ]
Text-conditioned diffusion models have emerged as a promising tool for neural video generation. However, current models still struggle with intricate spatiotemporal prompts and often generate restricted or incorrect motion (e.g., even lacking the ability to be prompted for objects moving from left to right). To address these limitations, we introduce LLM-grounded Video Diffusion (LVD). Instead of directly generating videos from the text inputs, LVD first leverages a large language model (LLM) to generate dynamic scene layouts based on the text inputs and subsequently uses the generated layouts to guide a diffusion model for video generation. We show that LLMs are able to understand complex spatiotemporal dynamics from text alone and generate layouts that align closely with both the prompts and the object motion patterns typically observed in the real world. We then propose to guide video diffusion models with these layouts by adjusting the attention maps. Our approach is training-free and can be integrated into any video diffusion model that admits classifier guidance. Our results demonstrate that LVD significantly outperforms its base video diffusion model and several strong baseline methods in faithfully generating videos with the desired attributes and motion patterns.
Primary Area: generative models
Keywords: Generative adversarial network optimal transport sliced Wasserstein distance
Scores: [ 6 6 ]
Primary Area: generative models
Keywords: Time Series Generation Koopman Theory; Variational Autoencoder; Generative Modeling
Scores: [ 8 5 6 6 ]
Generating realistic time series data is important for numerous engineering and scientific applications. Several existing works tackle this problem using generative adversarial networks, however, GANs are often unstable during training and suffer from mode collpase. While variational autoencoders (VAEs) are more robust to the above issues, surprisingly, they are considered less for time series generation. In this work, we introduce Koopman VAE (KVAE), a new generative framework that is based on a novel design for the model prior, and that can be optimized for either regular and irregular training data. Inspired by the Koopman theory, we represent the latent conditional prior dynamics using a linear map. Our approach enhances generative modeling with two desired features: (i) incorporating domain knowledge can be achieved by leverageing spectral tools that prescribe constraints on the eigenvalues of the linear map; and (ii) studying the qualitative behavior and stablity of the system can be performed using tools from dynamical systems theory. Our results show that KVAE outperforms state-of-the-art GAN and VAE methods across several challenging synthetic and real-world time series generation benchmarks. Whether trained on regular or irregular data, KVAE generates time series that improve both discriminative and predictive metrics. Further, we present visual evidence suggesting that KVAE learns probability density functions that better approximate the empirical ground truth distribution.
Primary Area: generative models
Keywords: Diffusion Models;text-vision Condition;
Scores: [ 5 6 6 8 ]
Originating from the diffusion phenomenon in physics that describes particle movement, the diffusion generative models inherit the characteristics of stochastic random walk in the data space along the denoising trajectory. However, the intrinsic mutual interference among image regions contradicts the need for practical downstream application scenarios where the preservation of low-level pixel information from given conditioning is desired (e.g., customization tasks like personalized generation and inpainting based on a user-provided single image). In this work, we investigate the diffusion (physics) in diffusion (machine learning) properties and propose our Cyclic One-Way Diffusion (COW) method to control the direction of diffusion phenomenon given a pre-trained frozen diffusion model for versatile customization application scenarios, where the low-level pixel information from the conditioning needs to be preserved. Notably, unlike most current methods that incorporate additional conditions by fine-tuning the base text-to-image diffusion model or learning auxiliary networks, our method provides a novel perspective to understand the task needs and is applicable to a wider range of customization scenarios in a learning-free manner. Extensive experiment results show that our proposed COW can achieve more flexible customization based on strict visual conditions in different application settings.
Primary Area: generative models
Keywords: Text-to-Image Diffusion Transformer
Scores: [ 6 8 6 8 ]
The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PixArt-α, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PixArt-α's training speed markedly surpasses existing large-scale T2I models, e.g., PixArt-α only takes 10.8% of Stable Diffusion v1.5's training time (~675 vs. ~6,250 A100 GPU days), saving nearly \300,000($26,000vs.$320,000)andreducing90\alpha$ excels in image quality, artistry, and semantic control. We hope PixArt-α will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.
Primary Area: generative models
Keywords: Human Image Generation Latent Structural Diffusion
Scores: [ 6 8 6 10 ]
Despite significant advances in large-scale text-to-image models, achieving hyper-realistic human image generation remains a desirable yet unsolved task. Existing models like Stable Diffusion and DALL·E 2 tend to generate human images with incoherent parts or unnatural poses. To tackle these challenges, our key insight is that human image is inherently structural over multiple granularities, from the coarse-level body skeleton to fine-grained spatial geometry. Therefore, capturing such correlations between the explicit appearance and latent structure in one model is essential to generate coherent and natural human images. To this end, we propose a unified framework, HyperHuman, that generates in-the-wild human images of high realism and diverse layouts. Specifically, 1) we first build a large-scale human-centric dataset, named HumanVerse, which consists of 340M images with comprehensive annotations like human pose, depth, and surface normal. 2) Next, we propose a Latent Structural Diffusion Model that simultaneously denoises the depth and surface normal along with the synthesized RGB image. Our model enforces the joint learning of image appearance, spatial relationship, and geometry in a unified network, where each branch in the model complements to each other with both structural awareness and textural richness. 3) Finally, to further boost the visual quality, we propose a Structure-Guided Refiner to compose the predicted conditions for more detailed generation of higher resolution. Extensive experiments demonstrate that our framework yields the state-of-the-art performance, generating hyper-realistic human images under diverse scenarios.
Primary Area: generative models
Keywords: Image editing Diffusion model
Scores: [ 8 3 8 5 ]
Diffusion-based image editing methods have achieved remarkable advances in text-driven image editing. The editing task aims to edit an input image with the original prompt to the desired image aligned with the target prompt. By comparing the original and target prompts, we can obtain numerous editing pairs, each comprising an object and its corresponding editing target. To allow editability while maintaining fidelity to the input image, existing editing methods typically involve a fixed number of inversion steps that project the whole input image to its noisier latent representation, followed by a denoising process guided by the target prompt. However, we find that the optimal number of inversion steps for achieving ideal editing results varies significantly among different editing pairs, owing to varying editing difficulties. Therefore, the current literature, which relies on a fixed number of inversion steps, produces suboptimal generation quality, especially when handling multiple editing pairs in a natural image.To this end, we propose a new image editing paradigm, dubbed Object-aware Inversion and Reassembly (OIR), to enable object-level fine-grained editing. Specifically, we design a new search metric, which determines the optimal inversion step for each editing pair, by jointly considering the editability of the target and the fidelity of the non-editing region. When editing an image, we first search the optimal inversion step for each editing pair with our search metric and edit them separately to circumvent concept mismatch. Subsequently, we propose an additional reassembly step to seamlessly integrate the respective editing results and the non-editing region to obtain the final edited image. To systematically evaluate the effectiveness of our method, we collect two datasets for benchmarking single- and multi-object editing, respectively. Experiments demonstrate that our method achieves superior performance in editing object shapes, colors, materials, categories, etc., especially in multi-object editing scenarios.
Primary Area: generative models
Keywords: score distillation sampling generative models text to image
Scores: [ 6 8 6 6 ]
Score Distillation Sampling (SDS) has emerged as the de facto approach for text-to-content generation in non-image domains. In this paper, we reexamine the SDS process and introduce a straightforward interpretation that demystifies the necessity for large Classifier-Free Guidance (CFG) scales, rooted in the distillation of an undesired noise term. Building upon our interpretation, we propose a novel Noise-Free Score Distillation (NFSD) process, which requires minimal modifications to the original SDS framework. Through this streamlined design, we achieve more effective distillation of pre-trained text-to-image diffusion models while using a nominal CFG scale. This strategic choice allows us to prevent the over-smoothing of results, ensuring that the generated data is both realistic and complies with the desired prompt. To demonstrate the efficacy of NFSD, we provide qualitative examples that compare NFSD and SDS, as well as several other methods.
Primary Area: generative models
Keywords: Image Synthesis Diffusion Generative AI
Scores: [ 8 8 8 8 ]
We present Stable Diffusion XL (SDXL), a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone, achieved by significantly increasing the number of attention blocks and including a second text encoder. Further, we design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. To ensure highest quality results, we also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL improves dramatically over previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators such as Midjourney.
Primary Area: generative models
Keywords: diffusion prior composition generation motion generative
Scores: [ 6 6 6 6 ]
Recent work has demonstrated the significant potential of denoising diffusion modelsfor generating human motion, including text-to-motion capabilities.However, these methods are restricted by the paucity of annotated motion data,a focus on single-person motions, and a lack of detailed control.In this paper, we introduce three forms of composition based on diffusion priors:sequential, parallel, and model composition.Using sequential composition, we tackle the challenge of long sequencegeneration. We introduce DoubleTake, an inference-time method with whichwe generate long animations consisting of sequences of prompted intervalsand their transitions, using a prior trained only for short clips.Using parallel composition, we show promising steps toward two-person generation.Beginning with two fixed priors as well as a few two-person training examples, we learn a slimcommunication block, ComMDM, to coordinate interaction between the two resulting motions.Lastly, using model composition, we first train individual priorsto complete motions that realize a prescribed motion for a given joint.We then introduce DiffusionBlending, an interpolation mechanism to effectively blend severalsuch models to enable flexible and efficient fine-grained joint and trajectory-level control and editing.We evaluate the composition methods using an off-the-shelf motion diffusion model,and further compare the results to dedicated models trained for these specific tasks.
Primary Area: generative models
Keywords: video generation diffusion models
Scores: [ 6 8 8 6 ]
Video diffusion models have recently made great progress in generation quality, but are still limited by the high memory and computational requirements. This is because current video diffusion models often attempt to process high-dimensional videos directly. To tackle this issue, we propose content-motion latent diffusion model (CMD), a novel efficient extension of pretrained image diffusion models for video generation. Specifically, we propose an autoencoder that succinctly encodes a video as a combination of a content frame (like an image) and a low-dimensional motion latent representation. The former represents the common content, and the latter represents the underlying motion in the video, respectively. We generate the content frame by fine-tuning a pretrained image diffusion model, and we generate the motion latent representation by training a new lightweight diffusion model. A key innovation here is the design of a compact latent space that can directly utilizes a pretrained image diffusion model, which has not been done in previous latent video diffusion models. This leads to considerably better quality generation and reduced computational costs. For instance, CMD can sample a video 7.7$\times$ faster than prior approaches by generating a video of 512$\times$1024 resolution and length 16 in 3.1 seconds. Moreover, CMD achieves an FVD score of 212.7 on WebVid-10M, 27.3% better than the previous state-of-the-art of 292.4.
Primary Area: generative models
Keywords: texture synthesis generative model NeRF
Scores: [ 8 6 6 8 ]
Textures are a vital aspect of creating visually appealing and realistic 3D models. In this paper, we study the problem of generating high-fidelity texture given shapes of 3D assets, which has been relatively less explored compared with generic 3D shape modeling. Our goal is to facilitate a controllable texture generation process, such that one texture code can correspond to a particular appearance style independent of any input shapes from a category. We introduce Texture UV Radiance Fields (TUVF) that generate textures in a learnable UV sphere space rather than directly on the 3D shape. This allows the texture to be disentangled from the underlying shape and transferable to other shapes that share the same UV space, i.e., from the same category. We integrate the UV sphere space with the radiance field, which provides a more efficient and accurate representation of textures than traditional texture maps. We perform our experiments on synthetic and real-world object datasets where we achieve not only realistic synthesis but also substantial improvements over state-of-the-arts on texture controlling and editing.
Primary Area: generative models
Keywords: ai-image detection image watermark deepfake detection watermark attack generative models
Scores: [ 5 6 6 6 ]
In light of recent advancements in generative AI models, it has become essential to distinguish genuine content from AI-generated one to prevent the malicious usage of fake materials as authentic ones and vice versa. Various techniques have been introduced for identifying AI-generated images, with watermarking emerging as a promising approach. In this paper, we analyze the robustness of various AI-image detectors including watermarking and classifier-based deepfake detectors. For watermarking methods that introduce subtle image perturbations (i.e., low perturbation budget methods), we reveal a fundamental trade-off between the evasion error rate (i.e., the fraction of watermarked images detected as non-watermarked ones) and the spoofing error rate (i.e., the fraction of non-watermarked images detected as watermarked ones) upon an application of a diffusion purification attack. In this regime, we also empirically show that diffusion purification effectively removes watermarks with minimal changes to images. For high perturbation watermarking methods where notable changes are applied to images, the diffusion purification attack is not effective. In this case, we develop a model substitution adversarial attack that can successfully remove watermarks. Moreover, we show that watermarking methods are vulnerable to spoofing attacks where the attacker aims to have real images (potentially obscene) identified as watermarked ones, damaging the reputation of the developers. In particular, by just having black-box access to the watermarking method, we show that one can generate a watermarked noise image which can be added to the real images to have them falsely flagged as watermarked ones. Finally, we extend our theory to characterize a fundamental trade-off between the robustness and reliability of classifier-based deep fake detectors and demonstrate it through experiments.
Primary Area: generative models
Keywords: LLMs Sparse Feedback Ratings Rankings Inconsistency Evaluation
Scores: [ 6 6 6 8 ]
Aligning large language models (LLMs) with human values and intents critically involves the use of human or AI feedback. While dense feedback annotations are expensive to acquire and integrate, sparse feedback presents a structural design choice between ratings (e.g., score Response A on a scale of 1-7) and rankings (e.g., is Response A better than Response B?). In this work, we analyze the effect of this design choice for the alignment and evaluation of LLMs. We uncover an inconsistency problem wherein the preferences inferred from ratings and rankings significantly disagree 60% for both human and AI annotators. Our subsequent analysis identifies various facets of annotator biases that explain this phenomena such as human annotators would rate denser responses higher while preferring accuracy during pairwise judgments, for a particular comparison instance. To our surprise, we observe that the choice of feedback protocol has a significant effect on the evaluation of aligned LLMs. In particular, we find that LLMs that leverage rankings data for alignment (say model X) are preferred over those that leverage ratings data (say model Y), with a rank-based evaluation protocol (is X/Y's response better than reference response?) but not with a rating-based evaluation protocol (score Rank X/Y's response on a scale of 1-7). Our findings thus shed light on critical gaps in methods for evaluating the real-world utility of language models and their strong dependence on the feedback protocol used for alignment.
Primary Area: generative models
Keywords: Large Language Model Tool Use API Use
Scores: [ 6 8 6 8 ]
Despite the advancements of open-source large language models (LLMs), e.g., LLaMA, they remain significantly limited in tool-use capabilities, i.e., using external tools (APIs) to fulfill human instructions. The reason is that current instruction tuning largely focuses on basic language tasks but ignores the tool-use domain. This is in contrast to the excellent tool-use capabilities of state-of-the-art (SOTA) closed-source LLMs, e.g., ChatGPT. To bridge this gap, we introduce ToolLLM, a general tool-use framework encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning dataset for tool use, which is constructed automatically using ChatGPT. Specifically, the construction can be divided into three stages: (i) API collection: we collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub; (ii) instruction generation: we prompt ChatGPT to generate diverse instructions involving these APIs, covering both single-tool and multi-tool scenarios; (iii) solution path annotation: we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To enhance the reasoning capabilities of LLMs, we develop a novel depth-first search-based decision tree algorithm. It enables LLMs to evaluate multiple reasoning traces and expand the search space. Moreover, to evaluate the tool-use capabilities of LLMs, we develop an automatic evaluator: ToolEval. Based on ToolBench, we fine-tune LLaMA to obtain an LLM ToolLLaMA, and equip it with a neural API retriever to recommend appropriate APIs for each instruction. Experiments show that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. Our ToolLLaMA also demonstrates strong zero-shot generalization ability in an out-of-distribution tool-use dataset: APIBench.
Primary Area: generative models
Keywords: Joint shape matching;shape generator;implicit representation;geometric regularization
Scores: [ 6 5 8 8 ]
Primary Area: generative models
Keywords: energy-based model generative model optimal transport entropic optimal transport general optimal transport cost function
Scores: [ 6 6 6 ]
Energy-based models (EBMs) are known in the Machine Learning community for decades. Since the seminal works devoted to EBMs dating back to the noughties, there have been a lot of efficient methods which solve the generative modelling problem by means of energy potentials (unnormalized likelihood functions). In contrast, the realm of Optimal Transport (OT) and, in particular, neural OT solvers is much less explored and limited by few recent works (excluding WGAN-based approaches which utilize OT as a loss function and do not model OT maps themselves). In our work, we bridge the gap between EBMs and Entropy-regularized OT. We present a novel methodology which allows utilizing the recent developments and technical improvements of the former in order to enrich the latter. From the theoretical perspective, we prove generalization bounds for our technique. In practice, we validate its applicability in toy 2D and image domains. To showcase the scalability, we empower our method with a pre-trained StyleGAN and apply it to high-res AFHQ 512×512 unpaired I2I translation. For simplicity, we choose simple short- and long-run EBMs as a backbone of our Energy-guided Entropic OT approach, leaving the application of more sophisticated EBMs for future research.
Primary Area: generative models
Keywords: Diffusion Models Multi-Task Learning Generative Models
Scores: [ 6 6 6 5 ]
Primary Area: generative models
Keywords: Large Language Models Code Generation Code Summarization LLM Evaluation Self-Consistency
Scores: [ 8 3 8 6 ]
Primary Area: generative models
Keywords: 3D graphs latent diffusion models in/equivariant representations
Scores: [ 6 6 6 5 6 8 ]
Generating 3D graphs of \textit{symmetry-group equivariance} is of intriguing potential in broad applications from machine vision to molecular discovery.Emerging approaches adopt diffusion generative models (DGMs) with proper re-engineering to capture 3D graph distributions.In this paper, we raise an orthogonal and fundamental question of \textit{in what (latent) space we should diffuse 3D graphs}.\ding{182} We motivate the study with theoretical analysis showing that the performance bound of 3D graph diffusion could be improved in a latent space versus the original space, provided that there are (i) low dimensionality yet (ii) high quality (i.e., low reconstruction error) of the latent space, and (iii) symmetry preservation as an inductive bias of latent DGMs.\ding{183} Guided by the theoretical guidelines, we propose to perform 3D graph diffusion in a low-dimensional latent space, which is learned through cascaded 2D--3D graph autoencoders for low-error reconstruction and symmetry-group invariance.The overall pipeline is dubbed \textbf{latent 3D graph diffusion}.\ding{184} Motivated by applications in molecular discovery, we further extend latent 3D graph diffusion to conditional generation given SE(3)-invariant attributes or equivariant 3D objects.\ding{185} We also demonstrate empirically that out-of-distribution conditional generation can be further improved by regularizing the latent space via graph self-supervised learning.We validate through comprehensive experiments that our method generates 3D molecules of higher validity / drug-likeliness and comparable conformations / energetics, while being an order of magnitude faster in training. Codes will be released upon acceptance.
Primary Area: generative models
Keywords: mathematical reasoning large language models zero-shot learning code generation prompting
Scores: [ 3 8 6 8 ]
Recent progress in large language models (LLMs) like GPT-4 and PaLM-2 has brought significant advancements in addressing math reasoning problems. In particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets. In this paper, we explore the effect of code on enhancing LLMs' reasoning capability by introducing different constraints on the Code Usage Frequency of GPT-4 Code Interpreter. We found that its success can be largely attributed to its powerful skills in generating and executing code, evaluating the output of code execution, and rectifying its solution when receiving unreasonable outputs. Based on this insight, we propose a novel and effective prompting method, explicit $\underline{\text{c}}$ode-based $\underline{\text{s}}elf−\underline{\text{v}}$erification (CSV), to further boost the mathematical reasoning potential of GPT-4 Code Interpreter. This method employs a zero-shot prompt on GPT-4 Code Interpreter to encourage it to use code to self-verify its answers. In instances where the verification state registers as "False", the model shall automatically amend its solution, analogous to our approach of rectifying errors during a mathematics examination. Furthermore, we recognize that the states of the verification result indicate the confidence of a solution, which can improve the effectiveness of majority voting. With GPT-4 Code Interpreter and CSV, we achieve an impressive zero-shot accuracy on MATH dataset (53.9% → 84.3%).
Primary Area: generative models
Keywords: diffusion models efficiency network pruning
Scores: [ 8 6 6 6 ]
Denoising Diffusion Probabilistic Models (DDPMs) have garnered popularity for data generation across various domains. However, a significant bottleneck is the necessity for whole-network computation during every step of the generative process, leading to high computational overheads. This paper presents a novel framework, Denoising Diffusion Step-aware Models (DDSM), to address this challenge. Unlike conventional approaches, DDSM employs a spectrum of neural networks whose sizes are adapted according to the importance of each generative step, as determined through evolutionary search. This step-wise network variation effectively circumvents redundant computational efforts, particularly in less critical steps, thereby enhancing the efficiency of the diffusion model. Furthermore, the step-aware design can be seamlessly integrated with other efficiency-geared diffusion models such as DDIMs and latent diffusion, thus broadening the scope of computational savings. Empirical evaluations demonstrate that DDSM achieves computational savings of 49% for CIFAR-10, 61% for CelebA-HQ, 59% for LSUN-bedroom, 71% for AFHQ, and 76% for ImageNet, all without compromising the generation quality. Our code and models will be publicly available.
Primary Area: generative models
Keywords: Vision Question Answering Natural Language Explanation (VQA-NLE)
Scores: [ 6 6 3 ]
Natural Language Explanation (NLE) in vision and language tasks aims to provide human-understandable explanations for the associated decision-making process. In practice, one might encounter explanations which lack informativeness or contradict visual-grounded facts, known as \textit{implausibility} and \textit{hallucination} problems, respectively. To tackle these challenging issues, we consider the task of visual question answering (VQA) and introduce \textit{Rapper}, a two-stage \textbf{R}einforced R\textbf{a}tionale-\textbf{P}rom\textbf{p}t\textbf{e}d Pa\textbf{r}adigm. By knowledge distillation, the former stage of \textit{Rapper} infuses rationale-prompting via large language models (LLMs), encouraging the rationales supported by language-based facts. As for the latter stage, a unique Reinforcement Learning from NLE Feedback (RLNF) is introduced for injecting visual facts into NLE generation. Finally, quantitative and qualitative experiments on two VL-NLE benchmarks show that \textsc{Rapper} surpasses state-of-the-art VQA-NLE methods while providing plausible and faithful NLE.
Primary Area: generative models
Keywords: Generative models Natural language generation Neural machine translation LLMs MBR decoding Quality estimation Finetuning Distillation
Scores: [ 6 6 6 6 ]
Recent research in decoding methods for Natural Language Generation (NLG) tasks has shown that MAP decoding is not optimal, because model probabilities do not always align with human preferences. Stronger decoding methods, including Quality Estimation (QE) reranking and Minimum Bayes' Risk (MBR) decoding, have since been proposed to mitigate the model-perplexity-vs-quality mismatch. While these decoding methods achieve state-of-the-art performance, they are prohibitively expensive to compute. In this work, we propose MBR finetuning and QE finetuning, which distill the quality gains from these decoding methods at training time, while using an efficient decoding algorithm at inference time. Using the canonical NLG task of Neural Machine Translation (NMT), we show that even with self-training, these finetuning methods significantly outperform the base model. Moreover, when using an external LLM as a teacher model, these finetuning methods outperform finetuning on human-generated references. These findings suggest new ways to leverage monolingual data to achieve improvements in model quality that are on par with, or even exceed, improvements from human-curated data, while maintaining maximum efficiency during decoding.
Primary Area: generative models
Keywords: Future Language Modeling Future Language Model Temporal Document History
Scores: [ 8 8 6 ]
Predicting the future is of great interest across many aspects of human activity. Businesses are interested in future trends, traders are interested in future stock prices, and companies are highly interested in future technological breakthroughs. While there are many automated systems for predicting future numerical data, such as weather, stock prices, and demand for products, there is relatively little work in automatically predicting textual data. Humans are interested in textual data predictions because it is a natural format for our consumption, and experts routinely make predictions in a textual format (Christensen et al., 2004; Tetlock & Gardner, 2015; Frick, 2015). However, there has been relatively little formalization of this general problem in the machine learning or natural language processing communities. To address this gap, we introduce the task of future language modeling: probabilistic modeling of texts in the future based on a temporal history of texts. To our knowledge, our work is the first work to formalize the task of predicting the future in this way. We show that it is indeed possible to build future language models that improve upon strong non-temporal language model baselines, opening the door to working on this important, and widely applicable problem.
Primary Area: generative models
Keywords: Large language models prompt interpretability prompt manipulation controlled LLM generation
Scores: [ 6 6 8 6 ]
Prompts play a crucial role in guiding the responses of Large Language Models (LLMs). However, the intricate role of individual tokens in prompts, known as input saliency, in shaping the responses remains largely underexplored. Existing saliency methods either misalign with LLM generation objectives or rely heavily on linearity assumptions, leading to potential inaccuracies. To address this, we propose Token Distribution Dynamics (TDD), an elegantly simple yet remarkably effective approach to unveil and manipulate the role of prompts in generating LLM outputs. TDD leverages the robust interpreting capabilities of the language model head (LM head) to assess input saliency. It projects input tokens into the embedding space and then estimates their significance based on distribution dynamics over the vocabulary. We introduce three TDD variants: forward, backward, and bidirectional, each offering unique insights into token relevance. Extensive experiments reveal that the TDD surpasses state-of-the-art baselines with a big margin in elucidating the causal relationships between prompts and LLM outputs. Beyond mere interpretation, we apply TDD to two prompt manipulation tasks for controlled text generation: zero-shot toxic language suppression and sentiment steering. Empirical results underscore TDD's proficiency in identifying both toxic and sentimental cues in prompts, subsequently mitigating toxicity or modulating sentiment in the generated content.
Primary Area: generative models
Keywords: Model Editing Diffusion Model Concept Erasure
Scores: [ 8 6 8 5 6 ]
Text-to-image generative models can produce photo-realistic images for an extremely broad range of concepts, and their usage has proliferated widely among the general public. On the flip side, these models have numerous drawbacks, including their potential to generate images featuring sexually explicit content, mirror artistic styles without permission, or even hallucinate (or deepfake) the likenesses of celebrities. Consequently, various methods have been proposed in order to "erase" sensitive concepts from text-to-image models. In this work, we examine seven recently proposed concept erasure methods, and show that targeted concepts are not fully excised from any of these methods. Specifically, we leverage the existence of special learned word embeddings that can retrieve "erased" concepts from the sanitized models with no alterations to their weights. Our results highlight the brittleness of post hoc concept erasure methods, and call into question their use in the algorithmic toolkit for AI safety.
Primary Area: generative models
Keywords: Diffusion Models Expressive Bottleneck Soft Mixture Denoising (SMD)
Scores: [ 8 8 5 6 ]
Primary Area: generative models
Keywords: Large language model safety alignment mistake analysis
Scores: [ 8 5 6 5 ]
The rapid advancement of large language models (LLMs) presents both opportunities and challenges, particularly concerning unintentional generation of harmful and toxic responses. While the traditional alignment methods strive to steer LLMs towards desired performance and shield them from malicious content, this study proposes a novel alignment strategy rooted in mistake analysis by exposing LLMs to flawed outputs purposefully and then conducting a thorough assessment to fully comprehend the internal reasons via natural language. Thus, harmful responses can be transformed into instruction tuning corpus for model alignment, and LLMs can not only be deterred from generating toxic responses but also trained to self-criticize, leveraging its innate ability to discriminate toxic content. Experimental results demonstrate that the proposed method outperforms conventional alignment techniques for safety instruction following, while maintaining superior efficiency.
Primary Area: generative models
Keywords: Diffusion Models Visual Correspondence
Scores: [ 8 8 8 8 ]
The objective for establishing dense correspondence between paired images consists of two terms: a data term and a prior term. While conventional techniques focused on defining hand-designed prior terms, which are difficult to formulate, recent approaches have focused on learning the data term with deep neural networks without explicitly modeling the prior, assuming that the model itself has the capacity to learn an optimal prior from a large-scale dataset. The performance improvement was obvious, however, they often fail to address inherent ambiguities of matching, such as textureless regions, repetitive patterns, large displacements, or noises. To address this, we propose DiffMatch, a novel conditional diffusion-based framework designed to explicitly model both the data and prior terms for dense matching. This is accomplished by leveraging a conditional denoising diffusion model that explicitly takes matching cost and injects the prior within generative process. However, limited resolution of the diffusion model is a major hindrance. We address this with a cascaded pipeline, starting with a low-resolution model, followed by a super-resolution model that successively upsamples and incorporates finer details to the matching field. Our experimental results demonstrate significant performance improvements of our method over existing approaches, and the ablation studies validate our design choices along with the effectiveness of each component. The code and pretrained weights will be available.
Primary Area: generative models
Keywords: lip-to-speech lip-reading diffusion models
Scores: [ 8 5 6 5 ]
Lip-to-speech involves generating a natural-sounding speech synchronized with a soundless video of a person talking. Despite recent advances, current methods still cannot produce high-quality speech with high levels of intelligibility for challenging and realistic datasets such as LRS3. In this work, we present LipVoicer, a novel method that generates high-quality speech, even for in-the-wild and rich datasets, by incorporating the text modality. Given a silent video, we first predict the spoken text using a pre-trained lip-reading network. We then condition a diffusion model on the video and use the extracted text through a classifier-guidance mechanism where a pre-trained automatic speech recognition (ASR ) serves as the classifier. LipVoicer outperforms multiple lip-to-speech baselines on LRS2 and LRS3, which are in-the-wild datasets with hundreds of unique speakers in their test set and an unrestricted vocabulary. Moreover, our experiments show that the inclusion of the text modality plays a major role in the intelligibility of the produced speech, readily perceptible while listening, and is empirically reflected in the substantial reduction of the word error rate ( WER ) metric. We demonstrate the effectiveness of LipVoicer through human evaluation, which shows that it produces more natural and synchronized speech signals compared to competing methods. Finally, we created a demo showcasing LipVoicer’s superiority in producing natural, synchronized, and intelligible speech, providing additional evidence of its effectiveness. Project page: https://lipvoicer.github.io
Primary Area: generative models
Keywords: diffusion models sampling methods exposure bias
Scores: [ 6 6 6 6 ]
Primary Area: generative models
Keywords: Panorama Outpainting Latent Diffusion Multi-modal Generation
Scores: [ 6 8 8 6 ]
Generating complete 360\textdegree{} panoramas from narrow field of view images is ongoing research as omnidirectional RGB data is not readily available. Existing GAN-based approaches face some barriers to achieving higher quality output, and have poor generalization performance over different mask types. In this paper, we present our 360\textdegree{} indoor RGB panorama outpainting model using latent diffusion models (LDM), called PanoDiffusion. We introduce a new bi-modal latent diffusion structure that utilizes both RGB and depth panoramic data during training, which works surprisingly well to outpaint depth-free RGB images during inference. We further propose a novel technique of introducing progressive camera rotations during each diffusion denoising step, which leads to substantial improvement in achieving panorama wraparound consistency. Results show that our PanoDiffusion not only significantly outperforms state-of-the-art methods on RGB panorama outpainting by producing diverse well-structured results for different types of masks, but can also synthesize high-quality depth panoramas to provide realistic 3D indoor models.
Primary Area: generative models
Keywords: diffusion model Baysian uncertainty
Scores: [ 8 6 6 6 ]
Diffusion models have impressive image generation capability, but low-quality generations still exist, and their identification remains challenging due to the lack of a proper sample-wise metric. To address this, we propose BayesDiff, a pixel-wise uncertainty estimator for generations from diffusion models based on Bayesian inference. In particular, we derive a novel uncertainty iteration principle to characterize the uncertainty dynamics in diffusion, and leverage the last-layer Laplace approximation for efficient Bayesian inference. The estimated pixel-wise uncertainty can not only be aggregated into a sample-wise metric to filter out low-fidelity images but also aids in augmenting successful generations and rectifying artifacts in failed generations in text-to-image tasks. Extensive experiments demonstrate the efficacy of BayesDiff and its promise for practical applications.
Primary Area: generative models
Keywords: multi-modality; large language models; generation; model efficiency;
Scores: [ 6 6 8 8 ]
This paper introduces an efficient strategy to transform Large Language Models(LLMs) into Multi-Modal Large Language Models (MLLMs). By conceptualizingthis transformation as a domain adaptation process, i.e., transitioning from textunderstanding to embracing multiple modalities, we intriguingly note that, withineach attention block, tuning LayerNorm suffices to yield strong performance.Moreover, when benchmarked against other tuning approaches like full parameterfinetuning or LoRA, its benefits on efficiency are substantial. For example, whencompared to LoRA on a 13B model scale, performance can be enhanced by anaverage of over 20% across five multi-modal tasks, and meanwhile, results in asignificant reduction of trainable parameters by 41.9% and a decrease in GPUmemory usage by 17.6%. On top of this LayerNorm strategy, we showcase thatselectively tuning only with conversational data can improve efficiency further.Beyond these empirical outcomes, we provide a comprehensive analysis to explorethe role of LayerNorm in adapting LLMs to the multi-modal domain and improvingthe expressive power of the model.
Primary Area: generative models
Keywords: in-context learning large language models
Scores: [ 6 6 8 6 ]
The predictions of Large Language Models (LLMs) on downstream tasks often improve significantly when including examples of the input–label relationship in the context. However, there is currently no consensus about how this in-context learning (ICL) ability of LLMs works. For example, while Xie et al. (2022) liken ICL to a general-purpose learning algorithm, Min et al. (2022b) argue ICL does not even learn label relationships from in-context examples. In this paper, we provide novel insights into how ICL leverages label information, revealing both capabilities and limitations. To ensure we obtain a comprehensive picture of ICL behavior, we study probabilistic aspects of ICL predictions and thoroughly examine the dynamics of ICL as more examples are provided. Our experiments show that ICL predictions almost always depend on in-context labels, and that ICL can learn truly novel tasks in-context. However, we also find that ICL struggles to fully overcome prediction preferences acquired from pre-training data, and, further, that ICL does not consider all in-context information equally.
Primary Area: generative models
Keywords: Denoising Diffusion Probabilistic Models Inverse Problems Generative Models Super Resolution Phase Quantification Variational Methods
Scores: [ 5 8 8 5 3 ]
Primary Area: generative models
Keywords: synthetic data differential privacy model API foundation models
Scores: [ 5 6 8 6 ]
Primary Area: generative models
Keywords: reinforcement learning RLHF diffusion models
Scores: [ 6 6 8 5 ]
Diffusion models are a class of flexible generative models trained with an approximation to the log-likelihood objective. However, most use cases of diffusion models are not concerned with likelihoods, but instead with downstream objectives such as human-perceived image quality or drug effectiveness. In this paper, we investigate reinforcement learning methods for directly optimizing diffusion models for such objectives. We describe how posing denoising as a multi-step decision-making problem enables a class of policy gradient algorithms, which we refer to as denoising diffusion policy optimization (DDPO), that are more effective than alternative reward-weighted likelihood approaches. Empirically, DDPO is able to adapt text-to-image diffusion models to objectives that are difficult to express via prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Finally, we show that DDPO can improve prompt-image alignment using feedback from a vision-language model without the need for additional data collection or human annotation.
Primary Area: generative models
Keywords: Information Extraction Zero-Shot Annotation Guidelines Large Language Models LLM prompt
Scores: [ 8 5 6 6 ]
Primary Area: generative models
Keywords: Generative model idempotent energy based models
Scores: [ 8 5 3 8 ]
We propose a new approach for generative modeling based on training a neural network to be idempotent. An idempotent operator is one that can be applied sequentially without changing the result beyond the initial application, namely f(f(z))=f(z). The proposed model f is trained to map a source distribution (e.g, Gaussian noise) to a target distribution (e.g. realistic images) using the following objectives: (1) Instances from the target distribution should map to themselves, namely f(x)=x. We define the target manifold as the set of all instances that f maps to themselves.(2) Instances that form the source distribution should map onto the defined target manifold. This is achieved by optimizing the idempotence term, f(f(z))=f(z) which encourages the range of f(z) to be on the target manifold. Under ideal assumptions such a process provably converges to the target distribution. This strategy results in a model capable of generating an output in one step, maintaining a consistent latent space, while also allowing sequential applications for refinement. Additionally, we find that by processing inputs from both target and source distributions, the model adeptly projects corrupted or modified data back to the target manifold. This work is a first step towards a global projector'' that enables projecting any input into a target data distribution.
Primary Area: generative models
Keywords: Schrödinger Bridge Optimal transport Entropy regularized OT Unpaired Learning
Scores: [ 8 5 5 8 8 ]
Despite the recent advances in the field of computational Schrodinger Bridges (SB), most existing SB solvers are still heavy-weighted and require complex optimization of several neural networks. It turns out that there is no principal solver which plays the role of simple-yet-effective baseline for SB just like, e.g., k-means method in clustering, logistic regression in classification or Sinkhorn algorithm in discrete optimal transport. We address this issue and propose a novel fast and simple SB solver. Our development is a smart combination of two ideas which recently appeared in the field: (a) parameterization of the Schrodinger potentials with sum-exp quadratic functions and (b) viewing the log-Schrodinger potentials as the energy functions. We show that combined together these ideas yield a lightweight, simulation-free and theoretically justified SB solver with a simple straightforward optimization objective. As a result, it allows solving SB in moderate dimensions in a matter of minutes on CPU without a painful hyperparameter selection. Our light solver resembles the Gaussian mixture model which is widely used for density estimation. Inspired by this similarity, we also prove an important theoretical result showing that our light solver is a universal approximator of SBs.
Primary Area: generative models
Keywords: Intuitive physics inverse graphics particle-based fluid simulation neural rendering
Scores: [ 6 6 6 8 ]
We introduce latent intuitive physics, a transfer learning framework for physics simulation that can infer hidden properties of fluids from a single 3D video and simulate the observed fluid in novel scenes. Our key insight is to use latent features drawn from a learnable prior distribution conditioned on the underlying particle states to capture the invisible and complex physical properties. To achieve this, we train a parametrized prior learner given visual observations to approximate the visual posterior of inverse graphics, and both the particle states and the visual posterior are obtained from a learned neural renderer. The converged prior learner is embedded in our probabilistic physics engine, allowing us to perform novel simulations on unseen geometries, boundaries, and dynamics without knowledge of the true physical parameters. We validate our model in three ways: (i) novel scene simulation with the learned visual-world physics, (ii) future prediction of the observed fluid dynamics, and (iii) supervised particle simulation. Our model demonstrates strong performance in all three tasks.
Primary Area: generative models
Keywords: 3D Open Vocabulary Conditional Generation
Scores: [ 6 6 8 ]
Recent works learn 3D representation explicitly under text-3D guidance. However, limited text-3D data restricts the vocabulary scale and text control of generations. Generators may easily fall into a stereotype concept for certain text prompts, thus losing open-vocabulary generation ability. To tackle this issue, we introduce a conditional 3D generative model, namely TextField3D. Specifically, rather than using the text prompts as input directly, we suggest to inject dynamic noise into the latent space of given text prompts, i.e., Noisy Text Fields (NTFs). In this way, limited 3D data can be mapped to the appropriate range of textual latent space that is expanded by NTFs. To this end, an NTFGen module is proposed to model general text latent code in noisy fields. Meanwhile, an NTFBind module is proposed to align view-invariant image latent code to noisy fields, further supporting image-conditional 3D generation. To guide the conditional generation in both geometry and texture, multi-modal discrimination is constructed with a text-3D discriminator and a text-2.5D discriminator. Compared to previous methods, TextField3D includes three merits: 1) large vocabulary, 2) text consistency, and 3) low latency. Extensive experiments demonstrate that our method achieves a potential open-vocabulary 3D generation capability.
Primary Area: generative models
Keywords: Consistency Models Consistency Training Diffusion Models Score-Based Generative Models Score-Based Diffusion Models Distillation
Scores: [ 8 8 6 6 ]
Consistency models are a nascent family of generative models that can sample high quality data in one step without the need for adversarial training. Current consistency models achieve optimal sample quality by distilling from pre-trained diffusion models, and employing learned metrics such as LPIPS. However, distillation limits the quality of consistency models to that of the pre-trained diffusion model, and LPIPS causes undesirable bias in evaluation. To tackle these challenges, we present improved techniques for consistency training, where consistency models learn directly from data without distillation. We delve into the theory behind consistency training and identify a previously overlooked flaw, which we address by eliminating Exponential Moving Average from the teacher consistency model. To replace learned metrics like LPIPS, we borrow Pseudo-Huber losses from robust statistics. Additionally, we introduce a new noise schedule for the consistency training objective, and propose a new curriculum for total discretization steps. Collectively, these modifications enable consistency models to achieve FID scores of 2.62 and 3.91 on CIFAR-10 and ImageNet 64×64 respectively in a single sampling step. These scores mark a 3.3$\times$ improvement compared to prior consistency training approaches. Through two-step sampling, we further reduce FID scores to 2.28 and 3.64, surpassing those obtained via distillation in both one-step and two-step settings, while narrowing the gap between consistency models and state-of-the-art generative models on both datasets.
Primary Area: generative models
Keywords: Image Generation Generative models Diffusion models
Scores: [ 6 8 8 8 ]
Primary Area: generative models
Keywords: Optical Flow Image Editing Guidance Diffusion Models
Scores: [ 6 6 8 8 ]
Diffusion models are capable of generating impressive images conditioned on text descriptions, and extensions of these models allow users to edit images at a relatively coarse scale. However, the ability to precisely edit the layout, position, pose, and shape of objects in images with diffusion models is still difficult. To this end, we propose {\it motion guidance}, a zero-shot technique that allows a user to specify dense, complex motion fields that indicate where each pixel in an image should move. Motion guidance works by steering the diffusion sampling process with the gradients through an off-the-shelf optical flow network. Specifically, we design a guidance loss that encourages the sample to have the desired motion, as estimated by a flow network, while also being visually similar to the source image. By simultaneously sampling from a diffusion model and guiding the sample to have low guidance loss, we can obtain a motion-edited image. We demonstrate that our technique works on complex motions and produces high quality edits of real and generated images.
Primary Area: generative models
Keywords: Generative Model Synthetic EHR
Scores: [ 6 5 5 6 6 ]
Realistic synthetic electronic health records (EHRs) can be leveraged to acceler- ate methodological developments for research purposes while mitigating privacy concerns associated with data sharing. However, the training of Generative Ad- versarial Networks remains challenging, often resulting in issues like mode col- lapse. While diffusion models have demonstrated progress in generating qual- ity synthetic samples for tabular EHRs given ample denoising steps, their perfor- mance wanes when confronted with missing modalities in heterogeneous tabular EHRs data. For example, some EHRs contain solely static measurements, and some contain only contain temporal measurements, or a blend of both data types. To bridge this gap, we introduce FLEXGEN-EHR– a versatile diffusion model tai- lored for heterogeneous tabular EHRs, equipped with the capability of handling missing modalities in an integrative learning framework. We define an optimal transport module to align and accentuate the common feature space of hetero- geneity of EHRs. We empirically show that our model consistently outperforms existing state-of-the-art synthetic EHR generation methods both in fidelity by up to 3.10% and utility by up to 7.16%. Additionally, we show that our method can be successfully used in privacy-sensitive settings, where the original patient-level data cannot be shared.
Primary Area: generative models
Keywords: Diffusion Probabilistic Model Diffusion Sampler Solver Schedule
Scores: [ 6 6 6 6 6 ]
Primary Area: generative models
Keywords: Green AI Large Language Models Fine-Tuning Adaptive Backpropagation
Scores: [ 6 5 6 5 ]
Primary Area: generative models
Keywords: 3D-aware image synthesis diffusion model latent diffusion 3D-aware generative model
Scores: [ 6 8 8 6 ]
Modern learning-based approaches to 3D-aware image synthesis achieve high photorealism and 3D-consistent viewpoint changes for the generated images. Existing approaches represent instances in a shared canonical space. However, for in-the-wild datasets a shared canonical system can be difficult to define or might not even exist. In this work, we instead model instances in view space, alleviating the need for posed images and learned camera distributions. We find that in this setting, existing GAN-based methods are prone to generating flat geometry and struggle with distribution coverage. We hence propose WildFusion, a new approach to 3D-aware image synthesis based on latent diffusion models (LDMs). We first train an autoencoder that infers a compressed latent representation, which additionallycaptures the images’ underlying 3D structure and enables not only reconstruction but also novel view synthesis. To learn a faithful 3D representation, we leverage cues from monocular depth prediction. Then, we train a diffusion model in the 3D-aware latent space, thereby enabling synthesis of high-quality 3D-consistent image samples, outperforming recent state-of-the-art GAN-based methods. Importantly,our 3D-aware LDM is trained without any direct supervision from multiview images or 3D geometry and does not require posed images or learned pose or camera distributions. It directly learns a 3D representation without relying on canonical camera coordinates. This opens up promising research avenues for scalable 3D-aware image synthesis and 3D content creation from in-the-wild image data.
Primary Area: generative models
Keywords: Text-to-3D Image-to-3D 3D Generation Efficiency
Scores: [ 8 8 10 8 ]
Recent advances in 3D content creation mostly leverage optimization-based 3D generation via score distillation sampling (SDS).Though promising results have been exhibited, these methods often suffer from slow per-sample optimization, limiting their practical usage. In this paper, we propose DreamGaussian, a novel 3D content generation framework that achieves both efficiency and quality simultaneously. Our key insight is to design a generative 3D Gaussian Splatting model with companioned mesh extraction and texture refinement in UV space.In contrast to the occupancy pruning used in Neural Radiance Fields, we demonstrate that the progressive densification of 3D Gaussians converges significantly faster for 3D generative tasks.To further enhance the texture quality and facilitate downstream applications, we introduce an efficient algorithm to convert 3D Gaussians into textured meshes and apply a fine-tuning stage to refine the details.Extensive experiments demonstrate the superior efficiency and competitive generation quality of our proposed approach.Notably, DreamGaussian produces high-quality textured meshes in just 2 minutes from a single-view image, achieving approximately 10 times acceleration compared to existing methods.
Primary Area: generative models
Keywords: Large Language Models Code Instruction Fine-tuning
Scores: [ 8 5 6 6 ]
Code Large Language Models (Code LLMs), such as StarCoder, have demonstrated remarkable performance in various code-related tasks. However, different from their counterparts in the general language modeling field, the technique of instruction fine-tuning remains relatively under-researched in this domain. In this paper, we present Code Evol-Instruct, a novel approach that adapts the Evol-Instruct method to the realm of code, enhancing Code LLMs to create novel models WizardCoder. Through comprehensive experiments on five prominent code generation benchmarks, namely HumanEval, HumanEval+, MBPP, DS-1000, and MultiPL-E, our models showcase outstanding performance. They consistently outperform all other open-source Code LLMs by a significant margin. Remarkably, WizardCoder 15B even surpasses the largest closed-source LLMs, including Anthropic’s Claude and Google’s Bard, on the HumanEval and HumanEval+ benchmarks. Additionally, WizardCoder 34B not only achieves a HumanEval score comparable to GPT3.5 (ChatGPT) but also surpasses it on the HumanEval+ benchmark. Furthermore, our preliminary exploration highlights the pivotal role of instruction complexity in achieving exceptional coding performance.
Primary Area: generative models
Keywords: diffusion video generation
Scores: [ 6 6 6 ]
Primary Area: generative models
Keywords: Diffusion Models Model Quantization Model Compression Efficient Models
Scores: [ 6 6 8 6 ]
Diffusion models have demonstrated remarkable capabilities in image synthesis and related generative tasks. Nevertheless, their practicality for low-latency real-world applications is constrained by substantial computational costs and latency issues. Quantization is a dominant way to compress and accelerate diffusion models, where post-training quantization (PTQ) and quantization-aware training (QAT) are two main approaches, each bearing its own properties. While PTQ exhibits efficiency in terms of both time and data usage, it may lead to diminished performance in low bit-width settings. On the other hand, QAT can help alleviate performance degradation but comes with substantial demands on computational and data resources. To capitalize on the advantages while avoiding their respective drawbacks, we introduce a data-free, quantization-aware and parameter-efficient fine-tuning framework for low-bit diffusion models, dubbed EfficientDM, to achieve QAT-level performance with PTQ-like efficiency. Specifically, we propose a quantization-aware variant of the low-rank adapter (QALoRA) that can be merged with model weights and jointly quantized to low bit-width. The fine-tuning process distills the denoising capabilities of the full-precision model into its quantized counterpart, eliminating the requirement for training data. To further enhance performance, we introduce scale-aware optimization to address ineffective learning of QALoRA due to variations in weight quantization scales across different layers. We also employ temporal learned step-size quantization to handle notable variations in activation distributions across denoising steps. Extensive experimental results demonstrate that our method significantly outperforms previous PTQ-based diffusion models while maintaining similar time and data efficiency. Specifically, there is only a marginal 0.05 sFID increase when quantizing both weights and activations of LDM-4 to 4-bit on ImageNet 256×256. Compared to QAT-based methods, our EfficientDM also boasts a 16.2× faster quantization speed with comparable generation quality, rendering it a compelling choice for practical applications.
Primary Area: generative models
Keywords: Diffusion models NeRF 3D synthesis
Scores: [ 8 5 6 ]
Text-to-3D generation has shown rapid progress in recent days with the advent of score distillation sampling (SDS), a methodology of using pretrained text-to-2D diffusion models to optimize a neural radiance field (NeRF) in a zero-shot setting. However, the lack of 3D awareness in the 2D diffusion model often destabilizes previous methods from generating a plausible 3D scene. To address this issue, we propose 3DFuse, a novel framework that incorporates 3D awareness into the pretrained 2D diffusion model, enhancing the robustness and 3D consistency of score distillation-based methods. Specifically, we introduce a consistency injection module which constructs a 3D point cloud from the text prompt and utilizes its projected depth map at given view as a condition for the diffusion model. The 2D diffusion model, through its generative capability, robustly infers dense structure from the sparse point cloud depth map and generates a geometrically consistent and coherent 3D scene. We also introduce a new technique called semantic coding that reduces semantic ambiguity of the text prompt for improved results. Our method can be easily adapted to various text-to-3D baselines, and we experimentally demonstrate how our method notably enhances the 3D consistency of generated scenes in comparison to previous baselines, achieving state-of-the-art performance in geometric robustness and fidelity.
Primary Area: generative models
Keywords: Source Separation Deep Learning Diffusion Models Upper Bound Information Bound
Scores: [ 6 6 6 ]
The problem of speech separation, also known as the cocktail party problem,refers to the task of isolating a single speech signal from a mixture of speechsignals. Previous work on source separation derived an upper bound for thesource separation task in the domain of human speech. This bound is derived fordeterministic models. Recent advancements in generative models challenge thisbound. We show how the upper bound can be generalized to the case of randomgenerative models. Applying a diffusion model Vocoder that was pretrained tomodel single-speaker voices on the output of a deterministic separation model leadsto state-of-the-art separation results. It is shown that this requires one to combinethe output of the separation model with that of the diffusion model. In our method,a linear combination is performed, in the frequency domain, using weights that areinferred by a learned model. We show state-of-the-art results on 2, 3, 5, 10, and 20speakers on multiple benchmarks. In particular, for two speakers, our method isable to surpass what was previously considered the upper performance bound.
Primary Area: generative models
Keywords: 3D generation shape analysis and synthesis diffusion models
Scores: [ 8 5 6 5 ]
Synthesizing novel 3D models that resemble the input example as long been pursued by graphics artists and machine learning researchers. In this paper, we present Sin3DM, a diffusion model that learns the internal patch distribution from a single 3D textured shapeand generates high-quality variations with fine geometry and texture details. Training a diffusion model directly in 3D would induce large memory and computational cost. Therefore, we first compress the input into a lower-dimensional latent space and then train a diffusion model on it. Specifically, we encode the input 3D textured shape into triplane feature maps that represent the signed distance and texture fields of the input. The denoising network of our diffusion model has a limited receptive field to avoid overfitting, and uses triplane-aware 2D convolution blocks to improve the result quality. Aside from randomly generating new samples, our model also facilitates applications such as retargeting, outpainting and local editing. Through extensive qualitative and quantitative evaluation, we show that our method outperforms prior methods in generation quality of 3D shapes.
Primary Area: generative models
Keywords: large language models behavior simulation
Scores: [ 10 5 8 6 ]
Shannon, in his seminal paper introducing information theory, divided the communication into three levels: technical, semantic, and effectivenss. While the technical level is concerned with accurate reconstruction of transmitted symbols, the semantic and effectiveness levels deal with the inferred meaning and its effect on the receiver. Thanks to telecommunications, the first level problem has produced great advances like the internet. Large Language Models (LLMs) make some progress towards the second goal, but the third level still remains largely untouched. The third problem deals with predicting and optimizing communication for desired receiver behavior. LLMs, while showing wide generalization capabilities across a wide range of tasks, are unable to solve for this. One reason for the underperformance could be a lack of behavior tokens'' in LLMs' training corpora. Behavior tokens define receiver behavior over a communication, such as shares, likes, clicks, purchases, retweets, \textit{etc}. While preprocessing data for LLM training, behavior tokens are often removed from the corpora as noise. Therefore, in this paper, we make some initial progress towards reintroducing behavior tokens in LLM training. The trained models, other than showing similar performance to LLMs on content understanding tasks, show generalization capabilities on behavior simulation, content simulation, behavior understanding, and behavior domain adaptation. Using a wide range of tasks on two corpora, we show results on all these capabilities. We call these models Large Content and Behavior Models (LCBMs). Further, to spur more research on LCBMs, we release our new Content Behavior Corpus (CBC), a repository containing communicator, message, and corresponding receiver behavior.
Primary Area: generative models
Keywords: Large Language Models Hallucination Factuality Decoding Text Generation
Scores: [ 8 8 5 8 ]
Despite their impressive capabilities, large language models (LLMs) are prone to hallucinations, i.e., generating content that deviates from facts seen during pretraining. We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs that does not require conditioning on retrieved external knowledge nor additional fine-tuning. Our approach obtains the next-token distribution by contrasting the differences in logits obtained from projecting the later layers versus earlier layers to the vocabulary space, exploiting the fact that factual knowledge in an LLMs has generally been shown to be localized to particular transformer layers. We find that this Decoding by Contrasting Layers (DoLa) approach is able to better surface factual knowledge and reduce the generation of incorrect facts. DoLa consistently improves the truthfulness across multiple choices tasks and open-ended generation tasks, for example improving the performance of LLaMA family models on TruthfulQA by 12-17% absolute points, demonstrating its potential in making LLMs reliably generate truthful facts.
Primary Area: generative models
Keywords: Diffusion Models
Scores: [ 5 6 5 6 ]
We propose an effective denoising diffusion model for generating high-resolution images (e.g., 1024$\times$512), trained on small-size image patches (e.g., 64$\times$64). We name our algorithm Patch-DM, in which a new feature collage strategy is designed to avoid the boundary artifact when synthesizing large-size images. Feature collage systematically crops and combines partial features of the neighboring patches to predict the features of a shifted image patch, allowing the seamless generation of the entire image due to the overlap in the patch feature space. Patch-DM produces high-quality image synthesis results on our newly collected dataset of nature images (1024$\times$512), as well as on standard benchmarks of LHQ(1024$\times$ 1024), FFHQ(1024$\times$ 1024) and on other datasets with smaller sizes (256$\times$256), including LSUN-Bedroom, LSUN-Church, and FFHQ. We compare our method with previous patch-based generation methods and achieve state-of-the-art FID scores on all six datasets. Further, Patch-DM also reduces memory complexity compared to the classic diffusion models.
Primary Area: generative models
Keywords: Text-guided diffusion models Robustness Adversarial examiner Failure detection
Scores: [ 6 10 6 6 ]
Text-guided diffusion models (TDMs) are widely applied but can fail unexpectedly. Common failures include: (i) natural-looking text prompts generating images with the wrong content, or (ii) different random samples of the latent variables that generate vastly different, and even unrelated, outputs despite being conditioned on the same text prompt. In this work, we aim to study and understand the failure modes of TDMs in more detail. To achieve this, we propose SAGE, the first adversarial search method on TDMs that systematically explores the discrete prompt space and the high-dimensional latent space, to automatically discover undesirable behaviors and failure cases in image generation. We use image classifiers as surrogate loss functions during searching, and employ human inspections to validate the identified failures. For the first time, our method enables efficient exploration of both the discrete and intricate human language space and the challenging latent space, overcoming the gradient vanishing problem. Then, we demonstrate the effectiveness of SAGE on five widely used generative models and reveal four typical failure modes that have not been systematically studied before: (1) We find a variety of natural text prompts that generate images failing to capture the semantics of input texts. We further discuss the underlying causes and potential solutions based on the results. (2) We find regions in the latent space that lead to distorted images independent of the text prompt, suggesting that parts of the latent space are not well-structured. (3) We also find latent samples that result in natural-looking images unrelated to the text prompt, implying a possible misalignment between the latent and prompt spaces. (4) By appending a single adversarial token embedding to any input prompts, we can generate a variety of specified target objects, with minimal impact on CLIP scores, demonstrating the fragility of language representations.
Primary Area: generative models
Keywords: Distribution matching diffusion models generalized Schrödinger bridge stochastic optimal control
Scores: [ 8 8 6 6 ]
Modern distribution matching algorithms for training diffusion or flow models directly prescribe the time evolution of the marginal distributions between two boundary distributions. In this work, we consider a generalized distribution matching setup, where these marginals are only implicitly described as a solution to some task-specific objective function. The problem setup, known as the Generalized Schrödinger Bridge (GSB), appears prevalently in many scientific areas both within and without machine learning. We propose Generalized Schödinger Bridge Matching (GSBM), a new matching algorithm inspired by recent advances, generalizing them beyond kinetic energy minimization and to account for nonlinear state costs. We show that such a generalization can be cast as solving conditional stochastic optimal control, for which efficient variational approximations can be used, and further debiased with the aid of path integral theory. Compared to prior methods for solving GSB problems, our GSBM algorithm always preserves a feasible transport map between the boundary distributions throughout training, thereby enabling stable convergence and significantly improved scalability. We empirically validate our claims on an extensive suite of experimental setups, including crowd navigation, opinion depolarization, LiDAR manifolds, and image domain transfer. Our work brings new algorithmic opportunities for training diffusion models enhanced with task-specific optimality structures.
Primary Area: generative models
Keywords: Generative Models Synthetic Data Sampling Bias Diffusion Models GANs
Scores: [ 6 6 8 ]
Seismic advances in generative AI algorithms for imagery, text, and other data types have led to the temptation to use AI-synthesized data to train next-generation models. Repeating this process creates an autophagous ("self-consuming") loop whose properties are poorly understood. We conduct a thorough analytical and empirical analysis using state-of-the-art generative image models of three families of autophagous loops that differ in how fixed or fresh real training data is available through the generations of training and whether the samples from previous-generation models have been biased to trade off data quality versus diversity. Our primary conclusion across all scenarios is that without enough fresh real data in each generation of an autophagous loop, future generative models are doomed to have their quality (precision) or diversity (recall) progressively decrease. We term this condition Model Autophagy Disorder (MAD), by analogy to mad cow disease, and show that appreciable MADness arises in just a few generations.
Primary Area: generative models
Keywords: Controlled text generation LLM Natural Language Processing
Scores: [ 8 8 6 8 5 ]
As Large Language Models (LLMs) are deployed more widely, customization with respect to vocabulary, style and character becomes more important. In this work we introduce model arithmetic, a novel inference framework for composing and biasing LLMs without the need for model (re)training or highly specific datasets. In addition, the framework allows for more precise control of generated text than direct prompting and prior controlled text generation (CTG) techniques. Using model arithmetic, we can express prior CTG techniques as simple formulas and naturally extend them to new and more effective formulations. Further, we show that speculative sampling, a technique for efficient LLM sampling, extends to our setting. This enables highly efficient text generation with multiple composed models with only marginal overhead over a single model. Our empirical evaluation demonstrates that model arithmetic allows fine-grained control of generated text while outperforming state-of-the-art on the task of toxicity reduction. We release an open source easy-to-use implementation of our framework at [ANONYMIZED].
Primary Area: generative models
Keywords: image editing multimodal large language model
Scores: [ 8 8 6 6 ]
Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive instructions and provides explicit guidance. The editing model jointly captures this visual imagination and performs manipulation through end-to-end training. We evaluate various aspects of Photoshop-style modification, global photo optimization, and local editing. Extensive experimental results demonstrate that expressive instructions are crucial to instruction-based image editing, and our MGIE can lead to a notable improvement in automatic metrics and human evaluation while maintaining competitive inference efficiency.
Primary Area: generative models
Keywords: Diffusion Models Error Propagation Theoretical Explanation Regularization
Scores: [ 8 8 6 8 ]
Primary Area: generative models
Keywords: Data Generation content grounded long-form question-answering summarization
Scores: [ 5 5 6 ]
The lack of high-quality data for content-grounded generation tasks has been identified as a major obstacle to advancing these tasks. To address this gap, we propose a novel method for automatically generating high-quality content-grounded data. It consists of three stages: (a) Content Preparation, (b) Generation: creating task-specific examples from the content (e.g., question-answer pairs or summaries). (c) Filtering mechanism aiming to ensure the quality and faithfulness of the generated data. We showcase this methodology by generating large-scale data for synthetic Long-form question-answering (LFQA) and summarization. In a human evaluation, our generated data was found to be natural and of high quality. Furthermore, we compare models trained on our data with models trained on human-written data – ELI5 and ASQA for LFQA and CNN-DailyMail for Summarization. We show that our models are on par with or outperforming models trained on human-generated data and consistently outperforming them in faithfulness. Finally, we applied our method to create LFQA data within the medical domain and compared a model trained on it with models trained on other domains.
Primary Area: generative models
Keywords: large language models sensitivity analysis prompt engineering evaluation prompting robustness in-context learning spurious features
Scores: [ 6 8 6 ]
As large language models (LLMs) are adopted as a fundamental component of language technologies, it is crucial to accurately characterize their performance. Because choices in prompt design can strongly influence model behavior, this design process is critical in effectively using any modern pre-trained generative language model. In this work, we focus on LLM sensitivity to a quintessential class of meaning-preserving design choices: prompt formatting. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points when evaluated using LLaMA-2-13B. Sensitivity remains even when increasing model size, the number of few-shot examples, or performing instruction tuning. Our analysis suggests that work evaluating LLMs with prompting-based methods would benefit from reporting a range of performance across plausible prompt formats, instead of the currently-standard practice of reporting performance on a single format. We also show that format performance only weakly correlates between models, which puts into question the methodological validity of comparing models with an arbitrarily chosen, fixed prompt format. To facilitate systematic analysis we propose FormatSpread, an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights. Furthermore, we present a suite of analyses that characterize the nature of this sensitivity, including exploring the influence of particular atomic perturbations and the internal representation of particular formats.
Primary Area: generative models
Keywords: Graph generative models graph neural networks
Scores: [ 5 5 8 8 ]
Generating graphs from a target distribution is a significant challenge across many domains, including drug discovery and social network analysis. In this work, we introduce a novel graph generation method leveraging K2 representation, originally designed for lossless graph compression. The K2 representation enables compact generation while concurrently capturing an inherent hierarchical structure of a graph. In addition, we make contributions by (1) presenting a sequential K2 representation that incorporates pruning, flattening, and tokenization processes and (2) introducing a Transformer-based architecture designed to generate the sequence by incorporating a specialized tree positional encoding scheme. Finally, we extensively evaluate our algorithm on four general and two molecular graph datasets to confirm its superiority for graph generation.
Primary Area: generative models
Keywords: text-to-image interpretability model editing
Scores: [ 8 5 8 5 ]
Text-to-Image Diffusion Models such as Stable-Diffusion and Imagen have achieved unprecedented quality of photorealism with state-of-the-art FID scores on MS-COCO and other generation benchmarks. Given a caption, image generation requires fine-grained knowledge about attributes such as object structure, style, and viewpoint amongst others. Where does this information reside in text-to-image generative models? In our paper, we tackle this question and understand how knowledge corresponding to distinct visual attributes is stored in large-scale text-to-image diffusion models. We adapt Causal Mediation Analysis for text-to-image models and trace knowledge about distinct visual attributes to various (causal) components in the (i) UNet and (ii) text-encoder of the diffusion model. In particular, we show that unlike large-language models, knowledge about different attributes is not localized in isolated components, but is instead distributed amongst a set of components in the conditional UNet. These sets of components are often distinct for different visual attributes (e.g., style} / objects). Remarkably, we find that the text-encoder in public text-to-image models such as Stable-Diffusion contains {\it only} one causal state across different visual attributes, and this is the first self-attention layer corresponding to the last subject token of the attribute in the caption. This is in stark contrast to the causal states in other language models which are often the mid-MLP layers. Based on this observation of only one causal state in the text-encoder, we introduce a fast, data-free model editing method DiffQuickFix which can effectively edit concepts (remove or update knowledge) in text-to-image models. DiffQuickFix can edit (ablate) concepts in under a second with a closed-form update, providing a significant 1000x speedup and comparable editing performance to existing fine-tuning based editing methods.
Primary Area: generative models
Keywords: Large Language Models Table Reasoning
Scores: [ 6 8 6 ]
Large Language Models (LLMs) trained on large volumes of data excel at various natural language tasks, but they cannot handle tasks with data that has not been trained on. One solution is to use a retriever that fetches relevant information to expand LLM's knowledge scope. However, existing textual-oriented retrieved-based LLMs are not ideal on structured table data due to diversified data modalities and large table sizes. In this work, we propose OpenTab, an open-domain table reasoning framework powered by LLMs. Overall, OpenTab leverages table retriever to fetch relevant tables and then generates SQL programs to parse such tables efficiently. Utilizing the intermediate data derived from the SQL executions, it then conducts grounded inference to produce the accurate response. Extensive experimental evaluation shows that OpenTab significantly outperforms baselines in both open- and close-domain settings, achieving up to 21.5% higher accuracy. We further run detailed ablation studies to validate the efficacy of our proposed designs.
Primary Area: generative models
Keywords: Fusion of Experts Mixture of Experts Expert Models
Scores: [ 6 8 6 6 ]
Training AI models that generalize across tasks and domains has long been among the open problems driving AI research. The emergence of Foundation Models made it easier to obtain expert models for a given task, but the heterogeneity of data that may be encountered at test time often means that any single expert is insufficient. We consider the Fusion of Experts (FoE) problem of fusing outputs of expert models with complementary knowledge of the data distribution and formulate it as an instance of supervised learning. Our method is applicable to both discriminative and generative tasks and leads to significant performance improvements in image and text classification, text summarization, multiple-choice QA, and automatic evaluation of generated text. We also extend our method to the "frugal" setting where it is desired to reduce the number of expert model evaluations at test time.
Primary Area: generative models
Keywords: reinforcement learning large language models rlhf ood generalisation diversity
Scores: [ 3 6 8 ]
Primary Area: generative models
Keywords: Generative flow networks reinforcement learning generative models
Scores: [ 8 8 8 8 ]
This paper studies generative flow networks (GFlowNets) to sample objects from the Boltzmann energy distribution via a sequence of actions. In particular, we focus on improving GFlowNet with partial inference: training flow functions with the evaluation of the intermediate states or transitions. To this end, the recently developed forward-looking GFlowNet reparameterizes the flow functions based on evaluating the energy of intermediate states. However, such an evaluation of intermediate energies may (i) be too expensive or impossible to evaluate and (ii) even provide misleading training signals under large energy fluctuations along the sequence of actions. To resolve this issue, we propose learning energy decompositions for GFlowNets (LED-GFN). Our main idea is to (i) decompose the energy of an object into learnable potential functions defined on state transitions and (ii) reparameterize the flow functions using the potential functions. In particular, to produce informative local credits, we propose to regularize the potential to change smoothly over the sequence of actions. It is also noteworthy that training GFlowNet with our learned potential can preserve the optimal policy. We empirically verify the superiority of LED-GFN in five problems including the generation of unstructured and maximum independent sets, molecular graphs, and RNA sequences.
Primary Area: generative models
Keywords: 3D animation text-driven animation motion generation mesh deformation
Scores: [ 8 6 6 6 ]
Previous motion generation methods are limited to the pre-rigged 3D human model, hindering their applications in the animation of various non-rigged characters. In this work, we present TapMo, a Text-driven Animation PIpeline for synthesizing Motion in a broad spectrum of skeleton-free 3D characters. The pivotal innovation in TapMo is its use of shape deformation-aware features as a condition to guide the diffusion model, thereby enabling the generation of mesh-specific motions for various characters. Specifically, TapMo comprises two main components - Mesh Handle Predictor and Shape-aware Diffusion Module. Mesh Handle Predictor predicts the skinning weights and clusters mesh vertices into adaptive handles for deformation control, which eliminates the need for traditional skeletal rigging. Shape-aware Motion Diffusion synthesizes motion with mesh-specific adaptations. This module employs text-guided motions and mesh features extracted during the first stage, preserving the geometric integrity of the animations by accounting for the character's shape and deformation. Trained in a weakly-supervised manner, TapMo can accommodate a multitude of non-human meshes, both with and without associated text motions. We demonstrate the effectiveness and generalizability of TapMo through rigorous qualitative and quantitative experiments. Our results reveal that TapMo consistently outperforms existing auto-animation methods, delivering superior-quality animations for both seen or unseen heterogeneous 3D characters.
Primary Area: generative models
Keywords: large language models automatic prompt engineering in-context learning
Scores: [ 6 6 5 8 ]
Large language models (LLMs) have made impressive progress in natural language processing. These models rely on proper human instructions (or prompts) to generate suitable responses. However, the potential of LLMs are not fully harnessed by commonly-used prompting methods: many human-in-the-loop algorithms employ ad-hoc procedures for prompt selection; while auto prompt generation approaches are essentially searching all possible prompts randomly and inefficiently. We propose Evoke, an automatic prompt refinement framework. In Evoke, there are two instances of a same LLM: one as a reviewer (LLM-Reviewer), it scores the current prompt; the other as an author (LLM-Author), it edits the prompt by considering the edit history and the reviewer's feedback. Such an author-reviewer feedback loop ensures that the prompt is refined in each iteration. We further aggregate a data selection approach to Evoke, where only the hard samples are exposed to the LLM. The hard samples are more important because the LLM can develop deeper understanding of the tasks out of them, while the model may already know how to solve the easier cases. Experimental results show that Evoke significantly outperforms existing methods. For instance, in the challenging task of logical fallacy detection, Evoke scores above 80, while all other baseline methods struggle to reach 20.
Primary Area: generative models
Keywords: Generative Models Diffusion Models Infinite Resolution
Scores: [ 6 8 8 6 ]
We introduce ∞-Diff, a generative diffusion model defined in an infinite-dimensional Hilbert space that allows infinite resolution data to be modelled. By randomly sampling subsets of coordinates during training and learning to denoise the content at those coordinates, a continuous function is learned that allows sampling at arbitrary resolutions. Prior infinite-dimensional generative models use point-wise functions that require latent compression for global context. In contrast, we propose using non-local integral operators to map between Hilbert spaces, allowing spatial information aggregation; to facilitate this, we design a powerful and efficient multi-scale architecture that operates directly on raw sparse coordinates. Training on high-resolution datasets we demonstrate that high-quality diffusion models can be learned with even 8× subsampling rates, enabling substantial improvements in run-time and memory requirements, achieving significantly higher sample quality as evidenced by lower FID scores, while also being able to effectively scale to higher resolutions than the training data while retaining detail.
Primary Area: generative models
Keywords: Diffusion model Image editing Image generation
Scores: [ 6 6 6 6 6 ]
Despite the ability of existing large-scale text-to-image (T2I) diffusion models to generate high-quality images from detailed textual descriptions, they often lack the ability to precisely edit the generated or real images. In this paper, we propose a novel image editing method, DragonDiffusion, enabling Drag-style manipulation on Diffusion models. Specifically, we treat image editing as the change of feature correspondence in a pre-trained diffusion model. By leveraging feature correspondence, we develop energy functions that align with the editing target, transforming image editing operations into gradient guidance. Based on this guidance approach, we also construct multi-scale guidance that considers both semantic and geometric alignment. Furthermore, we incorporate a visual cross-attention strategy based on a memory bank design to ensure consistency between the edited result and original image. Benefiting from these efficient designs, all content editing and consistency operations come from the feature correspondence without extra model fine-tuning or additional modules. Extensive experiments demonstrate that our method has promising performance on various image editing tasks, including editing within a single image (e.g., object moving, resizing, and content dragging) and across images (e.g., appearance replacing and object pasting).
Primary Area: generative models
Keywords: Generative Model Representation Learning Sets Vector Graphics
Scores: [ 6 6 5 ]
Vector drawings are innately interactive as they preserve creational cues. Despitethis desirable property they remain relatively under explored due to the difficultiesin modeling complex vector drawings. This is in part due to the primarily sequential and auto-regressive nature of existing approaches failing to scale beyond simpledrawings. In this paper, we define generative models over highly complex vectordrawings by first representing them as “stroke-clouds” – sets of arbitrary cardinality comprised of semantically meaningful strokes. The dimensionality of thestrokes is a design choice that allows the model to adapt to a range of complexities.We learn to encode these set of strokes into compact latent codes by a probabilisticreconstruction procedure backed by De-Finetti’s Theorem of Exchangability. Theparametric generative model is then defined over the latent vectors of the encodedstroke-clouds. The resulting “Latent stroke-cloud generator (LSG)” thus capturesthe distribution of complex vector drawings on an implicit set space. We demonstrate the efficacy of our model on complex drawings (a newly created Animeline-art dataset) through a rangeof generative tasks.
Primary Area: generative models
Keywords: Text-to-3D Creation 3D Content Editing Attribute Mismatching
Scores: [ 6 6 6 5 ]
Recent text-to-3D generation methods achieve impressive 3D content creation capacity thanks to the advances in image diffusion models and optimizing strategies. However, current methods struggle to generate correct 3D content for a complex prompt in semantics, i.e., a prompt describing multiple interacted objects binding with different attributes. In this work, we propose a general framework named Progressive3D, which decomposes the entire generation into a series of locally progressive editing steps to create precise 3D content for complex prompts, and we constrain the content change to only occur in regions determined by user-defined region prompts in each editing step. Furthermore, we propose an overlapped semantic component suppression technique to encourage the optimization process to focus more on the semantic differences between prompts. Extensive experiments demonstrate that the proposed Progressive3D framework generates precise 3D content for prompts with complex semantics through progressive editing steps and is general for various text-to-3D methods driven by different 3D representations.
Primary Area: generative models
Keywords: Diffusion models safety protection
Scores: [ 6 5 6 6 ]
While generative diffusion models excel in producing high-quality images, they can also be misused to mimic authorized images, posing a significant threat to AI systems. Efforts have been made to add calibrated perturbations to protect images from diffusion-based mimicry pipelines. However, most of the existing methods are too ineffective and even impractical to be used by individual users due to their high computation and memory requirements. In this work, we present novel findings on attacking latent diffusion models (LDM) and propose new plug-and-play strategies for more effective protection. In particular, we explore the bottleneck in attacking an LDM, discovering that the encoder module rather than the denoiser module is the vulnerable point. Based on this insight, we present our strategy using Score Distillation Sampling (SDS) to double the speed of protection and reduce memory occupation by half without compromising its strength. Additionally, we provide a robust protection strategy by counterintuitively minimizing the semantic loss, which can assist in generating more natural perturbations. Finally, we conduct extensive experiments to substantiate our findings and comprehensively evaluate our newly proposed strategies. We hope our insights and protective measures can contribute to better defense against malicious diffusion-based mimicry, advancing the development of secure AI systems.
Primary Area: generative models
Keywords: Audio modeling audio generation music generation non-autoregressive models
Scores: [ 8 8 6 ]
We introduce MAGNeT, a masked generative sequence modeling method that operates directly over several streams of discrete audio representation, i.e., tokens. Unlike prior work, MAGNeT is comprised of a single-stage, non-autoregressive transformer encoder. During training, we predict spans of masked tokens obtained from the masking scheduler, while during inference we gradually construct the output sequence using several decoding steps. To further enhance the quality of the generated audio, we introduce a novel model rescorer method. In which, we leverage an external pre-trained model to rescore and rank predictions from MAGNeT which will be then used for later decoding steps. Lastly, we explore a hybrid version of MAGNeT, in which we fuse between autoregressive and non-autoregressive models to generate the first few seconds in an autoregressive manner while the rest of the sequence is being decoded in parallel. We demonstrate the efficiency of MAGNeT over the task of text-to-music generation and conduct extensive empirical evaluation, considering both automatic and human studies. We show the proposed approach is comparable to the evaluated baselines while being significantly faster (x7 faster than the autoregressive baseline). Through ablation studies and analysis, we shed light on the importance of each of the components comprising MAGNeT, together with pointing to the trade-offs between autoregressive and non-autoregressive considering latency, throughput, and generation quality. Samples are available as part of the supplemental material.
Primary Area: generative models
Keywords: large multimodal pre-training open-world embodied agent large language model
Scores: [ 6 5 5 6 ]
Recent studies have presented compelling evidence that large language models (LLMs) can equip embodied agents with the self-driven capability to interact with the world, which marks an initial step toward versatile robotics.However, these efforts tend to overlook the visual richness of open worlds, rendering the entire interactive process akin to a blindfolded text-based game.''Consequently, LLM-based agents frequently encounter challenges in intuitively comprehending their surroundings and producing responses that are easy to understand.In this paper, we propose Steve-Eye, an end-to-end trained large multimodal model designed to address this limitation.Steve-Eye integrates the LLM with a visual encoder which enables it to process visual-text inputs and generate multimodal feedback.In addition, we use a semi-automatic strategy to collect an extensive dataset comprising 850K open-world instruction pairs, empowering our model to encompass three essential functions for an agent: multimodal perception, foundational knowledge base, and skill prediction and planning.Lastly, we develop three open-world evaluation benchmarks, then carry out extensive experiments from a wide range of perspectives to validate our model's capability to strategically act and plan.Codes and datasets will be released.
Primary Area: generative models
Keywords: diffusion model long tail distribution score matching
Scores: [ 6 6 6 ]
Primary Area: generative models
Keywords: Diffusion Model Architecture Multi-Task Learning (MTL) Diffusion Models
Scores: [ 6 8 8 ]
Diffusion models generate highly realistic images through learning a multi-step denoising process, naturally embodying the principles of multi-task learning (MTL). Despite the inherent connection between diffusion models and MTL, there remains an unexplored area in designing neural architectures that explicitly incorporate MTL into the framework of diffusion models. In this paper, we present Denoising Task Routing (DTR), a simple add-on strategy for existing diffusion model architectures to establish distinct information pathways for individual tasks within a single architecture by selectively activating subsets of channels in the model. What makes DTR particularly compelling is its seamless integration of prior knowledge of denoising tasks into the framework: (1) Task Affinity: DTR activates similar channels for tasks at adjacent timesteps and shifts activated channels as sliding windows through timesteps, capitalizing on the inherent strong affinity between tasks at adjacent timesteps. (2) Task Weights: During the early stages (higher timesteps) of the denoising process, DTR assigns a greater number of task-specific channels, leveraging the insight that diffusion models prioritize reconstructing global structure and perceptually rich contents in earlier stages, and focus on simple noise removal in later stages. Our experiments demonstrate that DTR consistently enhances the performance of diffusion models across various evaluation protocols, all without introducing additional parameters. Furthermore, DTR contributes to accelerating convergence during training. Finally, we show the complementarity between our architectural approach and existing MTL optimization techniques, providing a more complete view of MTL within the context of diffusion training.
Primary Area: generative models
Keywords: diffusion model; single-view reconstruction; 3D generation; generative models
Scores: [ 8 10 8 8 6 ]
In this paper, we present a novel diffusion model called SyncDreamer that generates multiview-consistent images from a single-view image. Using pretrained large-scale 2D diffusion models, recent work Zero123 demonstrates the ability to generate plausible novel views from a single-view image of an object. However, maintaining consistency in geometry and colors for the generated images remains a challenge. To address this issue, we propose a synchronized multiview diffusion model that models the joint probability distribution of multiview images, enabling the generation of multiview-consistent images in a single reverse process. SyncDreamer synchronizes the intermediate states of all the generated images at every step of the reverse process through a 3D-aware feature attention mechanism that correlates the corresponding features across different views. Experiments show that SyncDreamer generates images with high consistency across different views, thus making it well-suited for various 3D generation tasks such as novel-view-synthesis, text-to-3D, and image-to-3D.
Primary Area: generative models
Keywords: 3D indoor scene synthesis conditional generateive models graph diffusion models
Scores: [ 8 6 8 8 ]
Comprehending natural language instructions is a charming property for 3D indoor scene synthesis systems. Existing methods suffer from directly modeling the object distributions within a scene, thereby hindering the controllability of generation. We introduce InstructScene, a novel generative framework that integrates a semantic graph prior and a layout decoder to improve controllability and fidelity for 3D scene synthesis. The proposed semantic graph prior jointly learns indoor scene appearance and layout distributions, exhibiting versatility across various generative tasks. To facilitate the benchmarking for text-driven 3D scene synthesis, we curate a high-quality dataset of scene-instruction pairs with large language and multimodal models. Extensive experimental results reveal that the proposed method surpasses existing state-of-the-art approaches by a large margin. Thorough ablation studies confirm the efficacy of crucial design components. Both our code and dataset will be publicly available after the review period.
Primary Area: generative models
Keywords: text-to-image generation overoptimization confidence calibration
Scores: [ 6 6 6 ]
Fine-tuning text-to-image models using reward functions trained on human feedback data has emerged as a powerful approach for aligning model behavior with human intent. However, excessive optimization with such reward models, which are only proxy objectives, can degrade the performance of the fine-tuned models, a phenomenon commonly referred to as reward overoptimization. We introduce the Text-Image Alignment Assessment (TIA2) benchmark, a diverse collection of text prompts, images, and human annotations, for studying the issue in depth. We evaluate several state-of-the-art reward models for text-to-image generation on our benchmark and find that they are often not well-aligned with human assessment. We empirically demonstrate that overoptimization can occur when a poorly aligned reward model is used as a fine-tuning objective. To address this, we introduce a simple method, TextNorm, for inducing confidence calibration in reward models by normalizing the scores across prompts that are semantically different from the original prompt. We demonstrate that using the confidence-calibrated scores in fine-tuning effectively reduces the risk of overoptimization.
Primary Area: generative models
Keywords: Instruction Following Language Model Decoding
Scores: [ 8 6 8 8 ]
While instruction-tuned language models have demonstrated impressive zero-shot generalization, these models often struggle to generate accurate responses when faced with instructions that fall outside their training set. This paper presents Instructive Decoding (ID), a simple yet effective approach that augments the efficacy of instruction-tuned models. Specifically, ID adjusts the logits for next-token prediction in a contrastive manner, utilizing predictions generated from a manipulated version of the original instruction, referred to as a noisy instruction. This noisy instruction aims to elicit responses that could diverge from the intended instruction yet remain plausible. We conduct experiments across a spectrum of such noisy instructions, ranging from those that insert semantic noise via random words to others like 'opposite' that elicit the deviated responses. Our approach achieves considerable performance gains across various instruction-tuned models and tasks without necessitating any additional parameter updates. Notably, utilizing 'opposite' as the noisy instruction in ID, which shows the maximum divergence from the original instruction, consistently produces the most significant performance gains across multiple models and tasks.
Primary Area: generative models
Keywords: Generative Modelling Humanoid Control Model Predictive Control Model-based Reinforcement Learning Offline Reinforcement Learning
Scores: [ 8 8 6 ]
Humanoid control is an important research challenge offering avenues for integration into human-centric infrastructures and enabling physics-driven humanoid animations.The daunting challenges in this field stem from the difficulty of optimizing in high-dimensional action spaces and the instability introduced by the bipedal morphology of humanoids. However, the extensive collection of human motion-captured data and the derived datasets of humanoid trajectories, such as MoCapAct, paves the way to tackle these challenges. In this context, we present Humanoid Generalist Autoencoding Planner (H-GAP), a state-action trajectory generative model trained on humanoid trajectories derived from human motion-captured data, capable of adeptly handling downstream control tasks with Model Predictive Control (MPC).For 56 degrees of freedom humanoid, we empirically demonstrate that H-GAP learns to represent and generate a wide range of motor behaviors. Further, without any learning from online interactions, it can also flexibly transfer these behaviours to solve novel downstream control tasks via planning. Notably, H-GAP excels established MPC baselines with access to the ground truth model, and is superior or comparable to offline RL methods trained for individual tasks.Finally, we do a series of empirical studies on the scaling properties of H-GAP, showing the potential for performance gains via additional data but not computing.
Primary Area: generative models
Keywords: large language model in-context learning calibration prompt
Scores: [ 6 6 6 6 ]
Prompting and in-context learning (ICL) have become efficient learning paradigms for large language models (LLMs). However, LLMs suffer from prompt brittleness and various bias factors in the prompt, including but not limited to the formatting, the choice verbalizers, and the ICL examples. To address this problem that results in unexpected performance degradation, calibration methods have been developed to mitigate the effects of these biases while recovering LLM performance. In this work, we first conduct a systematic analysis of the existing calibration methods, where we both provide a unified view and reveal the failure cases. Inspired by these analyses, we propose Batch Calibration (BC), a simple yet intuitive method that controls the contextual bias from the batched input, unifies various prior approaches and effectively addresses the aforementioned issues. BC is zero-shot, inference-only, and incurs negligible additional costs. In the few-shot setup, we further extend BC to allow it to learn the contextual bias from labeled data. We validate the effectiveness of BC with PaLM 2-(S, M, L) and CLIP models and demonstrate state-of-the-art performance over previous calibration baselines across more than 10 natural language understanding and image classification tasks.
Primary Area: generative models
Keywords: diffusion models score-based models diversity guidance conformer generation image generation
Scores: [ 5 5 8 6 ]
In light of the widespread success of generative models, a significant amount of research has gone into speeding up their sampling time. However, generative models are often sampled multiple times to obtain a diverse set incurring in a cost that is orthogonal to sampling time. We tackle the question of how to improve diversity and sample efficiency by moving beyond the common assumption of independent samples. For this we propose particle guidance, an extension of diffusion-based generative sampling where a joint-particle time-evolving potential enforces diversity. We analyze theoretically the joint distribution that particle guidance generates, its implications on the choice of potential, and the connections with methods in other disciplines. Empirically, we test the framework both in the setting of conditional image generation, where we are able to increase diversity without affecting quality, and molecular conformer generation, where we reduce the state-of-the-art median error by 13% on average.
Primary Area: generative models
Keywords: speech pre-training speech generation generative pre-training flow matching
Scores: [ 6 8 6 3 ]
Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training.
Primary Area: generative models
Keywords: Cascades Efficient Inference Language Models
Scores: [ 6 6 8 8 ]
Recent advances in language models (LMs) have led to significant improvements in quality on complex NLP tasks, but at the expense of increased inference costs. A simple strategy to achieve more favorable cost-quality tradeoffs is cascading: here, a small model is invoked for most “easy” instances, while a few “hard” instances are deferred to the large model. While the principles underpinning effective cascading are well-studied for classification tasks — with deferral based on predicted class uncertainty favored theoretically and practically — a similar understanding is lacking for generative LM tasks. In this work, we initiate a systematic study of deferral rules for LM cascades. We begin by examining the natural extension of predicted class uncertainty to generative LM tasks, namely, the predicted sequence uncertainty. We show that this measure suffers from the length bias problem, either over- or under-emphasizing outputs based on their lengths. This is because LMs produce a sequence of uncertainty values, one for each output token; and moreover, the number of output tokens is variable across different examples. To mitigate the length bias, we propose to exploit the richer token-level uncertainty information implicit in generative LMs. We argue that naive predicted sequence uncertainty corresponds to a simple aggregation of these uncertainties. By contrast, we show that incorporating token-level uncertainty through learned post-hoc deferral rules can significantly outperform such simple aggregation strategies, via experiments on a range of natural language benchmarks with FLAN-T5 models. We further show that incorporating embeddings from the smaller model and intermediate layers of the larger model can give an additional boost in the overall cost-quality tradeoff.
Primary Area: generative models
Keywords: Generative Models Graph Generative Network Graph Neural Network Probabilistic Model
Scores: [ 6 6 6 6 8 ]
Primary Area: generative models
Keywords: code data large language models reasoning capabilities
Scores: [ 8 8 8 5 ]
Large Language models (LLMs) have exhibited remarkable reasoning capabilities and become the foundation of language technologies. Inspired by the great success of code data in training LLMs, we naturally wonder at which training stage introducing code data can really help LLMs reasoning. To this end, this paper systematically explores the impact of code data on LLMs at different stages. Concretely, we introduce the code data at the pre-training stage, instruction-tuning stage, and both of them, respectively. Then, the reasoning capability of LLMs is comprehensively and fairly evaluated via six reasoning tasks. We critically analyze the experimental results and provide conclusions with insights. First, pre-training LLMs with the mixture of code and text can significantly enhance LLMs' general reasoning capability almost without negative transfer on other tasks. Besides, at the instruction-tuning stage, code data endows LLMs the task-specific reasoning capability. Moreover, the dynamic mixing strategy of code and text data assists LLMs to learn reasoning capability step-by-step during training. These insights deepen the understanding of LLMs regarding reasoning ability for their application, such as scientific question answering, legal support, etc. The source code and model parameters are released at the anonymous link:https://anonymous.4open.science/r/CodeLLM-FD25/.
Primary Area: generative models
Keywords: video generation diffusion model temproal modeling
Scores: [ 6 6 3 6 ]
Creating stable, controllable videos is a complex task due to the need for significant variation in temporal dynamics and cross-frame temporal consistency. To address this, we enhance the spatial-temporal capability and introduce a versatile video generation model, VersVideo, which leverages textual, visual, and stylistic conditions. Current video diffusion models typically extend image diffusion architectures by supplementing 2D operations (such as convolutions and attentions) with temporal operations. While this approach is efficient, it often restricts spatial-temporal performance due to the oversimplification of standard 3D operations. To counter this, we incorporate two key elements: (1) multi-excitation paths for spatial-temporal convolutions with dimension pooling across different axes, and (2) multi-expert spatial-temporal attention blocks. These enhancements boost the model's spatial-temporal performance without significantly escalating training and inference costs. We also tackle the issue of information loss that arises when a variational autoencoder is used to transform pixel space into latent features and then back into pixel frames. To mitigate this, we incorporate temporal modules into the decoder to maintain inter-frame consistency. Lastly, by utilizing the innovative denoising UNet and decoder, we develop a unified ControlNet model suitable for various conditions, including image, Canny, HED, depth, and style. Examples of the videos generated by our model can be found at https://anonymous-pages.github.io/video_demos/.
Primary Area: generative models
Keywords: large language models quantization model compression
Scores: [ 6 6 6 ]
Large Language Models (LLMs) have recently demonstrated a remarkable success across various tasks. However, efficiently serving LLMs has been a challenge due to its large memory bottleneck, specifically in small batch inference settings (e.g. mobile devices). Weight-only quantization can be a promising approach, but sub-4 bit quantization remains a challenge due to large-magnitude activation outliers. To mitigate the undesirable outlier effect, we first propose per-IC quantization, a simple yet effective method that creates quantization groups within each input channel (IC) rather than the conventional per-output channel (OC). Our method is motivated by the observation that activation outliers affect the input dimension of the weight matrix, so similarly grouping the weights in the IC direction can isolate outliers to be within a group. We also find that activation outliers do not dictate quantization difficulty, and inherent weight sensitivities also exist. With per-IC quantization as a new outlier-friendly scheme, we then propose Adaptive Dimensions (AdaDim), a versatile quantization framework that can adapt to various weight sensitivity patterns. We demonstrate the effectiveness of AdaDim by augmenting prior methods such as Round-To-Nearest and GPTQ, showing significant improvements across various language modeling benchmarks for both base (up to +4.7% on MMLU) and instruction-tuned (up to +10% on HumanEval) LLMs.
Primary Area: generative models
Keywords: diffusion model video editing text-to-video
Scores: [ 8 5 5 8 ]
Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts.A major challenge in this task is to ensure that all frames in the edited video are visually consistent. Most recent works apply advanced text-to-image diffusion models to this task by inflating 2D spatial attention in the U-Net into spatio-temporal attention.Although temporal context can be added through spatio-temporal attention, it may introduce some irrelevant information for each patch and therefore cause inconsistency in the edited video. In this paper, for the first time, we introduce optical flow into the attention module in diffusion model's U-Net to address the inconsistency issue for text-to-video editing.Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion based text-to-video editing methods and improve their visual consistency.Experiment results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance. In particular, our method excels in maintaining the visual consistency in the edited videos.
Primary Area: generative models
Keywords: Generative Models Iterative Training Diffusion
Scores: [ 8 5 8 6 ]
Deep generative models have made tremendous progress in modeling complex data, often exhibiting generation quality that surpasses a typical human's ability to discern the authenticity of samples. Undeniably a key driver of this success is enabled by the massive amounts of web-scale data consumed by these models. Due to the striking performance of these models combined with their ease of availability, the web will inevitably be increasingly populated with synthetic content. Such a fact directly implies that future iterations of generative models must contend with the reality that their training is curated from both clean data and artificially generated data from past models. In this paper, we develop a framework to rigorously study the impact on the stability of training generative models on mixed datasets of real and synthetic data. We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough and the proportion of clean training data (w.r.t. synthetic data) is large enough. Building on this foundation we quantify the error incurred by iterative retraining of generative models and we also provide a radius stability in parameter space. We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models on CIFAR10 and FFHQ.
Primary Area: generative models
Keywords: flow matching generative model efficient sampling distillation responsible ML
Scores: [ 6 6 6 ]
Flow matching is a powerful framework for generating high-quality samples in various applications, especially image synthesis. However, the intensive computational demands of these models, especially during the fine-tuning process and sampling processes, pose significant challenges for low-resource scenarios. This paper introduces Bellman Optimal Step-size Straightening (BOSS) technique for distilling flow-matching generative models: it aims specifically for a few-step efficient image sampling while adhering to a computational budget constraint. First, this technique involves a dynamic programming algorithm that optimizes the step sizes of the pretrained network. Then, it refines the velocity network to match the optimal step sizes, aiming to straighten the generation paths. Extensive experimental evaluations across image generation tasks demonstrate the efficacy of BOSS in terms of both resource utilization and image quality. Our results reveal that BOSS achieves substantial gains in efficiency while maintaining competitive sample quality, effectively bridging the gap between low-resource constraints and the demanding requirements of flow-matching generative models. Our paper also fortifies the responsible development of artificial intelligence, offering a more sustainable generative model that reduces computational costs and environmental footprints.
Primary Area: generative models
Keywords: Graph problems large language models encoding graphs generative models
Scores: [ 6 6 6 6 ]
Graphs are a powerful tool for representing and analyzing complex relationships in real-world applications such as social networks, recommender systems, and computational finance. Reasoning on graphs is essential for drawing inferences about the relationships between entities in a complex system, and to identify hidden patterns and trends. Despite the remarkable progress in automated reasoning with natural text, reasoning on graphs with large language models (LLMs) remains an understudied problem. In this work, we perform the first comprehensive study of encoding graph-structured data as text for consumption by LLMs. We show that LLM performance on graph reasoning tasks varies on three fundamental levels: (1) the graph encoding method, (2) the nature of the graph task itself, and (3) interestingly, the very structure of the graph considered. These novel results provide valuable insight on strategies for encoding graphs as text. Using these insights we illustrate how the correct choice of encoders can boost performance on graph reasoning tasks inside LLMs by 4.8% to 61.8%, depending on the task.
Primary Area: generative models
Keywords: Computer Vision Video Editing Diffusion Model
Scores: [ 5 6 5 8 ]
We introduce a novel and efficient approach for text-based video-to-video editing that eliminates the need for resource-intensive per-video-per-model finetuning. At the core of our approach is a synthetic paired video dataset tailored for video-to-video transfer tasks. Inspired by Instruct Pix2Pix's image transfer via editing instruction, we adapt this paradigm to the video domain. Extending the Prompt-to-Prompt to videos, we efficiently generate paired samples, each with an input video and its edited counterpart. Alongside this, we introduce the Long Video Sampling Correction during sampling, ensuring consistent long videos across batches. Our method surpasses current methods like Tune-A-Video, heralding substantial progress in text-based video-to-video editing and suggesting exciting avenues for further exploration and deployment.
Primary Area: generative models
Keywords: Deep Latent Variable Models Generative Models Raven’s Progressive Matrix Abstract Visual Reasoning
Scores: [ 8 6 6 6 ]
Primary Area: generative models
Keywords: Text-to-3D Generative models Diffusion models
Scores: [ 6 8 6 ]
The advancements in automatic text-to-3D generation have been remarkable. Most existing methods use pre-trained text-to-image diffusion models to optimize 3D representations like Neural Radiance Fields (NeRFs) via latent-space denoising score matching. Yet, these methods often result in artifacts and inconsistencies across different views due to their suboptimal optimization approaches and limited understanding of 3D geometry. Moreover, the inherent constraints of NeRFs in rendering crisp geometry and stable textures usually lead to a two-stage optimization to attain high-resolution details. This work proposes holistic sampling and smoothing approaches to achieve high-quality text-to-3D generation, all in a single-stage optimization. We compute denoising scores in the text-to-image diffusion model's latent and image spaces. Instead of randomly sampling timesteps (also referred to as noise levels in denoising score matching), we introduce a novel timestep annealing approach that progressively reduces the sampled timestep throughout optimization. To generate high-quality renderings in a single-stage optimization, we propose regularization for the variance of z-coordinates along NeRF rays. To address texture flickering issues in NeRFs, we introduce a kernel smoothing technique that refines importance sampling weights coarse-to-fine, ensuring accurate and thorough sampling in high-density regions. Extensive experiments demonstrate the superiority of our method over previous approaches, enabling the generation of highly detailed and view-consistent 3D assets through a single-stage training process.
Primary Area: generative models
Keywords: satellite images generative models diffusion models computer vision
Scores: [ 8 6 3 8 ]
Diffusion models have achieved state-of-the-art results on many modalities including images, speech, and video. However, existing models are not tailored to support remote sensing data, which is widely used in important applications including environmental monitoring and crop-yield prediction. Satellite images are significantly different from natural images -- they can be multi-spectral, irregularly sampled across time -- and existing diffusion models trained on images from the Web do not support them. Furthermore, remote sensing data is inherently spatio-temporal, requiring conditional generation tasks not supported by traditional methods based on captions or images. In this paper, we present DiffusionSat, to date the largest generative foundation model trained on a collection of publicly available large, high-resolution remote sensing datasets .As text-based captions are sparsely available for satellite images, we incorporate the associated metadata such as geolocation as conditioning information. Our method produces realistic samples and can be used to solve multiple generative tasks including temporal generation, multi-spectral superrresolution and in-painting. Our method outperforms previous state-of-the-art methods for satellite image generation and is the first large-scale \textit{generative} foundation model for satellite imagery.
Primary Area: generative models
Keywords: diffusion models heavy ball momentum divergence artifacts numerical method ode solver image generation
Scores: [ 8 5 6 8 ]
Despite the remarkable success of diffusion models in image generation, slow sampling remains a persistent issue. To accelerate the sampling process, prior studies have reformulated diffusion sampling as an ODE/SDE and introduced higher-order numerical methods. However, these methods often produce divergence artifacts, especially with a low number of sampling steps, which limits the achievable acceleration. In this paper, we investigate the potential causes of these artifacts and suggest that the small stability regions of these methods could be the principal cause. To address this issue, we propose two novel techniques. The first technique involves the incorporation of Heavy Ball (HB) momentum, a well-known technique for improving optimization, into existing diffusion numerical methods to expand their stability regions. We also prove that the resulting methods have first-order convergence. The second technique, called Generalized Heavy Ball (GHVB), constructs a new high-order method that offers a variable trade-off between accuracy and artifact suppression. Experimental results show that our techniques are highly effective in reducing artifacts and improving image quality, surpassing state-of-the-art diffusion solvers on both pixel-based and latent-based diffusion models for low-step sampling. Our research provides novel insights into the design of numerical methods for future diffusion work.
Primary Area: generative models
Keywords: diffusion model noisy label robustness
Scores: [ 6 6 6 5 ]
Primary Area: generative models
Keywords: 3D Generation; Single-view 3D Reconstruction; text-to-3D
Scores: [ 10 8 6 8 ]
We propose DMV3D, a novel 3D generation approach that uses a transformer-based 3D large reconstruction model to denoise multi-view diffusion. Our reconstruction model incorporates a triplane NeRF representation and, functioning as a denoiser, can denoise noisy multi-view images via 3D NeRF reconstruction and rendering, achieving single-stage 3D generation in the 2D diffusion denoising process. We train DMV3D on large-scale multi-view image datasets of extremely diverse objects using only image reconstruction losses, without accessing 3D assets. We demonstrate state-of-the-art results for the single-image reconstruction problem where probabilistic modeling of unseen object parts is required for generating diverse reconstructions with sharp textures. We also show high-quality text-to-3D generation results outperforming previous 3D diffusion models. Our project website is at: https://dmv3d.github.io/.
Primary Area: generative models
Keywords: motion generation multi-modality dance generation human human-interaction GPT reinforcement learning
Scores: [ 8 3 8 6 ]
Primary Area: generative models
Keywords: Text-to-3D generation Diffusion model Score-based Generative Model Score Distillation Sampling
Scores: [ 6 8 6 5 ]
Recent progress in text-to-3D generation has been achieved through the utilization of score distillation methods: they make use of the pre-trained text-to-image (T2I) diffusion models by distilling via the diffusion model training objective. However, such an approach inevitably results in the use of random timesteps at each update, which increases the variance of the gradient and ultimately prolongs the optimization process. In this paper, we propose to enhance the text-to-3D optimization by leveraging the T2I diffusion prior in the generative sampling process with a predetermined timestep schedule. To this end, we interpret text-to-3D optimization as a multi-view image-to-image translation problem, and propose a solution by approximating the probability flow. By leveraging the proposed novel optimization algorithm, we design DreamFlow, a practical three-stage coarse-to-fine text-to-3D optimization framework that enables fast generation of high-quality and high-resolution (i.e., 1024×1024) 3D contents. For example, we demonstrate that DreamFlow is 5 times faster than the existing state-of-the-art text-to-3D method, while producing more photorealistic 3D contents.
Primary Area: generative models
Keywords: LLMs Large Language Models Question Answering Generalization Knowledge Representation Logical Inference Relations
Scores: [ 6 6 8 6 ]
Primary Area: generative models
Keywords: Retrieval Augmented Language Models Information Retrieval summarization QA
Scores: [ 8 8 6 6 ]
Retrieval-augmented language models can better adapt to changes in world state and incorporate long-tail knowledge. However, most existing methods retrieve only short contiguous chunks from a retrieval corpus, limiting holistic understanding of the overall document context. We introduce the novel approach of recursively embedding, clustering, and summarizing chunks of text, constructing a tree with differing levels of summarization from the bottom up. At inference time, our RAPTOR model retrieves from this tree, integrating information across lengthy documents at different levels of abstraction. Controlled experiments show that retrieval with recursive summaries offers significant improvements over traditional retrieval-augmented LMs on several tasks. On question-answering tasks that involve complex, multi-step reasoning, we show state-of-the-art results; for example, by coupling RAPTOR retrieval with the use of GPT-4, we can improve the best performance on the QuALITY benchmark by 20% in absolute accuracy.
Primary Area: generative models
Keywords: Modular skill learning Multi-task learning Parameter-Efficient Fine-Tuning
Scores: [ 6 6 6 6 ]
Modular skill learning is an emerging direction in the field of Parameter Efficient Fine-Tuning (PEFT), as it enables neural networks to better organize and clarify various aspects of knowledge, leading to improved knowledge transfer for new tasks. In this paper, we introduce a novel approach that categorizes skills into shared domain skills and specialized skills, with the skill parameters being highly parameterized using low-rank or sparse techniques. Each task is associated with an exclusive specialized skill while also benefiting from shared domain skills. Moreover, tasks can selectively utilize specialized skills from other tasks as needed. To facilitate this approach, we propose a skill assignment matrix that can be jointly learned, and the task network is instantiated based on the skill parameters. To evaluate the effectiveness of our approach, we conducted extensive experiments on the Super Natural Instructions and SuperGLUE datasets. Our results demonstrate that compared to fully-shared, task-specific, or skill-indistinguishable baselines. Modular learning with skill-type discrimination significantly enhances the sample efficiency of multi-task learning. Furthermore, the freezing of a substantial number of base model parameters greatly improves parameter efficiency, leading to boosted training efficiency.
Primary Area: generative models
Keywords: Deep Learning Diffusion Model Video Generation
Scores: [ 6 8 8 6 ]
With the advance of text-to-image (T2I) diffusion models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. However, adding motion dynamics to existing high-quality personalized T2Is and enabling them to generate animations remains an open challenge. In this paper, we present AnimateDiff, a practical framework for animating personalized T2I models without requiring model-specific tuning. At the core of our framework is a plug-and-play motion module that can be trained once and seamlessly integrated into any personalized T2Is originating from the same base T2I. Through our proposed training strategy, the motion module effectively learns transferable motion priors from real-world videos. Once trained, the motion module can be inserted into a personalized T2I model to form a personalized animation generator. We further propose MotionLoRA, a lightweight fine-tuning technique for AnimateDiff that enables a pre-trained motion module to adapt to new motion patterns, such as different shot types, at a low training and data collection cost. We evaluate AnimateDiff and MotionLoRA on several public representative personalized T2I models collected from the community. The results demonstrate that our approaches help these models generate temporally smooth animation clips while preserving the visual quality and motion diversity.
Primary Area: generative models
Keywords: diffusion models image editing diffusion inversion
Scores: [ 6 8 6 6 ]
Text-guided diffusion models have revolutionized image generation and editing, offering exceptional realism and diversity. Specifically, in the context of diffusion-based editing, where a source image is edited according to a target prompt, the process commences by acquiring a noisy latent vector corresponding to the source image via the diffusion model. This vector is subsequently fed into separate source and target diffusion branches for editing. The accuracy of this inversion process significantly impacts the final editing outcome, influencing both essential content preservation of the source image and edit fidelity according to the target prompt. Prior inversion techniques aimed at finding a unified solution in both the source and target diffusion branches. However, our theoretical and empirical analyses reveal that disentangling these branches leads to a distinct separation of responsibilities for preserving essential content and ensuring edit fidelity. Building on this insight, we introduce “Direct Inversion,” a novel technique achieving optimal performance of both branches with just three lines of code. To assess image editing performance, we present PIE-Bench, an editing benchmark with 700 images showcasing diverse scenes and editing types, accompanied by versatile annotations and comprehensive evaluation metrics. Compared to state-of-the-art optimization-based inversion techniques, our solution not only yields superior performance across 8 editing methods but also achieves nearly an order of speed-up.
Primary Area: generative models
Keywords: diffusion model subject-driven text-to-image generation disentangled finetuning
Scores: [ 8 8 8 6 ]
Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing attention. Existing methods mainly resort to finetuning a pretrained generative model, where the identity-relevant information (e.g., the boy) and the identity-irrelevant information (e.g., the background or the pose of the boy) are entangled in the latent embedding space. However, the highly entangled latent embedding may lead to the failure of subject-driven text-to-image generation as follows: (i) the identity-irrelevant information hidden in the entangled embedding may dominate the generation process, resulting in the generated images heavily dependent on the irrelevant information while ignoring the given text descriptions; (ii) the identity-relevant information carried in the entangled embedding can not be appropriately preserved, resulting in identity change of the subject in the generated images. To tackle the problems, we propose DisenBooth, an identity-preserving disentangled tuning framework for subject-driven text-to-image generation. Specifically, DisenBooth finetunes the pretrained diffusion model in the denoising process. Different from previous works that utilize an entangled embedding to denoise each image, DisenBooth instead utilizes disentangled embeddings to respectively preserve the subject identity and capture the identity-irrelevant information. We further design the novel weak denoising and contrastive embedding auxiliary tuning objectives to achieve the disentanglement. Extensive experiments show that our proposed DisenBooth framework outperforms baseline models for subject-driven text-to-image generation with the identity-preserved embedding. Additionally, by combining the identity-preserved embedding and identity-irrelevant embedding, DisenBooth demonstrates more generation flexibility and controllability.
Primary Area: generative models
Keywords: data selection; large language model; instruction tuning;
Scores: [ 6 6 6 ]
Large language models~(LLMs) strengthen instruction-following capability through instruction-finetuning (IFT) on supervised instruction/response data. However, widely used IFT datasets (e.g., Alpaca's 52k data) surprisingly contain many low-quality instances with incorrect or irrelevant responses, which are misleading and detrimental to IFT. In this paper, we propose a simple and effective data selection strategy that automatically identifies and removes low-quality data using a strong LLM (e.g., ChatGPT). To this end, we introduce Alpagasus, which is finetuned on only 9k high-quality data filtered from the 52k Alpaca data. Alpagasus significantly outperforms the original Alpaca as evaluated by GPT-4 on multiple test sets and the controlled human study. Its 13B variant matches >90% performance of its teacher LLM (i.e., Text-Davinci-003) on test tasks. It also provides 5.7x faster training, reducing the training time for a 7B variant from 80 minutes (for Alpaca) to 14 minutes \footnote{We apply IFT for the same number of epochs as Alpaca(7B) but on fewer data, using 4$\times$NVIDIA A100 (80GB) GPUs and following the original Alpaca setting and hyperparameters.}. In the experiment, we also demonstrate that our method can work not only for machine-generated datasets but also for human-written datasets. Overall, Alpagasus demonstrates a novel data-centric IFT paradigm that can be generally applied to instruction-tuning data, leading to faster training and better instruction-following models.
Primary Area: generative models
Keywords: Image Generation 3D Generation Diffusion Model Multi-view consistency
Scores: [ 8 6 6 6 ]
We introduce MVDream, a multi-view diffusion model that is able to generate consistent multi-view images from a given text prompt. Learning from both 2D and 3D data, a multi-view diffusion model can achieve the generalizability of 2D diffusion models and the consistency of 3D renderings. We demonstrate that such a multi-view prior can serve as a generalizable 3D prior that is agnostic to 3D representations. It can be applied to 3D generation via Score Distillation Sampling, significantly enhancing the consistency and stability of existing 2D-lifting methods. It can also learn new concepts from a few 2D examples, akin to DreamBooth, but for 3D generation.
Primary Area: generative models
Keywords: Large Language Model Multi-Agent Debate LLM evaluators
Scores: [ 5 6 5 6 6 ]
Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality.Recognizing that best practices of human evaluation processes often involve multiple human annotators collaborating in the evaluation, we resort to a multi-agent debate framework, moving beyond single-agent prompting strategies.In this paper, we construct a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of different texts. Our experiments on two benchmarks illustrate that ChatEval delivers superior accuracy and correlation in alignment with human assessment. Furthermore, we find that the diverse role prompts (different personas) are essential in the multi-agent debate process; that is, utilizing the same role description in the prompts can lead to a degradation in performance. Our qualitative analysis also shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.
Primary Area: generative models
Keywords: generative model; video generation; diffusion model
Scores: [ 6 6 5 5 ]
Recently video generation has achieved substantial progress with realistic results. Nevertheless, existing AI-generated videos are usually very short clips ("shot-level'') depicting a single scene. To deliver a coherent long video ("story-level''), it is desirable to have creative transition and prediction effects across different clips. This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction. The goal is to generate high-quality long videos with smooth and creative transitions between scenes and varying lengths of shot-level videos. Specifically, we propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions. By providing the images of different scenes as inputs, combined with text-based control, our model generates transition videos that ensure coherence and visual quality. Furthermore, the model can be readily extended to various tasks such as image-to-video animation and autoregressive video prediction. To conduct a comprehensive evaluation of this new generative task, we propose three assessing criteria for smooth and creative transition: temporal consistency, semantic similarity, and video-text semantic alignment. Extensive experiments validate the effectiveness of our approach over existing methods for generative transition and prediction, enabling the creation of story-level long videos.
Primary Area: generative models
Keywords: Generative Modeling Diffusion Models
Scores: [ 6 8 6 8 ]
Primary Area: generative models
Keywords: Sequence Modelling Imitiation Learning Language Modelling
Scores: [ 6 6 6 6 ]
In many domains, autoregressive models can attain high likelihood on the task of predicting the next observation. However, this maximum-likelihood (MLE) objective does not necessarily match a downstream use-case of autoregressively generating high-quality sequences. The MLE objective weights sequences proportionally to their frequency under the data distribution, with no guidance for the model's behaviour out of distribution (OOD): leading to compounding error during autoregressive generation. In order to address this compounding error problem, we formulate sequence generation as an imitation learning (IL) problem. This allows us to minimize a variety of divergences between the distribution of sequences generated by an autoregressive model and sequences from a dataset, including divergences with weight on OOD generated sequences. The IL framework also allows us to incorporate backtracking by introducing a backspace action into the generation process. This further mitigates the compounding error problem by allowing the model to revert a sampled token if it takes the sequence OOD. Our resulting method, SequenceMatch, can be implemented without adversarial training or major architectural changes. We identify the SequenceMatch-χ2 divergence as a more suitable training objective for autoregressive models which are used for generation. We show that empirically, SequenceMatch training leads to improvements over MLE on text generation with language models and arithmetic
Primary Area: generative models
Keywords: multi-marginal unsupervised learning generative modeling measure transport optimal transport
Scores: [ 5 5 8 ]
Primary Area: generative models
Keywords: Few shot; Domain Adaptation; Image Generator
Scores: [ 5 8 8 ]
Can a pre-trained generator be adapted to the hybrid of multiple target domains and generate images with integrated attributes of them? In this work, we introduce a new task -- Few-shot Hybrid Domain Adaptation (HDA). Given a source generator and several target domains, HDA aims to acquire an adapted generator that preserves the integrated attributes of all target domains, without overriding the source domain's characteristics. Compared with Domain Adaptation (DA), HDA offers greater flexibility and versatility to adapt generators to more composite and expansive domains. Simultaneously, HDA also presents more challenges than DA as we have access only to images from individual target domains and lack authentic images from the hybrid domain. To address this issue, we introduce a discriminator-free framework that directly encodes different domains' images into well-separable subspaces. To achieve HDA, we propose a novel directional subspace loss comprised of a distance loss and a direction loss. Concretely, the distance loss blends the attributes of all target domains by reducing the distances from generated images to all target subspaces. The direction loss preserves the characteristics from the source domain by guiding the adaptation along the perpendicular to subspaces. Experiments show that our method can obtain numerous domain-specific attributes in a single adapted generator, which surpasses the baseline methods in semantic similarity, image fidelity, and cross-domain consistency.
Primary Area: generative models
Keywords: diffusion models layout-to-image domain generalization
Scores: [ 6 6 6 6 ]
Despite the recent advances in large-scale diffusion models, little progress has been made on the layout-to-image (L2I) synthesis task. Current L2I models either suffer from poor editability via text or weak alignment between the generated image and the input layout. This limits their usability in practice. To mitigate this, we propose to integrate adversarial supervision into the conventional training pipeline of L2I diffusion models (ALDM). Specifically, we employ a segmentation-based discriminator which provides explicit feedback to the diffusion generator on the pixel-level alignment between the denoised image and the input layout. To encourage consistent adherence to the input layout over the sampling steps, we further introduce the multistep unrolling strategy. Instead of looking at a single timestep, we unroll a few steps recursively to imitate the inference process, and ask the discriminator to assess the alignment of denoised images with the layout over a certain time window. Our experiments show that ALDM enables layout faithfulness of the generated images, while allowing broad editability via text prompts. Moreover, we showcase its usefulness for practical applications: by synthesizing target distribution samples via text control, we improve domain generalization of semantic segmentation models by a large margin (~12 mIoU points).
Primary Area: generative models
Keywords: large language mode agent multi-agent
Scores: [ 6 6 6 6 ]
Autonomous agents empowered by Large Language Models (LLMs) have undergone significant improvements, enabling them to generalize across a broad spectrum of tasks. However, in real-world scenarios, cooperation among individuals is often required to enhance the efficiency and effectiveness of task accomplishment. Hence, inspired by human group dynamics, we propose a multi-agent framework AgentVerse that can effectively orchestrate a collaborative group of expert agents as a greater-than-the-sum-of-its-parts system. Our experiments demonstrate that AgentVerse can proficiently deploy multi-agent groups that outperform a single agent. Extensive experiments on text understanding, reasoning, coding, tool utilization, and embodied AI confirm the effectiveness of AgentVerse. Moreover, our analysis of agent interactions within AgentVerse reveals the emergence of specific collaborative behaviors, contributing to heightened group efficiency. We will release our codebase, AgentVerse, to further facilitate multi-agent research.
Primary Area: generative models
Keywords: Generative model diffusion model score-based prior conditional diffusion model text-to-image alignment score inversion process image quality assessment T2I alignment score
Scores: [ 8 8 8 ]
Recent conditional diffusion models have shown remarkable advancements and have been widely applied in fascinating real-world applications. However, samples generated by these models often do not strictly comply with user-provided conditions. Due to this, there have been few attempts to evaluate this alignment via pre-trained scoring models to select well-generated samples. Nonetheless, current studies are confined to the text-to-image domain and require large training datasets. This suggests that crafting alignment scores for various conditions will demand considerable resources in the future. In this context, we introduce a universal condition alignment score that leverages the conditional probability measurable through the diffusion process. Our technique operates across all conditions and requires no additional models beyond the diffusion model used for generation, effectively enabling self-rejection. Our experiments validate that our met- ric effectively applies in diverse conditional generations, such as text-to-image, {instruction, image}-to-image, edge-/scribble-to-image, and text-to-audio.
Primary Area: generative models
Keywords: Diffusion models; Inverse problems; Krylov subspace
Scores: [ 8 6 6 6 ]
Krylov subspace, which is generated by multiplying a given vector by the matrix of a linear transformation and its successive powers, has been extensively studied in classical optimization literature to design algorithms that converge quickly for large linear inverse problems. For example, the conjugate gradient method (CG), one of the most popular Krylov subspace methods, is based on the idea of minimizingthe residual error in the Krylov subspace. However, with the recent advancement of high-performance diffusion solvers for inverse problems, it is not clear how classical wisdom can be synergistically combined with modern diffusion models. In this study, we propose a novel and efficient diffusion sampling strategy that synergistically combine the diffusion sampling and Krylov subspace methods. Specifically, we prove that if the tangent space at a denoised sample by Tweedie’s formula forms a Krylov subspace, then the CG initialized with the denoised data ensures the data consistency update to remain in the tangent space. This negates the need to compute the manifold-constrained gradient (MCG), leading to a more efficient diffusion sampling method. Our method is applicable regardless of the parametrization and setting (i.e., VE, VP). Notably, we achieve state-of-the-art reconstruction quality on challenging real-world medical inverse imaging problems, including multi-coil MRI reconstruction and 3D CT reconstruction. Moreover, our proposed method achieves more than 80 times faster inference time than the previous state-of-the-art method.
Primary Area: generative models
Keywords: Deep Learning Motion synthesis Animation Single Instance Learning Generative models
Scores: [ 8 8 8 6 ]
Synthesizing realistic animations of humans, animals, and even imaginary creatures, has long been a goal for artists and computer graphics professionals. Compared to the imaging domain, which is rich with large available datasets, the number of data instances for the motion domain is limited, particularly for the animation of animals and exotic creatures (e.g., dragons), which have unique skeletons and motion patterns. In this work, we introduce SinMDM, a Single Motion Diffusion Model. It is designed to learn the internal motifs of a single motion sequence with arbitrary topology and synthesize a variety of motions of arbitrary length that remain faithful to the learned motifs. We harness the power of diffusion models and present a denoising network explicitly designed for the task of learning from a single input motion. SinMDM is crafted as a lightweight architecture, which avoids overfitting by using a shallow network with local attention layers that narrow the receptive field and encourage motion diversity. Our work applies to multiple contexts, including spatial and temporal in-betweening, motion expansion, style transfer, and crowd animation. Our results show that SinMDM outperforms existing methods both qualitatively and quantitatively. Moreover, while prior network-based approaches require additional training for different applications, SinMDM supports these applications during inference. Our code is included as supplementary material and will be published.
Primary Area: generative models
Keywords: Large Language Models alignment group preference alignment few-shot learning in-context learning fine-tuning
Scores: [ 6 5 6 ]
Many applications of large language models (LLMs), ranging from chatbots tocreative writing, require nuanced subjective judgments that can differ significantlyacross different groups. Existing alignment algorithms can be expensive to alignfor each group, requiring prohibitive amounts of group-specific preference dataand computation for real-world use cases. We introduce Group Preference Optimization (GPO), an alignment framework that steers language models to preferences of individual groups in a few-shot manner. In GPO, we augment the baseLLM with an independent transformer module trained to predict the preferencesof a group for the LLM generations. For few-shot learning, we parameterize thismodule as an in-context autoregressive transformer and train it via meta-learningon several groups. We empirically validate the efficacy of GPO through rigorous evaluations using LLMs with varied sizes on three human opinion adaptation tasks. These tasks involve adapting to the preferences of US demographicgroups, global countries, and individual users. Our results demonstrate that GPOnot only aligns models more accurately but also requires fewer group-specificpreferences and less training and inference computing resources, outperformingexisting strategies such as in-context steering and fine-tuning methods.
Primary Area: generative models
Keywords: Diffusion models Stochastic differential equations Image dragging Image editing Score-based models
Scores: [ 8 6 5 ]
We present a unified probabilistic formulation for diffusion-based image editing, where a latent variable is edited in a task-specific manner and generally deviates from the corresponding marginal distribution induced by the original stochastic or ordinary differential equation (SDE or ODE). Instead, it defines a corresponding SDE or ODE for editing. In the formulation, we prove that the Kullback-Leibler divergence between the marginal distributions of the two SDEs gradually decreases while that for the ODEs remains as the time approaches zero, which shows the promise of SDE in image editing. Inspired by it, we provide the SDE counterparts for widely used ODE baselines in various tasks including inpainting and image-to-image translation, where SDE shows a consistent and substantial improvement. Moreover, we propose \emph{SDE-Drag} -- a simple yet effective method built upon the SDE formulation for point-based content dragging. We build a challenging benchmark (termed \emph{DragBench}) with open-set natural, art, and AI-generated images for evaluation. A user study on DragBench indicates that SDE-Drag significantly outperforms our ODE baseline, existing diffusion-based methods, and the renowned DragGAN. Our results demonstrate the superiority and versatility of SDE in image editing and push the boundary of diffusion-based editing methods.
Primary Area: generative models
Keywords: large language model LLM compression pruning fisher
Scores: [ 5 5 5 ]
State-of-the-art language models are becoming increasingly large in an effort to achieve the highest performance on large corpora of available textual data. However, the sheer size of the Transformer architectures makes it difficult to deploy models within computational, environmental or device-specific constraints. We explore data-driven compression of existing pretrained models as an alternative to training smaller models from scratch. To do so, we scale Kronecker-factored curvature approximations of the target loss landscape to large language models. In doing so, we can compute both the dynamic allocation of structures that can be removed as well as updates of remaining weights that account for the removal. We provide a general framework for unstructured, semi-structured and structured pruning and improve upon weight updates to capture more correlations between weights, while remaining computationally efficient. Experimentally, our method can prune rows and columns from a range of OPT models and Llamav2-7B by 20%-30%, with a negligible loss in performance, and achieve state-of-the-art results in unstructured and semi-structured pruning of large language models. We will open source our code on GitHub upon acceptance.
Primary Area: generative models
Keywords: text to 3D diffusion models NeRF neural rendering 3d synthesis
Scores: [ 6 5 6 8 ]
We present DreamCraft3D, a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects. We tackle the problem by leveraging a 2D reference image to guide the stages of geometry sculpting and texture boosting. A central focus of this work is to address the consistency issue that existing works encounter. To sculpt geometries that render coherently, we perform score distillation sampling via a view-dependent diffusion model. This 3D prior, alongside several training strategies, prioritizes the geometry consistency but compromises the texture fidelity. We further propose bootstrapped score distillation to specifically boost the texture. We train a personalized diffusion model, Dreambooth, on the augmented renderings of the scene, imbuing it with 3D knowledge of the scene being optimized. The score distillation from this 3D-aware diffusion prior provides view-consistent guidance for the scene. Notably, through an alternating optimization of the diffusion prior and 3D scene representation, we achieve mutually reinforcing improvements: the optimized 3D scene aids in training the scene-specific diffusion model, which offers increasingly view-consistent guidance for 3D optimization. The optimization is thus bootstrapped and leads to substantial texture boosting. With tailored 3D priors throughout the hierarchical generation, DreamCraft3D generates coherent 3D objects with photorealistic renderings, advancing the state-of-the-art in 3D content generation.
Primary Area: generative models
Keywords: perceptual image compression neural image compression
Scores: [ 6 8 8 8 ]
Primary Area: generative models
Keywords: Large Language Model Instruction Fine-tuning
Scores: [ 6 6 6 6 ]
Training large language models (LLMs) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions. In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM. Both automatic and human evaluations consistently indicate that WizardLM outperforms baselines such as Alpaca (trained from Self-Instruct) and Vicuna (trained from human-created instructions). The experimental results demonstrate that the quality of instruction-following dataset crafted by Evol-Instruct can significantly improve the performance of LLMs.
Primary Area: generative models
Keywords: generative model time series pattern recognition diffusion model financial time series
Scores: [ 8 8 6 ]
Primary Area: generative models
Keywords: LMs language models vision models GPT4 ChatGPT Midjourney generative models understanding models
Scores: [ 6 8 6 8 ]
Primary Area: generative models
Keywords: diffusion models diffusion ODE image super-resolution
Scores: [ 6 8 6 ]
Primary Area: generative models
Keywords: Large Language Model Network Compression
Scores: [ 8 6 5 8 ]
This paper explores network binarization, a radical form of quantization, compressing model weights to a single bit, specifically for Large Language Models (LLMs) compression. Due to previous binarization methods collapsing LLMs, we propose a novel approach, Partially-Binarized LLM (PB-LLM), which can achieve extreme low-bit quantization while maintaining the linguistic reasoning capacity of quantized LLMs. Specifically, our exploration first uncovers the ineffectiveness of naïve applications of existing binarization algorithms and highlights the imperative role of salient weights in achieving low-bit quantization. Thus, PB-LLM filters a small ratio of salient weights during binarization, allocating them to higher-bit storage, i.e., partially-binarization. PB-LLM is extended to recover the capacities of quantized LMMs, by analyzing from the perspective of post-training quantization (PTQ) and quantization-aware training (QAT). Under PTQ, combining the concepts from GPTQ, we reconstruct the binarized weight matrix guided by the Hessian matrix and successfully recover the reasoning capacity of PB-LLM in low-bit. Under QAT, we freeze the salient weights during training, explore the derivation of optimal scaling factors crucial for minimizing the quantization error, and propose a scaling mechanism based on this derived scaling strategy for residual binarized weights. Those explorations and the developed methodologies significantly contribute to rejuvenating the performance of low-bit quantized LLMs and present substantial advancements in the field of network binarization for LLMs.
Primary Area: generative models
Keywords: Diffusion models riemannian manifolds graphs
Scores: [ 6 6 8 ]
Primary Area: generative models
Keywords: Energy-based model recovery-likelihood cooperative learning
Scores: [ 8 6 6 6 8 ]
Primary Area: generative models
Keywords: Talking Avatar Generation Video Generation Disentanglement Diffusion Models
Scores: [ 6 6 6 8 ]
Previous methods for zero-shot talking avatar generation have relied on domain-specific heuristics such as warping-based motion representation and 3D Morphable Models (3DMMs), etc. However, these heuristics limit the diversity and naturalness of the generated avatars due to the absence of end-to-end learning. In this work, we instead provide a data-driven solution, named GAIA (Generative AI for Avatar), that eliminates the domain priors in talking avatar generation. In light of the observation that the speech only drives the motion of the avatar while the appearance of the target avatar and the background typically remain the same throughout the entire video, we divide our approach into two stages: 1) motion and appearance disentanglement and 2) speech-to-motion generation. The first stage disentangles each frame into motion and appearance representations, while the second stage generates motion sequences conditioned on the speech and reference avatar image. To evaluate our approach, we collect a large-scale high-quality talking avatar dataset and train the model on it with different scales (up to 2B parameters). Experimental results verify the superiority, scalability, and flexibility of GAIA as 1) the resulting model beats previous baseline models in terms of naturalness, diversity, lip-sync quality, and visual quality; 2) the framework is scalable and larger models yield better results; 3) it is general and enables different applications like controllable talking avatar generation and text-driven video generation.
Primary Area: generative models
Keywords: diffusion models memorization generalization inductive bias curse of dimensionality denoising geometry-adaptive harmonic basis
Scores: [ 8 8 10 8 ]
Primary Area: generative models
Keywords: language model for code editing refactoring
Scores: [ 5 6 8 6 ]
Developers often dedicate significant time to maintaining and refactoring existing code. However, most prior work on generative models for code focuses solely on creating new code, overlooking the distinctive needs of editing existing code. In this work, we explore a multi-round code auto-editing setting, aiming to predict edits to a code region based on recent changes within the same codebase. Our model, Coeditor, is a fine-tuned language model specifically designed for code editing tasks. We represent code changes using a line diff format and employ static analysis to form large customized model contexts, ensuring the availability of appropriate information for prediction. We collect a code editing dataset from the commit histories of 1650 open-source Python projects for training and evaluation. In a simplified single-round, single-edit task, Coeditor significantly outperforms GPT-3.5 and SOTA open-source code completion models (bringing exact-match accuracy from 34.7 up to 60.4), demonstrating the benefits of incorporating editing history for code completion. In a multi-round, multi-edit setting, we observe substantial gains by iteratively conditioning on additional user edits. We have open-sourced our code, data, and model weights to encourage future research and have released a VSCode extension powered by our model for interactive IDE usage.
Primary Area: generative models
Keywords: novel view synthesis 3D object synthesis diffusion model
Scores: [ 6 6 6 6 ]
In this paper, we present TOSS, which introduces text to the task of novel view synthesis (NVS) from just a single RGB image. While Zero123 has demonstrated impressive zero-shot open-set NVS capabilities, it treats NVS as a pure image-to-image translation problem. This approach suffers from the challengingly under-constrained nature of single-view NVS: the process lacks means of explicit user control and often result in implausible NVS generations.To address this limitation, TOSS uses text as high-level semantic information to constrain the NVS solution space.TOSS fine-tunes text-to-image Stable Diffusion pre-trained on large-scale text-image pairs and introduces modules specifically tailored to image and camera pose conditioning, as well as dedicated training for pose correctness and preservation of fine details. Comprehensive experiments are conducted with results showing that our proposed TOSS outperforms Zero123 with higher-quality NVS results and faster convergence. We further support these results with comprehensive ablations that underscore the effectiveness and potential of the introduced semantic guidance and architecture design.
Primary Area: generative models
Keywords: NeRF Diffusion model 3D scene editing
Scores: [ 6 6 5 5 ]
Recently, there has been a significant advancement in text-to-image diffusion models, leading to groundbreaking performance in 2D image generation. These advancements have been extended to 3D models, enabling the generation of novel 3D objects from textual descriptions. This has evolved into NeRF editing methods, which allow the manipulation of existing 3D objects through textual conditioning. However, existing NeRF editing techniques have faced limitations in their performance due to slow training speeds and the use of loss functions that do not adequately consider editing. To address this, here we present a novel 3D NeRF editing approach dubbed ED-NeRF by successfully embedding real-world scenes into the latent space of the latent diffusion model (LDM) through a unique refinement layer. This approach enables us to obtain a NeRF backbone that is not only faster but also more amenable to editing compared to traditional image space NeRF editing. Furthermore, we propose an improved loss function tailored for editing by migrating the delta denoising score (DDS) distillation loss, originally used in 2D image editing to the three-dimensional domain. This novel loss function surpasses the well-known score distillation sampling (SDS) loss in terms of suitability for editing purposes. Our experimental results demonstrate that ED-NeRF achieves faster editing speed while producing improved output quality compared to state-of-the-art 3D editing models.
Primary Area: generative models
Keywords: conditional diffusion sampling classifier guidance
Scores: [ 5 6 5 6 8 ]
Guidance in conditional diffusion generation is of great importance for sample quality and controllability. However, existing guidance schemes are to be desired. On one hand, mainstream methods such as classifier guidance and classifier-free guidance both require extra training with labeled data, which is time-consuming and unable to adapt to new conditions.On the other hand, training-free methods such as universal guidance, though more flexible, have yet to demonstrate comparable performance. In this work, through a comprehensive investigation into the design space, we show that it is possible to achieve significant performance improvements over existing guidance schemes by leveraging off-the-shelf classifiers in a training-free fashion, enjoying the best of both worlds. Employing calibration as a general guideline, we propose several pre-conditioning techniques to better exploit pretrained off-the-shelf classifiers for guiding diffusion generation. Extensive experiments on ImageNet validate our proposed method, showing that state-of-the-art (SOTA) diffusion models (DDPM, EDM, DiT) can be further improved (up to 20%) using off-the-shelf classifiers with barely any extra computational cost.With the proliferation of publicly available pretrained classifiers, our proposed approach has great potential and can be readily scaled up to text-to-image generation tasks.
Primary Area: generative models
Keywords: gauge freedom conservativitym intrinsic dimensionality estimation diffusion models explainable AI theory
Scores: [ 6 5 8 8 ]
Primary Area: generative models
Keywords: DDPM diffusion image generation encoder
Scores: [ 6 6 6 5 ]
Diffusion models may be viewed as hierarchical variational autoencoders (VAEs) with two improvements: parameter sharing for the conditionals in the generative process and efficient computation of the loss as independent terms over the hierarchy. We consider two changes to the diffusion model that retain these advantages while potentially adding model flexibility. Firstly, we introduce a data and depth-dependent mean function in the diffusion process, which leads to a modified diffusion loss. Our proposed framework, DiffEnc, achieves state of the art results on CIFAR-10. Secondly, we let the ratio of the noise variance of the reverse encoder process and the generative process be a free weight parameter rather than being fixed to one. This leads to theoretical insights: For a finite depth hierarchy, the evidence lower bound (ELBO) can be used as an objective for a weighted diffusion loss approach and for optimizing the noise schedule specifically for inference. For the infinite-depth hierarchy, on the other hand, the weight parameter has to be one to have a well-defined ELBO.
Primary Area: generative models
Keywords: Stable Diffusion training-free grounded text-to-image generation controllable generation
Scores: [ 6 6 6 6 ]
Recent text-to-image (T2I) diffusion models have achieved remarkable progress in generating high-quality images given text-prompts as input. However, these models fail to convey appropriate spatial composition specified by a layout instruction. In this work, we probe into zero-shot grounded T2I generation with diffusion models, that is, generating images corresponding to the input layout information without training auxiliary modules or finetuning diffusion models. We propose a Region and Boundary (R&B) aware cross-attention guidance approach that gradually modulates the attention maps of diffusion model during generative process, and assists the model to synthesize images (1) with high fidelity, (2) highly compatible with textual input, and (3) interpreting layout instructions accurately. Specifically, we leverage the discrete sampling to bridge the gap between consecutive attention maps and discrete layout constraints, and design a region-aware loss to refine the generative layout during diffusion process. We further propose a boundary-aware loss to strengthen object discriminability within the corresponding regions. Experimental results show that our method outperforms existing state-of-the-art zero-shot grounded T2I generation methods by a large margin both qualitatively and quantitatively on several benchmarks.
Primary Area: generative models
Keywords: Tabular data imbalanced learning language models
Scores: [ 6 6 6 ]
Primary Area: generative models
Keywords: Diffusion Model Memorization
Scores: [ 8 8 8 8 ]
Recent breakthroughs in diffusion models have exhibited exceptional image-generation capabilities. However, studies show that some outputs are merely replications of training data. Such replications present potential legal challenges for model owners, especially when the generated content contains proprietary information. In this work, we introduce a straightforward yet effective method for detecting memorized prompts by inspecting the magnitude of text-conditional predictions. Our proposed method seamlessly integrates without disrupting sampling algorithms, and delivers high accuracy even at the first generation step, with a single generation per prompt. Building on our detection strategy, we unveil an explainable approach that shows the contribution of individual words or tokens to memorization. This offers an interactive medium for users to adjust their prompts. Moreover, we propose two strategies i.e., to mitigate memorization by leveraging the magnitude of text-conditional predictions, either through minimization during inference or filtering during training. These proposed strategies effectively counteract memorization while maintaining high-generation quality.
Primary Area: generative models
Keywords: Diffusion models image outpainting
Scores: [ 6 6 8 ]
Image outpainting aims to generate the content of an input sub-image beyond its original boundaries. It is an important task in content generation yet remains an open problem for generative models. This paper pushes the technical frontier of image outpainting in two directions that have not been resolved in literature: 1) outpainting with arbitrary and continuous multiples (without restriction), and 2) outpainting in a single step (even for large expansion multiples). Moreover, we develop a method that does not depend on a pre-trained backbone network, which is in contrast commonly required by the previous SOTA outpainting methods. The arbitrary multiple outpainting is achieved by utilizing randomly cropped views from the same image during training to capture arbitrary relative positional information. Specifically, by feeding one view and positional embeddings as queries, we can reconstruct another view. At inference, we generate images with arbitrary expansion multiples by inputting an anchor image and its corresponding positional embeddings. The one-step outpainting ability here is particularly noteworthy in contrast to previous methods that need to be performed for times to obtain a final multiple which is times of its basic and fixed multiple. We evaluate the proposed approach (called PQDiff as we adopt a diffusion-based generator as our embodiment, under our proposed \textbf{P}ositional \textbf{Q}uery scheme) on public benchmarks, demonstrating its superior performance over state-of-the-art approaches. Specifically, PQDiff achieves state-of-the-art FID scores on the Scenery (\textbf{21.512}), Building Facades (\textbf{25.310}), and WikiArts (\textbf{36.212}) datasets. Furthermore, under the 2.25x, 5x, and 11.7x outpainting settings, PQDiff only takes \textbf{51.3%}, \textbf{25.6%}, and \textbf{17.1%} of the time of the benchmark state-of-the-art (SOTA) method.
Primary Area: generative models
Keywords: human evaluation large language models evaluation natural language generation
Scores: [ 8 6 6 6 ]
Primary Area: generative models
Keywords: Generative Models Video Generative Models Evaluation Fidelity Diversity Assessment
Scores: [ 6 3 6 6 ]
Primary Area: generative models
Keywords: Neural Radiance Field One-shot Talking Face Generation
Scores: [ 8 10 8 8 ]
One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio to generate a talking portrait video. The existing methods fail to simultaneously achieve the goals of accurate 3D avatar reconstruction and stable talking face animation. Besides, while the existing works mainly focus on synthesizing the head part, it is also vital to generate natural torso and background segments to obtain a realistic talking portrait video. To address these limitations, we present Real3D-Potrait, a framework that (1) improves the one-shot 3D reconstruction power with a large image-to-plane model that distills 3D prior knowledge from a 3D face generative model; (2) facilitates accurate motion-conditioned animation with an efficient motion adapter; (3) synthesizes realistic video with natural torso movement and switchable background using a head-torso-background super-resolution model; and (4) supports one-shot audio-driven talking face generation with a generalizable audio-to-motion model. Extensive experiments show that Real3D-Portrait generalizes well to unseen identities and generates more realistic talking portrait videos compared to previous methods. Video samples are available at https://real3dportrait.github.io.
Primary Area: generative models
Keywords: GFlowNet molecule optimization biological sequence design local search reinforcement learning
Scores: [ 8 6 6 ]
Generative Flow Networks (GFlowNets) are amortized sampling methods that learn a distribution over discrete objects proportional to their rewards. GFlowNets exhibit a remarkable ability to generate diverse samples, yet occasionally struggle to consistently produce samples with high rewards due to over-exploration on wide sample space. This paper proposes to train GFlowNets with local search which focuses on exploiting high rewarded sample space to resolve this issue. Our main idea is to explore the local neighborhood via destruction and reconstruction guided by backward and forward policies, respectively. This allows biasing the samples toward high-reward solutions, which is not possible for a typical GFlowNet solution generation scheme which uses the forward policy to generate the solution from scratch. Extensive experiments demonstrate a remarkable performance improvement in several biochemical tasks. Source code is available: \url{https://anonymous.4open.science/r/ls-gfn-FC9A/README.md}.
Primary Area: generative models
Keywords: Efficient Transformers Inference-time Efficiency CPU
Scores: [ 6 6 3 6 6 ]
Primary Area: generative models
Keywords: Image Interpolation;Diffusion Models
Scores: [ 8 8 8 ]
Image interpolation based on diffusion models is promising in creating fresh and interesting images. Advanced interpolation methods mainly focus on linear spherical interpolation, delivering remarkable success for images generated by diffusion models.However, existing methods struggle with natural images (not generated by diffusion models), limiting practical applications.Our investigation into the interpolation process has unveiled that its shortcomings are rooted in the introduction of inappropriate noise, which may either exceed or fall below the denoising threshold, leading to issues such as image artifacts and information loss in the interpolated images. To address this issue, we initially investigated a direct noise addition method, which improved image quality but introduced unwanted information. Drawing from these findings, we subsequently developed a novel interpolation approach that harnesses the advantages of both techniques. This approach retains the valuable noise with information from the original images while introducing a subtle Gaussian noise to enhance interpolation quality. Moreover, we introduced an innovative constraint on the noise component responsible for generating artifacts and incorporated original image to supplement missing information.These enhancements not only improved the interpolation results for images within the training domain but also extended the capability to interpolate with natural images beyond the training domain, achieving in the best interpolation results to date.
Primary Area: generative models
Keywords: Image Restoration Diffusion Models Image Prior Blind Restoration Super-Resolution Image Denoising Colorization JPEG Artifacts Correction
Scores: [ 8 6 8 6 ]
Primary Area: generative models
Keywords: Generative model Classifier-based Diffusion Model Sketch
Scores: [ 6 6 6 5 ]
While diffusion models have revolutionized generative AI, their application to human sketch generation, especially in the creation of complex yet concise and recognizable sketches, remains largely unexplored. Existing efforts have primarily focused on vector-based sketches, limiting their ability to handle intricate sketch data. This paper introduces an innovative extension of diffusion models to pixellevel sketch generation, addressing the challenge of dynamically optimizing the guidance scale for classifier-guided diffusion. Our approach achieves a delicate balance between recognizability and complexity in generated sketches through scale-adaptive classifier-guided diffusion models, a scaling indicator, and the concept of a residual sketch. We also propose a three-phase sampling strategy to enhance sketch diversity and quality. Experiments on the QuickDraw dataset showcase the potential of diffusion models to push the boundaries of sketch generation, particularly in complex scenarios unattainable by vector-based methods.
Primary Area: generative models
Keywords: Large Multimodal Models; Joint Training; Interleaved Image-Text Generation; Autoregressive Models
Scores: [ 6 5 6 5 ]
In recent years, advances in the large-scale pretraining of language and text-to-image models have revolutionized the field of machine learning. Yet, integrating these two modalities into a single, robust model capable of generating seamless multimodal outputs remains a significant challenge. To address this gap, we present the Joint Autoregressive Mixture (JAM) framework, a modular approach that systematically fuses existing text and image generation models. We also introduce a specialized, data-efficient instruction-tuning strategy, tailored for mixed-modal generation tasks. Our final instruct-tuned model demonstrates unparalleled performance in generating high-quality multimodal outputs and represents the first model explicitly designed for this purpose.
Primary Area: generative models
Keywords: Large Lanauge Models Knowledge Distillation
Scores: [ 6 8 5 6 ]
Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs). However, previous KD methods are primarily applied to white-box classification models or training small models to imitate black-box model APIs like ChatGPT. How to effectively distill the knowledge from white-box generative LLMs is still under-explored, which becomes more and more important with the prosperity of LLMs. In this work, we propose MiniLLM that distills smaller language models from generative larger language models. We first replace the forward Kullback-Leibler divergence (KLD) objective in the standard KD approaches with reverse KLD, which is more suitable for KD on generative language models, to prevent the student model from overestimating the low-probability regions of the teacher distribution. Then, we derive an effective optimization approach to learn this objective. Extensive experiments in the instruction-following setting show that the MiniLLM models generate more precise responses with the higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance. Our method is also scalable for different model families with 120M to 13B parameters. Our code can be found in the supplementary material.
Primary Area: generative models
Keywords: Diffusion models video generation
Scores: [ 6 5 6 5 ]
Text-driven diffusion models have unlocked unprecedented abilities in image generation, whereas their video counterpart lags behind due to the excessive training cost.To avert the training burden, we propose a training-free ControlVideo to produce high-quality videos based on the provided text prompts and motion sequences.Specifically, ControlVideo adapts a pre-trained text-to-image model (i.e., ControlNet) for controllable text-to-video generation.To generate continuous videos without flicker effect, we propose an interleaved-frame smoother to smooth the intermediate frames.In particular, interleaved-frame smoother splits the whole videos with successive three-frame clips, and stabilizes each clip by updating the middle frame with the interpolation among other two frames in latent space.Furthermore, a fully cross-frame interaction mechanism have been exploited to further enhance the frame consistency, while a hierarchical sampler is employed to produce long videos efficiently.Extensive experiments demonstrate that our ControlVideo outperforms the state-of-the-arts both quantitatively and qualitatively. It is worthy noting that, thanks to the efficient designs, ControlVideo could generate both short and long videos within several minutes using one NVIDIA 2080Ti. All videos are shown in this anonymous link.
Primary Area: generative models
Keywords: variational autoencoders posterior collapse
Scores: [ 6 6 6 6 8 ]
The posterior collapse phenomenon in variational autoencoder (VAE), where the variational posterior distribution closely matches the prior distribution, can hinder the quality of the learned latent variables. As a consequence of posterior collapse, the latent variables extracted by the encoder in VAE preserve less information from the input data and thus fail to produce meaningful representations as input to the reconstruction process in the decoder. While this phenomenon has been an actively addressed topic related to VAE performance, the theory for posterior collapse remains underdeveloped, especially beyond the standard VAE. In this work, we advance the theoretical understanding of posterior collapse to two important and prevalent yet less studied classes of VAE: conditional VAE and hierarchical VAE. Specifically, via a non-trivial theoretical analysis of linear conditional VAE and hierarchical VAE with two levels of latent, we prove that the cause of posterior collapses in these models includes the correlation between the input and output of the conditional VAE and the effect of learnable encoder variance in the hierarchical VAE. We empirically validate our theoretical findings for linear conditional and hierarchical VAE and demonstrate that these results are also predictive for non-linear cases with extensive experiments.
Primary Area: generative models
Keywords: Diffusion models Time series
Scores: [ 6 5 8 ]
Primary Area: generative models
Keywords: reasoning language models pretraining
Scores: [ 8 6 6 ]
Primary Area: generative models
Keywords: Table Understanding In-context Learning Large Language Model
Scores: [ 5 6 5 6 ]
Table-based reasoning with large language models (LLMs) is a promising direction to tackle many table understanding tasks, such as table-based question answering and fact verification. Compared with generic reasoning, table-based reasoning requires the extraction of underlying semantics from both free-form questions and semi-structured tabular data. Chain-of-Thought and its similar approaches incorporate the reasoning chain in the form of textual context, but it is still an open question how to effectively leverage tabular data in the reasoning chain. We propose the Chain-of-Table framework, where tabular data is explicitly used in the reasoning chain as a proxy for intermediate thoughts. Specifically, we guide LLMs using in-context samples to iteratively generate operations and update the table to represent a complex reasoning chain. LLMs can therefore dynamically plan the next operation based on the results of the previous ones. This continuous evolution of the table forms a chain, showing the reasoning process for a given tabular problem. The chain carries structured information of the intermediate results, enabling more accurate and reliable predictions. Chain-of-Table achieves new state-of-the-art performance on WikiTQ, FeTaQA, and TabFact benchmarks across multiple LLM choices.
Primary Area: generative models
Keywords: Tabular data tabular generation diffusion models
Scores: [ 8 6 8 5 ]
Primary Area: generative models
Keywords: novel view synthesis object-centric scene representations camera control scene editing 3D diffusion generative models
Scores: [ 6 6 6 5 ]
Primary Area: generative models
Keywords: Language models Distillation RLHF
Scores: [ 8 6 6 6 ]
Primary Area: generative models
Keywords: Novel View Synthesis 3D from Single Image Efficient Training
Scores: [ 8 8 8 5 ]
The task of novel view synthesis aims to generate unseen perspectives of an object or scene from a limited set of input images. Nevertheless, synthesizing novel views from a single image still remains a significant challenge in the realm of computer vision. Previous approaches tackle this problem by adopting mesh prediction, multi-plain image construction, or more advanced techniques such as neural radiance fields. Recently, a pre-trained diffusion model that is specifically designed for 2D image synthesis has demonstrated its capability in producing photorealistic novel views, if sufficiently optimized on a 3D finetuning task. Although the fidelity and generalizability are greatly improved, training such a powerful diffusion model requires a vast volume of training data and model parameters, resulting in a notoriously long time and high computational costs. To tackle this issue, we propose Efficient-3DiM, a simple but effective framework to learn single-image novel-view generative models. Motivated by our in-depth visual analysis of the inference process of diffusion models, we propose several pragmatic strategies to reduce the training overhead to a manageable scale, including a crafted timestep sampling strategy, a superior 3D feature extractor, and an enhanced training scheme. When combined, our framework is able to reduce the total training time from 10 days to less than 1 day, significantly accelerating the training process under the same computational platform (one instance with 8 Nvidia A100 GPUs). Comprehensive experiments are conducted to demonstrate the efficiency and generalizability of our proposed method on several common benchmarks.
Primary Area: generative models
Keywords: Prompting Large Language Models Reasoning Abstraction
Scores: [ 8 8 8 ]
Primary Area: generative models
Keywords: diffusion models fairness generative models minority generation
Scores: [ 6 6 6 3 ]
Primary Area: generative models
Keywords: diffusion model density ratio estimation dataset bias
Scores: [ 6 5 8 5 ]
Primary Area: generative models
Keywords: Implicit neural representation generative model domain agnostic
Scores: [ 6 6 6 6 ]
Recent studies have introduced a new class of generative models for synthesizing implicit neural representations (INRs) that capture arbitrary continuous signals in various domains.These models opened the door for domain-agnostic generative models, but they often fail to achieve high-quality generation.We observed that the existing methods generate the weights of neural networks to parameterize INRs and evaluate the network with fixed positional embeddings (PEs).Arguably, this architecture limits the expressive power of generative models and results in low-quality INR generation.To address this limitation, we propose Domain-agnostic Latent Diffusion Model for INRs (DDMI) that generates adaptive positional embeddings instead of neural networks' weights.Specifically, we develop a Discrete-to-continuous space Variational AutoEncoder (D2C-VAE), which seamlessly connects discrete data and the continuous signal functions in the shared latent space. Additionally, we introduce a novel conditioning mechanism for evaluating INRs with the generated hierarchically decomposed basis fields to further enhance expressive power.Extensive experiments across four modalities, \eg, 2D images, 3D shapes, Neural Radiance Fields, and videos, with seven benchmark datasets, demonstrate the versatility of DDMI and its superior performance compared to the existing INR generative models.
Primary Area: generative models
Keywords: text-to-3d generative models diffusion models 3D reconstruction 3D generation sparse-view reconstruction
Scores: [ 8 8 6 ]
Text-to-3D with diffusion models have achieved remarkable progress in recent years. However, existing methods either rely on score distillation-based optimization which suffer from slow inference, low diversity and Janus problems, or are feed-forward methods that generate low quality results due to the scarcity of 3D training data. In this paper, we propose Instant3D, a novel method that generates high-quality and diverse 3D assets from text prompts in a feed-forward manner. We adopt a two-stage paradigm, which first generates a sparse set of four structured and consistent views from text in one shot with a fine-tuned 2D text-to-image diffusion model, and then directly regresses the NeRF from the generated images with a novel transformer-based sparse-view reconstructor. Through extensive experiments, we demonstrate that our method can generate high-quality, diverse and Janus-free 3D assets within 20 seconds, which is two order of magnitude faster than previous optimization-based methods that can take 1 to 10 hours. Our project webpage: https://instant-3d.github.io/.
Primary Area: generative models
Keywords: Graph Generation Denoising Diffusion Spectral Graph Theory
Scores: [ 6 5 8 5 ]
In the realm of generative models for graphs, a substantial body of research has been undertaken. However, most existing methods encounter difficulties when applied to large graphs. These issues stem from the need to express the entire joint distribution across all node pairs, the difficulty of accurately representing both the global structure and local details of the graph at the same time, and the computational complexity of the neural networks used.To overcome these issues, we introduce a method that generates a graph by progressively expanding a single node to a target graph. In each step, nodes and edges are added in a localized manner through denoising diffusion, building first the global structure, and then refining the local details. Since each generation step is local, our method does not need to model the complete joint distribution over all node pairs at once, allowing for significant computational efficiency, that is, an empirical subquadratic runtime in relation to the number of nodes. At the same time, multiscale generation allows us to retain high expressivity. Our experiments show that our model achieves state-of-the-art performance on well-established benchmark datasets while successfully scaling to graphs with at least 5000 nodes. Our method is also the first to successfully extrapolate to graphs outside of the training distribution, showcasing a much better generalization capability over existing methods.
Primary Area: generative models
Keywords: optimal transport domain translation image translation flow matching
Scores: [ 6 6 6 6 ]
In optimal transport (OT), a Monge map is known as a mapping that transports a source distribution to a target distribution in the most cost-efficient way. Recently, multiple neural estimators for Monge maps have been developed and applied in diverse unpaired domain translation tasks, e.g. in single-cell biology and computer vision. However, the classic OT framework enforces mass conservation, whichmakes it prone to outliers and limits its applicability in real-world scenarios. The latter can be particularly harmful in OT domain translation tasks, where the relative position of a sample within a distribution is explicitly taken into account. While unbalanced OT tackles this challenge in the discrete setting, its integration into neural Monge map estimators has received limited attention. We propose a theoreticallygrounded method to incorporate unbalancedness into any Monge map estimator. We improve existing estimators to model cell trajectories over time and to predict cellular responses to perturbations. Moreover, our approach seamlessly integrates with the OT flow matching (OT-FM) framework. While we show that OT-FM performs competitively in image translation, we further improve performance byincorporating unbalancedness (UOT-FM), which better preserves relevant features. We hence establish UOT-FM as a principled method for unpaired image translation.
Primary Area: generative models
Keywords: Computer Vision Diffusion Models Video Editing
Scores: [ 6 6 8 6 ]
This paper introduces a novel grounding-guided video-to-video translation framework called Ground-A-Video for multi-attribute video editing.Recent endeavors in video editing have showcased promising results in single-attribute editing or style transfer tasks, either by training T2V models on text-video data or adopting training-free methods.However, when confronted with the complexities of multi-attribute editing scenarios, they exhibit shortcomings such as omitting or overlooking intended attribute changes, modifying the wrong elements of the input video, and failing to preserve regions of the input video that should remain intact.Ground-A-Video attains temporally consistent multi-attribute editing of input videos in a training-free manner without aforementioned shortcomings.Central to our method is the introduction of cross-frame gated attention which incorporates groundings information into the latent representations in a temporally consistent fashion, along with Modulated Cross-Attention and optical flow guided inverted latents smoothing.Extensive experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.Further results and codes are provided at our anonymous project page (http://ground-a-video.github.io).
Primary Area: generative models
Keywords: retrieval-augmented language model large language model knowledge intensive NLP
Scores: [ 5 6 6 8 ]
Retrieval-augmented language models (RALMs) improve performance by accessing long-tail and up-to-date knowledge from external data stores, but are challenging to build. Existing approaches require either expensive retrieval-specific modifications to LM pre-training or use post-hoc integration of the data store that leads to suboptimal performance. We introduce Retrieval-Augmented Dual Instruction Tuning (RA-DIT), a lightweight fine-tuning methodology that provides a third option by retrofitting any LLM with retrieval capabilities. Our approach operates in two distinct fine-tuning steps: (1) one updates a pre-trained LM to better use retrieved information, while (2) the other updates the retriever to return more relevant results, as preferred by the LM. By fine-tuning over tasks that require both knowledge utilization and contextual awareness, we demonstrate that each stage yields significant performance improvements, and using both leads to additional gains. Our best model, RA-DIT 65B, achieves state-of-the-art performance across a range of knowledge-intensive zero- and few-shot learning benchmarks, significantly outperforming existing in-context RALM approaches by up to +8.9% in 0-shot setting and +1.4% in 5-shot setting on average.
Primary Area: generative models
Keywords: diffusion models preference-based learning
Scores: [ 5 8 8 3 ]
Primary Area: generative models
Keywords: large language models self-supervised learning data augmentation
Scores: [ 8 8 8 8 ]
We present a scalable method to build a high quality instruction following language model by automatically labelling human-written text with corresponding instructions. Our approach, named instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. The seed model is used to construct training examples by generating instruction prompts for web documents (self-augmentation), and then selecting high quality examples from among these candidates (self-curation). This data is then used to finetune a stronger model. Finetuning LLaMa on two iterations of our approach yields a model that outperforms all other LLaMa-based models on the Alpaca leaderboard not relying on distillation data, demonstrating highly effective self-alignment.
Primary Area: generative models
Keywords: Diffusion Models Generative Models Acceleration
Scores: [ 8 8 6 6 ]
Diffusion models have revolutionized text-to-image generation with its exceptional quality and creativity. However, its multi-step sampling process is known to be slow, often requiring tens of inference steps to obtain satisfactory results. Previous attempts to improve its sampling speed and reduce computational costs through distillation have been unsuccessful in achieving a functional one-step model.In this paper, we explore a recent method called Rectified Flow, which, thus far, has only been applied to small datasets. The core of Rectified Flow lies in its \emph{reflow} procedure, which straightens the trajectories of probability flows, refines the coupling between noises and images, and facilitates the distillation process with student models. We propose a novel text-conditioned pipeline to turn Stable Diffusion (SD) into an ultra-fast one-step model, in which we find reflow plays a critical role in improving the assignment between noise and images. Leveraging our new pipeline, we create, to the best of our knowledge, the first one-step diffusion-based text-to-image generator with SD-level image quality, achieving an FID (Fréchet Inception Distance) of 23.3 on MS COCO 2017-5k, surpassing the previous state-of-the-art technique, progressive distillation, by a significant margin (37.2 → 23.3 in FID). By utilizing an expanded network with 1.7B parameters, we further improve the FID to 22.4. We call our one-step models \emph{InstaFlow}. On MS COCO 2014-30k, InstaFlow yields an FID of 13.1 in just 0.09 second, the best in ≤0.1 second regime, outperforming the recent StyleGAN-T (13.9 in 0.1 second). Notably, the training of InstaFlow only costs 199 A100 GPU days.
Primary Area: generative models
Keywords: Score Distillation 3D Content Creation Diffusion Model
Scores: [ 8 6 3 6 ]
Text-to-image diffusion models pre-trained on billions of image-text pairs have recently enabled 3D content creation by optimizing a randomly initialized differentiable 3D representation with score distillation. However, the optimization process suffers slow convergence and the resultant 3D models often exhibit two limitations: (a) quality concerns such as missing attributes and distorted shape and textures; (b) extremely low diversity comparing to text-guided image synthesis. In this paper, we show that the conflict between the 3D optimization process and uniform timestep sampling in score distillation is the main reason for these limitations. To resolve this conflict, we propose to prioritize timestep sampling with monotonically non-increasing functions, which aligns the 3D optimization process with the sampling process of diffusion model. Extensive experiments show that our simple redesign significantly improves 3D content creation with faster convergence, better quality and diversity.
Primary Area: generative models
Keywords: diffusion models score matching variational approximation regularization by denoising inverse problems
Scores: [ 5 5 6 6 ]
Diffusion models have emerged as a key pillar of foundation models in visual domains. One of their critical applications is to universally solve different downstream inverse tasks via a single diffusion prior without re-training for each task. Most inverse tasks can be formulated as inferring a posterior distribution over data (e.g., a full image) given a measurement (e.g., a masked image). This is however challenging in diffusion models since the nonlinear and iterative nature of the diffusion process renders the posterior intractable. To cope with this challenge, we propose a variational approach that by design seeks to approximate the true posterior distribution. We show that our approach naturally leads to regularization by denoising diffusion process (RED-diff) where denoisers at different timesteps concurrently impose different structural constraints over the image. To gauge the contribution of denoisers from different timesteps, we propose a weighting mechanism based on signal-to-noise-ratio (SNR). Our approach provides a new variational perspective for solving inverse problems with diffusion models, allowing us to formulate sampling as stochastic optimization, where one can simply apply off-the-shelf solvers with lightweight iterates. Our experiments for image restoration tasks such as inpainting and superresolution demonstrate the strengths of our method compared with state-of-the-art sampling-based diffusion models.
Primary Area: generative models
Keywords: generative models flow matching diffusion models normalizing flows ode solver fast sampling distillation
Scores: [ 6 8 8 8 6 ]
Diffusion or flow-based models are powerful generative paradigms that are notoriously hard to sample as samples are defined as solutions to high-dimensional Ordinary or Stochastic Differential Equations (ODEs/SDEs) which require a large Number of Function Evaluations (NFE) to approximate well. Existing methods to alleviate the costly sampling process include model distillation and designing dedicated ODE solvers. However, distillation is costly to train and sometimes can deteriorate quality, while dedicated solvers still require relatively large NFE to produce high quality samples. In this paper we introduce Bespoke solvers'', a novel framework for constructing custom ODE solvers tailored to the ODE of a given pre-trained flow model. Our approach optimizes an order consistent and parameter-efficient solver (e.g., with 80 learnable parameters), is trained for roughly 1% of the GPU time required for training the pre-trained model, and significantly improves approximation and generation quality compared to dedicated solvers. For example, a Bespoke solver for a CIFAR10 model produces samples with Fréchet Inception Distance (FID) of 2.73 with 10 NFE, and gets to 1% of the Ground Truth (GT) FID (2.59) for this model with only 20 NFE. On the more challenging ImageNet-64$\times$64, Bespoke samples at 2.2 FID with 10 NFE, and gets within 2% of GT FID (1.71) with 20 NFE.
Primary Area: generative models
Keywords: Neural Scene Representations Generative Models Denoising Diffusion 3D Reconstruction
Scores: [ 6 8 6 6 ]
Generating and reconstructing 3D scenes is a challenging open problem, which requires synthesizing plausible content that is fully consistent in 3D space. While recent methods such as neural radiance fields excel at view synthesis and 3D reconstruction, they cannot synthesize plausible details in unobserved regions since they lack a generative capability. Conversely, existing generative methods are typically not capable of reconstructing detailed, large-scale scenes in the wild, as they use limited-capacity 3D scene representations, require aligned camera poses, or rely on additional regularizers. In this work, we introduce the first diffusion model able to perform fast, detailed reconstruction and generation of real-world 3D scenes. To achieve this, we make three contributions. First, we introduce a new neural scene representation that can efficiently and accurately represent large 3D scenes, dynamically allocating more capacity as needed to capture details visible in each image. Second, we propose a denoising-diffusion framework to learn a prior over this novel 3D scene representation, using only 2D images without the need for any additional supervision signal such as masks, depths, or regularizers, required by prior works. Third, we introduce a fast and unified architecture that supports conditioning on varying numbers of images, including a priori generation of 3D scenes and reconstruction from one or several images. We evaluate the model on several challenging datasets of real and synthetic images, and demonstrate superior results on generation, novel view synthesis and 3D reconstruction.
Primary Area: generative models
Keywords: Neural Radiance Fields 3D generation generative 3D models
Scores: [ 5 8 5 8 ]
We present Magic123'', a two-stage coarse-to-fine approach for high-quality, textured 3D mesh generation from a single image in the wild using both 2D and 3D priors. In the first stage, we optimize a neural radiance field to produce a coarse geometry. In the second stage, we adopt a memory-efficient differentiable mesh representation to yield a high-resolution mesh with a visually appealing texture. In both stages, the 3D content is learned through reference-view supervision and novel-view guidance by a joint 2D and 3D diffusion prior. We introduce a trade-off parameter between the 2D and 3D priors to control the details and 3D consistencies of the generation. Magic123 demonstrates a significant improvement over previous image-to-3D techniques, as validated through extensive experiments on diverse synthetic and real-world images.
Primary Area: generative models
Keywords: Instruction Finetuning
Scores: [ 6 5 6 6 ]
We show that language model finetuning can be improved, sometimes dramatically, with a simple augmentation. NEFTune adds noise to the embedding vectors during training.Standard finetuning of LLaMA-2-7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings.NEFTune also improves over strong baselines on modern instruction datasets.Models trained with Evol-Instruct see a 10% improvement, with ShareGPT an 8% improvement, and with OpenPlatypus an 8% improvement. Even powerful models further refined with RLHF such as LLaMA-2-Chat benefit from additional training with NEFTune.
Primary Area: generative models
Keywords: diffusion models classifier-free guidance
Scores: [ 5 6 8 5 ]
Primary Area: infrastructure, software libraries, hardware, etc.
Keywords: Pipeline Parallelism Zero Bubble
Scores: [ 8 6 6 8 ]
Primary Area: infrastructure, software libraries, hardware, etc.
Keywords: BatteryML Battery life prediction Machine learning Open-source platform Unified standards Collaborative research
Scores: [ 8 6 8 6 ]
Primary Area: infrastructure, software libraries, hardware, etc.
Keywords: programming models prompting techniques in-context learning few-shot learning chain of thought multi-hop reasoning language agents
Scores: [ 6 8 8 ]
The ML community is rapidly exploring techniques for prompting language models (LMs) and for stacking them into pipelines that solve complex tasks. Unfortunately, existing LM pipelines are typically implemented using hard-coded “prompt templates”, i.e. lengthy strings discovered via trial and error. Toward a more systematic approach for developing and optimizing LM pipelines, we introduce DSPy, a programming model that abstracts LM pipelines as text transformation graphs, or imperative computational graphs where LMs are invoked through declarative modules. DSPy modules are parameterized, meaning they can learn (by creating and collecting demonstrations) how to apply compositions of prompting, finetuning, augmentation, and reasoning techniques. We design a compiler that will optimize any DSPy pipeline to maximize a given metric. We conduct two case studies, showing that succinct DSPy programs can express and optimize sophisticated LM pipelines that reason about math word problems, tackle multi-hop retrieval, answer complex questions, and control agent loops. Within minutes of compiling, DSPy can automatically produce prompt pipelines and finetune pipelines that outperform out-of-the-box few-shot prompting as well as expert-created demonstrations for GPT-3.5 and Llama2-13b-chat. On top of that, DSPy programs compiled to relatively small LMs like 770M parameter T5 and Llama2- 13b-chat are competitive with many approaches that rely on large and proprietary LMs like GPT-3.5 and on expert-written prompt chains.
Primary Area: infrastructure, software libraries, hardware, etc.
Keywords: data debugging data valuation shapley value machine learning pipelines
Scores: [ 8 8 6 6 ]
Primary Area: infrastructure, software libraries, hardware, etc.
Keywords: Deep Neural Networks Quantized Neural Networks Network Quantization Accumulators Accelerators Inference Computer Vision Language Models
Scores: [ 6 6 5 6 ]
The majority of the research on the quantization of Deep Neural Networks (DNNs) is focused on reducing the precision of tensors visible by high-level frameworks (e.g., weights, activations, and gradients). However, current hardware still relies on high-accuracy core operations. Most significant is the operation of accumulating products. This high-precision accumulation operation is gradually becoming the main computational bottleneck. This is because, so far, the usage of low-precision accumulators led to a significant degradation in performance. In this work, we present a simple method to train and fine-tune DNNs, to allow, for the first time, utilization of cheaper, 12-bits accumulators, with no significant degradation in accuracy. Lastly, we show that as we decrease the accumulation precision further, using fine-grained gradient approximations can improve the DNN accuracy.
Primary Area: infrastructure, software libraries, hardware, etc.
Keywords: Attention GPUs LLM training LLM inference
Scores: [ 5 8 10 6 ]
Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in the sequence length. FlashAttention [5] exploits the asymmetric GPU memory hierarchy to bring significant memory saving (linear instead of quadratic) and runtime speedup (2-4× compared to optimized baselines), with no approximation. However, FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40% of the theoretical maximum FLOPs/s. We observe that the inefficiency is due to suboptimal work partitioning between different thread blocks and warps on the GPU, causing either low-occupancy or unnecessary shared memory reads/writes. We propose FlashAttention-2, with better work partitioning to address these issues. In particular, we (1) tweak the algorithm to reduce the number of non-matmul FLOPs (2) parallelize the attention computation, even for a single head, across different thread blocks to increase occupancy, and (3) within each thread block, distribute the work between warps to reduce communication through shared memory. These yield around 2× speedup compared to FlashAttention, reaching 50-73% of the theoretical maximum FLOPs/s on A100 and getting close to the efficiency of GEMM operations. We empirically validate that when used end-to-end to train GPT-style models, FlashAttention-2 reaches training speed of up to 225 TFLOPs/s per A100 GPU (72% model FLOPs utilization).
Primary Area: infrastructure, software libraries, hardware, etc.
Keywords: Reinforcement Learning Distributed Systems Large Scale Training
Scores: [ 8 8 6 8 ]
Primary Area: infrastructure, software libraries, hardware, etc.
Keywords: JAX automatic differentiation calculus of variation higher-order functions functionals operators JVP rules transpose rules
Scores: [ 5 6 8 8 ]
Primary Area: infrastructure, software libraries, hardware, etc.
Keywords: low-precision LLM pretraining 2 bits auto compression low memory pretraining
Scores: [ 8 5 5 6 6 ]
Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of large language models on massive GPU clusters due to its ease of use, efficiency, and good scalability. However, when training on low-bandwidth clusters, and/or when small batch size per GPU is used, ZeRO’s effective throughput is limited due to communication overheads. To alleviate this limitation, this paper introduces ZeRO++ composing of three communication volume reduction techniques (lowprecision all-gather, data remapping, and low-precision gradient averaging) to significantly reduce the communication volume up to 4x that enables up to 2.16x better throughput at 384 GPU scale. Our results also show ZeRO++ can speedup the RLHF by 3.3x compared to vanilla ZeRO. To verify the convergence of ZeRO++, we test up to 13B model for pretraining with 8/6-bits all gather and up to 30B model for finetuning with 4/2-bits all gather, and demonstrate on-par accuracy as original ZeRO (aka standard training). As a byproduct, the model trained with ZeRO++ is naturally weight-quantized, which can be directly used for inference without post-training quantization or quantization-aware training.
Primary Area: infrastructure, software libraries, hardware, etc.
Keywords: convolutions GPUs hardware-efficient algorithms long context fast fourier transform I/O awareness
Scores: [ 6 8 8 ]
Convolution models with long filters have demonstrated state-of-the-art reasoning abilities in many long-sequence tasks but lag behind the most optimized Transformers in wall-clock time.A major bottleneck is the Fast Fourier Transform (FFT)---which allows long convolutions to run in O(NlogN) time in sequence length N but has poor hardware utilization.In this paper, we study how to optimize the FFT convolution.We find two key bottlenecks: the FFT does not effectively use specialized matrix multiply units, and it incurs expensive I/O between layers of the memory hierarchy.In response, we propose FlashFFTConv.FlashFFTConv uses a matrix decomposition that computes the FFT using matrix multiply units and enables kernel fusion for long sequences, reducing I/O.We also present two sparse convolution algorithms---1) partial convolutions and 2) frequency-sparse convolutions---which can be implemented simply by skipping blocks in the matrix decomposition, enabling further opportunities for memory and compute savings.FlashFFTConv speeds up exact FFT convolutions by up to 8.7$\times$ over PyTorch and achieves up to 4.4$\times$ speedup end-to-end.Given the same compute budget, FlashFFTConv allows Hyena-GPT-s to achieve 2.3 points better perplexity and M2-BERT-base to achieve 3.3 points higher GLUE score---matching models with twice the parameter count.FlashFFTConv also achieves 96.1% accuracy on Path-512, a high-resolution vision task where no model had previously achieved better than 50%.Furthermore, partial convolutions enable longer-sequence models---yielding the first DNA model that can process the longest human genes (2.3M base pairs)---and frequency-sparse convolutions speed up pretrained models while maintaining or improving model quality.
Primary Area: infrastructure, software libraries, hardware, etc.
Keywords: Neural Network Quantization Model Compression Generative Language Model Transformer Deep Learning
Scores: [ 8 6 6 8 ]
Primary Area: infrastructure, software libraries, hardware, etc.
Keywords: Long Sequence Inference Inference Compiler Activation Memory Machine Learning Infrastructure Low Resource Inference
Scores: [ 5 8 6 ]
Large deep learning models have achieved impressive performance across a range of applications. However, their large memory requirements, including parameter memory and activation memory, have become a significant challenge for their practical serving. While existing methods mainly address parameter memory, the importance of activation memory has been overlooked. Especially for long input sequences, activation memory is expected to experience a significant exponential growth as the length of sequences increases. In this approach, we propose AutoChunk, an automatic and adaptive compiler system that efficiently reduces activation memory for long sequence inference by chunk strategies. The proposed system generates chunk plans by optimizing through multiple stages. In each stage, the chunk search pass explores all possible chunk candidates and the chunk selection pass identifies the optimal one. At runtime, AutoChunk employs code generation to automatically apply chunk strategies. The experiments demonstrate that AutoChunk can reduce over 80% of activation memory while maintaining speed loss within 10%, extend max sequence length by 3.2x to 11.7x, and outperform state-of-the-art methods by a large margin.
Primary Area: infrastructure, software libraries, hardware, etc.
Keywords: Electronics Design Automation (EDA) Logic Synthesis Reinforcement Learning Hardware design Circuits
Scores: [ 8 6 6 5 6 ]
Logic synthesis, a pivotal stage in chip design, entails optimizing chip specifications encoded in hardware description languages like Verilog into highly efficient implementations using Boolean logic gates. The process involves a sequential application of logic minimization heuristics (synthesis recipe"), with their arrangement significantly impacting crucial metrics such as area and delay. Addressing the challenge posed by the broad spectrum of hardware design complexities — from variations of past designs (e.g., adders and multipliers) to entirely novel configurations (e.g., innovative processor instructions) — requires a nuanced 'synthesis recipe' guided by human expertise and intuition. This study conducts a thorough examination of learning and search techniques for logic synthesis, unearthing a surprising revelation: pre-trained agents, when confronted with entirely novel designs, may veer off course, detrimentally affecting the search trajectory. We present RGLS, a meticulously tuned α parameter that adeptly adjusts recommendations from pre-trained agents during the search process. Computed based on similarity scores through nearest neighbor retrieval from the training dataset, RGLS yields superior synthesis recipes tailored for a wide array of hardware designs. Our findings showcase substantial enhancements in the Quality of Result (QoR) of synthesized circuits, boasting improvements of up to 24.8% compared to state-of-the-art techniques. Furthermore, RGLS achieves an impressive up to 9x reduction in runtime (iso-QoR) when compared to current state-of-the-art methodologies.
Primary Area: learning on graphs and other geometries & topologies
Keywords: graph neural network
Scores: [ 8 8 6 8 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: Graphs kernels random walks Laplacian adjacency matrix kernel learning ordinary differential equation neural network
Scores: [ 8 8 8 8 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: Temporal Dynamic Graphs Spectral Transform GNN
Scores: [ 6 6 6 6 ]
We present the Evolving Graph Fourier Transform (EFT), the first invertible spectral transform that captures evolving representations on temporal graphs. We motivate our work by the inadequacy of existing methods for capturing the evolving graph spectra, which are also computationally expensive due to the temporal aspect along with the graph vertex domain. We view the problem as an optimization over the Laplacian of the continuous time dynamic graph. Additionally, we propose pseudo-spectrum relaxations that decompose the transformation process, making it highly computationally efficient. The EFT method adeptly captures the evolving graph's structural and positional properties, making it effective for downstream tasks on evolving graphs. Hence, as a reference implementation, we develop a simple neural model induced with \eft for capturing evolving graph spectra. We empirically validate our theoretical findings on a number of large-scale and standard temporal graph benchmarks and demonstrate that our model achieves state-of-the-art performance.
Primary Area: learning on graphs and other geometries & topologies
Keywords: Graph Neural Networks Fraud Detection
Scores: [ 5 5 6 6 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: Graph Neural Network Link Prediction
Scores: [ 3 8 6 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: graph neural networks expressive power representation learning subgraph isomorphism cliques cycles motifs substructures count message-passing
Scores: [ 6 6 6 6 ]
Graph Neural Networks (GNNs) are powerful representation learning tools that have achieved remarkable performance in various important tasks. However, their ability to count substructures, which play a crucial role in biological and social networks, remains uncertain. In this work, we fill this gap and characterize the representation power of GNNs in terms of their ability to produce powerful representations that count graph substructures. In particular, we study the message-passing operations of GNNs with random stationary input and show that they can produce permutation equivariant representations that are associated with high-order statistical moments. Using these representations, we prove that GNNs can learn how to count cycles, quasi-cliques, and the number of connected components in a graph. We also provide new insights into the generalization capacity of GNNs. Our analysis is constructive and enables the design of a generic GNN architecture that shows remarkable performance in four distinct tasks: cycle detection, cycle counting, graph classification, and molecular property prediction.
Primary Area: learning on graphs and other geometries & topologies
Keywords: graph learning nature-powered computing dynamic physical system
Scores: [ 6 6 5 ]
Nature performs complex computations constantly at clearly lower cost and higher performance than digital computers. It is crucial to understand how to harness the unique computational power of nature in Machine Learning (ML). In the past decade, besides the development of Neural Networks (NNs), the community has also relentlessly explored nature-powered ML paradigms. Although most of them are still predominantly theoretical, a new practical paradigm enabled by the recent advent of CMOS-compatible room-temperature nature-based computers has emerged. By harnessing the nature's power of entropy increase, this paradigm can solve binary learning problems delivering immense speedup and energy savings compared with NNs, while maintaining comparable accuracy. Regrettably, its values to the real world are highly constrained by its binary nature. A clear pathway to its extension to real-valued problems remains elusive. This paper aims to unleash this pathway by proposing a novel end-to-end Nature-Powered Graph Learning (NP-GL) framework. Specifically, through a three-dimensional co-design, NP-GL can leverage the nature's power of entropy increase to efficiently solve real-valued graph learning problems. Experimental results across 4 real-world applications with 6 datasets demonstrate that NP-GL delivers, on average, 6.97×103 speedup and 105 energy consumption reduction with comparable or even higher accuracy than Graph Neural Networks (GNNs).
Primary Area: learning on graphs and other geometries & topologies
Keywords: Graph Neural Networks Generalization bounds PAC-Bayesian Structural imbalance Graph Global Workspace
Scores: [ 6 6 6 6 6 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: Lie Groups Riemannian Batch Normalization SPD Neural Networks
Scores: [ 6 3 8 1 8 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: Deep weight space Graph neural networks Transformers Permutation equivariance Implicit neural representations Networks for networks Neural graphs
Scores: [ 8 8 6 ]
Neural networks that process the parameters of other neural networks find applications in domains as diverse as classifying implicit neural representations, generating neural network weights, and predicting generalization errors.However, existing approaches either overlook the inherent permutation symmetry in the neural network or rely on intricate weight-sharing patterns to achieve equivariance, while ignoring the impact of the network architecture itself.In this work, we propose to represent neural networks as computational graphs of parameters, which allows us to harness powerful graph neural networks and transformers that preserve permutation symmetry.Consequently, our approach enables a single model to encode neural computational graphs with diverse architectures.We showcase the effectiveness of our method on a wide range of tasks, including classification and editing of implicit neural representations, predicting generalization performance, and learning to optimize, while consistently outperforming state-of-the-art methods.
Primary Area: learning on graphs and other geometries & topologies
Keywords: Lottery Ticket Hypothesis Graph Neural Networks Graph Neural Tangent Kernel
Scores: [ 6 6 5 ]
Graph Neural Networks (GNNs) have emerged as the leading deep learning models for graph-based representation learning. However, the training and inference of GNNs on large graphs remain resource-intensive, impeding their utility in real-world scenarios and curtailing their applicability in deeper and more sophisticated GNN architectures. To address this issue, the Graph Lottery Ticket (GLT) hypothesis assumes that GNN with random initialization harbors a pair of core subgraph and sparse subnetwork, which can yield comparable performance and higher efficiency to that of the original dense network and complete graph. Despite that GLT offers a new paradigm for GNN training and inference, existing GLT algorithms heavily rely on trial-and-error pruning rate tuning and scheduling, and adhere to an irreversible pruning paradigm that lacks elasticity. Worse still, current methods suffer scalability issues when applied to deep GNNs, as they maintain the same topology structure across all layers. These challenges hinder the integration of GLT into deeper and larger-scale GNN contexts. To bridge this critical gap, this paper introduces an $\textbf{A}$daptive, $\textbf{D}$ynamic, and $\textbf{A}$utomated framework for identifying $\textbf{G}$raph $\textbf{L}$ottery $\textbf{T}ickets(\textbf{AdaGLT}$). Our proposed method derives its key advantages and addresses the above limitations through the following three aspects: 1) tailoring layer-adaptive sparse structures for various datasets and GNNs, thus endowing it with the capability to facilitate deeper GNNs; 2) integrating the pruning and training processes, thereby achieving a dynamic workflow encompassing both pruning and restoration; 3) automatically capturing graph lottery tickets across diverse sparsity levels, obviating the necessity for extensive pruning parameter tuning. More importantly, we rigorously provide theoretical proofs to guarantee AdaGLT to mitigate over-smoothing issues and obtain improved sparse structures in deep GNN scenarios. Extensive experiments demonstrate that AdaGLT outperforms state-of-the-art competitors across multiple graph datasets of various scales and types, particularly in scenarios involving deep GNNs.
Primary Area: learning on graphs and other geometries & topologies
Keywords: Graph neural networks Graph canonization Stability
Scores: [ 6 6 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: Equivariant Operations; Tensor Product; Change of basis; Spherical Harmonics; Fourier Basis;
Scores: [ 8 6 8 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: Metanetwork graph equivariance expressivity
Scores: [ 6 6 6 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: GNN graph pooling parsing
Scores: [ 5 8 8 3 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: Conformal Prediction Graph Neural Networks Inductive Node Classification
Scores: [ 5 8 6 6 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: Node classification Graph Neural Networks Large Language Models
Scores: [ 8 6 6 6 ]
In recent years, there have been remarkable advancements in node classification achieved by Graph Neural Networks (GNNs). However, they necessitate abundant high-quality labels to ensure promising performance. In contrast, Large Language Models (LLMs) exhibit impressive zero-shot proficiency on text-attributed graphs. Yet, they face challenges in efficiently processing structural data and suffer from high inference costs. In light of these observations, this work introduces a label-free node classification on graphs with LLMs pipeline, LLM-GNN. It amalgamates the strengths of both GNNs and LLMs while mitigating their limitations. Specifically, LLMs are leveraged to annotate a small portion of nodes and then GNNs are trained on LLMs' annotations to make predictions for the remaining large portion of nodes. The implementation of LLM-GNN faces a unique challenge: how can we actively select nodes for LLMs to annotate and consequently enhance the GNN training? How can we leverage LLMs to obtain annotations of high quality, representativeness, and diversity, thereby enhancing GNN performance with less cost?To tackle this challenge, we develop an annotation quality heuristic and leverage the confidence scores derived from LLMs to advanced node selection. Comprehensive experimental results validate the effectiveness of LLM-GNN. In particular, LLM-GNN can achieve an accuracy of 74.9% on a vast-scale dataset \products with a cost less than 1 dollar.
Primary Area: learning on graphs and other geometries & topologies
Keywords: Deep Graph Networks Graph Neural Networks Bayesian Networks Sum-Product Networks Graph Representation Learning Missing Data Graph Classification Weak Supervision
Scores: [ 6 6 5 ]
We introduce Graph-Induced Sum-Product Networks (GSPNs), a new probabilistic framework for graph representation learning that can tractably answer probabilistic queries. Inspired by the computational trees induced by vertices in the context of message-passing neural networks, we build hierarchies of sum-product networks (SPNs) where the parameters of a parent SPN are learnable transformations of the a-posterior mixing probabilities of its children's sum units. Due to weight sharing and the tree-shaped computation graphs of GSPNs, we obtain the efficiency and efficacy of deep graph networks with the additional advantages of a probabilistic model. We show the model's competitiveness on scarce supervision scenarios, under missing data, and for graph classification in comparison to popular neural models. We complement the experiments with qualitative analyses on hyper-parameters and the model's ability to answer probabilistic queries.
Primary Area: learning on graphs and other geometries & topologies
Keywords: Graph Knowledge Distillation Efficient Graph Learning
Scores: [ 5 8 8 6 5 ]
GNN-to-MLP distillation aims to utilize knowledge distillation (KD) to learn computationally-efficient multi-layer perceptron (student MLP) on graph data by mimicking the output representations of teacher GNN. Existing methods mainly make the MLP to mimic the GNN predictions over a few class labels. However, the class space may not be expressive enough for covering numerous diverse local graph structures, thus limiting the performance of knowledge transfer from GNN to MLP. To address this issue, we propose to learn a new powerful graph representation space by directly labeling nodes' diverse local structures for GNN-to-MLP distillation. Specifically, we propose a variant of VQ-VAE to learn a structure-aware tokenizer on graph data that can encode each node's local substructure as a discrete code. The discrete codes constitute a codebook as a new graph representation space that is able to identify different local graph structures of nodes with the corresponding code indices. Then, based on the learned codebook, we propose a new distillation target, namely soft code assignments, to directly transfer the structural knowledge of each node from GNN to MLP. The resulting framework VQGraph achieves new state-of-the-art performance on GNN-to-MLP distillation in both transductive and inductive settings across seven graph datasets. We show that VQGraph with better performance infers faster than GNNs by 828×, and also achieves accuracy improvement over GNNs and stand-alone MLPs by 3.90% and 28.05% on average, respectively.
Primary Area: learning on graphs and other geometries & topologies
Keywords: Graph Neural Networks Subgraphs Expressive power Sampling
Scores: [ 6 6 8 6 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: Dynamic Graph Neural Network
Scores: [ 6 6 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: general manifolds diffusion models continuous normalizing flow
Scores: [ 8 8 8 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: Graph anomaly detection consistency training learnable data augmentation
Scores: [ 5 8 8 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: Graph Neural Networks Higher-order Representation Learning Simplicial Complexes Simplicial Neural Networks Weisfeiler-Lehman Isomorphism Test
Scores: [ 6 6 8 8 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: Active Learning Graph Neural Networks Structural Fairness
Scores: [ 5 6 6 6 ]
Graph Neural Networks (GNNs) have seen significant achievements in semi-supervised node classification. Yet, their efficacy often hinges on access to high-quality labeled node samples, which may not always be available in real-world scenarios. While active learning is commonly employed across various domains to pinpoint and label high-quality samples based on data features, graph data present unique challenges due to their intrinsic structures that render nodes non-i.i.d. Furthermore, biases emerge from the positioning of labeled nodes; for instance, nodes closer to the labeled counterparts often yield better performance. To better leverage graph structure and mitigate structural bias in active learning, we present a unified optimization framework (SCARCE), which is also easily incorporated with node features. Extensive experiments demonstrate that the proposed method not only improves the GNNs performance but also paves the way for more fair results.
Primary Area: learning on graphs and other geometries & topologies
Keywords: Link prediction; Node Local Topology; Graph Neural Network ; Cold-start Issues
Scores: [ 6 6 6 5 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: E(3) equivariant networks
Scores: [ 6 8 6 ]
Integrating a notion of symmetry into point cloud neural networks is a provably effective way to improve their generalization capability. Of particular interest are E(3) equivariant point cloud networks where Euclidean transformations applied to the inputs are preserved in the outputs. Recent efforts aim to extend networks that are equivariant with respect to a single global E(3) transformation, to accommodate inputs made of multiple parts, each of which exhibits local E(3) symmetry.In practical settings, however, the partitioning into individually transforming regions is unknown a priori.Errors in the partition prediction would unavoidably map to errors in respecting the true input symmetry. Past works have proposed different ways to predict the partition, which may exhibit uncontrolled errors in their ability to maintain equivariance to the actual partition. To this end, we introduce APEN: a general framework for constructing approximate piecewise-E(3) equivariant point networks. Our framework offers an adaptable design to guaranteed bounds on the resulting piecewise E(3) equivariance approximation errors.Our primary insight is that functions which are equivariant with respect to a finer partition (compared to the unknown true partition) will also maintain equivariance in relation to the true partition. Leveraging this observation, we propose a compositional design for a partition prediction model. It initiates with a fine partition and incrementally transitions towards a coarser subpartition of the true one, consistently maintaining piecewise equivariance in relation to the current partition.As a result, the equivariance approximation error can be bounded solely in terms of (i) uncertainty quantification of the partition prediction, and (ii) bounds on the probability of failing to suggest a proper subpartition of the ground truth one.We demonstrate the practical effectiveness of APEN using two data types exemplifying part-based symmetry: (i) real-world scans of room scenes containing multiple furniture-type objects; and, (ii) human motions, characterized by articulated parts exhibiting rigid movement. Our empirical results demonstrate the advantage of integrating piecewise E(3) symmetry into network design, showing a distinct improvement in generalization over prior works in terms of generalization accuracy for both classification and segmentation tasks.
Primary Area: learning on graphs and other geometries & topologies
Keywords: Clifford Algebra Geometric Algebra Graph Neural Networks Simplicial Message Passing Topological Deep Learning Geometric Deep Learning Equivariance
Scores: [ 6 6 6 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: graph neural networks Datasets molecules molecular graphs Quantum Multi-task foundation model
Scores: [ 8 6 8 6 3 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: GNN MPNN graph rewiring probabilistic weisfeiler leman lehman k-subset sampling expressivity
Scores: [ 8 6 6 ]
Message-passing graph neural networks (MPNNs) emerged as powerful tools for processing graph-structured input. However, they operate on a fixed input graph structure, ignoring potential noise and missing information. Furthermore, their local aggregation mechanism can lead to problems such as over-squashing and limited expressive power in capturing relevant graph structures. Existing solutions to these challenges have primarily relied on heuristic methods, often disregarding the underlying data distribution. Hence, devising principled approaches for learning to infer graph structures relevant to the given prediction task remains an open challenge. In this work, leveraging recent progress in exact and differentiable k-subset sampling, we devise probabilistically rewired MPNNs (PR-MPNNs), which learn to add relevant edges while omitting less beneficial ones. For the first time, our theoretical analysis explores how PR-MPNNs enhance expressive power, and we identify precise conditions under which they outperform purely randomized approaches. Empirically, we demonstrate that our approach effectively mitigates issues like over-squashing and under-reaching. In addition, on established real-world datasets, our method exhibits competitive or superior predictive performance compared to traditional MPNN models and recent graph transformer architectures.
Primary Area: learning on graphs and other geometries & topologies
Keywords: Graph Representation Learning Graph Transformer
Scores: [ 5 6 6 5 ]
Graph transformer has been proven as an effective graph learning method for its adoption of attention mechanism that is capable of capturing expressive representations from complex topological and feature information of graphs. Graph transformer conventionally performs dense attention (or global attention) for every pair of nodes to learn node representation vectors, resulting in quadratic computational costs that are unaffordable for large-scale graph data. Therefore, mini-batch training for graph transformers is a promising direction, but limited samples in each mini-batch can not support effective dense attention to encode informative representations.Facing this bottleneck, (1) we start by assigning each node a token list that is sampled by personalized PageRank (PPR) and then apply standard multi-head self-attention only on this list to compute its node representations. This PPR tokenization method decouples model training from complex graph topological information and makes heavy feature engineering offline and independent, such that mini-batch training of graph transformers is possible by loading each node's token list in batches. We further prove this PPR tokenization is viable as a graph convolution network with a fixed polynomial filter and jumping knowledge. However, only using personalized PageRank may limit information carried by a token list, which could not support different graph inductive biases for model training. To this end, (2) we rewire graphs by introducing multiple types of virtual connections through structure- and content-based super nodes that enable PPR tokenization to encode local and global contexts, long-range interaction, and heterophilous information into each node's token list, and then formalize our Virtual Connection Ranking based Graph Transformer (VCR-Graphormer). Overall, VCR-Graphormer only needs O(m+klogk) complexity for graph tokenization as compared to O(n3) of previous works. We also show that VCR-Graphormer outperforms the state-of-the-arts on node classification in 12 datasets.
Primary Area: learning on graphs and other geometries & topologies
Keywords: large language models (LLM) feature learning text attributed graphs (TAG) graph neural networks (GNN)
Scores: [ 6 5 6 ]
Representation learning on text-attributed graphs (TAGs) has become a critical research problem in recent years. A typical example of a TAG is a paper citation graph, where the text of each paper serves as node attributes. Initial graph neural network (GNN) pipelines handled these text attributes by transforming them into shallow or hand-crafted features, such as skip-gram or bag-of-words features. Recent efforts have focused on enhancing these pipelines with language models (LMs), which typically demand intricate designs and substantial computational resources. With the advent of powerful large language models (LLMs) such as GPT or Llama2, which demonstrate an ability to reason and to utilize general knowledge, there is a growing need for techniques which combine the textual modelling abilities of LLMs with the structural learning capabilities of GNNs. Hence, in this work, we focus on leveraging LLMs to capture textual information as features, which can be used to boost GNN performance on downstream tasks. A key innovation is our use of \emph{explanations as features}: we prompt an LLM to perform zero-shot classification, request textual explanations for its decision-making process, and design an \emph{LLM-to-LM interpreter} to translate these explanations into informative features that enhance downstream GNNs. Our experiments demonstrate that our method achieves state-of-the-art results on well-established TAG datasets, including \texttt{Cora}, \texttt{PubMed}, \texttt{ogbn-arxiv}, as well as our newly introduced dataset, \texttt{arXiv-2023}. Furthermore, our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on \texttt{ogbn-arxiv}. Lastly, we believe the versatility of the proposed method extends beyond TAGs and holds the potential to enhance other tasks involving graph-text data~\footnote{Our codes and datasets are available at: \url{https://anonymous.4open.science/r/TAPE-dev}}.
Primary Area: learning on graphs and other geometries & topologies
Keywords: simplicial complex simplicial neural network random walks
Scores: [ 8 5 8 5 ]
Triggered by limitations of graph-based deep learning methods in terms of computational expressivity and model flexibility, recent years have seen a surge of interest in computational models that operate on higher-order topological domains such as hypergraphs and simplicial complexes. While the increased expressivity of these models can indeed lead to a better classification performance and a more faithful representation of the underlying system, the computational cost of these higher-order models can increase dramatically. To this end, we here explore a simplicial complex neural network learning architecture based on random walks and fast 1D convolutions (SCRaWl), in which we can adjust the increase in computational cost by varying the length and number of random walks considered while accounting for higher-order relationships. Importantly, due to the random walk-based design, the expressivity of the proposed architecture is provably incomparable to that of existing message-passing simplicial neural networks. We empirically evaluate SCRaWl on real-world datasets and show that it outperforms other simplicial neural networks.
Primary Area: learning on graphs and other geometries & topologies
Keywords: Differentiable Euler Characteristic Transforms for Shape Classification
Scores: [ 6 5 6 8 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: Graph neural network
Scores: [ 6 6 6 8 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: Neural Algorithmic Reasoning
Scores: [ 6 8 8 6 6 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: Graph Neural Networks Information Theory Heterophily Graphs
Scores: [ 6 8 8 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: Graph Neural Networks Label Poisoning Attacks Pitfalls
Scores: [ 6 8 5 8 ]
Node labels for graphs are usually generated using an automated process or crowd-sourced from human users. This opens up avenues for malicious users to compromise the training labels, making it unwise to blindly rely on them. While robustness against noisy labels is an active area of research, there are only a handful of papers in the literature that address this for graph-based data. Even more so, the effects of adversarial label perturbations is sparsely studied. More critically, we reveal that the entire literature on label poisoning for GNNs is plagued by serious evaluation pitfalls. Thus making it hard to conclude how robust GNNs are against label perturbations. After course correcting the state of label poisoning attacks with our faithful evaluation, we identify a discrepancy in attack efficiency of ∼9% on average. Additionally, we introduce two new simple yet effective attacks that are significantly stronger (up to ∼8%) than the previous strongest attack. Our strongest proposed attack can be efficiently computed and is theoretically backed.
Primary Area: learning on graphs and other geometries & topologies
Keywords: Generalization Estimation Evolving Graph Graph Neural Network
Scores: [ 5 8 6 5 6 ]
Graph Neural Networks (GNNs) are widely deployed in vast fields, but they often struggle to maintain accurate representations as graphs evolve. We theoretically establish a lower bound, proving that under mild conditions, representation distortion inevitably occurs over time. To estimate the temporal distortion without human annotation after deployment, one naive approach is to pre-train a recurrent model (e.g., RNN) before deployment and use this model afterwards, but the estimation is far from satisfactory. In this paper, we analyze the representation distortion from an information theory perspective, and attribute it primarily to inaccurate feature extraction during evolution. Consequently, we introduce Smart, a straightforward and effective baseline enhanced by an adaptive feature extractor through self-supervised graph reconstruction. In synthetic random graphs, we further refine the former lower bound to show the inevitable distortion over time and empirically observe that Smart achieves good estimation performance. Moreover, we observe that Smart consistently shows outstanding generalization estimation on four real-world evolving graphs. The ablation studies underscore the necessity of graph reconstruction. For example, on OGB-arXiv dataset, the estimation metric MAPE deteriorates from 2.19% to 8.00% without reconstruction.
Primary Area: learning on graphs and other geometries & topologies
Keywords: Simplex vertex hunting successive projection pseudo-points pruning hyper-spectral unmixing archetypal analysis network analysis.
Scores: [ 3 6 6 8 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: graph neural networks equivariance expressivity graph orbits
Scores: [ 6 8 6 8 ]
Equivariance is an important structural property that is captured by architectures such as graph neural networks (GNNs). However, equivariant graph functions cannot produce different outputs for similar nodes, which may be undesirable when the function is trying to optimize some global graph property. In this paper, we define orbit-equivariance, a relaxation of equivariance which allows for such functions whilst retaining important structural inductive biases. We situate the property in the hierarchy of graph functions, define a taxonomy of orbit-equivariant functions, and provide four different ways to achieve non-equivariant GNNs. For each, we analyze their expressivity with respect to orbit-equivariance and evaluate them on two novel datasets, one of which stems from a real-world use-case of designing optimal bioisosteres.
Primary Area: learning on graphs and other geometries & topologies
Keywords: Graph Neural Networks Discrete Curvature Rewiring Positional Encodings Structural Encodings
Scores: [ 3 6 8 6 ]
Structural and Positional Encodings can significantly improve the performance of Graph Neural Networks in downstream tasks. Recent literature has begun to systematically investigate differences in the structural properties that these approaches encode, as well as performance trade-offs between them. However, the question of which structural properties yield the most effective encoding remains open. In this paper, we investigate this question from a geometric perspective. We propose a novel structural encoding based on discrete Ricci curvature (Local Curvature Profiles, short LCP) and show that it significantly outperforms existing encoding approaches. We further show that combining local structural encodings, such as LCP, with global positional encodings improves downstream performance, suggesting that they capture complementary geometric information. Finally, we compare different encoding types with (curvature-based) rewiring techniques. Rewiring has recently received a surge of interest due to its ability to improve the performance of Graph Neural Networks by mitigating over-smoothing and over-squashing effects. Our results suggest that utilizing curvature information for structural encodings delivers significantly larger performance increases than rewiring.
Primary Area: learning on graphs and other geometries & topologies
Keywords: WL test graph neural networks graph motif parameters subgraph counting
Scores: [ 6 6 8 6 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: Spectral Graph Convolutions Directed Graphs Weighted Graphs Complex Analysis Spectral Graph Theory Heterophily Node Classification Graph Regression pectral Graph Theory Rigorous Proofs
Scores: [ 6 8 6 6 ]
Within the graph learning community, conventional wisdom dictates that spectral convolutional networks may only be deployed on undirected graphs: Only there could the existence of a well-defined graph Fourier transform be guaranteed, so that information may be translated between spatial- and spectral domains. Here we show this traditional reliance on the graph Fourier transform to be superfluous: Making use of certain advanced tools from complex analysis and spectral theory, we extend spectral convolutions to directed graphs. We provide a frequency- response interpretation of newly developed filters, investigate the influence of the basis’ used to express filters and discuss the interplay with characteristic operators on which networks are based. In order to thoroughly test the developed general theory, we conduct experiments in real world settings, showcasing that directed spectral convolutional networks provide new state of the art results for heterophilic node classification and – as opposed to baselines – may be rendered stable to resolution-scale varying topological perturbations.
Primary Area: learning on graphs and other geometries & topologies
Keywords: geometric deep learning differential forms representation learning graph learning geometry topology
Scores: [ 8 6 5 6 5 ]
\emph{Geometric deep learning} extends deep learning to incorporate information about the geometry and topology data, especially in complex domains like graphs. Despite the popularity of message passing in this field, it has limitations such as the need for graph rewiring, ambiguity in interpreting data, and over-smoothing. In this paper, we take a different approach, focusing on leveraging geometric information from simplicial complexes embedded in Rn using node coordinates. We use differential k-forms in Rn to create representations of simplices, offering interpretability and geometric consistency without message passing. This approach also enables us to apply differential geometry tools and achieve universal approximation. Our method is efficient, versatile, and applicable to various input complexes, including graphs, simplicial complexes, and cell complexes. It outperforms existing message passing neural networks in harnessing information from geometrical graphs with node features serving as coordinates.
Primary Area: learning on graphs and other geometries & topologies
Keywords: Graph Neural Networks Adversarial Robustness
Scores: [ 5 8 6 8 ]
Graph Neural Networks (GNNs) have demonstrated state-of-the-art performance in various graph representation learning tasks. Recently, studies revealed their vulnerability to adversarial attacks. In this work, we theoretically define the concept of expected robustness in the context of attributed graphs and relate it to the classical definition of adversarial robustness in the graph representation learning literature. Our definition allows us to derive an upper bound of the expected robustness of Graph Convolutional Networks (GCNs) and Graph Isomorphism Networks subject to node feature attacks. Building on these findings, we connect the expected robustness of GNNs to the orthogonality of their weight matrices and consequently propose an attack-independent, more robust variant of the GCN, called the Graph Convolutional Orthogonal Robust Networks (GCORNs). We further introduce a probabilistic method to estimate the expected robustness, which allows us to evaluate the effectiveness of GCORN on several real-world datasets. Experimental experiments showed that GCORN outperforms available defense methods.
Primary Area: learning on graphs and other geometries & topologies
Keywords: Graph Neural Networks; Equivariance; Molecular model
Scores: [ 6 8 8 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: graph neural networks forward learning forward-forward algorithm
Scores: [ 6 6 6 8 ]
Graph neural networks (GNNs) have achieved remarkable success across a wide range of applications, such as recommendation, drug discovery, and question answering. Behind the success of GNNs lies the backpropagation (BP) algorithm, which is the de facto standard for training deep neural networks. However, despite its effectiveness, BP imposes several constraints, which are not only biologically implausible, but also limit the scalability, parallelism, and flexibility in learning neural networks. Examples of such constraints include the storage of neural activities computed in the forward pass for use in the subsequent backward pass, and the dependence of parameter updates on non-local signals. To address these limitations, the forward-forward algorithm (FF) was recently proposed as an alternative to BP in the image classification domain, which trains neural networks by performing two forward passes over positive and negative data. Inspired by this advance, we propose ForwardGNN in this work, a new forward learning procedure for GNNs, which avoids the constraints imposed by BP via an effective layer-wise local forward training. ForwardGNN extends the original FF to deal with graph data and GNNs, and makes it possible to operate without generating negative inputs (hence no longer forward-forward). Further, ForwardGNN enables each layer to learn from both the bottom-up and top-down signals without relying on the backpropagation of errors. Extensive experiments involving five real-world datasets and three representative GNNs show the effectiveness and generality of the proposed forward graph learning framework.
Primary Area: learning on graphs and other geometries & topologies
Keywords: Dynamic Graph Graph Explanation Graph Neural Network Causal Inference
Scores: [ 5 5 8 6 ]
Dynamic Graph Neural Networks (DyGNNs) have gained significant popularity in the research of dynamic graphs, but are limited by the low transparency, such that human-understandable insights can hardly be drawn from their predictions. Although a number of existing research have been devoted to investigating the interpretability of graph neural networks (GNNs), achieving the interpretability of DyGNNs is pivotally challenging due to the complex spatial-temporal correlations in dynamic graphs. To this end, we propose an innovative causality-inspired generative model based on structural causal model (SCM), which explores the underlying philosophies of DyGNN predictions by identifying the trivial, static, and dynamic causal relationships. To reach this goal, two critical tasks need to be accomplished including (1) disentangling the complex causal relationships, and (2) fitting the spatial-temporal explanations of DyGNNs in the SCM architecture. To tackle these challenges, the proposed method incorporates a contrastive learning module to disentangle trivial and causal relationships, and a dynamic correlating module to disentangle dynamic and static causal relationships, respectively. A dynamic VGAE-based framework is further developed, which generates causal-and-dynamic masks for spatial interpretability, and recognizes dynamic relationships along the time horizon through causal invention for temporal interpretability. Comprehensive experiments have been conducted on both synthetic and real-world datasets, where our approach yields substantial improvements, thereby demonstrating significant superiority.
Primary Area: learning on graphs and other geometries & topologies
Keywords: Link Prediction;Graph Neural Network
Scores: [ 6 6 8 5 ]
Link prediction, a fundamental task on graphs, has proven indispensable in various applications, e.g., friend recommendation, protein analysis, and drug interaction prediction. However, since datasets span a multitude of domains, they could have distinct underlying mechanisms of link formation. Evidence in existing literature underscores the absence of a universally best algorithm suitable for all datasets. In this paper, we endeavor to explore principles of link prediction across diverse datasets from a data-centric perspective. We recognize three fundamental factors critical to link prediction: local structural proximity, global structural proximity, and feature proximity. We then unearth relationships among those factors where (i) global structural proximity only shows effectiveness when local structural proximity is deficient. (ii) The incompatibility can be found between feature and structural proximity. Such incompatibility leads to GNNs for Link Prediction (GNN4LP) consistently underperforming on edges where the feature proximity factor dominates. Inspired by these new insights from a data perspective, we offer practical instruction for GNN4LP model design and guidelines for selecting appropriate benchmark datasets for more comprehensive evaluations.
Primary Area: learning on graphs and other geometries & topologies
Keywords: local graph clustering graph diffusion attributed graphs noisy labels
Scores: [ 6 6 8 3 ]
The growing interest in machine learning problems over graphs with additional node information such as texts, images, or labels has popularized methods that require the costly operation of processing the entire graph. Yet, little effort has been made to the development of fast local methods (i.e. without accessing the entire graph) that extract useful information from such data. To that end, we propose a study of local graph clustering using noisy node labels as a proxy for additional node information. In this setting, nodes receive initial binary labels based on cluster affiliation: 1 if they belong to the target cluster and 0 otherwise. Subsequently, a fraction of these labels is flipped. We investigate the benefits of incorporating noisy labels for local graph clustering. By constructing a weighted graph with such labels, we study the performance of graph diffusion-based local clustering method on both the original and the weighted graphs. From a theoretical perspective, we consider recovering an unknown target cluster with a single seed node in a random graph with independent noisy node labels. We provide sufficient conditions on the label noise under which, with high probability, using diffusion in the weighted graph yields a more accurate recovery of the target cluster. This approach proves more effective than using the given labels alone or using diffusion in the label-free original graph. Empirically, we show that reliable node labels can be obtained with just a few samples from an attributed graph. Moreover, utilizing these labels via diffusion in the weighted graph leads to significantly better local clustering performance across several real-world datasets, improving F1 scores by up to 13%.
Primary Area: learning on graphs and other geometries & topologies
Keywords: graph neural networks expressiveness approximate inference
Scores: [ 5 6 8 6 8 ]
Designing expressive Graph Neural Networks (GNNs) is an important topic in graph machine learning fields. Despite the existence of numerous approaches proposed to enhance GNNs based on Weisfeiler-Lehman (WL) tests, what GNNs \emph{can and cannot} learn still lacks a deeper understanding. This paper adopts a fundamentally different approach to examine the expressive power of GNNs from a probabilistic perspective. By establishing connections between GNNs' predictions and the central inference problems of probabilistic graphical models (PGMs), we can analyze previous GNN variants with a novel hierarchical framework and gain new insights into their node-level and link-level behaviors. Additionally, we introduce novel methods that can provably enhance GNNs' ability to capture complex dependencies and make complex predictions. Experiments on both synthetic and real-world datasets demonstrate the effectiveness of our approaches.
Primary Area: learning on graphs and other geometries & topologies
Keywords: Geometric Deep Learning Equivariant Neural Networks Characterization Theorem Point-wise Activations
Scores: [ 8 6 6 8 ]
Equivariant neural networks have shown improved performance, expressiveness and sample complexity on symmetrical domains. But for some specific symmetries, representations, and choice of coordinates, the most common point-wise activations, such as ReLU, are not equivariant, hence they cannot be employed in the design of equivariant neural networks. The theorem we present in this paper describes all possibile combinations of representations, choice of coordinates and point-wise activations to obtain an equivariant layer, generalizing and strengthening existing characterizations.Notable cases of practical relevance are discussed as corollaries. Indeed, we prove that rotation-equivariant networks can only be invariant, as it happens for any network which is equivariant with respect to connected compact groups. Then, we discuss implications of our findings when applied to important instances of equivariant networks. First, we completely characterize permutation equivariant networks such as Invariant Graph Networks with point-wise nonlinearities and their geometric counterparts, highlighting a plethora of models whose expressive power and performance are still unknown. Second, we show that feature spaces of disentangled steerable convolutional neural networks are trivial representations.
Primary Area: learning on graphs and other geometries & topologies
Keywords: graph distillation graph classification frequent pattern mining
Scores: [ 6 6 6 6 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: group synchronization angular synchronization neural networks directed graphs deep learning cycle consistency
Scores: [ 8 6 6 5 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: Graph neural networks graph multiresolution analysis
Scores: [ 6 8 8 6 ]
Graph Neural Networks are popular tools in graph representation learning that capture the graph structural properties. However, most GNNs employ single-resolution graph feature extraction, thereby failing to capture micro-level local patterns (high resolution) and macro-level graph cluster and community patterns (low resolution) simultaneously. Many multiresolution methods have been developed to capture graph patterns at multiple scales, but most of them depend on predefined and handcrafted multiresolution transforms that remain fixed throughout the training process once formulated. Due to variations in graph instances and distributions, fixed handcrafted transforms can not effectively tailor multiresolution representations to each graph instance. To acquire multiresolution representation suited to different graph instances and distributions, we introduce the Multiresolution Meta-Framelet-based Graph Convolutional Network (MM-FGCN), facilitating comprehensive and adaptive multiresolution analysis across diverse graphs. Extensive experiments demonstrate that our MM-FGCN achieves SOTA performance on various graph learning tasks.
Primary Area: learning on graphs and other geometries & topologies
Keywords: Graph Neural Networks Message Passing Neural Networks Over-squashing Graph Rewiring
Scores: [ 8 5 3 6 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: Graph-level anomaly detection Spectral GNN Rayleigh Quotient
Scores: [ 6 6 6 ]
Graph-level anomaly detection has gained significant attention as it finds many applications in various domains, such as cancer diagnosis and enzyme prediction. However, existing methods fail to capture the underlying properties of graph anomalies, resulting in unexplainable framework design and unsatisfying performance. In this paper, we take a step back and re-investigate the spectral differences between anomalous and normal graphs. Our main observation shows a significant disparity in the accumulated spectral energy between these two classes. Moreover, we prove that the accumulated spectral energy of the graph signal can be represented by its Rayleigh Quotient, indicating that the Rayleigh Quotient is a driving factor behind the anomalous properties of graphs. Motivated by this, we propose Rayleigh Quotient Graph Neural Network (RQGNN), the first spectral GNN for graph-level anomaly detection, providing a new perspective on exploring the inherent spectral features of anomalous graphs. Specifically, we introduce a novel framework that consists of two components: the Rayleigh Quotient learning component (RQL) and Chebyshev Wavelet GNN with RQ-pooling (CWGNN-RQ). RQL explicitly captures the Rayleigh Quotient of graphs and CWGNN-RQ implicitly explores the spectral space of graphs. Extensive experiments on 10 real-world datasets show that RQGNN outperforms the best rival by 6.74% in Macro-F1 score and 1.44% in AUC, demonstrating the effectiveness of our framework.
Primary Area: learning on graphs and other geometries & topologies
Keywords: Graph Neural Network Large Language Model In-context Learning
Scores: [ 10 6 6 6 ]
Designing a single model that addresses multiple tasks has been a long-standing objective in artificial intelligence. Recently, large language models have demonstrated exceptional capability in integrating and solving different tasks within the language domain. However, a unified model for various tasks on graphs remains underexplored, primarily due to the challenges unique to the graph learning domain. First, graph data from different areas carry distinct attributes and follow different distributions. Such discrepancy makes it difficult to represent graphs in a single representation space. Second, tasks on graphs diversify into node, link, and graph tasks, requiring distinct embedding strategies. Finally, an appropriate graph prompting paradigm for in-context learning is unclear. Striving to handle all the aforementioned challenges, we propose One for All (OFA), the first general framework that can use a single graph model to address the above challenges. Specifically, OFA proposes text-attributed graphs to unify different graph data by describing nodes and edges with natural language and uses language models to encode the diverse and possibly cross-domain text attributes to feature vectors in the same embedding space. Furthermore, OFA introduces the concept of nodes-of-interest to standardize different tasks with a single task representation. For in-context learning on graphs, OFA introduces a novel graph prompting paradigm that appends prompting substructures to the input graph, which enables it to address varied tasks without fine-tuning. We train the OFA model using graph data from multiple domains (including citation networks, molecular graphs, knowledge graphs, etc.) simultaneously and evaluate its ability in supervised, few-shot, and zero-shot learning scenarios. OFA performs well across different tasks, making it the first general-purpose graph classification model across domains.
Primary Area: learning on graphs and other geometries & topologies
Keywords: Graph Neural Networks KG reasoning Link prediction Rule structure learning Expressivity
Scores: [ 8 6 6 5 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: Equivariant Graph Neural Network Graph Neural Network
Scores: [ 6 8 6 ]
Graph Neural Networks (GNNs) with equivariant properties have emerged as powerful tools for modeling complex dynamics of multi-object physical systems. However, their generalization ability is limited by the inadequate consideration of physical inductive biases: (1) Existing studies overlook the continuity of transitions among system states, opting to employ several discrete transformation layers to learn the direct mapping between two adjacent states; (2) Most models only account for first-order velocity information, despite the fact that many physical systems are governed by second-order motion laws. To incorporate these inductive biases, we propose the Second-order Equivariant Graph Neural Ordinary Differential Equation (SEGNO). Specifically, we show how the second-order continuity can be incorporated into GNNs while maintaining the equivariant property. Furthermore, we offer theoretical insights into SEGNO, highlighting that it can learn a unique trajectory between adjacent states, which is crucial for model generalization. Additionally, we prove that the discrepancy between this learned trajectory of SEGNO and the true trajectory is bounded. Extensive experiments on complex dynamical systems including molecular dynamics and motion capture demonstrate that our model yields a significant improvement over the state-of-the-art baselines.
Primary Area: learning on graphs and other geometries & topologies
Keywords: Set Representation; Permutation Invariance; Permutation Equivariance
Scores: [ 8 8 8 5 ]
Primary Area: learning on graphs and other geometries & topologies
Keywords: Graphs random walkers quasi-Monte Carlo kernel PageRank graphlets scalable mixing
Scores: [ 6 6 6 ]
We present a novel quasi-Monte Carlo mechanism to improve graph-based sampling, coined repelling random walks. By inducing correlations between the trajectories of an interacting ensemble such that their marginal transition probabilities are unmodified, we are able to explore the graph more efficiently, improving the concentration of statistical estimators whilst leaving them unbiased. The mechanism has a trivial drop-in implementation. We showcase the effectiveness of repelling random walks in a range of settings including estimation of graph kernels, the PageRank vector and graphlet concentrations. We provide detailed experimental evaluation and robust theoretical guarantees. To our knowledge, repelling random walks constitute the first rigorously studied quasi-Monte Carlo scheme correlating the directions of walkers on a graph, inviting new research in this exciting nascent domain.
Primary Area: learning on graphs and other geometries & topologies
Keywords: Topological Deep Learning Geometric Deep Learning Latent Topology Inference Latent Graph Inference Cell Complexes
Scores: [ 8 8 8 ]
Latent Graph Inference (LGI) relaxed the reliance of Graph Neural Networks (GNNs) on a given graph topology by dynamically learning it. However, most of LGI methods assume to have a (noisy, incomplete, improvable, ...) input graph to rewire and can solely learn regular graph topologies. In the wake of the success of Topological Deep Learning (TDL), we study Latent Topology Inference (LTI) for learning higher-order cell complexes (with sparse and not regular topology) describing multi-way interactions between data points. To this aim, we introduce the Differentiable Cell Complex Module (DCM), a novel learnable function that computes cell probabilities in the complex to improve the downstream task. We show how to integrate DCM with cell complex message-passing networks layers and train it in an end-to-end fashion, thanks to a two-step inference procedure that avoids an exhaustive search across all possible cells in the input, thus maintaining scalability. Our model is tested on several homophilic and heterophilic graph datasets and it is shown to outperform other state-of-the-art techniques, offering significant improvements especially in cases where an input graph is not provided.
Primary Area: learning theory
Keywords: calibration temperature scaling mixup label noise
Scores: [ 6 6 8 ]
Despite the impressive generalization capabilities of deep neural networks, they have been repeatedly shown to be overconfident when they are wrong. Fixing this issue is known as model calibration, and has consequently received much attention in the form of modified training schemes and post-training calibration procedures such as temperature scaling. While temperature scaling is frequently used because of its simplicity, it is often outperformed by modified training schemes. In this work, we identify a specific bottleneck for the performance of temperature scaling. We show that for empirical risk minimizers for a general set of distributions in which the supports of classes have overlaps, the performance of temperature scaling degrades with the amount of overlap between classes, and asymptotically becomes no better than random when there are a large number of classes. On the other hand, we prove that optimizing a modified form of the empirical risk induced by the Mixup data augmentation technique can in fact lead to reasonably good calibration performance, showing that training-time calibration may be necessary in some situations. We also verify that our theoretical results reflect practice by showing that Mixup significantly outperforms empirical risk minimization (with respect to multiple calibration metrics) on image classification benchmarks with class overlaps introduced in the form of label noise.
Primary Area: learning theory
Keywords: testable learning pac learning agnostic learning Massart label noise adversarial label noise distribution testing
Scores: [ 6 5 8 8 ]
Primary Area: learning theory
Keywords: Fairness Multi-Agent Reinforcement Learning Markov Decision Process
Scores: [ 8 8 8 3 ]
Fairness plays a crucial role in various multi-agent systems (e.g., communication networks, financial markets, etc.). Many multi-agent dynamical interactions can be cast as Markov Decision Processes (MDPs). While existing research has focused on studying fairness in known environments, the exploration of fairness in such systems for unknown environments remains open. In this paper, we propose a Reinforcement Learning (RL) approach to achieve fairness in multi-agent finite-horizon episodic MDPs. Instead of maximizing the sum of individual agents' value functions, we introduce a fairness function that ensures equitable rewards across agents. Since the classical Bellman's equation does not hold when the sum of individual value functions is not maximized, we cannot use traditional approaches. Instead, in order to explore, we maintain a confidence bound of the unknown environment and then propose an online convex optimization based approach to obtain a policy constrained to this confidence region. We show that such an approach achieves sub-linear regret in terms of the number of episodes. Additionally, we provide a probably approximately correct (PAC) guarantee based on the obtained regret bound. We also propose an offline RL algorithm and bound the optimality gap with respect to the optimal fair solution. To mitigate computational complexity, we introduce a policy-gradient type method for the fair objective. Simulation experiments also demonstrate the efficacy of our approach.
Primary Area: learning theory
Keywords: Contextual bandit Federated learning Truthful mechanism design
Scores: [ 6 8 6 8 ]
To enhance the efficiency and practicality of federated bandit learning, recent advances have introduced incentives to motivate communication among clients, where a client participates only when the incentive offered by the server outweighs its participation cost.However, existing incentive mechanisms naively assume the clients are truthful: they all report their true coat and thus the higher cost one participating client claims, the more the server has to pay. Such mechanisms are therefore vulnerable to strategic clients aiming to optimize their own utility by misreporting. To address this issue, we propose an incentive compatible (i.e., truthful) communication protocol, named Truth-FedBan, where the incentive for each participant is independent of its self-reported cost, and reporting the true costs is the only way to achieve the best utility for each client. Truth-FedBan also provides near-optimal theoretical guarantees on regret and communication cost. Extensive numerical studies further validate the effectiveness of our proposed solution for incentivized federated bandits.
Primary Area: learning theory
Keywords: transformers in-context learning reinforcement learning learning theory
Scores: [ 6 8 6 5 ]
Large transformer models pretrained on offline reinforcement learning datasets have demonstrated remarkable in-context reinforcement learning (ICRL) capabilities, where they can make good decisions when prompted with interaction trajectories from unseen environments. However, when and how transformers can be trained to perform ICRL have not been theoretically well-understood. In particular, it is unclear which reinforcement-learning algorithms transformers can perform in context, and how distribution mismatch in offline training data affects the learned algorithms. This paper provides a theoretical framework that analyzes supervised pretraining for ICRL. This includes two recently proposed training methods --- algorithm distillation and decision-pretrained transformers. First, assuming model realizability, we prove the supervised-pretrained transformer will imitate the conditional expectation of the expert algorithm given the observed trajectory. The generalization error will scale with model capacity and a distribution divergence factor between the expert and offline algorithms. Second, we show transformers with ReLU attention can efficiently approximate near-optimal online reinforcement learning algorithms like LinUCB and Thompson sampling for stochastic linear bandits, and UCB-VI for tabular Markov decision processes. This provides the first quantitative analysis of the ICRL capabilities of transformers pretrained from offline trajectories.
Primary Area: learning theory
Keywords: Recurrent neural networks sequence modelling approximation theory
Scores: [ 8 8 8 5 ]
We prove an inverse approximation theorem for the approximation of nonlinear sequence-to-sequence relationships using recurrent neural networks (RNNs). This is a so-called Bernstein-type result in approximation theory, which deduces properties of a target function under the assumption that it can be effectively approximated by a hypothesis space. In particular, we show that nonlinear sequence relationships that can be stably approximated by nonlinear RNNs must have an exponential decaying memory structure - a notion that can be made precise. This extends the previously identified curse of memory in linear RNNs into the general nonlinear setting, and quantifies the essential limitations of the RNN architecture for learning sequential relationships with long-term memory. Based on the analysis, we propose a principled reparameterization method to overcome the limitations. Our theoretical results are confirmed by numerical experiments.
Primary Area: learning theory
Keywords: non-convex optimization random initialization global convergence matrix recovery matrix sensing
Scores: [ 8 8 8 8 ]
This paper rigorously shows how over-parameterization dramatically changes the convergence behaviors of gradient descent (GD) for the matrix sensing problem, where the goal is to recover an unknown low-rank ground-truth matrix from near-isotropic linear measurements.First, we consider the symmetric setting with the symmetric parameterization where M∗∈Rn×n is a positive semi-definite unknown matrix of rank r≪n, and one uses a symmetric parameterization XX⊤ to learn M∗. Here X∈Rn×k with k>r is the factor matrix. We give a novel Ω(1/T2) lower bound of randomly initialized GD for the over-parameterized case (k>r) where T is the number of iterations. This is in stark contrast to the exact-parameterization scenario (k=r) where the convergence rate is exp(−Ω(T)). Next, we study asymmetric setting where M∗∈Rn1×n2 is the unknown matrix of rank r≪min{n1,n2}, and one uses an asymmetric parameterization FG⊤ to learn M∗ where F∈Rn1×k and G∈Rn2×k. We give the first global exact convergence result of randomly initialized GD for the exact-parameterization case (k=r) with an exp(−Ω(T)) rate. Furthermore, we give the first global exact convergence result for the over-parameterization case (k>r) with an exp(−Ω(α2T)) rate where α is the initialization scale. This linear convergence result in the over-parameterization case is especially significant because one can apply the asymmetric parameterization to the symmetric setting to speed up from Ω(1/T2) to linear convergence. Therefore, we identify a surprising phenomenon: asymmetric parameterization can exponentially speed up convergence. Equally surprising is our analysis that highlights the importance of imbalance between F and G. This is in sharp contrast to prior works which emphasize balance. We further give an example showing the dependency on α in the convergence rate is unavoidable in the worst case. On the other hand, we propose a novel method that only modifies one step of GD and obtains a convergence rate independent of α, recovering the rate in the exact-parameterization case. We provide empirical studies to verify our theoretical findings.
Primary Area: learning theory
Keywords: Knowledge distillation Teacher-student training Empirical risk minimization
Scores: [ 5 6 5 5 ]
Primary Area: learning theory
Keywords: deep learning theory; large learning rate; oscillation of stochastic gradient descent;
Scores: [ 3 6 8 ]
In this work, we theoretically investigate the generalization property of neural networks (NN) trained by stochastic gradient descent (SGD) with \emph{large learning rate}. Under such a training regime, our finding is that, the oscillation of the NN weights caused by SGD with large learning rates turns out to be beneficial to generalization, potentially improving over the same NN trained by SGD with small learning rates that converges more smoothly. In view of the findings, we call such a phenomenon “benign oscillation”. Our theory towards demystifying such a phenomenon builds upon the feature learning perspective of deep learning. Specifically, we consider a feature-noise data generation model that consists of (i) weak features which have a small ℓ2-norm and appear in each data point; (ii) strong features which have a large ℓ2-norm but appear only in a certain fraction of all data points; and (iii) noise. We prove that NNs trained by oscillating SGD with a large learning rate can effectively learn the weak features in the presence of those strong features. In contrast, NNs trained by SGD with a small learning rate only learn the strong features but make little progress in learning the weak features. Consequently, when it comes to the new testing data points that consist of only weak features, the NN trained by oscillating SGD with large learning rates can still make correct predictions, while the NN trained by SGD with small learning rates could not. Our theory sheds light on how large learning rate training benefits the generalization of NNs. Experimental results demonstrate our findings on the phenomenon of “benign oscillation".
Primary Area: learning theory
Keywords: in-context learning linear regression ridge regression Bayes optimality
Scores: [ 6 8 8 5 ]
Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities, enabling them to solve unseen tasks solely based on input contexts without adjusting model parameters. In this paper, we study ICL in one of its simplest setups: pretraining a single-layer linear attention model for linear regression with a Gaussian prior. We establish a statistical task complexity bound for the attention model pretraining, showing that effective pretraining only requires a small number of independent tasks. Furthermore, we prove that the pretrained model closely matches the Bayes optimal algorithm, i.e., optimally tuned ridge regression, by achieving nearly Bayes optimal risk on unseen tasks under a fixed context length. These theoretical findings complement prior experimental research and shed light on the statistical foundations of ICL.
Primary Area: learning theory
Keywords: model-free RL self-play memory efficiency Q-learning Nash equilibrium Markov policy
Scores: [ 8 8 6 6 ]
Primary Area: learning theory
Keywords: Attention computation kronecker computation
Scores: [ 8 8 8 ]
In the classical transformer attention scheme, we are given three n×d size matrices Q,K,V (the query, key, and value tokens), and the goal is to compute a new n×d size matrix D−1exp(QK⊤)V where D=diag(exp(QK⊤)1n). Here, exp() is applied entry-wise and 1n denotes a length-n vector whose entries are all ones.Intuitively, attention computation captures pairwise information between words in a sentence, but not higher-order information. Indeed, recent work \cite{sht23} has shown that attention units cannot solve simple problems about detecting triples of connected words.In this work, we study a generalization of attention which captures triple-wise correlations. The generalization is based on computations involving tensors defined by tuples of words. More formally, given five n×d size matrices Q,K1,K2,V1 and V2 (generalized query, key, and value tokens), our new goal is to compute an n×d size matrix $D^{-1} \exp( Q ( K_1 \oslash K_2)^\top ) (V_1 \oslash V_2) $ where D=diag(exp(Q(K1⊘K2)⊤)1n2) and K1⊘K2∈Rn2×d denotes the column-wise Kronecker product of K1 and K2. This generalization is indeed able to solve problems about detecting triple-wise connections that were shown to be impossible for transformers.The potential downside of this generalization is that it appears as though computations are even more difficult, since the straightforward algorithm requires cubic time in n. However, we show that in the bounded-entry setting (which arises in practice, and which is well-studied in both theory and practice), there is actually a near-linear time algorithm. More precisely, we show that bounded entries are both necessary and sufficient for quickly performing generalized computations:∙ On the positive side, if all entries of the input matrices are bounded above by o(3√logn) then we show how to approximate the tensor-type'' attention matrix in n1+o(1) time.∙ On the negative side, we show that if the entries of the input matrices may be as large as Ω(3√logn), then there is no algorithm that runs faster than n3−o(1) (assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory).We also show that our construction, algorithms, and lower bounds naturally generalize to higher-order tensors and correlations. Interestingly, the higher the order of the tensors, the lower the bound on the entries needs to be for an efficient algorithm. Our results thus yield a natural tradeoff between the boundedness of the entries, and order of the tensor one may use for more expressive, efficient attention computation.Our constructions make use of a novel connection with a higher-order variant on the kernel density estimation problem. They combine a number of technical tools, including the polynomial method, algebraic geometry codes, and multiparty Merlin-Arthur communication protocols.
Primary Area: learning theory
Keywords: Learning to reject balanced error evaluation metrics selective classification plug-in approach long-tail learning class imbalance non-decomposable metrics
Scores: [ 8 8 8 8 ]
Primary Area: learning theory
Keywords: Invariance Learning Domain Adaptation
Scores: [ 5 8 6 ]
Invariance learning algorithms that conditionally filter out domain-specific random variables as distractors, do so based only on the data semantics, and not the target domain under evaluation. We show that a provably optimal and sample-efficient way of learning conditional invariances is by relaxing the invariance criterion to be non-commutatively directed towards the target domain. Under domain asymmetry, i.e., when the target domain contains semantically relevant information absent in the source, the risk of the encoder φ∗ that is optimal on average across domains is strictly lower-bounded by the risk of the target-specific optimal encoder Φ∗τ. We prove that non-commutativity steers the optimization towards Φ∗τ instead of φ∗, bringing the H-divergence between domains down to zero, leading to a stricter bound on the target risk. Both our theory and experiments demonstrate that non-commutative invariance (NCI) can leverage source domain samples to meet the sample complexity needs of learning Φ∗τ, surpassing SOTA invariance learning algorithms for domain adaptation, at times by over 2%, approaching the performance of an oracle. Implementation will be made public upon acceptance.
Primary Area: learning theory
Keywords: ResNet mean field generalization Rademacher complexity
Scores: [ 5 8 8 ]
Primary Area: learning theory
Keywords: Image to Image translation Noise robustness f-divergence Score matching GAN
Scores: [ 6 6 3 ]
Image-to-image (I2I) translation is vital in computer vision tasks like style transfer and domain adaptation. While recent advances in GAN have enabled high-quality sample generation, real-world challenges such as noise and distortion remain significant obstacles. Although Gaussian noise injection during training has been utilized, its theoretical underpinnings have been unclear. This work provides a robust theoretical framework elucidating the role of Gaussian noise injection in I2I translation models. We address critical questions on the influence of noise variance on distribution divergence, resilience to unseen noise types, and optimal noise intensity selection. Our contributions include connecting f-divergence and score matching, unveiling insights into the impact of Gaussian noise on aligning probability distributions, and demonstrating generalized robustness implications. We also explore choosing an optimal training noise level for consistent performance in noisy environments. Extensive experiments validate our theoretical findings, showing substantial improvements over various I2I baseline models in noisy settings. Our research rigorously grounds Gaussian noise injection for I2I translation, offering a sophisticated theoretical understanding beyond heuristic applications.
Primary Area: learning theory
Keywords: active learning strategic learning
Scores: [ 6 6 6 6 ]
Primary Area: learning theory
Keywords: unsupervised pretraining; representation learning; sample complexity
Scores: [ 6 8 8 6 ]
Unsupervised pretraining, which learns a useful representation using a large amount of unlabeled data to facilitate the learning of downstream tasks, is a critical component of modern large-scale machine learning systems. Despite its tremendous empirical success, the rigorous theoretical understanding of why unsupervised pretraining generally helps remains rather limited---most existing results are restricted to particular methods or approaches for unsupervised pretraining with specialized structural assumptions. This paper studies a generic framework,where the unsupervised representation learning task is specified by an abstract class of latent variable models Φ and the downstream task is specified by a class of prediction functions Ψ. We consider a natural approach of using Maximum Likelihood Estimation (MLE) for unsupervised pretraining and Empirical Risk Minimization (ERM) for learning downstream tasks. We prove that, under a mild informative'' condition, our algorithm achieves an excess risk of tildemathcalO(√C_Φ/m+√C_Ψ/n) for downstream tasks, where C_Φ,C_Ψ are complexity measures of function classes Φ,Ψ, and m,n are the number of unlabeled and labeled data respectively. Comparing to the baseline of ~O(√C_Φ∘Ψ/n) achieved by performing supervised learning using only the labeled data, our result rigorously shows the benefit of unsupervised pretraining when m≫n and C_Φ∘Ψ>C_Ψ. This paper further shows that our generic framework covers a wide range of approaches for unsupervised pretraining, including factor models, Gaussian mixture models, and contrastive learning.
Primary Area: learning theory
Keywords: Neural networks theory of deep learning convergence guarantees random graphs algebraic topology
Scores: [ 8 5 5 8 ]
Primary Area: learning theory
Keywords: diffusion models score-based generative models convergence bounds stochastic localization
Scores: [ 8 8 8 6 6 8 ]
Denoising diffusion models are a powerful method for generating approximate samples from high-dimensional data distributions. Recent results have provided polynomial bounds on their convergence rate, assuming L2-accurate scores. Until now, the tightest bounds were either superlinear in the data dimension or required strong smoothness assumptions. We provide the first convergence bounds which are linear in the data dimension (up to logarithmic factors) assuming only finite second moments of the data distribution. We show that diffusion models require at most ~O(dlog2(1/δ)ε2) steps to approximate an arbitrary distribution on Rd corrupted with Gaussian noise of variance δ to within ε2 in KL divergence. Our proof extends the Girsanov-based methods of previous works. We introduce a refined treatment of the error from discretizing the reverse SDE inspired by stochastic localization theory.
Primary Area: learning theory
Keywords: Threshold Latent Value Censored Feedback Query Complexity
Scores: [ 6 5 5 8 ]
In this paper, we investigate a problem of actively learning threshold in latent space, where the unknown reward g(γ,v) depends on the proposed threshold γ and latent value v and it can be only achieved if the threshold is lower than or equal to the unknown latent value. This problem has broad applications in practical scenarios, e.g., reserve price optimization in online auctions, online task assignments in crowdsourcing, setting recruiting bars in hiring, etc. We first characterize the query complexity of learning a threshold with the expected reward at most ϵ smaller than the optimum and prove that the number of queries needed can be infinitely large even when g(γ,v) is monotone with respect to both γ and v. On the positive side, we provide a tight query complexity ~Θ(1/ϵ3) when g is monotone and the CDF of value distribution is Lipschitz. Moreover, we show a tight ~Θ(1/ϵ3) query complexity can be achieved as long as g satisfies one-sided Lipschitzness, which provides a complete characterization for this problem. Finally, we extend this model to an online learning setting and demonstrate a tight Θ(T2/3) regret bound using continuous-arm bandit techniques and the aforementioned query complexity results.
Primary Area: learning theory
Keywords: inverse reinforcement learning reward learing misspecification
Scores: [ 6 6 6 8 ]
Inverse reinforcement learning (IRL) aims to infer an agent's preferences (represented as a reward function R) from their behaviour (represented as a policy π). To do this, we need a behavioural model of how π relates to R. In the current literature, the most common behavioural models are optimality, Boltzmann-rationality, and causal entropy maximisation. However, the true relationship between a human's preferences and their behaviour is much more complex than any of these behavioural models. This means that the behavioural models are misspecified, which raises the concern that they may lead to systematic errors if applied to real data. In this paper, we analyse how sensitive the IRL problem is to misspecification of the behavioural model. Specifically, we provide necessary and sufficient conditions that completely characterise how the observed data may differ from the assumed behavioural model without incurring an error above a given threshold. In addition to this, we also characterise the conditions under which a behavioural model is robust to small perturbations of the observed policy, and we analyse how robust many behavioural models are to misspecification of their parameter values (such as e.g. the discount rate). Our analysis suggests that the IRL problem is highly sensitive to misspecification, in the sense that very mild misspecification can lead to very large errors in the inferred reward function.
Primary Area: learning theory
Keywords: sampling score matching contrastive divergence langevin dynamics
Scores: [ 8 3 8 6 ]
There is a long history, as well as a recent explosion of interest, in statistical and generative modeling approaches based on \emph{score functions} --- derivatives of the log-likelihood of a distribution. In seminal works, Hyv"arinen proposed vanilla score matching as a way to learn distributions from data by computing an estimate of the score function of the underlying ground truth, and established connections between this method and established techniques like Contrastive Divergence and Pseudolikelihood estimation. It is by now well-known that vanilla score matching has significant difficulties learning multimodal distributions. Although there are various ways to overcome this difficulty, the following question has remained unanswered --- is there a natural way to sample multimodal distributions using just the vanilla score? Inspired by a long line of related experimental works, we prove that the Langevin diffusion with early stopping, initialized at the empirical distribution, and run on a score function estimated from data successfully generates natural multimodal distributions (mixtures of log-concave distributions).
Primary Area: learning theory
Keywords: multi-agent reinforcement learning theory general function approximation general-sum Markov Games
Scores: [ 6 6 6 6 6 ]
Primary Area: learning theory
Keywords: regularization regression optimization
Scores: [ 6 5 6 ]
Primary Area: learning theory
Keywords: statistical physics flow-based generative model stochastic interpolation gaussian mixture auto-encoder
Scores: [ 8 8 3 ]
Primary Area: learning theory
Keywords: Minimax rate outlier detection transfer learning Neyman-Pearson unbalanced classification
Scores: [ 5 6 8 ]
A critical barrier to learning an accurate decision rule for outlier detection is the scarcity of outlier data. As such, practitioners often turn to the use of similar but imperfect outlier data from which they might \emph{transfer} information to the target outlier detection task. Despite the recent empirical success of transfer learning in outlier detection, a fundamental understanding of when and how knowledge can be transferred from a source to a target in outlier detection remains elusive. In this work, we adopt the traditional framework of Neyman-Pearson classification---which formalizes \emph{supervised outlier detection}, i.e., unbalanced classification---with the added assumption that we have access to both source and (some or no) target outlier data. Our main results are then as follows:We first determine the information-theoretic limits of the problem under a measure of discrepancy that extends some existing notions from traditional balanced classification; interestingly, unlike in balanced classification, seemingly very dissimilar sources can provide much information about a target, thus resulting in fast transfer.We then show that, in principle, these information-theoretic limits are achievable by \emph{adaptive} procedures, i.e., procedures with no a priori information on the discrepancy between source and target distributions.
Primary Area: learning theory
Keywords: bandits mechanism design incentive-aware learning nash equilibrium
Scores: [ 6 8 6 8 ]
We study a strategic variant of the multi-armed bandit problem, which we coin the strategic click-bandit. This model is motivated by applications in online recommendation where the choice of recommended items depends on both the click-through rates and the post-click rewards. Like in classical bandits, rewards follow a fixed unknown distribution. However, we assume that the click-rate of each arm is chosen strategically by the arm (e.g., a host on Airbnb) in order to maximize the number of times it gets clicked. The algorithm designer does not know the post-click rewards nor the arms' actions (i.e., strategically chosen click-rates) in advance, and must learn both values over time. To solve this problem, we design an incentive-aware learning algorithm, UCB-S, which achieves two goals simultaneously: (a) incentivizing desirable arm behavior under uncertainty; (b) minimizing regret by learning unknown parameters. We characterize all approximate Nash equilibria among arms under UCB-S and show a ~O(√KT) regret bound uniformly in every equilibrium. We also show that incentive-unaware algorithms generally fail to achieve low regret in the strategic click-bandit. Finally, we support our theoretical results by simulations of strategic arm behavior which confirm the effectiveness and robustness of our proposed incentive design.
Primary Area: learning theory
Keywords: LoRA expressive power parameter-efficient fine-tuning adaptation neural networks transformer
Scores: [ 6 8 6 6 ]
Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that leverages low-rank adaptation of weight matrices, has emerged as a prevalent technique for fine-tuning pre-trained models such as large language models and diffusion models.Despite its huge success in practice, the theoretical underpinnings of LoRA have largely remained unexplored. This paper takes the first step to bridge this gap by theoretically analyzing the expressive power of LoRA. We prove that, for fully connected neural networks, LoRA can adapt any model f to accurately represent any smaller target model ¯f if LoRA-rank ≥(width of f)×depth of ¯fdepth of f, under a mild assumption. We also quantify the approximation error when the LoRA-rank is lower than the threshold. For Transformer networks, we show any model can be adapted to a target model of the same size with rank-(embedding size2) LoRA adapters.All our theoretical insights are validated by numerical experiments.
Primary Area: learning theory
Keywords: Equilibrium propagation local learning weight transport jacobian regularization
Scores: [ 5 6 8 6 ]
Primary Area: learning theory
Keywords: Mixture of Experts Maximum Likelihood Estimation
Scores: [ 6 6 6 ]
Top-K sparse softmax gating mixture of experts has been widely used for scaling up massive deep-learning architectures without increasing the computational cost. Despite its popularity in real-world applications, the theoretical understanding of that gating function has remained an open problem. The main challenge comes from the structure of the top-K sparse softmax gating function, which partitions the input space into multiple regions with distinct behaviors. By focusing on a Gaussian mixture of experts, we establish theoretical results on the effects of the top-K sparse softmax gating function on both density and parameter estimations. Our results hinge upon defining novel loss functions among parameters to capture different behaviors of the input regions. When the true number of experts k∗ is known, we demonstrate that the convergence rates of density and parameter estimations are both parametric on the sample size. However, when k∗ becomes unknown and the true model is over-specified by a Gaussian mixture of k experts where k>k∗, our findings suggest that the number of experts selected from the top-K sparse softmax gating function must exceed the total cardinality of a certain number of Voronoi cells associated with the true parameters to guarantee the convergence of the density estimation. Moreover, while the density estimation rate remains parametric under this setting, the parameter estimation rates become substantially slow due to an intrinsic interaction between the softmax gating and expert functions.
Primary Area: learning theory
Keywords: inductive bias margin maximization feature learning mechanistic interpretability
Scores: [ 6 8 8 6 ]
Understanding the internal representations learned by neural networks is a cornerstone challenge in the science of machine learning. While there have been significant recent strides in some cases towards understanding how neural networks implement specific target functions, this paper explores a complementary question -- why do networks arrive at particular computational strategies? Our inquiry focuses on the algebraic learning tasks of modular addition, sparse parities, and finite group operations. Our primary theoretical findings analytically characterize the features learned by stylized neural networks for these algebraic tasks. Notably, our main technique demonstrates how the principle of margin maximization alone can be used to fully specify the features learned by the network. Specifically, we prove that the trained networks utilize Fourier features to perform modular addition and employ features corresponding to irreducible group-theoretic representations to perform compositions in general groups, aligning closely with the empirical observations of Nanda et al. (2023) and Chughtai et al. (2023). More generally, we hope our techniques can help to foster a deeper understanding of why neural networks adopt specific computational strategies.
Primary Area: learning theory
Keywords: asymptotic freeness sketching ensembles ridge regression generalized cross-validation tuning
Scores: [ 6 8 8 8 ]
Primary Area: learning theory
Keywords: deep learning information bottleneck principle stochastic neural networks lossy compression
Scores: [ 6 5 6 6 ]
Primary Area: learning theory
Keywords: ReLU neural networks path-norm generalization contraction lemma peeling
Scores: [ 6 8 8 ]
Primary Area: learning theory
Keywords: deep learning theory representation learning neural collapse
Scores: [ 6 8 6 5 ]
Primary Area: learning theory
Keywords: Hyperparameter optimization Bayesian optimization Convergence rate Bilevel optimization Learning theory
Scores: [ 8 6 6 6 ]
Primary Area: learning theory
Keywords: Covariate shift; Maximum Likelihood Estimation; Out-of-Distribution generalization;
Scores: [ 6 8 8 3 ]
A key challenge of modern machine learning systems is to achieve Out-of-Distribution (OOD) generalization --- generalizing to target data whose distribution differs from those of source data. Despite its significant importance, the fundamental question of what are the most effective algorithms for OOD generalization'' remains open even under the standard setting of covariate shift.This paper addresses this fundamental question by proving that, surprisingly, classical Maximum Likelihood Estimation (MLE) purely using source data (without any modification) achieves the minimax optimality for covariate shift under the well-specified setting. This result holds for a very large class of parametric models, including but not limited to linear regression, logistic regression, and phase retrieval, and does not require any boundedness condition on the density ratio. This paper further complement the study by proving that for the misspecified setting, MLE can perform poorly, and the Maximum Weighted Likelihood Estimator (MWLE) emerges as minimax optimal in specific scenarios, outperforming MLE.
Primary Area: learning theory
Keywords: Combinatorial multi-armed bandit k-MAX bandit value-index feedback maximum reward function
Scores: [ 6 6 6 6 ]
Primary Area: learning theory
Keywords: single-index symmetric permutation invariant gradient flow learning
Scores: [ 6 8 5 6 ]
Few neural architectures lend themselves to provable learning with gradient based methods. One popular model is the single-index model, in which labels are produced by composing an unknown linear projection with a possibly unknown scalar link function. Learning this model with SGD is relatively well-understood, whereby the so-called information exponent of the link function governs a polynomial sample complexity rate. However, extending this analysis to deeper or more complicated architectures remains challenging.In this work, we consider single index learning in the setting of symmetric neural networks. Under analytic assumptions on the activation and maximum degree assumptions on the link function, we prove that gradient flow recovers the hidden planted direction, represented as a finitely supported vector in the feature space of power sum polynomials. We characterize a notion of information exponent adapted to our setting that controls the efficiency of learning.
Primary Area: learning theory
Keywords: universal approximation neural networks
Scores: [ 8 8 5 ]
It has been shown that deep neural networks of a large enough width are universal approximators but they are not if the width is too small.There were several attempts to characterize the minimum width wmin enabling the universal approximation property; however, only a few of them found the exact values.In this work, we show that the minimum width for universal approximation of Lp functions from [0,1]dx to Rdy is exactly maxdx,dy,2 if an activation function is ReLU-Like (e.g., ReLU, GELU, Softplus).Compared to the known result for ReLU networks, wmin=maxdx+1,dy when the domain is Rdx, our result first shows that approximation on a compact domain requires smaller width than on Rdx.We next prove a lower bound on wmin for uniform approximation using general activation functions including ReLU: wmin≥dy+1 if dx<dy≤2dx. Together with our first result, this shows a dichotomy between Lp and uniform approximations for general activation functions and input/output dimensions.
Primary Area: learning theory
Keywords: dynamics modeling training analysis low dimensional model for training
Scores: [ 10 5 3 8 ]
As neural networks grow in scale, their training becomes both computationally demanding and rich in dynamics. Amidst the flourishing interest in these training dynamics, we present a novel observation: Parameters during training exhibit intrinsic correlations over time. Capitalizing on this, we introduce \emph{correlation mode decomposition} (CMD). This algorithm clusters the parameter space into groups, termed modes, that display synchronized behavior across epochs. This enables CMD to efficiently represent the training dynamics of complex networks, like ResNets and Transformers, using only a few modes. Moreover, test set generalization is enhanced. We introduce an efficient CMD variant, designed to run concurrently with training. Our experiments indicate that CMD surpasses the state-of-the-art method for compactly modeled dynamics on image classification. Our modeling can improve training efficiency and lower communication overhead, as shown by our preliminary experiments in the context of federated learning.
Primary Area: learning theory
Keywords: knowledge transfer classification minimax optimality density estimation knowledge distillation
Scores: [ 6 6 8 6 ]
Primary Area: learning theory
Keywords: kernel ridge regression cost of overfitting benign overfitting tempered overfitting
Scores: [ 8 6 6 6 ]
We study the cost of overfitting in noisy kernel ridge regression (KRR), which we define as the ratio between the test error of the interpolating ridgeless model and the test error of the optimally-tuned model. We take an agnostic'' view in the following sense: we consider the cost as a function of sample size for any target function, even if the sample size is not large enough for consistency or the target is outside the RKHS. We analyze the cost of overfitting under a Gaussian universality ansatz using recently derived (non-rigorous) risk estimates in terms of the task eigenstructure. Our analysis provides a more refined characterization of benign, tempered and catastrophic overfitting (cf. Mallinar et al. 2022).
Primary Area: learning theory
Keywords: grokking implicit bias margin kernel training dynamics generalization
Scores: [ 6 6 6 ]
Recent work by Power et al. (2022) highlighted a surprising "grokking" phenomenon in learning arithmetic tasks: a neural net first "memorizes" the training set, resulting in perfect training accuracy but near-random test accuracy, and after training for sufficiently longer, it suddenly transitions to perfect test accuracy. This paper studies the grokking phenomenon in theoretical setups and shows that it can be induced by a dichotomy of early and late phase implicit biases. Specifically, when training homogeneous neural nets with large initialization and small weight decay on both classification and regression tasks, we prove that the training process gets trapped at a solution corresponding to a kernel predictor for a long time, and then a very sharp transition to min-norm/max-margin predictors occurs, leading to a dramatic change in test accuracy. Even in the absence of weight decay, we show that grokking can still happen when the late phase implicit bias is driven by other regularization mechanisms, such as implicit margin maximization or sharpness reduction.
Primary Area: learning theory
Keywords: Adversarial Perturbations Adversarial Examples Adversarial Attacks Non-Robust Features
Scores: [ 6 5 6 5 ]
Primary Area: learning theory
Keywords: Wasserstein Autoencoders Statistical Analysis Error rates Intrinsic Dimension
Scores: [ 6 6 6 6 ]
Primary Area: learning theory
Keywords: scaling law associative memory mechanistic interpretability Hopfield network
Scores: [ 8 8 6 8 8 ]
Primary Area: learning theory
Keywords: Representation learning meta learning multi-task learning
Scores: [ 8 8 6 8 ]
Primary Area: learning theory
Keywords: reinforcement learning theory online RL offline RL hybrid RL density ratio marginalized importance weight weight function general function approximation
Scores: [ 8 6 6 ]
Primary Area: learning theory
Keywords: Linear MDP; horizon-free regret analysis
Scores: [ 6 5 6 8 ]
A recent line of works showed regret bounds in reinforcement learning (RL) can be (nearly) independent of planning horizon, a.k.a. the horizon-free bounds. However, these regret bounds only apply to settings where a polynomial dependency on the size of transition model is allowed, such as tabular Markov Decision Process (MDP) and linear mixture MDP. We give the first horizon-free bound for the popular linear MDP setting where the size of the transition model can be exponentially large or even uncountable. In contrast to prior works which explicitly estimate the transition model and compute the inhomogeneous value functions at different time steps, we directly estimate the value functions and confidence sets. We obtain the horizon-free bound by: (1) maintaining multiple weighted least square estimators for the value functions; and (2) a structural lemma which shows the maximal total variation of the inhomogeneous value functions is bounded by a polynomial factor of the feature dimension.
Primary Area: learning theory
Keywords: transformers language models reasoning theoretical analysis variable binding
Scores: [ 8 6 8 8 8 ]
Primary Area: learning theory
Keywords: ML-Augmented Algorithms Caching Metrical Task Systems
Scores: [ 8 8 8 8 ]
Primary Area: learning theory
Keywords: Neural Networks Small Initialization Gradient Flow Neuron Alignment
Scores: [ 5 8 5 8 ]
Primary Area: learning theory
Keywords: Hierarchical polynomials feature learning three-layer networks sample complexity gradient descent
Scores: [ 5 8 5 5 ]
We study the problem of learning hierarchical polynomials over the standard Gaussian distribution with three-layer neural networks. We specifically consider target functions of the form h=g∘p where p:Rd→R is a degree k polynomial and g:R→R is a degree q polynomial. This function class generalizes the single-index model, which corresponds to k=1, and is a natural class of functions possessing an underlying hierarchical structure. Our main result shows that for a large subclass of degree k polynomials p, a three-layer neural network trained via layerwise gradient descent on the square loss learns the target h up to vanishing test error in ˜O(dk) samples and polynomial time. This is a strict improvement over kernel methods, which require ˜Θ(dkq) samples, as well as existing guarantees for two-layer networks, which require the target function to be low-rank. Our result also generalizes prior works on three-layer neural networks, which were restricted to the case of p being a quadratic. When p is indeed a quadratic, we achieve the information-theoretically optimal sample complexity ˜O(d2), which is an improvement over prior work (Nichani et al., 2023) requiring a sample size of ˜Θ(d4). Our proof proceeds by showing that during the first stage of training the network performs feature learning to recover the feature p with ˜O(dk) samples. This work demonstrates the ability of three-layer neural networks to learn complex features and as a result learn a broad class of hierarchical functions.
Primary Area: learning theory
Keywords: implicit bias SGD low-rank linear networks
Scores: [ 8 8 5 10 ]
The L2-regularized loss of Deep Linear Networks (DLNs) withmore than one hidden layers has multiple local minima, correspondingto matrices with different ranks. In tasks such as matrix completion,the goal is to converge to the local minimum with the smallest rankthat still fits the training data. While rank-underestimating minimacan be avoided since they do not fit the data, GD might getstuck at rank-overestimating minima. We show that with SGD, there is always a probability to jumpfrom a higher rank minimum to a lower rank one, but the probabilityof jumping back is zero. More precisely, we define a sequence of sets$B_{1}\subset B_{2}\subset\cdots\subset B_{R}$ so that $B_{r}$contains all minima of rank r or less (and not more) that are absorbingfor small enough ridge parameters λ and learning rates η:SGD has prob. 0 of leaving Br, and from any starting point thereis a non-zero prob. for SGD to go in Br.
Primary Area: learning theory
Keywords: neural network optimization feature learning mean-field Langevin dynamics
Scores: [ 6 6 8 ]
Primary Area: learning theory
Keywords: Learning Theory Unlabeled Data Kernel Methods Semi-supervised Learning Representation Learning Label Propagation
Scores: [ 8 8 8 8 8 ]
Primary Area: learning theory
Keywords: overparameterization interpolation random feature regression kernel regression generalization overfitting
Scores: [ 6 8 6 ]
In our era of enormous neural networks, empirical progress has been driven by the philosophy that *more is better.*Recent deep learning practice has found repeatedly that larger model size, more data, and more computation (resulting in lower training loss) optimizing to near-interpolation improves performance. In this paper, we give theoretical backing to these empirical observations by showing that these three properties hold in random feature (RF) regression, a class of models equivalent to shallow networks with only the last layer trained.Concretely, we first show that the test risk of RF regression decreases monotonically with both the number of features and samples, provided the ridge penalty is tuned optimally. In particular, this implies that infinite width RF architectures are preferable to those of any finite width. We then proceed to demonstrate that, for a large class of tasks characterized by powerlaw eigenstructure, training to near-zero training loss is obligatory: near-optimal performance can only be achieved when the training error is much smaller than the test error. Grounding our theory in real-world data, we find empirically that standard computer vision tasks with convolutional neural kernels clearly fall into this class. Taken together, our results tell a simple, testable story of the benefits of overparameterization and overfitting in random feature models.
Primary Area: learning theory
Keywords: overparametrization generalization
Scores: [ 8 6 8 6 ]
Primary Area: learning theory
Keywords: learning theory sample complexity vc dimension contrastive learning metric learning
Scores: [ 8 6 8 8 ]
Contrastive learning is a highly successful technique for learning representations of data from labeled tuples, specifying the distance relations within the tuple. We study the sample complexity of contrastive learning, i.e. the minimum number of labeled tuples sufficient for getting high generalization accuracy. We give tight bounds on the sample complexity in a variety of settings, focusing on arbitrary distance functions, ℓp-distances, and tree metrics. Our main result is an (almost) optimal bound on the sample complexity of learning ℓp-distances for integer p. For any p≥1, we show that ~Θ(nd) labeled tuples are necessary and sufficient for learning d-dimensional representations of n-point datasets. Our results hold for an arbitrary distribution of the input samples and are based on giving the corresponding bounds on the Vapnik-Chervonenkis/Natarajan dimension of the associated problems. We further show that the theoretical bounds on sample complexity obtained via VC/Natarajan dimension can have strong predictive power for experimental results, in contrast with the folklore belief about a substantial gap between the statistical learning theory and the practice of deep learning.
Primary Area: learning theory
Keywords: Adversarial training
Scores: [ 3 6 6 8 ]
Primary Area: learning theory
Keywords: Learning Theory Expressivity Multi-Head Attention Transformers
Scores: [ 8 8 8 6 ]
Transformers have become the go-to architecture for language and vision tasks, yet their theoretical properties, especially memorization capacity, remain elusive. This paper investigates the memorization abilities of multi-head attention mechanisms, examining how many example sequences they can memorize, as a function of the number of heads and sequence length. Motivated by experimental findings on vision transformers, we introduce novel assumptions about the linear independence of input data, distinct from the commonly used general-position assumption. Under these assumptions, we demonstrate that an attention layer with H heads, dimension d, and context size n<d, featuring Θ(Hd2) parameters, can memorize Ω(Hn) examples. Our analysis sheds light on how different attention heads handle various example sequences, aided by the softmax operator’s saturation property. We validate our findings through experiments on synthetic data.
Primary Area: learning theory
Keywords: stochastic gradient descent Hessian multi-layer neural networks high-dimensional classification Gaussian mixture model XOR problem
Scores: [ 10 8 8 6 8 6 ]
Primary Area: learning theory
Keywords: regret analysis learning in non-stationary games bandit feedback
Scores: [ 6 6 6 6 ]
Primary Area: learning theory
Keywords: target shift regularization distributional shift integral equation importance weight
Scores: [ 8 8 6 6 ]
The presence of distribution shifts poses a significant challenge for deploying modern machine learning models in real-world applications. This work focuses on the target shift problem in a regression setting (Zhang et al., 2013; Nguyen et al., 2016). More specifically, the target variable y (also known as the response variable), which is continuous, has different marginal distributions in the training source and testing domain, while the conditional distribution of features x given y remains the same. While most literature focuses on classification tasks with finite target space, the regression problem has an infinite dimensional target space, which makes many of the existing methods inapplicable. In this work, we show that the continuous target shift problem can be addressed by estimating the importance weight function from an ill-posed integral equation. We propose a nonparametric regularized approach named ReTaSA to solve the ill-posed integral equation and provide theoretical justification for the estimated importance weight function. The effectiveness of the proposed method has been demonstrated with extensive numerical studies on synthetic and real-world datasets.
Primary Area: learning theory
Keywords: kernel classification minimax optimality neural network classifiers reproducing kernel Hilbert Space
Scores: [ 5 6 5 6 ]
Primary Area: learning theory
Keywords: two-layer linear neural network feature learning optimal criterion multi-output linear reggression high-dimensional setting
Scores: [ 5 8 6 6 ]
Primary Area: learning theory
Keywords: generalization bound full-rank weight matrix Koopman operator
Scores: [ 6 8 5 ]
We propose a new bound for generalization of neural networks using Koopman operators. Whereas most of existing works focus on low-rank weight matrices, we focus on full-rank weight matrices. Our bound is tighter than existing norm-based bounds when the condition numbers of weight matrices are small. Especially, it is completely independent of the width of the network if the weight matrices are orthogonal. Our bound does not contradict to the existing bounds but is a complement to the existing bounds. As supported by several existing empirical results, low-rankness is not the only reason for generalization. Furthermore, our bound can be combined with the existing bounds to obtain a tighter bound. Our result sheds new light on understanding generalization of neural networks with full-rank weight matrices, and it provides a connection between operator-theoretic analysis and generalization of neural networks.
Primary Area: learning theory
Keywords: Oracle efficiency online learning groupwise regret sleeping experts
Scores: [ 6 8 8 3 ]
Primary Area: learning theory
Keywords: optimization stochastic gradient descent two-layer neural network sample complexity
Scores: [ 8 8 8 6 ]
In this work, we consider the optimization process of minibatch stochastic gradient descent (SGD) on a 2-layer neural network with data separated by a quadratic ground truth function. We prove that with data drawn from the Boolean hypercube labeled by the quadratic XOR'' function y=−xixj , it is possible to train to a population error o(1) with Θ(dpolylog(d)) samples. Our result considers simultaneously training both layers of the two-layer-neural network with ReLU activations via standard minibatch SGD on the logistic loss. To our knowledge, this work is the first to give a sample complexity of for efficiently learning the XOR function on isotropic data on a standard neural network with standard training. Our main technique is showing that the network evolves in two phases: a \em signal-finding \em phase where the network is small and many of the neurons evolve independently to find features, and a \em signal-heavy \em phase, where SGD maintains and balances the features. We leverage the simultaneous training of the layers to show that it is sufficient for only a small fraction of the neurons to learn features, since those neurons will be amplified by the simultaneous growth of their second layer weights.
Primary Area: learning theory
Keywords: information theory generalization bound learning theory minimum error entropy
Scores: [ 8 6 6 6 ]
Primary Area: learning theory
Keywords: Interpolation Learning Benign Overfitting ReLU Networks
Scores: [ 8 8 8 ]
Understanding how overparameterized neural networks generalize despite perfect interpolation of noisy training data is a fundamental question. Mallinar et. al. (2022) noted that neural networks seem to often exhibit tempered overfitting'', wherein the population risk does not converge to the Bayes optimal error, but neither does it approach infinity, yielding non-trivial generalization. However, this has not been studied rigorously. We provide the first rigorous analysis of the overfiting behaviour of regression with minimum norm (ℓ2 of weights), focusing on univariate two-layer ReLU networks. We show overfitting is tempered (with high probability) when measured with respect to the L1 loss, but also show that the situation is more complex than suggested by Mallinar et. al., and overfitting is catastrophic with respect to the L2 loss, or when taking an expectation over the training set.
Primary Area: learning theory
Keywords: Grokking Random Matrix Theory Linear Regression Representation Learning
Scores: [ 3 8 5 6 ]
Primary Area: learning theory
Keywords: neural network nonlinear operator learning convolutional network estimation error analysis
Scores: [ 8 6 8 ]
Recent deep learning applications, exemplified by text-to-image tasks, often involve high-dimensional inputs and outputs. While several studies have investigated the function estimation capabilities of deep learning, research on dilated convolutional neural networks (CNNs) has mainly focused on cases where input dimensions are infinite but output dimensions are one-dimensional, similar to many other studies. However, many practical deep learning tasks involve high-dimensional (or even infinite dimensional) inputs and outputs.In this paper, we investigate the optimality of dilated CNNs for estimating a map between infinite-dimensional input and output spaces by analyzing their approximation and estimation abilities. For that purpose, we first show that approximation and estimation errors depend only on the smoothness and decay rate with respect to the infinity norm of the output, and their estimation accuracy actually achieve the {\it minimax optimal} rate of convergence.Second, we demonstrate that the dilated CNNs outperform {\it any} linear estimators including kernel ridge regression and k-NN estimators in a minimax error sense, highlighting the usefulness of feature learning realized by deep neural networks.Our theoretical analysis particularly explains the success of deep learning in recent high-dimensional input-output tasks.
Primary Area: learning theory
Keywords: aggregate learning asymptotic analysis bias variance privacy regularization
Scores: [ 6 6 6 6 6 ]
Primary Area: learning theory
Keywords: Prompt;Pre-Trained;Few-shot;Debiased;Domain Abstraction
Scores: [ 8 6 6 5 ]
As a novel and effective fine-tuning paradigm based on large-scale pre-trained language models (PLMs), prompt-tuning aims to reduce the gap between downstream tasks and pre-training objectives. While prompt-tuning has yielded continuous advancements in various tasks, such an approach still remains a persistent defect: prompt-tuning methods fail to generalize to specific few-shot patterns. From the perspective of distribution analyses, we disclose that the intrinsic issues behind the phenomenon are the over-multitudinous conceptual knowledge contained in PLMs and the abridged knowledge for target downstream domains, which jointly result in that PLMs mis-locate the knowledge distributions corresponding to the target domains in the universal knowledge embedding space. To this end, we intuitively explore to approximate the unabridged target domains of downstream tasks in a debiased manner, and then abstract such domains to generate discriminative prompts, thereby providing the de-ambiguous guidance for PLMs. Guided by such an intuition, we propose a simple yet effective approach, namely BayesPrompt, to learn prompts that contain the domain discriminative information against the interference from domain-irrelevant knowledge. BayesPrompt primitively leverages known distributions to approximate the debiased factual distributions of target domains and further uniformly samples certain representative features from the approximated distributions to generate the ultimate prompts for PLMs. We provide solid theoretical analyses on classification error bounds to prove that BayesPrompt tightens the upper bound of the classification error on the downstream inference of PLMs. Empirically, our method achieves state-of-the-art performance on benchmarks.
Primary Area: learning theory
Keywords: Selective classification Learning to reject Abstention OOD detection SCOD Loss functions Plug-in estimators Statistical consistency
Scores: [ 6 8 8 ]
Primary Area: learning theory
Keywords: neural network double descent classification interpretability
Scores: [ 5 5 3 5 ]
Double descent presents a counter-intuitive aspect within the machine learning domain, and researchers have observed its manifestation in various models and tasks. While some theoretical explanations have been proposed for this phenomenon in specific contexts, an accepted theory to account for its occurrence in deep learning remains yet to be established. In this study, we revisit the phenomenon of double descent and demonstrate that its occurrence is strongly influenced by the presence of noisy data. Through conducting a comprehensive analysis of the feature space of learned representations, we unveil that double descent arises in imperfect models trained with noisy data. We argue that double descent is a consequence of the model first learning the noisy data until interpolation and then adding implicit regularization via over-parameterization acquiring therefore capability to separate the information from the noise. We postulate that double descent should never occur in well-regularized models.
Primary Area: learning theory
Keywords: benign overfitting grokking neural networks feature learning interpolation theory
Scores: [ 5 6 6 ]
Primary Area: learning theory
Keywords: multimodal learning multimodal interactions information theory self-supervised learning multimodal fusion
Scores: [ 8 6 6 ]
Primary Area: learning theory
Keywords: Stochastic modified equations dropout noise structure flatness
Scores: [ 6 6 8 6 ]
Dropout is a widely utilized regularization technique in the training of neural networks, nevertheless, its underlying mechanism and impact on achieving good generalization abilities remain to be understood. In this work, we start by undertaking a rigorous theoretical derivation of the stochastic modified equations, with the primary aim of providing an effective approximation for the discrete iterative process of dropout. Subsequently, we empirically delve into the intricate mechanisms by which dropout facilitates the identification of flatter minima. This exploration is conducted through intuitive approximations, exploiting the structural analogies inherent in the Hessian of loss landscape and the covariance of dropout. Our empirical findings substantiate the ubiquitous presence of the Hessian-variance alignment relation throughout the training process of dropout. These dual facets of our research, encompassing both theoretical derivations and empirical observations, collectively constitute a substantial contribution towards a deeper understanding of the inherent tendency of dropout to locate flatter minima.
Primary Area: learning theory
Keywords: Out-of-domain data Semi-supervised learing learning theory generalization bound adversarial robustness
Scores: [ 6 8 8 6 ]
We propose a novel framework for incorporating unlabeled data into semi-supervised classification problems, where scenarios involving the minimization of either i) adversarially robust or ii) non-robust loss functions have been considered. Notably, we allow the unlabeled samples to deviate slightly (in total variation sense) from the in-domain distribution. The core idea behind our framework is to combine Distributionally Robust Optimization (DRO) with self-supervised training. As a result, we also leverage efficient polynomial-time algorithms for the training stage. From a theoretical standpoint, we apply our framework on the classification problem of a mixture of two Gaussians in Rd, where in addition to the m independent and labeled samples from the true distribution, a set of n (usually with n≫m) out of domain and unlabeled samples are gievn as well. Using only the labeled data, it is known that the generalization error can be bounded by ∝(d/m)1/2. However, using our method on both isotropic and non-isotropic Gaussian mixture models, one can derive a new set of analytically explicit and non-asymptotic bounds which show substantial improvement on the generalization error compared ERM. Our results underscore two significant insights: 1) out-of-domain samples, even when unlabeled, can be harnessed to narrow the generalization gap, provided that the true data distribution adheres to a form of the "cluster assumption", and 2) the semi-supervised learning paradigm can be regarded as a special case of our framework when there are no distributional shifts. We validate our claims through experiments conducted on a variety of synthetic and real-world datasets.
Primary Area: learning theory
Keywords: Deep Learning Theory Sample Complexity Convolutional Neural Networks
Scores: [ 8 8 6 8 ]
Primary Area: learning theory
Keywords: Stability Analysis; Zeroth-Order Optimization; Black-Box Learning
Scores: [ 8 8 6 6 ]
Zeroth-order optimization algorithms are widely used for black-box optimization problems, such as those in machine learning and prompt engineering, where the gradients are approximated using function evaluations. Recently, a generalization result was provided for zeroth-order stochastic gradient descent (SGD) algorithms through stability analysis. However, this result was limited to the vanilla 2-point zeroth-order estimate of Gaussian distribution used in SGD algorithms. To address these limitations, we propose a general proof framework for stability analysis that applies to convex, strongly convex, and non-convex conditions, and yields results for popular zeroth-order optimization algorithms, including SGD, GD, and SVRG, as well as various zeroth-order estimates, such as 1-point and 2-point with different distributions and coordinate estimates. Our general analysis shows that coordinate estimation can lead to tighter generalization bounds for SGD, GD, and SVRG versions of zeroth-order optimization algorithms, due to the smaller expansion brought by coordinate estimates to stability analysis.
Primary Area: learning theory
Keywords: Accelerated stochastic gradient descent excess risk linear regression overparameterization
Scores: [ 6 6 6 ]
Primary Area: learning theory
Keywords: Graph Neural Networks Message Passing Graph Transformers Virtual Nodes Expressivity Uniform Expressivity
Scores: [ 6 8 6 6 ]
Graph Transformers (GTs) such as SAN and GPS have been shown to be universal function approximators. We show that when extending MPGNNs and even 2-layer MLPs with the same positional encodings that GTs use, they also become universal function approximators on graphs. All these results hold in the non-uniform case where a different network may be used for every graph size. In order to show meaningful differences between GTs and MPGNNs we then consider the uniform setting where a single network needs to work for all graph sizes. First, we show that none of the above models is universal in that setting. Then, our main technical result is that there are functions that GTs can express while MPGNNs with virtual nodes cannot and vice versa, making their uniform expressivity provably different. We show this difference empirically on synthetic data and observe that on real-world data global information exchange through graph transformers and conceptually simpler MPGNNs with virtual nodes achieve similar performance gains over message passing on various datasets.
Primary Area: learning theory
Keywords: deep learning theory residual networks neural ODEs optimization implicit regularization gradient flow
Scores: [ 8 6 8 6 ]
Residual neural networks are state-of-the-art deep learning models. Their continuous-depth analog, neural ordinary differential equations (ODEs), are also widely used. Despite their success, the link between the discrete and continuous models still lacks a solid mathematical foundation. In this article, we take a step in this direction by establishing an implicit regularization of deep residual networks towards neural ODEs, for nonlinear networks trained with gradient flow. We prove that if the network is initialized as a discretization of a neural ODE, then such a discretization holds throughout training. Our results are valid for a finite training time, and also as the training time tends to infinity provided that the network satisfies a Polyak-Łojasiewicz condition. Importantly, this condition holds for a family of residual networks where the residuals are two-layer perceptrons with an overparameterization in width that is only linear, and implies the convergence of gradient flow to a global minimum. Numerical experiments illustrate our results.
Primary Area: learning theory
Keywords: adaptive regret multi arm bandit
Scores: [ 8 5 8 8 ]
Primary Area: learning theory
Keywords: Equivariance statistical query lower bound computational hardness invariance symmetry neural networks
Scores: [ 8 6 8 ]
We study the problem of learning equivariant neural networks via gradient descent. The incorporation of known symmetries ("equivariance") into neural nets has empirically improved the performance of learning pipelines, in domains ranging from biology to computer vision. However, a rich yet separate line of learning theoretic research has demonstrated that actually learning shallow, fully-connected (i.e. non-symmetric) networks has exponential complexity in the correlational statistical query (CSQ) model, a framework encompassing gradient descent. In this work, we ask: are known problem symmetries sufficient to alleviate the fundamental hardness of learning neural nets with gradient descent? We answer this question in the negative. In particular, we give lower bounds for shallow graph neural networks, convolutional networks, invariant polynomials, and frame-averaged networks for permutation subgroups, which all scale either superpolynomially or exponentially in the relevant input dimension. Therefore, in spite of the significant inductive bias imparted via symmetry, actually learning the complete classes of functions represented by equivariant neural networks via gradient descent remains hard.
Primary Area: learning theory
Keywords: learning in games rank games Nash equilibria Minty optimistic gradient
Scores: [ 5 6 8 5 ]
Learning Nash equilibria (NE) in games has garnered significant attention, particularly in the context of training Generative Adversarial Networks (GANs) and multi-agent Reinforcement Learning. The current state-of-the-art in efficiently learning games focuses on landscapes that meet the (weak) Minty property or games characterized by a unique function, often referred to as potential games. A significant challenge in this domain is that computing Nash equilibria is a computationally intractable task [Daskalakis et al. 2009]. In this paper we focus on bimatrix games (A,B) called rank-1. These are games in which the sum of the payoff matrices A+B is a rank 1 matrix; note that standard zero-sum games are rank 0. We show that optimistic gradient descent/ascent converges to an \epsilon-approximate NE after 1/\epsilon^2 log(1/\epsilon) iterates in rank-1 games. We achieve this by leveraging structural results about the NE landscape of rank-1 games Adsul et al. 2021. Notably, our approach bypasses the fact that these games do not satisfy the MVI property.
Primary Area: learning theory
Keywords: Linear Self-Attention In-context learning Gradient Descent Theoretical Understanding
Scores: [ 5 8 5 6 ]
Recent works have empirically analyzed in-context learning and shown that transformers trained on synthetic linear regression tasks can learn to implement ridge regression, which is the Bayes-optimal predictor, given sufficient capacity (Akyurek et al., 2023), while one-layer transformers with linear self-attention and no MLP layer will learn to implement one step of gradient descent (GD) on a least-squares linear regression objective (von Oswald et al., 2022). However, the theory behind these observations remains poorly understood. We theoretically study transformers with a single layer of linear self-attention, trained on synthetic noisy linear regression data. First, we mathematically show that when the covariates are drawn from a standard Gaussian distribution, the one-layer transformer which minimizes the pre-training loss will implement a single step of GD on the least-squares linear regression objective. Then, we find that changing the distribution of the covariates and weight vector to a non-isotropic Gaussian distribution has a strong impact on the learned algorithm: the global minimizer of the pre-training loss now implements a single step of pre-conditioned GD. However, if only the distribution of the responses is changed, then this does not have a large effect on the learned algorithm: even when the response comes from a more general family of nonlinear functions, the global minimizer of the pre-training loss still implements a single step of GD on a least-squares linear regression objective.
Primary Area: learning theory
Keywords: generalisation data selection neural tangent kernel sample interaction learning dynamics
Scores: [ 8 6 6 ]
Primary Area: learning theory
Keywords: Offline Reinforcement Learning Confounded POMDP Policy Gradient Statistical Guarantee Function Approximation
Scores: [ 8 8 8 8 ]
In this paper, we propose a policy gradient method for confounded partially observable Markov decision processes (POMDPs) with continuous state and observation spaces in the offline setting. We first establish a novel identification result to non-parametrically estimate any history-dependent policy gradient under POMDPs using the offline data. The identification enables us to solve a sequence of conditional moment restrictions and adopt the min-max learning procedure with general function approximation for estimating the policy gradient. We then provide a finite-sample non-asymptotic bound for estimating the gradient uniformly over a pre-specified policy class in terms of the sample size, length of horizon, concentratability coefficient and the measure of ill-posedness in solving the conditional moment restrictions. Lastly, by deploying the proposed gradient estimation in the gradient ascent algorithm, we show the global convergence of the proposed algorithm in finding the history-dependent optimal policy under some technical conditions. To the best of our knowledge, this is the first work studying the policy gradient method for POMDPs under the offline setting.
Primary Area: learning theory
Keywords: Learned Indexing Learned Cardinality Estimation Machine learning for Data Management
Scores: [ 3 8 5 8 ]
Primary Area: learning theory
Keywords: scientific computing data-driven algorithm design online learning multi-armed bandits contextual bandits numerical analysis learning-augmented algorithms algorithms with predictions
Scores: [ 8 8 8 8 ]
Solving a linear system Ax=b is a fundamental scientific computing primitive, and numerous solvers and preconditioners have been developed. These come with parameters whose optimal values depend on the system being solved but are often impossible or too expensive to identify; thus in practice sub-optimal heuristics are used instead. We consider the common setting in which many related linear systems are solved, e.g. during a single numerical simulation. In this scenario, can we sequentially choose parameters that attain a near-optimal overall number of iterations, without extra matrix computations? We answer in the affirmative for Successive Over-Relaxation~(SOR), a standard solver whose parameter ω has a strong impact on its runtime. For this method, we prove that a bandit algorithm—using only the number of iterations as feedback—can select parameters for a sequence of instances such that the overall cost is almost as good as that the best fixed ω would have obtained. Furthermore, when given additional structural information, we show that a {\em contextual} bandit method approaches the performance of the {\em instance-optimal} policy, which selects the best ω for each instance. Our work provides the first learning-theoretic treatment of high-precision linear system solvers and the first end-to-end guarantees for data-driven scientific computing, demonstrating theoretically the potential to speed up numerical methods using well-understood learning algorithms.
Primary Area: learning theory
Keywords: Lipschitz continuity Double Descent Label Noise Generalization
Scores: [ 3 6 6 8 ]
Lipschitz continuity is a crucial functional property of any predictive model, that naturally governs its robustness, generalisation, as well as adversarial vulnerability. Contrary to other works that focus on obtaining tighter bounds and developing different practical strategies to enforce certain Lipschitz properties, we aim to thoroughly examine and characterise the Lipschitz behaviour of Neural Networks. Thus, we carry out an empirical investigation in a range of different settings (namely, architectures, datasets, label noise, and more) by exhausting the limits of the simplest and the most general lower and upper bounds. As a highlight of this investigation, we showcase a remarkable fidelity of the lower Lipschitz bound, identify a striking Double Descent trend in both upper and lower bounds to the Lipschitz and explain the intriguing effects of label noise on function smoothness and generalisation.
Primary Area: learning theory
Keywords: diffusion models score-based generative modeling non-asymptotic theory reverse SDE probability flow ODE denoising diffusion probabilistic model
Scores: [ 8 8 8 6 ]
Diffusion models, which convert noise into new data instances by learning to reverse a Markov diffusion process, have become a cornerstone in contemporary generative modeling. While their practical power has now been widely recognized, the theoretical underpinnings remain far from mature. In this work, we develop a suite of non-asymptotic theory towards understanding the data generation process of diffusion models in discrete time, assuming access to ℓ2-accurate estimates of the (Stein) score functions. For a popular deterministic sampler (based on the probability flow ODE), we establish a convergence rate proportional to 1/T (with T the total number of steps), improving upon past results; for another mainstream stochastic sampler (i.e., a type of the denoising diffusion probabilistic model), we derive a convergence rate proportional to 1/√T, matching the state-of-the-art theory. Imposing only minimal assumptions on the target data distribution (e.g., no smoothness assumption is imposed), our results characterize how ℓ2 score estimation errors affect the quality of the data generation process. In contrast to prior works, our theory is developed based on an elementary yet versatile non-asymptotic approach without resorting to toolboxes for SDEs and ODEs.
Primary Area: learning theory
Keywords: Memorization expressive power of network optimal robust memorization computation complexity Lipschitz constant
Scores: [ 5 5 8 ]
Primary Area: learning theory
Keywords: gradient descent kernel ridge regression optimal algorithm generalization asymptotic error rates power-laws
Scores: [ 8 8 8 ]
The asymptotically precise estimation of the generalization of kernel methods has recently received attention due to the parallels between neural networks and their associated kernels. However, prior works derive such estimates for training by kernel ridge regression (KRR), whereas neural networks are typically trained with gradient descent (GD). In the present work, we consider the training of kernels with a family of \emph{spectral algorithms} specified by profile h(λ), and including KRR and GD as special cases. Then, we derive the generalization error as a functional of learning profile h(λ) for two data models: high-dimensional Gaussian and low-dimensional translation-invariant model. Under power-law assumptions on the spectrum of the kernel and target, we use our framework to (i) give full loss asymptotics for both noisy and noiseless observations (ii) show that the loss localizes on certain spectral scales, giving a new perspective on the KRR saturation phenomenon (iii) conjecture, and demonstrate for the considered data models, the universality of the loss w.r.t. non-spectral details of the problem, but only in case of noisy observation.
Primary Area: learning theory
Keywords: model-based and model-free RL representation complexity circuit complexity approximation error
Scores: [ 5 6 8 ]
We study the representation complexity of model-based and model-free reinforcement learning (RL) in the context of circuit complexity. We prove theoretically that there exists a broad class of MDPs such that their underlying transition and reward functions can be represented by constant depth circuits with polynomial size, while the optimal Q-function suffers an exponential circuit complexity in constant-depth circuits. By drawing attention to the approximation errors and building connections to complexity theory, our theory provides unique insights into why model-based algorithms usually enjoy better sample complexity than model-free algorithms from a novel representation complexity perspective: in some cases, the ground-truth rule (model) of the environment is simple to represent, while other quantities, such as Q-function, appear complex. We empirically corroborate our theory by comparing the approximation error of the transition kernel, reward function, and optimal Q-function in various Mujoco environments, which demonstrates that the approximation errors of the transition kernel and reward function are consistently lower than those of the optimal Q-function. To the best of our knowledge, this work is the first to study the circuit complexity of RL, which also provides a rigorous framework for future research.
Primary Area: learning theory
Keywords: Chain of thought language modeling circuit complexity deep learning theory
Scores: [ 8 6 8 3 8 5 ]
Generating a sequence of intermediate steps, \emph{a.k.a.}, a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length n, previous works have constant-depth transformers with finite precision poly(n) embedding size can only solve problems in TC0 without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in AC0, a proper subset of $ \mathsf{TC}^0$. However, with T steps of CoT, constant-depth transformers using constant-bit precision and O(logn) embedding size can solve any problem solvable by boolean circuits of size T. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers.
Primary Area: learning theory
Keywords: q-replicator dynamics potential games average price of anarchy learning
Scores: [ 8 6 ]
Primary Area: metric learning, kernel learning, and sparse coding
Keywords: Optimal transport sparsity sparsistency metric learning
Scores: [ 5 8 8 6 ]
Primary Area: metric learning, kernel learning, and sparse coding
Keywords: Deep metric learning Open-world visual recognition Threshold consistency
Scores: [ 5 8 6 ]
Primary Area: metric learning, kernel learning, and sparse coding
Keywords: Metric Learning Triplet Loss Hard Negative Mining
Scores: [ 8 3 5 6 ]
Primary Area: metric learning, kernel learning, and sparse coding
Keywords: Latent Graph Inference Representation Learning Metric Embedding Geometric Deep Learning Graph Neural Networks
Scores: [ 6 8 5 8 ]
The inductive bias of a graph neural network (GNN) is largely encoded in its specified graph. Latent graph inference relies on latent geometric representations to dynamically rewire or infer a GNN's graph to maximize the GNN's predictive downstream performance, but it lacks solid theoretical foundations in terms of embedding-based representation guarantees. This paper addresses this issue by introducing a trainable deep learning architecture, coined \textit{neural snowflake}, that can adaptively implement fractal-like metrics on Rd. We prove that any given finite weights graph can be isometrically embedded by a standard MLP encoder. Furthermore, when the latent graph can be represented in the feature space of a sufficiently regular kernel, we show that the combined neural snowflake and MLP encoder do not succumb to the curse of dimensionality by using only a low-degree polynomial number of parameters in the number of nodes. This implementation enables a low-dimensional isometric embedding of the latent graph. We conduct synthetic experiments to demonstrate the superior metric learning capabilities of neural snowflakes when compared to more familiar spaces like Euclidean space. Additionally, we carry out latent graph inference experiments on graph benchmarks. Consistently, the neural snowflake model achieves predictive performance that either matches or surpasses that of the state-of-the-art latent graph inference models. Importantly, this performance improvement is achieved without requiring random search for optimal latent geometry. Instead, the neural snowflake model achieves this enhancement in a differentiable manner.
Primary Area: metric learning, kernel learning, and sparse coding
Keywords: graph kernel graph metric learning maximum mean discrepancy
Scores: [ 8 8 6 8 ]
Primary Area: metric learning, kernel learning, and sparse coding
Keywords: scalable kernel methods random features deep neural networks
Scores: [ 5 5 8 8 ]
We introduce the concept of scalable neural network kernels (SNNKs), the replacements of regular feedforward layers (FFLs), capable of approximating the latter, but with favorable computational properties. SNNKs effectively disentangle the inputs from the parameters of the neural network in the FFL, only to connect them in the final computation via the dot-product kernel. They are also strictly more expressive, as allowing to model complicated relationships beyond the functions of the dot-products of parameter-input vectors. We also introduce the neural network bundling process that applies SNNKs to compactify deep neural network architectures, resulting in additional compression gains. In its extreme version, it leads to the fully bundled network whose optimal parameters can be expressed via explicit formulae for several loss functions (e.g. mean squared error), opening a possibility to bypass backpropagation. As a by-product of our analysis, we introduce the mechanism of the universal random features (or URFs), applied to instantiate several SNNK variants, and interesting on its own in the context of scalable kernel methods. We provide rigorous theoretical analysis of all these concepts as well as an extensive empirical evaluation, ranging from point-wise kernel estimation to Transformers' fine-tuning with novel adapter layers inspired by SNNKs. Our mechanism provides up to 5x reduction in the number of trainable parameters, while maintaining competitive accuracy.
Primary Area: metric learning, kernel learning, and sparse coding
Keywords: cross-encoder kNN retrieval nearest-neighbor search
Scores: [ 5 6 8 6 ]
Cross-encoder (CE) models which compute similarity by jointly encoding a query-item pair perform better than using dot-product with embedding-based models (dual-encoders) at estimating query-item relevance. Existing approaches perform k-NN search with cross-encoders by approximating the CE similarity with a vector embedding space fit either with dual-encoders (DE) or CUR matrix factorization. DE-based retrieve-and-rerank approaches suffer from poor recall as DE generalize poorly to new domains and the test-time retrieval with DE is decoupled from the CE. While CUR-based approaches can be more accurate than the DE-based retrieve-and-rerank approach, such approaches require prohibitively large number of CE calls to compute item embeddings, thus making it impractical for deployment as scale. In this paper, we address these shortcomings with our proposed sparse-matrix factorization based method that efficiently computes latent query and item representations to approximate CE scores and performs k-NN search with the approximate CE similarity. In an offline indexing stage, we compute item embeddings by factorizing a sparse matrix containing query-item CE scores for a set of train queries. Our method produces a high quality approximation while requiring only a fraction of CE similarity calls as compared to CUR-based methods, and allows for leveraging DE models to initialize the embedding space while avoiding compute- and resource-intensive finetuning of DE via distillation. At test-time, we keep item embeddings fixed and perform retrieval over multiple rounds, alternating between a) estimating the test-query embedding by minimizing error in approximating CE scores of items retrieved thus far, and b) using the updated test-query embedding for retrieving more items in the next round. Our proposed k-NN search method can achieve up to 5% and 54% improvement in k-NN recall for k=1 and 100 respectively over the widely-used DE-based retrieve-and-rerank approach. Furthermore, our proposed approach to index the items by aligning item embeddings with the CE achieves up to 100$\times$ and 5$\times$ speedup over CUR-based and dual-encoder distillation based approaches respectively while matching or improving k-NN search recall over baselines.
Primary Area: metric learning, kernel learning, and sparse coding
Keywords: Perceptual similarity metric certified defense deep learning
Scores: [ 6 5 5 ]
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: program synthesis pragmatics self-play
Scores: [ 6 8 6 6 ]
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: Image translation Diffusion model
Scores: [ 6 6 6 6 ]
We introduce the novel Diffusion Visual Programmer (DVP), a neuro-symbolic image translation framework. Our proposed DVP seamlessly embeds a condition-flexible diffusion model within the GPT architecture, orchestrating a coherent sequence of visual programs (i.e., computer vision models) for various pro-symbolic steps, which span RoI identification, style transfer, and position manipulation, facilitating transparent and controllable image translation processes. Extensive experiments demonstrate DVP’s remarkable performance, surpassing concurrent arts. This success can be attributed to several key features of DVP: First, DVP achieves condition-flexible translation via instance normalization, enabling the model to eliminate sensitivity caused by the manual guidance and optimally focus on textual descriptions for high-quality content generation. Second, the frame work enhances in-context reasoning by deciphering intricate high-dimensional concepts in feature spaces into more accessible low-dimensional symbols (e.g., [Prompt], [RoI object]), allowing for localized, context-free editing while maintaining overall coherence. Last but not least, DVP improves systemic controllability and explainability by offering explicit symbolic representations at each programming stage, empowering users to intuitively interpret and modify results. Our research marks a substantial step towards harmonizing artificial image translation processes with cognitive intelligence, promising broader applications.
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: Synthetic Data Generation
Scores: [ 6 5 8 6 ]
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: Propositional satisfiability Graph Neural Networks CDCL SAT Solving Backbone Phase Prediction
Scores: [ 5 8 5 6 ]
Propositional satisfiability (SAT) is an NP-complete problem that impacts manyresearch fields, such as planning, verification, and security. Mainstream modernSAT solvers are based on the Conflict-Driven Clause Learning (CDCL) algorithm.Recent work aimed to enhance CDCL SAT solvers using Graph Neural Networks(GNNs). However, so far this approach either has not made solving more effective,or required substantial GPU resources for frequent online model inferences. Aimingto make GNN improvements practical, this paper proposes an approach calledNeuroBack, which builds on two insights: (1) predicting phases (i.e., values) ofvariables appearing in the majority (or even all) of the satisfying assignments areessential for CDCL SAT solving, and (2) it is sufficient to query the neural modelonly once for the predictions before the SAT solving starts. Once trained, theoffline model inference allows NeuroBack to execute exclusively on the CPU,removing its reliance on GPU resources. To train NeuroBack, a new dataset calledDataBack containing 120,286 data samples is created. Finally, NeuroBack is implementedas an enhancement to a state-of-the-art SAT solver called Kissat. As a result,it allowed Kissat to solve 5.2% more problems on the recent SAT competitionproblem set, SATCOMP-2022. NeuroBack therefore shows how machine learningcan be harnessed to improve SAT solving in an effective and practical manner.
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: transpilation program translation assembly code language model neurosymbolic machine learning
Scores: [ 8 8 5 ]
Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze.Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: planning abstractions hierarchical planning library learning learning from language
Scores: [ 6 8 6 5 6 ]
Long-horizon planning is dauntingly hard -- it requires modeling relevant aspects of the environment and searching over large, complex action spaces. \textit{Hierarchical planning} approaches make complex problems more tractable using temporal \textit{action abstractions}, decomposing hard tasks into smaller abstract subproblems that can be solved modularly. However, actually learning useful action abstractions has long posed significant challenges without human expert knowledge. Here, we introduce a system that leverages background information in language to learn a \textit{library of symbolic action abstractions and accompanying low-level policies} that can be composed to solve increasingly complex tasks. Our approach queries large language models (LLMs) as a prior for proposing useful symbolic action definitions, but integrates these proposals into a formal hierarchical planning system to ground and verify proposed actions. On two language-guided interactive planning domains (\textit{Mini Minecraft} and the \textit{ALFRED Household Tasks} benchmark), our approach far outperforms other baseline approaches that use LLMs in planning, enabling far more accurate planning and enable better generalization to more complex tasks.
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: transformers interactive theorem proving automated reasoning contrastive learning premise selection
Scores: [ 8 8 8 8 ]
This paper presents a novel approach to premise selection, a crucial reasoning task in automated theorem proving. Traditionally, symbolic methods that rely on extensive domain knowledge and engineering effort are applied to this task. In contrast, this work demonstrates that contrastive training with the transformer architecture can achieve higher-quality retrieval of relevant premises, without the knowledge or feature engineering overhead. Our method, Magnushammer, outperforms the most advanced and widely used automation tool in interactive theorem proving called Sledgehammer. On the PISA and miniF2f benchmarks Magnushammer achieves 59.5% (against 38.3%) and 34.0% (against 20.9%) success rates, respectively. By combining Magnushammer with a language-model-based automated theorem prover, we further improve the state-of-the-art proof success rate from 57.0% to 71.0% on the PISA benchmark using $4$x fewer parameters. Moreover, we develop and open source a novel dataset for premise selection, containing textual representations of (proof state, relevant premise) pairs. To the best of our knowledge, this is the largest available premise selection dataset, and the first dataset of this kind for the Isabelle proof assistant.
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: Knowledge Graph Chain-of-Thought Large Language Models
Scores: [ 8 8 6 6 ]
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: Faithful Rule Extraction Rule Learning Knowledge Graph Datalog
Scores: [ 6 8 8 5 ]
There is increasing interest in methods for extracting interpretable rules from ML models trained to solve a wide range of tasks over knowledge graphs (KGs), such as KG completion, node classification, question answering and recommendation. Many such approaches, however, lack formal guarantees establishing the precise relationship between the model and the extracted rules, and this lack of assurance becomes especially problematic when the extracted rules are applied in safety-critical contexts or to ensure compliance with legal requirements. Recent research has examined whether the rules derived from the influential Neural-LP model exhibit soundness (or completeness), which means that the results obtained by applying the model to any dataset always contain (or are contained in) the results obtained by applying the rules to the same dataset. In this paper, we extend this analysis to the context of DRUM, an approach that has demonstrated superior practical performance. After observing that the rules currently extracted from a DRUM model can be unsound and/or incomplete, we propose a novel algorithm where the output rules, expressed in an extension of Datalog, ensure both soundness and completeness. This algorithm, however, can be inefficient in practice and hence we propose additional constraints to DRUM models facilitating rule extraction, albeit at the expense of reduced expressive power.
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: transformer encoders languages AC0 first order logic linear temporal logic counting terms parity majority
Scores: [ 8 8 3 3 6 ]
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: theorem extraction mathematical reasoning theorem proving
Scores: [ 8 5 8 8 ]
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: neurosymbolic planning rationality
Scores: [ 5 6 8 6 ]
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: mathematical reasoning autoformalization automated theorem proving quantitative reasoning
Scores: [ 3 8 8 6 ]
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: program synthesis language models library learning compression refactoring abstraction documentation neurosymbolic
Scores: [ 6 8 6 8 ]
While large language models (LLMs) now excel at code generation, a key aspect of software development is the art of refactoring: consolidating code into libraries of reusable and readable programs. In this paper, we introduce LILO, a neurosymbolic framework that iteratively synthesizes, compresses, and documents code to build libraries tailored to particular problem domains. Computationally, library learning presents a challenging optimization problem that requires formal reasoning about program structure at scale. LILO combines LLM-guided program synthesis with recent algorithmic advances in automated refactoring from Stitch: a symbolic compression system that efficiently identifies optimal lambda abstractions across large code corpora. To make these abstractions interpretable, we introduce an auto-documentation (AutoDoc) procedure that infers natural language names and docstrings based on contextual examples of usage. In addition to improving human readability, we find that AutoDoc boosts performance by helping LILO's synthesizer to interpret and deploy learned abstractions. We evaluate LILO on three inductive program synthesis benchmarks for string editing, scene reasoning, and graphics composition. Compared to existing neural and symbolic methods—including the state-of-the-art library learning algorithm DreamCoder—LILO solves more complex tasks and learns richer libraries that are grounded in linguistic knowledge. In sum, LILO provides a general design pattern for human-interpretable systems that build up shared libraries of program abstractions to solve complex software problems.
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: Large Language Models In-context Learning Self-Verification Self-Correction Truthfulness Tool-use Interaction
Scores: [ 5 8 8 5 ]
Recent developments in large language models (LLMs) have been impressive. However, these models sometimes show inconsistencies and problematic behavior, such as hallucinating facts, generating flawed code, or creating offensive and toxic content. Unlike these models, humans typically utilize external tools to cross-check and refine their initial content, like using a search engine for fact-checking, or a code interpreter for debugging. Inspired by this observation, we introduce a framework called CRITIC that allows LLMs, which are essentially “black boxes” to validate and progressively amend their own outputs in a manner similar to human interaction with tools. More specifically, starting with an initial output, CRITIC interacts with appropriate tools to evaluate certain aspects of the text, and then revises the output based on the feedback obtained during this validation process. Comprehensive evaluations involving free-form question answering, mathematical program synthesis, and toxicity reduction demonstrate that CRITIC consistently enhances the performance of LLMs. Meanwhile, our research highlights the crucial importance of external feedback in promoting the ongoing self-improvement of LLMs.
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: Verification Certification Approximation Robustness
Scores: [ 6 6 6 6 ]
Randomized smoothing-based certification is an effective approach for obtaining robustness certificates of deep neural networks (DNNs) against adversarial attacks. This method constructs a smoothed DNN model and certifies its robustness through statistical sampling, but it is computationally expensive, especially when certifying with a large number of samples. Furthermore, when the smoothed model is modified (e.g., quantized or pruned), certification guarantees may not hold for the modified DNN, and recertifying from scratch can be prohibitively expensive.We present the first approach for incremental robustness certification for randomized smoothing, IRS. We show how to reuse the certification guarantees for the original smoothed model to certify an approximated model with very few samples. IRS significantly reduces the computational cost of certifying modified DNNs while maintaining strong robustness guarantees. We experimentally demonstrate the effectiveness of our approach, showing up to 4.1x certification speedup over the certification that applies randomized smoothing of the approximate model from scratch.
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: Bioinformatics
Scores: [ 3 3 6 ]
While artificial intelligence has made remarkable strides in revealing the relationship between biological macromolecules' primary sequence and tertiary structure, designing RNA sequences based on specified tertiary structures remains challenging. Though existing approaches in protein design have thoroughly explored structure-to-sequence dependencies in proteins, RNA design still confronts difficulties due to structural complexity and data scarcity. Adding to the problem, direct transplantation of protein design methodologies into RNA design fails to achieve satisfactory outcomes although sharing similar structural components. In this study, we aim to systematically construct a data-driven RNA design pipeline. We crafted a large, well-curated benchmark dataset and designed a comprehensive structural modeling approach to represent the complex RNA tertiary structure. More importantly, we proposed a hierarchical data-efficient representation learning framework that learns structural representations through contrastive learning at both cluster-level and sample-level to fully leverage the limited data. By constraining data representations within a limited hyperspherical space, the intrinsic relationships between data points could be explicitly imposed. Moreover, we incorporated extracted secondary structures with base pairs as prior knowledge to facilitate the RNA design process. Extensive experiments demonstrate the effectiveness of our proposed method, providing a reliable baseline for future RNA design tasks. The source code and benchmark dataset will be released publicly.
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: knowledge graph reasoning graph sampling
Scores: [ 5 6 3 ]
To deduce new facts on knowledge graph (KG), a reasoning system learns from the graph structure and collects local evidence to find the answer. However, existing methods suffer from a severe scalability problem due to the utilization of the whole KG for reasoning, which hinders their promise on large-scale KG and cannot be directly addressed by vanilla sampling methods. In this work, we propose the one-shot subgraph reasoning to achieve efficient as well as adaptive KG reasoning. The design principle is that, instead of directly acting on the whole KG, the reasoning procedure is decoupled into two steps, i.e., (i) extracting only one query-dependent subgraph and (ii) reasoning on this single subgraph. We reveal that the non-parametric and computation-efficient heuristics Personalized PageRank (PPR) can effectively identify the potential answers and supports to the reasoning. With the promoted efficiency, we further introduce the subgraph-based searching of optimal configurations in both data and model spaces. Empirically, our method achieves promoted efficiency and also leading performances on five large-scale benchmarks.
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: Large Language Models Formal verification
Scores: [ 8 5 5 5 ]
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: symbolic regression reinforcement learning equation discovery
Scores: [ 6 8 6 6 ]
In nature, the behaviors of many complex systems can be described by parsimonious math equations. Automatically distilling these equations from limited data is cast as a symbolic regression (SR) process which hitherto remains a grand challenge. Keen efforts have been placed on tackling this issue and demonstrated success in { SR}. However, there still exist bottlenecks that current methods struggle to break when the discrete search space tends toward infinity and especially when the underlying math formula is intricate. To this end, we propose a novel Reinforcement Symbolic Regression Machine (RSRM) that masters the capability of uncovering complex math equations from only scarce data. The RSRM model is composed of three key modules: (1) a Monte Carlo tree search (MCTS) agent, { designed for exploration}, that explores optimal math expression trees consisting of pre-defined math operators and variables, (2) a Double Q-learning block, { designed for exploitation}, that helps reduce the feasible search space of MCTS via properly understanding the distribution of reward, and (3) a modulated sub-tree discovery block that heuristically learns and defines new math operators to improve representation ability of math expression trees. Binding of these modules yields the SOTA performance of RSRM in SR as demonstrated by multiple benchmark datasets. The RSRM shows clear superiority over several representative baseline models.
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: programmatic policy reinforcement learning
Scores: [ 5 3 3 ]
Recent works have introduced LEAPS and HPRL, systems that learn latent spaces of domain-specific languages, which are used to define programmatic policies for partially observable Markov decision processes (POMDPs). These systems induce a latent space while optimizing losses such as the behavior loss, which aim to achieve locality in program behavior, meaning that vectors close in the latent space should correspond to similarly behaving programs. In this paper, we show that the programmatic space, induced by the domain-specific language and requiring no training, presents values for the behavior loss similar to those observed in latent spaces presented in previous work. Moreover, algorithms searching in the programmatic space significantly outperform those in LEAPS and HPRL. To explain our results, we measured the friendliness'' of the two spaces to local search algorithms. We discovered that algorithms are more likely to stop at local maxima when searching in the latent space than when searching in the programmatic space. This implies that the optimization topology of the programmatic space, induced by the reward function in conjunction with the neighborhood function, is more conducive to search than that of the latent space. This result provides an explanation for the superior performance in the programmatic space.
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: Large language models Neuro-symbolic Visual Reasoning
Scores: [ 6 6 5 6 ]
Recent works have shown that Large Language Models (LLMs) could empower traditional neuro-symbolic models via programming capabilities to translate lan- guages into module descriptions, thus achieving strong visual reasoning results while maintaining the model’s transparency and efficiency. However, these mod- els usually exhaustively generate the entire code snippet given each new instance of a task, which is extremely ineffective. On the contrary, human beings grad- ually acquire knowledge that can be reused and grow into more profound skills for fast generalization to new tasks since we are an infant. Inspired by this, we propose generative neuro-symbolic visual reasoning by growing and reusing mod- ules. Specifically, our model consists of three unique stages, module initialization, module generation, and module execution. First, given a vision-language task, we adopt LLMs to examine whether we could reuse and grow over established mod- ules to handle this new task. If not, we initialize a new module needed by the task and specify the inputs and outputs of this new module. After that, the new module is created by querying LLMs to generate corresponding code snippets that match the requirements. In order to get a better sense of the new module’s ability, we treat few-shot training examples as test cases to see if our new module could pass these cases. If yes, the new module is added to the module library for future reuse. Finally, we evaluate the performance of our model on the testing set by executing the parsed programs with the newly made visual modules to get the results. We find the proposed GNSVR model possesses several advantages. First, it performs competitively on standard tasks like visual question answering and referring ex- pression comprehension; Second, the visual modules learned from one task can be seamlessly transferred to new tasks; Last but not least, it is able to adapt to new visual reasoning tasks by observing a few training examples and reusing modules.
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: physics-informed machine learning OOD robustness meta learning causal structure discovery
Scores: [ 6 8 6 8 ]
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: air quality prediction physics-informed spatiotemporal-learning interpretability
Scores: [ 6 8 6 6 ]
Air quality prediction and modelling plays a pivotal role in public health and environment management, for individuals and authorities to make informed decisions. Although traditional data-driven models have shown promise in this domain,their long-term prediction accuracy can be limited, especially in scenarios with sparse or incomplete data and they often rely on "black-box" deep learning structures that lack solid physical foundation leading to reduced transparency and interpretability in predictions. To address these limitations, this paper presents a novel approach named Physics guided Neural Network for Air Quality Prediction (AirPhyNet). Specifically, we leverage two well-established physics principles of air particle movement (diffusion and advection) by representing them as differential equation networks. Then, we utilize a graph structure to integrate physics knowledge into a neural network architecture and exploit latent representations to capture spatio-temporal relationships within the air quality data. Experiments on two real-world benchmark datasets demonstrate that AirPhyNet outperforms state-of-the-art models for different testing scenarios including different lead time (24h, 48h, 72h), sparse data and sudden change prediction, achieving reduction in prediction errors up to 10%. Moreover, a case study further validates that our model captures underlying physical processes of particle movement and generates accurate predictions with real physical meaning. The code is available at:: https://anonymous.4open.science/r/AirPhyNet-230F/
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: Physics-informed Neural Networks PINNs adaptive training points selection
Scores: [ 8 8 8 6 ]
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: Inductive reasoning large language models
Scores: [ 6 3 6 5 3 6 ]
Inductive reasoning is a core problem-solving capacity: humans can identify underlying principles from a few examples, which can then be robustly generalized to novel scenarios. Recent work has evaluated large language models (LLMs) on inductive reasoning tasks by directly prompting them yielding "in context learning." This can work well for straightforward inductive tasks, but performs very poorly on more complex tasks such as the Abstraction and Reasoning Corpus (ARC). In this work, we propose to improve the inductive reasoning ability of LLMs by generating explicit hypotheses at multiple levels of abstraction: we prompt the LLM to propose multiple abstract hypotheses about the problem, in natural language, then implement the natural language hypotheses as concrete Python programs. These programs can be directly verified by running on the observed examples and generalized to novel inputs. To reduce the hypothesis search space, we explore steps to filter the set of hypotheses to be implemented as programs: we either ask the LLM to summarize them into a smaller set of hypotheses, or ask human annotators to select a subset. We verify our pipeline's effectiveness on the ARC visual inductive reasoning benchmark, its variant 1D-ARC, and string transformation dataset SyGuS. On a random 40-problem subset of ARC, our automated pipeline using LLM summaries achieves 27.5% accuracy, significantly outperforming the direct prompting baseline (accuracy of 12.5%). With the minimal human input of selecting from LLM-generated candidates, the performance is boosted to 37.5%. Our ablation studies show that abstract hypothesis generation and concrete program representations are both beneficial for LLMs to perform inductive reasoning tasks.
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: Neuro-symbolic AI Systematic Generalization Compositional Generalization
Scores: [ 8 3 8 6 ]
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: Neural Network Verification Robustness Certification
Scores: [ 6 6 6 6 ]
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: Large Language Models Mathematical Reasoning Tool Learning
Scores: [ 6 5 8 8 ]
Large language models have made significant progress in various language tasks, yet they still struggle with complex mathematics. In this paper, we propose ToRA a series of Tool-integrated Reasoning Agents designed to solve challenging mathematical problems by seamlessly integrating natural language reasoning with the utilization of external tools (e.g., computation libraries and symbolic solvers), thereby amalgamating the analytical prowess of language and the computational efficiency of tools. To train ToRA, we curate interactive tool-use trajectories on mathematical datasets, apply imitation learning on the annotations, and propose output space shaping to further refine models' reasoning behavior. As a result, ToRA models significantly outperform open-source models on 10 mathematical reasoning datasets across all scales with 13%-19% absolute improvements on average. Notably, ToRA-7B reaches 44.6% on the competition-level dataset MATH, surpassing the best open-source model WizardMath-70B by 22% absolute. ToRA-34B is also the first open-source model that achieves an accuracy exceeding 50% on MATH, which significantly outperforms GPT-4's CoT result, and is competitive with GPT-4 solving problems with programs. Additionally, we conduct a comprehensive analysis of the benefits and remaining challenges of tool interaction for mathematical reasoning, providing valuable insights for future research.
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: neurosymbolic learning machine learning world modeling compositional generalization
Scores: [ 6 6 6 ]
We introduce Cosmos, a framework for object-centric world modeling that is designed for compositional generalization (CG), i.e., high performance on unseen input scenes obtained through the composition of known visual "atoms." The central insight behind Cosmos is the use of a novel form of neurosymbolic grounding. Specifically, the framework introduces two new tools: (i) neurosymbolic scene encodings, which represent each entity in a scene using a real vector computed using a neural encoder, as well as a vector of composable symbols describing attributes of the entity, and (ii) a neurosymbolic attention mechanism that binds these entities to learned rules of interaction. Cosmos is end-to-end differentiable; also, unlike traditional neurosymbolic methods that require representations to be manually mapped to symbols, it computes an entity's symbolic attributes using vision-language foundation models. Through an evaluation that considers two different forms of CG on an established blocks-pushing domain, we show that the framework establishes a new state-of-the-art for CG in world modeling.
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Keywords: Theorem proving Large language model Autoformalization
Scores: [ 8 6 8 8 ]
Despite the success of large language models (LLMs), the task of theorem proving still remains one of the hardest reasoning tasks that is far from being fully solved. Prior methods using language models have demonstrated promising results, but they still struggle to prove even middle school level theorems. One common limitation of these methods is that they assume a fixed theorem library during the whole theorem proving process. However, as we all know, creating new useful theorems or even new theories is not only helpful but crucial and necessary for advancing mathematics and proving harder and deeper results.In this work, we present LEGO-Prover, which employs a growing skill library containing verified lemmas as skills to augment the capability of LLMs used in theorem proving. By constructing the proof modularly, LEGO-Prover enables LLMs to utilize existing skills retrieved from the library and to create new skills during the proving process. These skills are further evolved (by prompting an LLM) to enrich the library on another scale. Modular and reusable skills are constantly added to the library to enable tackling increasingly intricate mathematical problems. Moreover, the learned library further bridges the gap between human proofs and formal proofs by making it easier to impute missing steps. LEGO-Prover advances the state-of-the-art pass rate on miniF2F-valid (48.0% to 57.0%) and miniF2F-test (45.5% to 47.1%). During the proving process, LEGO-Prover also manages to generate over 20,000 skills (theorems/lemmas) and adds them to the growing library. Our ablation study indicates that these newly added skills are indeed helpful for proving theorems, resulting in an improvement from a success rate of 47.1% to 50.4%. We also release our code and all the generated skills.
Primary Area: optimization
Keywords: Neural Optimizer Differential Dynamic Programming Adversarial Defense Game Theory
Scores: [ 6 6 6 ]
Primary Area: optimization
Keywords: distributed training Local SGD local gradient methods generalization implicit bias sharpness
Scores: [ 8 5 6 8 ]
In distributed deep learning with data parallelism, synchronizing gradients at each training step can cause a huge communication overhead, especially when many nodes work together to train large models. Local gradient methods, such as Local SGD, address this issue by allowing workers to compute locally for H steps without synchronizing with others, hence reducing communication frequency. While H has been viewed as a hyperparameter to trade optimization efficiency for communication cost, recent research indicates that setting a proper H value can lead to generalization improvement. Yet, selecting a proper H is elusive. This work proposes a theory-grounded method for determining H, named the Quadratic Synchronization Rule (QSR), which recommends dynamically setting H in proportion to 1η2 as the learning rate η decays over time. Extensive ImageNet experiments on ResNet and ViT show that local gradient methods with QSR consistently improve the test accuracy over other synchronization strategies. Compared to the standard data parallel training, QSR enables Local AdamW to cut the training time on 16 or 64 GPUs down from 26.7 to 20.2 hours or from 8.6 to 5.5 hours and, at the same time, achieves 1.16% or 0.84% higher top-1 validation accuracy.
Primary Area: optimization
Keywords: Convex Optimization Stochastic Optimization Last Iterates
Scores: [ 8 6 6 ]
In the past several years, the convergence of the last iterate of the Stochastic Gradient Descent (SGD) algorithm has triggered people's great interest due to its good performance in practice but lack of theoretical understanding. For Lipschtiz and convex functions, different works have established the optimal O(log(1/δ)logT/√T) or O(√log(1/δ)/T) high-probability convergence rates for the final iterate, where T is the time horizon and δ is the failure probability. However, to prove these bounds, all the existing works are limited to compact domains, and almost all of them also require almost surely bounded noises. It is natural to ask whether the last iterate of SGD can still guarantee the optimal convergence rate but without these two restrictive assumptions. Besides this important question, there are still lots of theoretical problems lacking an answer. For example, compared with the last iterate convergence of SGD for non-smooth problems, only very few results for smooth optimization have yet been developed. Additionally, the existing results are all limited to a single objective and the standard Euclidean norm. It still remains unclear whether the last-iterative convergence can be provably extended to wider composite optimization and non-Euclidean norms. In this work, to address the issues mentioned above, we revisit the last-iterative convergence of stochastic gradient methods and provide the first unified way to prove the convergence rates both in expectation and in high probability to accommodate general domains, composite objectives, non-Euclidean norms, Lipschitz conditions, smoothness and (strong) convexity simultaneously.
Primary Area: optimization
Keywords: mlp batch-normalization optimization depth calculus theory deep-learning non-asymptotic
Scores: [ 6 6 6 6 ]
Normalization layers are one of the key building blocks for deep neural networks. Several theoretical studies have shown that batch normalization improves the signal propagation, by avoiding the representations from becoming collinear across the layers. However, results on mean-field theory of batch normalization also conclude that this benefit comes at the expense of exploding gradients in depth. Motivated by these two aspects of batch normalization, in this study we pose the following question: Can a batch-normalized network keep the optimal signal propagation properties, but avoid exploding gradients? We answer this question in the affirmative by giving a particular construction of an MLP with linear activations and batch-normalization that provably has bounded gradients at any depth. Based on Weingarten calculus, we develop a rigorous and non-asymptotic theory for this constructed MLP that gives a precise characterization of forward signal propagation, while proving that gradients remain bounded for linearly independent input samples, which holds in most practical settings. Inspired by our theory, we also design an activation shaping scheme that empirically achieves the same properties for non-linear activations.
Primary Area: optimization
Keywords: Bi-level Optimization Constrained Optimization Hessian-free Single-loop Value Function Convergence Analysis
Scores: [ 5 8 8 6 5 ]
Primary Area: optimization
Keywords: differentially private optimization stochastic gradient descent linear regression theory private deep learning
Scores: [ 1 8 8 ]
Primary Area: optimization
Keywords: computational imaging inverse problems deep learning plug-and-play priors
Scores: [ 5 6 8 6 ]
Primary Area: optimization
Keywords: distributed learning Byzantine robustness batch size
Scores: [ 6 6 6 ]
Byzantine-robust distributed learning (BRDL), in which computing devices are likely to behave abnormally due to accidental failures or malicious attacks, has recently become a hot research topic. However, even in the independent and identically distributed (i.i.d.) case, existing BRDL methods will suffer a significant drop on model accuracy due to the large variance of stochastic gradients. Increasing batch sizes is a simple yet effective way to reduce the variance. However, when the total number of gradient computation is fixed, a too-large batch size will lead to a too-small iteration number (update number), which may also degrade the model accuracy. In view of this challenge, we mainly study the effect of batch size when the total number of gradient computation is fixed in this work. In particular, we show that when the total number of gradient computation is fixed, the optimal batch size corresponding to the tightest theoretical upper bound in BRDL increases with the fraction of Byzantine workers. Therefore, compared to the case without attacks, a larger batch size is preferred when under Byzantine attacks. Motivated by the theoretical finding, we propose a novel method called Byzantine-robust stochastic gradient descent with normalized momentum (ByzSGDnm) in order to further increase model accuracy in BRDL. We theoretically prove the convergence of ByzSGDnm for general non-convex cases under Byzantine attacks. Empirical results show that when under Byzantine attacks, compared to the cases of small batch sizes, setting a relatively large batch size can significantly increase the model accuracy, which is consistent with our theoretical results. Moreover, ByzSGDnm can achieve higher model accuracy than existing BRDL methods when under deliberately crafted attacks. In addition, we empirically show that increasing batch sizes has the bonus of training acceleration.
Primary Area: optimization
Keywords: Grokking Feature Learning Neural Tangent Kernel Generalization Training Dynamics
Scores: [ 8 8 3 5 ]
Primary Area: optimization
Keywords: Black-Box Optimization Meta-Black-Box Optimization Deep Reinforcement Learning Symbolic Equation Learning
Scores: [ 8 6 6 6 ]
Primary Area: optimization
Keywords: clustering Burer-Monteiro semidefinite programming
Scores: [ 8 3 8 8 ]
Primary Area: optimization
Keywords: Robust Optimization Stochastic Optimization End-to-End learning
Scores: [ 8 6 8 3 ]
We study contextual stochastic optimization problems. Optimization problems have uncertain parameters stemming from unknown, context-dependent, distributions. Due to the inherent uncertainty in these problems, one is often interested not only in minimizing expected cost, but also to be robust and protect against worst case scenarios. We propose a novel method that combines the learning stage with knowledge of the downstream optimization task. The method prescribes decisions which aim to maximize the likelihood that the cost is below a (user-controlled) threshold. The key idea is (1) to discretize the feasible region into subsets so that the uncertain objective function can be well approximated deterministically within each subset, and (2) devise a secondary optimization problem to prescribe decisions by integrating the individual approximations determined in step (1). We provide theoretical guarantees bounding the underlying regret of decisions proposed by our method. In addition, experimental results demonstrate that our approach is competitive in terms of average regret and yields more robust solutions than other methods proposed in the literature, including up to 20 times lower worst-case cost on a real-world electricity generation problem.
Primary Area: optimization
Keywords: optimization fast mixing fast equilibrium SDE SGD large deviation principle
Scores: [ 6 8 6 6 ]
Normalization layers are ubiquitous in deep learning, greatly accelerating optimization. However, they also introduce many unexpected phenomena during training, for example, the Fast Equilibrium conjecture proposed by (Li et al.,2020), which states that the scale-invariant normalized network, when trained by SGD with η learning rate and λ weight decay, mixes to an equilibrium in ~O(1/ηλ) steps, as opposed to classical eO(η−1) mixing time. Recent works by Wang & Wang (2022); Li et al. (2022c) proved this conjecture under different sets of assumptions. This paper aims to answer the fast equilibrium conjecture in full generality by removing the non-generic assumptions of Wang & Wang (2022); Li et al. (2022c) that the minima are isolated, that the region near minima forms a unique basin, and that the set of minima is an analytic set. Our main technical contribution is to show that with probability close to 1, in exponential time trajectories will not escape the attracting basin containing its initial position.
Primary Area: optimization
Keywords: gradient-free learning zeroth-order optimization gradient sparsity
Scores: [ 5 3 6 8 8 6 ]
Zeroth-order (ZO) optimization has become a popular technique for solving machine learning (ML) problems when first-order (FO) information is difficult or impossible to obtain. However, the scalability of ZO optimization remains an open problem: Its use has primarily been limited to relatively small-scale ML problems, such as sample-wise adversarial attack generation. To our best knowledge, no prior work has demonstrated the effectiveness of ZO optimization in training deep neural networks (DNNs) without a significant decrease in performance. To overcome this roadblock, we develop DeepZero, a principled and practical ZO deep learning (DL) framework that can scale ZO optimization to DNN training from scratch through three primary innovations. First, we demonstrate the advantages of coordinate-wise gradient estimation (CGE) over randomized vector-wise gradient estimation in training accuracy and computational efficiency. Second, we propose a sparsity-induced ZO training protocol that extends the model pruning methodology using only finite differences to explore and exploit the sparse DL prior in CGE. Third, we develop the methods of feature reuse and forward parallelization to advance the practical implementations of ZO training. Our extensive experiments show that DeepZero achieves state-of-the-art (SOTA) accuracy on ResNet-20 trained on CIFAR-10, approaching FO training performance for the first time. Furthermore, we show the practical utility of DeepZero in applications of certified adversarial defense and DL-based partial differential equation error correction, achieving 10-20% improvement over SOTA. We believe our results will inspire future research on scalable ZO optimization and contribute to advancing deep learning.
Primary Area: optimization
Keywords: Pure Differential Privacy Monte Carlo sampling Gaussian Differential Privacy Exponential Mechanism
Scores: [ 8 6 6 ]
Posterior sampling, i.e., exponential mechanism to sample from the posterior distribution, provides ε-pure differential privacy (DP) guarantees and does not suffer from potentially unbounded privacy breach introduced by (ε,δ)-approximate DP. In practice, however, one needs to apply approximate sampling methods such as Markov chain Monte Carlo (MCMC), thus re-introducing the unappealing δ-approximation error into the privacy guarantees. To bridge this gap, we propose the Approximate SAample Perturbation (abbr. ASAP) algorithm which perturbs an MCMC sample with noise proportional to its Wasserstein-infinity (W∞) distance from a reference distribution that satisfies pure DP or pure Gaussian DP (i.e., δ=0). We then leverage a Metropolis-Hastings algorithm to generate the sample and prove that the algorithm converges in W$_\infty$ distance. We show that by combining our new techniques with a careful localization step, we obtain the first nearly linear-time algorithm that achieves the optimal rates in the DP-ERM problem with strongly convex and smooth losses.
Primary Area: optimization
Keywords: Quantum Algorithms Quantum Query Complexity Convex Optimization Minimizing Loss
Scores: [ 6 8 5 5 ]
The problem of minimizing the maximum of N convex, Lipschitz functions plays significant roles in optimization and machine learning. It has a series of results, with the most recent one requiring O(Nϵ−2/3+ϵ−8/3) queries to a first-order oracle to compute an ϵ-suboptimal point. On the other hand, quantum algorithms for optimization are rapidly advancing with speedups shown on many important optimization problems. In this paper, we conduct a systematic study of quantum algorithms and lower bounds for minimizing the maximum of N convex, Lipschitz functions. On one hand, we develop quantum algorithms with an improved complexity bound of ~O(√Nϵ−5/3+ϵ−8/3). On the other hand, we prove that quantum algorithms must take ~Ω(√Nϵ−2/3) queries to a first-order quantum oracle, showing that our dependence on N is optimal up to poly-logarithmic factors.
Primary Area: optimization
Keywords: Second-order methods convex optimization stochastic optimization Cubic Newton Method high-order methods tensor methods lower bounds
Scores: [ 6 6 1 8 ]
We present a new accelerated stochastic second-order method that is robust to both gradient and Hessian inexactness, typical in machine learning. We establish theoretical lower bounds and prove that our algorithm achieves optimal convergence in both gradient and Hessian inexactness in this key setting. We further introduce a tensor generalization for stochastic higher-order derivatives. When the oracles are non-stochastic, the proposed tensor algorithm matches the global convergence of Nesterov Accelerated Tensor method. Both algorithms allow for approximate solutions of their auxiliary subproblems with verifiable conditions on the accuracy of the solution.
Primary Area: optimization
Keywords: federated optimization communication cost non-linear bandit bandit optimization cumulative regret
Scores: [ 6 6 6 6 ]
Federated optimization studies the problem of collaborative function optimization among multiple clients (e.g. mobile devices or organizations) under the coordination of a central server. Since the data is collected separately by each client and always remains decentralized, federated optimization preserves data privacy and allows for large-scale computing, which makes it a promising decentralized machine learning paradigm. Though it is often deployed for tasks that are online in nature, e.g., next-word prediction on keyboard apps, most works formulate it as an offline problem. The few exceptions that consider federated bandit optimization are limited to very simplistic function classes, e.g., linear, generalized linear, or non-parametric function class with bounded RKHS norm, which severely hinders its practical usage. In this paper, we propose a new algorithm, named Fed-GO-UCB, for federated bandit optimization with generic non-linear objective function. Under some mild conditions, we rigorously prove that Fed-GO-UCB is able to achieve sub-linear rate for both cumulative regret and communication cost. At the heart of our theoretical analysis are distributed regression oracle and individual confidence set construction, which can be of independent interests. Empirical evaluations also demonstrate the effectiveness of the proposed algorithm.
Primary Area: optimization
Keywords: feature-metric image alignment
Scores: [ 8 6 8 ]
Primary Area: optimization
Keywords: Optimization Federated Learning Contractive Compression
Scores: [ 6 6 6 8 ]
Modern distributed training relies heavily on communication compression to reduce the communication overhead. In this work, we study algorithms employing a popular class of contractive compressors in order to reduce communication overhead. However, the naive implementation often leads to unstable convergence or even exponential divergence due to the compression bias. Error Compensation (EC) is an extremely popular mechanism to mitigate the aforementioned issues during the training of models enhanced by contractive compression operators. Compared to the effectiveness of EC in the data homogeneous regime, the understanding of the practicality and theoretical foundations of EC in the data heterogeneous regime is limited. Existing convergence analyses typically rely on strong assumptions such as bounded gradients, bounded data heterogeneity, or large batch accesses, which are often infeasible in modern Machine Learning Applications. We resolve the majority of current issues by proposing EControl, a novel mechanism that can regulate error compensation by controlling the strength of the feedback signal. We prove fast convergence for EControl in standard strongly convex, general convex, and nonconvex settings without any additional assumptions on the problem or data heterogeneity. We conduct extensive numerical evaluations to illustrate the efficacy of our method and support our theoretical findings.
Primary Area: optimization
Keywords: sharpness-aware minimization sam trust region optimization cross-lingual transfer language modeling
Scores: [ 5 6 8 8 ]
By reducing the curvature of the loss surface in the parameter space,Sharpness-aware minimization (SAM) yields widespread robustness improvementunder domain transfer. Instead of focusing on parameters, however, this workconsiders the transferability of representations as the optimizationtarget for out-of-domain generalization in a fine-tuning setup. To encourage theretention of transferable representations, we consider trust region-basedfine-tuning methods, which exploit task-specific skills without forgettingtask-agnostic representations from pre-training. We unify parameter- andrepresentation-space smoothing approaches by using trust region bounds to informSAM-style regularizers on both of these optimization surfaces. We proposeTrust Region Aware Minimization (TRAM), a fine-tuning algorithm thatoptimizes for flat minima and smooth, informative representations withoutforgetting pre-trained structure. We find that TRAM outperforms bothsharpness-aware and trust region-based optimization methods on cross-domainlanguage modeling and cross-lingual transfer, where robustness to domaintransfer and representation generality are critical for success. TRAMestablishes a new standard in training generalizable models with minimaladditional computation.
Primary Area: optimization
Keywords: Saddle Point Optimization Distributed Optimization Composite Optimization Dual Extrapolation Mirror Prox Convex Optimization Bregman Divergence
Scores: [ 6 5 6 ]
Distributed optimization (DO) approaches for saddle point problems (SPP) have recently gained in popularity due to the critical role they play in machine learning (ML). Existing works mostly target smooth unconstrained objectives in Euclidean space, whereas ML problems often involve constraints or non-smooth regularization, which results in a need for composite optimization. Moreover, although non-smooth regularization often serves to induce structure (e.g., sparsity), standard aggregation schemes in distributed optimization break this structure. Addressing these issues, we propose Federated Dual Extrapolation (FeDualEx), an extra-step primal-dual algorithm with local updates, which is the first of its kind to encompass both saddle point optimization and composite objectives under the distributed paradigm. Using a generalized notion of Bregman divergence, we analyze its convergence and communication complexity in the homogeneous setting. Furthermore, the empirical evaluation demonstrates the effectiveness of FeDualEx for inducing structure in these challenging settings.
Primary Area: optimization
Keywords: neural network optimization progressive sharpening edge of stability adaptive gradient methods batch normalization
Scores: [ 8 5 3 6 ]
Primary Area: optimization
Keywords: Deep Reinforcement Learning Graph Neural Network Job Shop Scheduling Combinatorial Optimization
Scores: [ 8 8 8 6 ]
Recent studies in using deep reinforcement learning (DRL) to solve Job-shop scheduling problems (JSSP) focus on construction heuristics. However, their performance is still far from optimality, mainly because the underlying graph representation scheme is unsuitable for modelling partial solutions at each construction step. This paper proposes a novel DRL-guided improvement heuristic for solving JSSP, where graph representation is employed to encode complete solutions. We design a Graph-Neural-Network-based representation scheme, consisting of two modules to effectively capture the information of dynamic topology and different types of nodes in graphs encountered during the improvement process. To speed up solution evaluation during improvement, we present a novel message-passing mechanism that can evaluate multiple solutions simultaneously. We prove that the computational complexity of our method scales linearly with problem size. Experiments on classic benchmarks show that the improvement policy learned by our method outperforms state-of-the-art DRL-based methods by a large margin.
Primary Area: optimization
Keywords: federated learning communication compression data heterogeneity controlled averaging
Scores: [ 8 8 8 ]
Primary Area: optimization
Keywords: burer-monteiro convex optimization neural networks stationary points global optima relu activation
Scores: [ 8 5 6 6 8 ]
It has been demonstrated that the training problem for a variety of (non) linear two-layer neural networks (such as two-layer perceptrons, convolutional networks, and self-attention) can be posed as equivalent convex optimization problems, with an induced regularizer which encourages low rank. However, this regularizer becomes prohibitively expensive to compute at moderate scales, impeding training convex neural networks. To this end, we propose applying the Burer-Monteiro factorization to convex neural networks, which for the first time enables a Burer-Monteiro perspective on neural networks with non-linearities. This factorization leads to an equivalent yet computationally tractable non-convex alternative with no spurious local minima. We develop a novel relative optimality bound of stationary points of the Burer-Monteiro factorization, providing verifiable conditions under which any stationary point is a global optimum. Further, for the first time, we show that linear self-attention with sufficiently many heads has no spurious local minima. Our experiments validate the novel relative optimality bound and the utility of the Burer-Monteiro factorization for scaling convex neural networks.
Primary Area: optimization
Keywords: online nonconvex optimization local regret optimal rate analysis
Scores: [ 6 6 8 6 ]
Primary Area: optimization
Keywords: Network Pruning Federated Learning Model Heterogeneity
Scores: [ 8 3 6 5 8 6 ]
Primary Area: optimization
Keywords: Federated Learning Variational Inequality Minimax Optimization Communication Complexity
Scores: [ 6 5 6 8 ]
Primary Area: optimization
Keywords: Deep learning Second-order optimization K-FAC Feature learning Infinite width Maximum update parameterization
Scores: [ 6 8 8 ]
Primary Area: optimization
Keywords: federated learning adaptive optimization distributed optimization
Scores: [ 6 8 6 6 ]
Federated learning (FL) is a distributed machine learning framework where the global model of a central server is trained via multiple collaborative steps by participating clients without sharing their data. While being a flexible framework, where the distribution of local data, participation rate, and computing power of each client can greatly vary, such flexibility gives rise to many new challenges, especially in the hyperparameter tuning on the client side. We propose Δ-SGD, a simple step size rule for SGD that enables each client to use its own step size by adapting to the local smoothness of the function each client is optimizing. We provide theoretical and empirical results where the benefit of the client adaptivity is shown in various FL scenarios.
Primary Area: optimization
Keywords: Constrained optimization Invexity Quasi-invexity Global optima
Scores: [ 6 3 5 ]
Signal restoration is an important constrained optimization problem with significant applications in various domains. Although non-convex constrained optimization problems have been shown to perform better than convex counterparts in terms of reconstruction quality, convex constrained optimization problems have been preferably for its global optima guarantees. Despite the success of non-convex methods in a large number of applications, it is not an overstatement to say that there is little or no hope for non-convex problems to ensure global optima. In this paper, for the first time, we develop invex constrained optimization theory to mitigate the loss of guarantees for global optima in non-convex constrained inverse problems, where the invex function is a mapping where any critical point is a global minimizer. We also develop relevant theories to extend the global optima guarantee to a set of quasi-invex functions - the largest optimizable mappings. More specifically, we propose a family of invex/quasi-invex of functions for handling constrained inverse problems using the non-convex setting along with guarantees for their global optima. Our experimental evaluation shows that the proposed approach is very promising and can aid in extending existing convex optimization algorithms, such as the alternating direction method of multipliers, and accelerated proximal gradient methods.
Primary Area: optimization
Keywords: markov chains diffusion processes
Scores: [ 3 6 8 ]
Primary Area: optimization
Keywords: zeroth-order gradient hard-thresholding variance reduction
Scores: [ 6 6 3 6 5 6 ]
Hard-thresholding is an important type of algorithm in machine learning that is used to solve ℓ0 constrained optimization problems. However, the true gradient of the objective function can be difficult to access in certain scenarios, which normally can be approximated by zeroth-order (ZO) methods. SZOHT algorithm is the only algorithm tackling ℓ0 sparsity constraints with zeroth-order gradients so far. Unfortunately, SZOHT has a notable limitation on the number of random directions due to the inherent conflict between the deviation of ZO gradients and the expansivity of the hard-thresholding operator. This paper approaches this problem by considering the role of variance and provides a new insight into variance reduction: mitigating the unique conflicts between ZO gradients and hard-thresholding. Under this perspective, we propose a generalized variance reduced ZO hard-thresholding algorithm as well as the generalized convergence analysis under standard assumptions. The theoretical results demonstrate the new algorithm eliminates the restrictions on the number of random directions, leading to improved convergence rates and broader applicability compared with SZOHT. Finally, we illustrate the utility of our method on a portfolio optimization problem as well as black-box adversarial attacks.
Primary Area: optimization
Keywords: Constrained Learning Convex Optimization Duality Constrained Optimization Fairness
Scores: [ 6 8 6 3 6 ]
With the widespread adoption of machine learning systems, the need to curtail their behavior has become increasingly apparent. This is evidenced by recent advancements towards developing models that satisfy robustness, safety and fairness requirements. Imposing these requirements leads to constrained learning problems, which can be tackled with dual ascent methods. However, convergence guarantees for dual ascent algorithms typically involve a randomized or averaged sequence of primal iterates. These solutions are impractical, since they require storing an ever growing sequence of models. Although it has been observed that final iterates perform well in practice, theoretical guarantees for their optimality and feasibility have remained elusive. In this work, we characterize the infeasibility of Lagrangian minimizers associated with optimal dual variables, which leads to a sub-optimality bound for best primal iterates. To do this, we leverage the fact that constrained learning problems are parametrized versions of convex functional programs. This bound sheds light on how the richness of the parametrization and the curvature of the objective impact the convergence of primal iterates. We empirically validate this finding in learning problems with fairness constraints.
Primary Area: optimization
Keywords: Lion Optimization Lyapunov Analysis
Scores: [ 8 6 8 8 ]
Lion (Evolved Sign Momentum), a new optimizer discovered through program search, has shown promising results in training large AI models. It achieves results comparable to AdamW but with greater memory efficiency. As what we can expect from the result of the random search, Lion blends a number of elements from existing algorithms, including signed momentum, decoupled weight decay, Polayk and Nesterov momentum, but doesn't fit into any existing category of theoretically grounded optimizers. Thus, even though Lion appears to perform well as a general-purpose optimizer for a wide range of tasks, its theoretical basis remains uncertain. This absence of theoretical clarity limits opportunities to further enhance and expand Lion's efficacy. This work aims to demystify Lion. Using both continuous-time and discrete-time analysis, we demonstrate that Lion is a novel and theoretically grounded approach for minimizing a general loss function f(x) while enforcing a bound constraint ||x||∞≤1/λ. Lion achieves this through the incorporation of decoupled weight decay, where λ represents the weight decay coefficient. Our analysis is facilitated by the development of a new Lyapunov function for the Lion updates. It applies to a wide range of Lion-ϕ algorithms, where the sign(⋅) operator in Lion is replaced by the subgradient of a convex function ϕ, leading to the solution of the general composite optimization problem minxf(x)+ϕ∗(x). Our findings provide valuable insights into the dynamics of Lion and pave the way for further enhancements and extensions of Lion-related algorithms.
Primary Area: optimization
Keywords: Semidefinite programming Lipschitz constant Deep learning
Scores: [ 5 6 6 8 ]
Primary Area: optimization
Keywords: Bayesian Optimization Online Learning Black-box Optimization
Scores: [ 5 8 5 5 ]
Bayesian Optimization (BO) is typically used to optimize an unknown function f that is noisy and costly to evaluate, by exploiting an acquisition function that must be maximized at each optimization step. Even if provably asymptotically optimal BO algorithms are efficient at optimizing low-dimensional functions, scaling them to high-dimensional spaces remains an open problem, often tackled by assuming an additive structure for f. By doing so, BO algorithms typically introduce additional restrictive assumptions on the additive structure that reduce their applicability domain. This paper contains two main contributions: (i) we relax the restrictive assumptions on the additive structure of f, at the expense of weakening the maximization guarantees of the acquisition function, and (ii) we address the over-exploration problem for decentralized BO algorithms. To these ends, we propose DumBO, an asymptotically optimal decentralized BO algorithm that achieves very competitive performance against state-of-the-art BO algorithms, especially when the additive structure of f comprises high-dimensional factors.
Primary Area: optimization
Keywords: Small Transformers Training Stability
Scores: [ 8 8 8 8 ]
Primary Area: optimization
Keywords: Multi-Objective Optimization Black-Box Optimization Black-Box Multi-Objective Optimization
Scores: [ 6 6 6 6 ]
Primary Area: optimization
Keywords: Distributed Training Data Parallelism Local Updating Asynchronous Communication
Scores: [ 8 6 6 8 6 8 ]
Primary Area: optimization
Keywords: federated learning partial client participation adaptation aggregation weights
Scores: [ 6 5 6 10 ]
In federated learning (FL), clients usually have diverse participation statistics that are unknown a priori, which can significantly harm the performance of FL if not handled properly. Existing works aiming at addressing this problem are usually based on global variance reduction, which requires a substantial amount of additional memory in a multiplicative factor equal to the total number of clients. An important open problem is to find a lightweight method for FL in the presence of clients with unknown participation rates. In this paper, we address this problem by adapting the aggregation weights in federated averaging (FedAvg) based on the participation history of each client. We first show that, with heterogeneous participation statistics, FedAvg with non-optimal aggregation weights can diverge from the optimal solution of the original FL objective, indicating the need of finding optimal aggregation weights. However, it is difficult to compute the optimal weights when the participation statistics are unknown. To address this problem, we present a new algorithm called FedAU, which improves FedAvg by adaptively weighting the client updates based on online estimates of the optimal weights without knowing the statistics of client participation. We provide a theoretical convergence analysis of FedAU using a novel methodology to connect the estimation error and convergence. Our theoretical results reveal important and interesting insights, while showing that FedAU converges to an optimal solution of the original objective and has desirable properties such as linear speedup. Our experimental results also verify the advantage of FedAU over baseline methods with various participation patterns.
Primary Area: optimization
Keywords: Optimization First-order optimization Non-convex optimization Distributed optimization
Scores: [ 3 6 6 ]
This paper introduces a new method for minimizing matrix-smooth non-convex objectives through the use of novel Compressed Gradient Descent (CGD) algorithms enhanced with a matrix-valued stepsize. The proposed algorithms are theoretically analyzed first in the single-node and subsequently in the distributed settings. Our theoretical results reveal that the matrix stepsize in CGD can capture the objective’s structure and lead to faster convergence compared to a scalar stepsize. As a byproduct of our general results, we emphasize the importance of selecting the compression mechanism and the matrix stepsize in a layer-wise manner, taking advantage of model structure. Moreover, we provide theoretical guarantees for free compression, by designing specific layer-wise compressors for the non-convex matrix smooth objectives. Our findings are supported with empirical evidence.
Primary Area: optimization
Keywords: mean-field Langevin dynamics minimax optimization zero-sum games Markov games
Scores: [ 8 6 6 8 6 ]
In this paper, we extend mean-field Langevin dynamics to minimax optimization over probability distributions for the first time with symmetric and provably convergent updates. We propose \emph{mean-field Langevin averaged gradient} (MFL-AG), a single-loop algorithm that implements gradient descent ascent in the distribution space with a novel weighted averaging, and establish average-iterate convergence to the mixed Nash equilibrium. We also study both time and particle discretization regimes and prove a new uniform-in-time propagation of chaos result which accounts for the dependency of the particle interactions on all previous distributions. Furthermore, we propose \emph{mean-field Langevin anchored best response} (MFL-ABR), a symmetric double-loop algorithm based on best response dynamics with linear last-iterate convergence. Finally, we study applications to zero-sum Markov games and conduct simulations to demonstrate the long-term optimality of MFL-AG and MFL-ABR.
Primary Area: optimization
Keywords: Machine Learning Optimization Differentiable Optimization Optimization layers
Scores: [ 6 8 8 6 ]
Optimization layers within neural network architectures have become increasingly popular for their ability to solve a wide range of machine learning tasks and to model domain-specific knowledge. However, designing optimization layers requires careful consideration as the underlying optimization problems might be infeasible during training. Motivated by applications in learning, control and robotics, this work focuses on convex quadratic programming (QP) layers. The specific structure of this type of optimization layer can be efficiently exploited for faster computations while still allowing rich modeling capabilities. We leverage primal-dual augmented Lagrangian techniques for computing derivatives of both feasible and infeasible QP solutions. More precisely, we propose a unified approach which tackles the differentiability of the closest feasible QP solutions in a classical ℓ2 sense. The obtained Jacobian covers for feasible QPs the traditional implicit differentiation when it is valid and a weaker notion (i.e., conservative Jacobian) when it is infeasible. We then harness this approach to enrich the expressive capabilities of existing QP layers. More precisely, we show how differentiating through infeasible QPs during training enables to drive towards feasibility at test time a new range of QP layers. These layers notably demonstrate superior predictive performance in some conventional learning tasks. Additionally, we present alternative formulations that enhance numerical robustness, speed, and accuracy for training such layers. Along with these contributions, we provide an open-source C++ software package called QPLayer for differentiating feasible and infeasible convex QPs and which can be interfaced with modern learning frameworks.
Primary Area: optimization
Keywords: SGD linear dynamics sharpness implicit regularization
Scores: [ 6 8 3 6 ]
Stochastic Gradient Descent (SGD) stands as a cornerstone optimization algorithm with proven real-world empirical successes but relatively limited theoretical understanding. Recent research has illuminated a key factor contributing to its practical efficacy: the implicit regularization it instigates. Several studies have investigated the linear stability property of SGD in the vicinity of a stationary point as a predictive proxy for sharpness and generalization error in overparameterized neural networks (Wu et al., 2022; Jastrzebski et al., 2019; Cohen et al., 2021). In this paper, we delve deeper into the relationship between linear stability and sharpness. More specifically, we meticulously delineate the necessary and sufficient conditions for linear stability, contingent on hyperparameters of SGD and the sharpness at the optimum. Towards this end, we introduce a novel coherence measure of the loss Hessian that encapsulates pertinent geometric properties of the loss function that are relevant to the linear stability of SGD. It enables us to provide a simplified sufficient condition for identifying linear instability at an optimum. Notably, compared to previous works, our analysis relies on significantly milder assumptions and is applicable for a broader class of loss functions than known before, encompassing not only mean-squared error but also cross-entropy loss.
Primary Area: optimization
Keywords: federated learning momentum data heterogeneity non-convex optimization
Scores: [ 5 5 8 5 ]
Federated learning is a powerful paradigm for large-scale machine learning, but itfaces significant challenges due to unreliable network connections, slow commu-nication, and substantial data heterogeneity across clients. FedAvg and SCAFFOLD are two prominent algorithms to address these challenges. In particular,FedAvg employs multiple local updates before communicating with a centralserver, while SCAFFOLD maintains a control variable on each client to compen-sate for “client drift” in its local updates. Various methods have been proposedto enhance the convergence of these two algorithms, but they either make imprac-tical adjustments to algorithmic structure, or rely on the assumption of boundeddata heterogeneity. This paper explores the utilization of momentum to enhancethe performance of FedAvg and SCAFFOLD. When all clients participate in thetraining process, we demonstrate that incorporating momentum allows FedAvgto converge without relying on the assumption of bounded data heterogeneity evenusing a constant local learning rate. This is novel and fairly suprising as existinganalyses for FedAvg require bounded data heterogeneity even with diminishinglocal learning rates. In partial client participation, we show that momentum en-ables SCAFFOLD to converge provably faster without imposing any additionalassumptions. Furthermore, we use momentum to develop new variance-reducedextensions of FedAvg and SCAFFOLD, which exhibit state-of-the-art conver-gence rates. Our experimental results support all theoretical findings.
Primary Area: optimization
Keywords: Learng from human feedback zeroth-order optimization Stable Diffusion ranking and preferences
Scores: [ 6 6 6 ]
Primary Area: optimization
Keywords: stochastic optimization convex optimization distributionally robust learning spectral risk measures incremental optimization
Scores: [ 8 8 8 8 ]
We consider the distributionally robust (DR) optimization problem with spectral risk-based uncertainty set and f-divergence penalty. This formulation includes common risk-sensitive learning objectives such as regularized condition value-at-risk (CVaR) and average top-k loss. We present Prospect, a stochastic gradient-based algorithm that only requires tuning a single learning rate hyperparameter, and prove that it enjoys linear convergence for smooth regularized losses. This contrasts with previous algorithms that either require tuning multiple hyperparameters or potentially fail to converge due to biased gradient estimates or inadequate regularization. Empirically, we show that Prospect can converge 2-3x faster than baselines such as SGD and stochastic saddle-point methods on distribution shift and fairness benchmarks spanning tabular, vision, and language domains.
Primary Area: optimization
Keywords: robust optimization deep learning discrete optimization
Scores: [ 8 6 6 ]
Robust optimization provides a mathematical framework for modeling and solving decision-making problems under worst-case uncertainty. This work addresses two-stage robust optimization (2RO) problems (also called adjustable robust optimization), wherein first-stage and second-stage decisions are made before and after uncertainty is realized, respectively. This results in a nested min-max-min optimization problem which is extremely challenging computationally, especially when the decisions are discrete. We propose Neur2RO, an efficient machine learning-driven instantiation of column-and-constraint generation (CCG), a classical iterative algorithm for 2RO. Specifically, we learn to estimate the value function of the second-stage problem via a novel neural network architecture that is easy to optimize over by design. Embedding our neural network into CCG yields high-quality solutions quickly as evidenced by experiments on two 2RO benchmarks, knapsack and capital budgeting. On small or easy instances, Neur2RO recovers solutions of nearly the same quality as state-of-the-art methods but is most advantageous on large-scale instances, where it finds better solutions on average.
Primary Area: optimization
Keywords: variational inequalities optimization general constraints primal-dual interior-point method projection-free
Scores: [ 6 6 8 6 ]
Yang et al. (2023) recently showed how to use first-order gradient methods to solve general variational inequalities (VIs) under a limiting assumption that analytic solutions of specific subproblems are available. In this paper, we circumvent this assumption via a warm-starting technique where we solve subproblems approximately and initialize variables with the approximate solution found at the previous iteration. We prove the convergence of this method and show that the gap function of the last iterate of the method decreases at a rate of O(1√K) when the operator is L-Lipschitz and monotone. In numerical experiments, we show that this technique can converge much faster than its exact counterpart. Furthermore, for the cases when the inequality constraints are simple, we introduce an alternative variant of ACVI and establish its convergence under the same conditions.Finally, we relax the smoothness assumptions in Yang et al., yielding, to our knowledge, the first convergence result for VIs with general constraints that does not rely on the assumption that the operator is L-Lipschitz.
Primary Area: optimization
Keywords: Sketching Residual error Low-rank approximation sparse recovery
Scores: [ 5 8 8 6 ]
Primary Area: optimization
Keywords: Finite Sum Minimization Variance Reduction Optimization
Scores: [ 6 8 6 5 ]
Given a sequence of functions f1,…,fn with fi:D↦R, finite-sum minimization seeks a point x⋆∈D minimizing ∑nj=1fj(x)/n. In this work, we propose a key twist into the finite-sum minimization, dubbed as instance-optimal finite-sum minimization, that asks for a sequence of points x⋆1,…,x⋆n∈D such that each x⋆i∈D minimizes the prefix-sum ∑ij=1fj(x)/i. Assuming that each prefix-sum is strongly convex, we develop a first-order stochastic instance optimal gradient method SIOPT−Grad producing an ϵ-optimal sequence with ~O(n/ϵ1/3+1/√ϵ) overall first-order oracles (FO). An FO corresponds to the computation of a single gradient ∇fj(x) at a given x∈D for some j∈[n]. Our approach significantly improves upon the O(n/ϵ) FOs that StochasticGradientDescent requires and the O(n2log(1/ϵ)) FOs that state-of-the-art variance reduction methods such as Katyusha require. We also prove that there is no natural first-order method with O(n/ϵα) gradient complexity for α<1/4, establishing that the first-order complexity of our method is nearly tight.
Primary Area: optimization
Keywords: extragradient minimax nonconvex-nonconcave variational inequilities saddle point problem
Scores: [ 8 6 5 5 ]
Primary Area: optimization
Keywords: quadratic models wide neural networks catapult phase optimization dynamics
Scores: [ 5 8 5 8 6 ]
Primary Area: optimization
Keywords: Deep Learning Bilevel Program Binary Tender Enhanced Sampling Input Supermodular Neural Network
Scores: [ 3 6 6 ]
Primary Area: optimization
Keywords: Decentralized optimization Stiefel manifold Conjugate gradient method
Scores: [ 8 6 5 ]
The conjugate gradient method is a crucial first-order optimization method that generally converges faster than the steepest descent method, and its computational cost is much lower than the second-order methods. However, while various types of conjugate gradient methods have been studied in Euclidean spaces and on Riemannian manifolds, there is little study for those in distributed scenarios. This paper proposes a decentralized Riemannian conjugate gradient descent (DRCGD) method that aims at minimizing a global function over the Stiefel manifold. The optimization problem is distributed among a network of agents, where each agent is associated with a local function, and the communication between agents occurs over an undirected connected graph. Since the Stiefel manifold is a non-convex set, a global function is represented as a finite sum of possibly non-convex (but smooth) local functions. The proposed method is free from expensive Riemannian geometric operations such as retractions, exponential maps, and vector transports, thereby reducing the computational complexity required by each agent. To the best of our knowledge, DRCGD is the first decentralized Riemannian conjugate gradient algorithm to achieve global convergence over the Stiefel manifold. The numerical experiments reveal that the advantages of our DRCGD.
Primary Area: optimization
Keywords: min-max optimization adaptive batch size policy evaluation.
Scores: [ 8 6 8 6 ]
Learning an accurate value function for a given policy is a critical step in solving reinforcement learning (RL) problems. So far, however, the convergence speed and sample complexity performances of most existing policy evaluation algorithms remain unsatisfactory, particularly with non-linear function approximation. This challenge motivates us to develop a new path-integrated primal-dual stochastic gradient (PILOT) method, that is able to achieve a fast convergence speed for RL policy evaluation with nonlinear function approximation. To further alleviate the periodic full gradient evaluation requirement, we further propose an enhanced method with an adaptive-batch adjustment called PILOT$+$. The main advantages of our methods include: i) PILOT allows the use of {\em{constant}} step sizes and achieves the O(1/K) convergence rate to first-order stationary points of non-convex policy evaluation problems; ii) PILOT is a generic {\em{single}}-timescale algorithm that is also applicable for solving a large class of non-convex strongly-concave minimax optimization problems; iii) By adaptively adjusting the batch size via historical stochastic gradient information, PILOT$+$ is more sample-efficient empirically without loss of theoretical convergence rate. Our extensive numerical experiments verify our theoretical findings and showcase the high efficiency of the proposed PILOT and PILOT$^+$ algorithms compared with the state-of-the-art methods.
Primary Area: optimization
Keywords: adversarial training distributionally robust optimization bilevel optimization instance reweighting
Scores: [ 8 6 6 ]
Primary Area: optimization
Keywords: convex optimization nonconvex optimization first order optimization second order optimization deep learning
Scores: [ 6 8 5 6 ]
Primary Area: optimization
Keywords: hybrid RL Sample Augmentation Learning to branch Imitation learning
Scores: [ 6 6 5 ]
Primary Area: optimization
Keywords: matrix completion
Scores: [ 6 6 6 ]
Primary Area: optimization
Keywords: quantization compression large language models NLP machine learning low rank
Scores: [ 8 8 6 ]
Primary Area: optimization
Keywords: Bilevel Optimization Unbounded Smoothness Deep Learning
Scores: [ 6 6 6 8 8 ]
Bilevel optimization is an important formulation for many machine learning problems, such as meta-learning and hyperparameter optimization. Current bilevel optimization algorithms assume that the gradient of the upper-level function is Lipschitz (i.e., the upper-level function has a bounded smoothness parameter). However, recent studies reveal that certain neural networks such as recurrent neural networks (RNNs) and long-short-term memory networks (LSTMs) exhibit potential unbounded smoothness, rendering conventional bilevel optimization algorithms unsuitable for these neural networks. In this paper, we design a new bilevel optimization algorithm, namely BO-REP, to address this challenge. This algorithm updates the upper-level variable using normalized momentum and incorporates two novel techniques for updating the lower-level variable: \textit{initialization refinement} and \textit{periodic updates}. Specifically, once the upper-level variable is initialized, a subroutine is invoked to obtain a refined estimate of the corresponding optimal lower-level variable, and the lower-level variable is updated only after every specific period instead of each iteration. When the upper-level problem is nonconvex and unbounded smooth, and the lower-level problem is strongly convex, we prove that our algorithm requires ˜O(1/ϵ4) \footnote{Here ˜O(⋅) compresses logarithmic factors of 1/ϵ and 1/δ, where δ∈(0,1) denotes the failure probability.} iterations to find an ϵ-stationary point in the stochastic setting, where each iteration involves calling a stochastic gradient or Hessian-vector product oracle. Notably, this result matches the state-of-the-art complexity results under the bounded smoothness setting up to logarithmic factors. Our proof relies on novel technical lemmas for the periodically updated lower-level variable, which are of independent interest. Our experiments on hyper-representation learning, hyperparameter optimization, and data hyper-cleaning for text classification tasks demonstrate the effectiveness of our proposed algorithm.
Primary Area: optimization
Keywords: multilayer transformer training dynamics theoretical analysis self-attention interpretability neural network understanding
Scores: [ 5 6 6 6 ]
We propose Joint MLP/Attention (JoMA) dynamics, a novel mathematical framework to understand the training procedure of multilayer Transformer architectures. This is achieved by integrating out the self-attention layer in Transformers, producing a modified dynamics of MLP layers only. JoMA removes unrealistic assumptions in previous analysis (e.g., lack of residual connection), and predicts that the attention first becomes sparse (to learn salient tokens), then dense (to learn less salient tokens) in the presence of nonlinear activations, while in the linear case, it is consistent with existing works. We leverage JoMA to qualitatively explains how tokens are combined to form hierarchies in multilayer Transformers, when the input tokens are generated by a latent hierarchical generative model. Experiments on models trained from real-world dataset (Wikitext2/Wikitext103) and various pre- trained models (OPT, Pythia) verify our theoretical findings.
Primary Area: optimization
Keywords: Symmetry optimization generalization
Scores: [ 6 8 8 8 ]
Primary Area: optimization
Keywords: Optimal transport Convex optimization Quasi-Newton methods Non-asymptotic analysis Extremal combinatorics
Scores: [ 8 8 6 6 ]
Primary Area: optimization
Keywords: Deep Learning Theory Hyperparameter Transfer Training Dynamics Residual Networks
Scores: [ 8 6 8 8 ]
Primary Area: optimization
Keywords: optimization differential privacy non-convex non-smooth
Scores: [ 8 5 6 6 ]
We introduce a new zeroth-order algorithm for private stochastic optimization on nonconvex and nonsmooth objectives.Given a dataset of size M, our algorithm ensures (α,αρ2/2)-Renyi differential privacy and finds a (δ,ϵ)-stationary point so long as M=~Ω(dδϵ3+d3/2ρδϵ2).This matches the optimal complexity found in its non-private zeroth-order analog. Notably, although the objective is not smooth, we have privacy for free'' when ρ≥√dϵ.
Primary Area: optimization
Keywords: differential privacy distance estimation sublinear algorithms high dimensional data analysis kernels
Scores: [ 8 8 6 8 ]
Primary Area: optimization
Keywords: Bilevel-Optimization Penalty Methods Landscape Analysis Non-Asymptotic Analysis First-Order Methods
Scores: [ 6 6 6 5 8 ]
Primary Area: optimization
Keywords: error feedback greedy sparsification distributed optimization communication complexity machine cloning weighted error feedback quadratic mean arithmetic mean large stepsizes
Scores: [ 6 8 3 6 ]
Primary Area: optimization
Keywords: optimization theory convergence analysis heavy-ball momentum learning rate schedule near-optimal convergence rate
Scores: [ 8 6 6 5 ]
Heavy-ball momentum with decaying learning rates is widely used with SGD for optimizing deep learning models. In contrast to its empirical popularity, the understanding of its theoretical property is still quite limited, especially under the standard anisotropic gradient noise condition for quadratic regression problems. Although it is widely conjectured that heavy-ball momentum method can provide accelerated convergence and should work well in large batch settings, there is no rigorous theoretical analysis. In this paper, we fill this theoretical gap by establishing a non-asymptotic convergence bound for stochastic heavy-ball methods with step decay scheduler on quadratic objectives, under the anisotropic gradient noise condition. As a direct implication, we show that heavy-ball momentum can provide ~O(√κ) accelerated convergence of the bias term of SGD while still achieving near-optimal convergence rate with respect to the stochastic variance term. The combined effect implies an overall convergence rate within log factors from the statistical minimax rate. This means SGD with heavy-ball momentum is useful in the large-batch settings such as distributed machine learning or federated learning, where a smaller number of iterations can significantly reduce the number of communication rounds, leading to acceleration in practice.
Primary Area: optimization
Keywords: Sparse Phase Retrieval Nonconvex Optimization Quadratic Convergence
Scores: [ 6 8 6 8 ]
We study the sparse phase retrieval problem, which aims to recover a sparse signal from a limited number of phaseless measurements. Existing algorithms for sparse phase retrieval primarily rely on first-order methods with linear convergence rate. In this paper, we propose an efficient second-order algorithm based on Newton projection, which maintains the same per-iteration computational complexity as popular first-order methods. The proposed algorithm is theoretically guaranteed to converge to the ground truth (up to a global sign) at a quadratic convergence rate after at most O(log(∥x♮∥/x♮min)) iterations, provided a sample complexity of O(s2logn), where x♮∈Rn represents an s-sparse ground truth signal. Numerical experiments demonstrate that our algorithm outperforms state-of-the-art methods in terms of achieving a significantly faster convergence rate.
Primary Area: optimization
Keywords: optimization clipping federated learning decentralized learning distributed optimization
Scores: [ 6 6 8 5 5 ]
Gradient clipping is key mechanism that is essential to differentially private training techniques in Federated learning. Two popular strategies are per-sample clipping, which clips the mini-batch gradient, and per-update clipping, which clips each user's model update. However, there has not been a thorough theoretical analysis of these two clipping methods.In this work, we rigorously analyze the impact of these two clipping techniques on the convergence of a popular federated learning algorithm FedAvg under standard stochastic noise and gradient dissimilarity assumptions. We provide a convergence guarantee given any arbitrary clipping threshold. Specifically, we show that per-sample clipping is guaranteed to converge to the neighborhood of the stationary point, with the size dependent on the stochastic noise, gradient dissimilarity, and clipping threshold. In contrast, the convergence to the stationary point can be guaranteed with a sufficiently small stepsize in per-update clipping at the cost of more communication rounds. We further provide insights into understanding the impact of the improved convergence analysis in the differentially private setting.
Primary Area: optimization
Keywords: Distributed Learning Self-Repellent Random Walk Token Algorithm Central Limit Theorem Asymptotic Analysis
Scores: [ 8 6 8 6 ]
We study a family of distributed stochastic optimization algorithms where gradients are sampled by a token traversing a network of agents in random-walk fashion. Typically, these random-walks are chosen to be Markov chains that asymptotically sample from a desired target distribution, and play a critical role in the convergence of the optimization iterates. In this paper, we take a novel approach by replacing the standard linear Markovian token by one which follows a non-linear Markov chain - namely the Self-Repellent Radom Walk (SRRW). Defined for any given 'base' Markov chain, the SRRW, parameterized by a positive scalar alpha, is less likely to transition to states that were highly visited in the past, thus the name. In the context of MCMC sampling on a graph, a recent breakthrough in Doshi et al. (2023) shows that the SRRW achieves O(1/alpha) decrease in the asymptotic variance for sampling. We propose the use of a `generalized' version of the SRRW to drive token algorithms for distributed stochastic optimization in the form of stochastic approximation, termed SA-SRRW. We prove that the optimization iterate errors of the resulting SA-SRRW converge to zero almost surely and prove a central limit theorem, deriving the explicit form of the resulting asymptotic covariance matrix corresponding to iterate errors. This asymptotic covariance is always smaller than that of an algorithm driven by the base Markov chain and decreases at rate O(1/alpha2) - the performance benefit of using SRRW thereby amplified in the stochastic optimization context. Empirical results support our theoretical findings.
Primary Area: optimization
Keywords: Large Model Transformer Multi-Level Training Acceleration
Scores: [ 6 6 5 6 ]
The fast growing capabilities of large-scale deep learning models, such as Bert, GPT and ViT, are revolutionizing the landscape of NLP, CV and many other domains. Training such models, however, poses an unprecedented demand for computing power, which incurs exponentially increasing energy cost and carbon dioxide emissions. It is thus critical to develop efficient training solutions to reduce the training costs. Motivated by a set of key observations of inter- and intra-layer similarities among feature maps and attentions that can be identified from typical training processes, we propose a multi-level framework for training acceleration. Specifically, the framework is based on three basic operators, Coalescing, De-coalescing and Interpolation, which can be orchestrated to build a multi-level training framework. The framework consists of a V-cycle training process, which progressively down- and up-scales the model size and projects the parameters between adjacent levels of models via coalescing and de-coalescing. The key idea is that a smaller model that can be trained for fast convergence and the trained parameters provides high-qualities intermediate solutions for the next level larger network. The interpolation operator is designed to break the symmetry of neurons incurred by de-coalescing for better convergence performance. Our experiments on transformer-based language models (e.g. Bert, GPT) as well as a vision model (e.g. DeiT) prove that the proposed framework reduces the computational cost by about 20% on training BERT/GPT-Base models and up to 51.6% on training the BERT-Large model while preserving the performance.
Primary Area: optimization
Keywords: stochastic optimization gradient estimation
Scores: [ 6 6 8 8 ]
While backpropagation (BP) is the mainstream approach for gradient computation in neural network training, its heavy reliance on the chain rule of differentiation constrains the designing flexibility of network architecture and training pipelines. We avoid the recursive computation in BP and develop a unified likelihood ratio (ULR) method for gradient estimation with just one forward propagation. Not only can ULR be extended to train a wide variety of neural network architectures, but the computation flow in BP can also be rearranged by ULR for better device adaptation. Moreover, we propose several variance reduction techniques to further accelerate the training process. Our experiments offer numerical results across diverse aspects, including various neural network training scenarios, computation flow rearrangement, and fine-tuning of pre-trained models. All findings demonstrate that ULR effectively enhances the flexibility of neural network training by permitting localized module training without compromising the global objective and significantly boosts the network robustness.
Primary Area: optimization
Keywords: fairness online knapsack learning-augmented algorithm Pareto-optimality robustness consistency
Scores: [ 6 6 8 ]
Primary Area: optimization
Keywords: Optimization
Scores: [ 8 6 8 ]
Primary Area: optimization
Keywords: Robust Aggregation Distributed Training Failure Augmented Byzantine Resilience
Scores: [ 6 6 6 ]
Primary Area: optimization
Keywords: Minimax optimization Nonconvex-nonconcave optimization Extragradient method Dynamical systems
Scores: [ 6 8 8 6 ]
Primary Area: optimization
Keywords: Inverse problems plug-and-play priors posterior sampling unadjusted Langevin algorithm
Scores: [ 6 6 5 6 ]
Posterior sampling has been shown to be a powerful Bayesian approach for solving imaging inverse problems. The recent plug-and-play unadjusted Langevin algorithm (PnP-ULA) has emerged as a promising method for Monte Carlo sampling and minimum mean squared error (MMSE) estimation by combining physical measurement models with deep-learning priors specified using image denoisers. However, the intricate relationship between the sampling distribution of PnP-ULA and the mismatched data-fidelity and denoiser has not been theoretically analyzed. We address this gap by proposing a posterior-L2 pseudometric and using it to quantify an explicit error bound for PnP-ULA under mismatched posterior distribution. We numerically validate our theory on several inverse problems such as sampling from Gaussian mixture models and image deblurring. Our results suggest that the sensitivity of the sampling distribution of PnP-ULA to a mismatch in the measurement model and the denoiser can be precisely characterized.
Primary Area: optimization
Keywords: Contextual Bandits Regret Bounds Deep Learning Neural Bandits
Scores: [ 8 6 5 5 6 ]
Recent works have shown a reduction from contextual bandits to online regression under a realizability assumption \citep{foster2020beyond,foster2021efficient}. In this work, we investigate the use of neural networks for such online regression and associated Neural Contextual Bandits (NeuCBs). Using existing results for wide networks, one can readily show a O(√T) regret for online regression with square loss, which via the reduction implies a O(√KT3/4) regret for NeuCBs. Departing from this standard approach, we first show a O(logT) regret for online regression with almost convex losses that satisfy QG (Quadratic Growth) condition, a generalization of the PL (Polyak-\L ojasiewicz) condition, and that have a unique minima. Although not directly applicable to wide networks since they do not have unique minima, we show that adding a suitable small random perturbation to the network predictions surprisingly makes the loss satisfy QG with unique minima. Based on such a perturbed prediction, we show a O(logT) regret for online regression with both squared loss and KL loss, and subsequently convert these respectively to ~O(√KT) and ~O(√KL∗+K) regret for NeuCB, where L∗ is the loss of the best policy. Separately, we also show that existing regret bounds for NeuCBs are Ω(T) or assume i.i.d. contexts, unlike this work. Finally, our experimental results on various datasets demonstrate that our algorithms, especially the one based on KL loss, persistently outperform existing algorithms.
Primary Area: optimization
Keywords: Federated learning Nonconvex optimization
Scores: [ 6 8 8 ]
Primary Area: optimization
Keywords: large language models pretraining optimization in deep learning
Scores: [ 8 8 6 8 ]
Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping. The clipping controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory. Sophia only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time and memory overhead. On language modeling with GPT models of sizes ranging from 125M to 1.5B, Sophia achieves a 2x speed-up compared to Adam in the number of steps, total compute, and wall-clock time, achieving the same perplexity with 50% fewer steps, less total compute, and reduced wall-clock time.
Primary Area: optimization
Keywords: generalization sharpness robustness SAM
Scores: [ 6 6 6 6 ]
Primary Area: optimization
Keywords: momentum SGD dynamics
Scores: [ 5 6 6 5 ]
Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without stochastic gradient noise. In stochastic optimization, such as training neural networks, folklore suggests that momentum may help deep learning optimization by reducing the variance of the stochastic gradient update, but previous theoretical analyses do not find momentum to offer any provable acceleration. Theoretical results in this paper clarify the role of momentum in stochastic settings where the learning rate is small and gradient noise is the dominant source of instability, suggesting that SGD with and without momentum behave similarly in the short and long time horizons. Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training regimes where the optimal learning rate is not very large, including small- to medium-batch training from scratch on ImageNet and fine-tuning language models on downstream tasks.
Primary Area: optimization
Keywords: Large-scale MILP Learning for Optimization Lightweight Optimization Framework
Scores: [ 3 5 6 6 ]
Primary Area: optimization
Keywords: Byzantine Robustness Nonconvex Optimization Federated Learning
Scores: [ 6 6 3 ]
Primary Area: optimization
Keywords: dataset distillation dataset condensation
Scores: [ 8 5 6 ]
Primary Area: optimization
Keywords: Tensor Programs mup deep learning optimization optimal hyperparameter transfer
Scores: [ 8 8 5 6 8 ]
Primary Area: optimization
Keywords: Transformer optimization adam clipping heavy-tailed noise directional smoothness
Scores: [ 6 8 6 6 ]
Primary Area: optimization
Keywords: finite-sum problems monotone inclusion operator norm residual stochastic Halpern iteration last iterate convergence variance reduction min-max optimization
Scores: [ 6 6 6 ]
Machine learning approaches relying on such criteria as adversarial robustness or multi-agent settings have raised the need for solving game-theoretic equilibrium problems. Of particular relevance to these applications are methods targeting finite-sum structure, which generically arises in empirical variants of learning problems in these contexts. Further, methods with computable approximation errors are highly desirable, as they provide verifiable exit criteria. Motivated by these applications, we study finite-sum monotone inclusion problems, which model broad classes of equilibrium problems. Our main contributions are variants of the classical Halpern iteration that employ variance reduction to obtain improved complexity guarantees in which n component operators in the finite sum are on average'' either cocoercive or Lipschitz continuous and monotone, with parameter L. The resulting oracle complexity of our methods, which provide guarantees for the last iterate and for a (computable) operator norm residual, is ˜O(n+√nLε−1), which improves upon existing methods by a factor up to √n. This constitutes the first variance reduction-type result for general finite-sum monotone inclusions and for more specific problems such as convex-concave optimization when operator norm residual is the optimality measure. We further argue that, up to poly-logarithmic factors, this complexity is unimprovable in the monotone Lipschitz setting; i.e., the provided result is near-optimal.
Primary Area: optimization
Keywords: Bilevel Optimization Overfitting
Scores: [ 6 8 8 6 ]
Bilevel optimization problems appear in many widely used machine learning tasks. Bilevel optimization models are sensitive to small changes, and bilevel training tasks typically involve limited datasets. Therefore, overfitting is a common challenge in bilevel training tasks. This paper considers the use of dropout to address this problem. We propose a bilevel optimization model that depends on the distribution of dropout masks. We investigate how the dropout rate affects the hypergradient of this model. We propose a dropout bilevel method to solve the dropout bilevel optimization model. Subsequently, we analyze the resulting dropout bilevel method from an optimization perspective. Analyzing the optimization properties of methods with dropout is essential because it provides convergence guarantees for methods using dropout. However, there has been limited investigation in this research direction. We provide the complexity of the resulting dropout bilevel method in terms of reaching an ϵ stationary point of the proposed stochastic bilevel model. Empirically, we demonstrate that overfitting occurs in data cleaning problems, and the method proposed in this work mitigates this issue.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Normalizing Flows Gradient Estimators Lattice Field Theory Variational Infernce
Scores: [ 8 6 8 8 ]
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Generative Flow Network (GFlowNets) Pre-train Goal-conditioned
Scores: [ 8 6 8 6 ]
Generative Flow Networks (GFlowNets) are amortized samplers that learn stochastic policies to sequentially generate compositional objects from a given unnormalized reward distribution.They can generate diverse sets of high-reward objects, which is an important consideration in scientific discovery tasks. However, as they are typically trained from a given extrinsic reward function, it remains an important open challenge about how to leverage the power of pre-training and train GFlowNets in an unsupervised fashion for efficient adaptation to downstream tasks.Inspired by recent successes of unsupervised pre-training in various domains, we introduce a novel approach for reward-free pre-training of GFlowNets. By framing the training as a self-supervised problem, we propose an outcome-conditioned GFlowNet (OC-GFN) that learns to explore the candidate space. Specifically, OC-GFN learns to reach any targeted outcomes, akin to goal-conditioned policies in reinforcement learning. We show that the pre-trained OC-GFN model can allow for a direct extraction of a policy capable of sampling from any new reward functions in downstream tasks.Nonetheless, adapting OC-GFN on a downstream task-specific reward involves an intractable marginalization over possible outcomes. We propose a novel way to approximate this marginalization by learning an amortized predictor enabling efficient fine-tuning.Extensive experimental results validate the efficacy of our approach, demonstrating the effectiveness of pre-training the OC-GFN, and its ability to swiftly adapt to downstream tasks and discover modes more efficiently.This work may serve as a foundation for further exploration of pre-training strategies in the context of GFlowNets.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Langevin diffusion log-concave distribution kernel smoothing
Scores: [ 8 6 8 6 ]
We introduce a theoretical framework for sampling from unnormalized densities based on a smoothing scheme that uses an isotropic Gaussian kernel with a single fixed noise scale. We prove one can decompose sampling from a density (minimal assumptions made on the density) into a sequence of sampling from log-concave conditional densities via accumulation of noisy measurements with equal noise levels. Our construction is unique in that it keeps track of a history of samples, making it non-Markovian as a whole, but it is lightweight algorithmically as the history only shows up in the form of a running empirical mean of samples. Our sampling algorithm generalizes walk-jump sampling (Saremi & Hyvärinen, 2019). The "walk" phase becomes a (non-Markovian) chain of (log-concave) Markov chains. The "jump" from the accumulated measurements is obtained by empirical Bayes. We study our sampling algorithm quantitatively using the 2-Wasserstein metric and compare it with various Langevin MCMC algorithms. We also report a remarkable capacity of our algorithm to "tunnel" between modes of a distribution.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: time series forecasting probabilistic multivariate copula transformer density estimation
Scores: [ 8 5 6 5 ]
We introduce a new model for multivariate probabilistic time series prediction, designed to flexibly address a range of tasks including forecasting, interpolation, and their combinations. Building on copula theory, we propose a simplified objective for the recently-introduced transformer-based attentional copulas (TACTiS), wherein the number of distributional parameters now scales linearly with the number of variables instead of factorially. The new objective requires the introduction of a training curriculum, which goes hand-in-hand with necessary changes to the original architecture. We show that the resulting model has significantly better training dynamics and achieves state-of-the-art performance across diverse real-world forecasting tasks, while maintaining the flexibility of prior work, such as seamless handling of unaligned and unevenly-sampled time series.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: tractable inference distribution estimation probabilistic circuits tensor networks
Scores: [ 8 8 6 8 6 ]
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Transfer Learning Nonparametric Learning Bayesian Neural Network
Scores: [ 6 6 6 6 ]
Transfer learning has recently shown significant performance across various tasks involving deep neural networks. In these transfer learning scenarios, the prior distribution for downstream data becomes crucial in Bayesian model averaging (BMA). While previous works proposed the prior over the neural network parameters centered around the pre-trained solution, such strategies have limitations when dealing with distribution shifts between upstream and downstream data. This paper introduces nonparametric transfer learning (NPTL), a flexible posterior sampling method to address the distribution shift issue within the context of nonparametric learning. The nonparametric learning (NPL) method is a recent approach that employs a nonparametric prior for posterior sampling, efficiently accounting for model misspecification scenarios, which is suitable for transfer learning scenarios that may involve the distribution shift between upstream and downstream tasks. Through extensive empirical validations, we demonstrate that our approach surpasses other baselines in BMA performance.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Logconcave sampling Dikin walk Markov chain Monte Carlo interior point methods
Scores: [ 3 8 8 8 ]
We consider the problem of sampling from a logconcave distribution π(θ)∝e−f(θ) constrained to a polytope K:={θ∈Rd:Aθ≤b}, where A∈Rm×d and b∈Rm. The fastest-known algorithm for the setting when f is O(1)-Lipschitz or O(1)-smooth runs in roughly O(md×mdω−1) arithmetic operations, where the mdω−1 term arises because each Markov chain step requires computing a matrix inversion and determinant (ω≈2.37 is the matrix multiplication constant). We present a nearly-optimal implementation of this Markov chain with per-step complexity that is roughly the number of non-zero entries of A while the number of Markov chain steps remains the same. The key technical ingredients are 1) to show that the matrices that arise in this Dikin walk change slowly, 2) to deploy efficient linear solvers which can leverage this slow change to speed up matrix inversion by using information computed in previous steps, and 3) to speed up the computation of the determinantal term in the Metropolis filter step via a randomized Taylor series-based estimator. This result directly improves the runtime for applications that involve sampling from Gibbs distributions constrained to polytopes that arise in Bayesian statistics and private optimization.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: generative flow networks GFlowNets protein design game theory self-play adversarial learning stochastic environments quantal response equilibrium Luce agents sequential decision making
Scores: [ 6 3 8 5 ]
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Stochastic Methods
Scores: [ 6 8 6 ]
Integrals with discontinuous integrands are ubiquitous, arising from discrete structure in applications like topology optimization, graphics, and computational geometry. These integrals are often part of a forward model in an inverse problem where it is necessary to reason backwards about the parameters, ideally using gradient-based optimization. Monte Carlo methods are widely used to estimate the value of integrals, but this results in a non-differentiable approximation that is amenable to neither conventional automatic differentiation nor reparameterization-based gradient methods. This significantly disrupts efforts to integrate machine learning methods in areas that exhibit these discontinuities: physical simulation and robotics, design, graphics, and computational geometry. Although bespoke domain-specific techniques can handle special cases, a general methodology to wield automatic differentiation in these discrete contexts is wanting. We introduce a differentiable variant of the simple Monte Carlo estimator which samples line segments rather than points from the domain. We justify our estimator analytically as conditional Monte Carlo and demonstrate the diverse functionality of the method as applied to image stylization, topology optimization, and computational geometry.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Conformal Prediction Probability Function Regression Neural Networks
Scores: [ 3 8 8 6 ]
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: variational inference neural sdes stochastic differential equations brownian motion fractional noise fractional brownian motion markov approximation markov representation
Scores: [ 8 8 5 8 ]
We present a novel variational framework for performing inference in (neural) stochastic differential equations (SDEs) driven by Markov-approximate fractional Brownian motion (fBM). SDEs offer a versatile tool for modeling real-world continuous-time dynamic systems with inherent noise and randomness. Combining SDEs with the powerful inference capabilities of variational methods, enables the learning of representative function distributions through stochastic gradient descent. However, conventional SDEs typically assume the underlying noise to follow a Brownian motion (BM), which hinders their ability to capture long-term dependencies. In contrast, fractional Brownian motion (fBM) extends BM to encompass non-Markovian dynamics, but existing methods for inferring fBM parameters are either computationally demanding or statistically inefficient. In this paper, building upon the Markov approximation of fBM, we derive the evidence lower bound essential for efficient variational inference of posterior path measures, drawing from the well-established field of stochastic analysis. Additionally, we provide a closed-form expression to determine optimal approximation coefficients. Furthermore, we propose the use of neural networks to learn the drift, diffusion and control terms within our variational posterior, leading to the variational training of neural-SDEs. In this framework, we also optimize the Hurst index, governing the nature of our fractional noise. Beyond validation on synthetic data, we contribute a novel architecture for variational latent video prediction,—an approach that, to the best of our knowledge, enables the first variational neural-SDE application to video perception.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: conformal prediction uncertainty estimation language models generative models confidence prediction sets sampling
Scores: [ 8 8 3 6 ]
In this paper, we propose a novel approach to conformal prediction for language models (LMs) in which we produce prediction sets with performance guarantees. LM responses are typically sampled from a predicted distribution over the large, combinatorial output space of language. Translating this to conformal prediction, we calibrate a stopping rule for sampling LM outputs that get added to a growing set of candidates until we are confident that the set covers at least one acceptable response. Since some samples may be low-quality, we also simultaneously calibrate a rejection rule for removing candidates from the output set to reduce noise. Similar to conformal prediction, we can prove that the final output set obeys certain desirable distribution-free guarantees. Within these sets of candidate responses, we also show that we can also identify subsets of individual components---such as phrases or sentences---that are each independently correct (e.g., that are not hallucinations''), again with guarantees. Our method can be applied to any LM API that supports sampling. Furthermore, we empirically demonstrate that we can achieve many desired coverage levels within a limited number of total samples when applying our method to multiple tasks in open-domain question answering, text summarization, and radiology report generation using different LM variants.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Periodicity Identification Probabilistic Models Amortized Inference Attention Time Series Analysis
Scores: [ 6 6 6 5 ]
Periodic patterns are a fundamental characteristic of time series in natural world, with significant implications for a range of disciplines, from economics to cloud systems. However, the current literature on periodicity detection faces two key challenges: limited robustness in real-world scenarios and a lack of memory to leverage previously observed time series to accelerate and improve inference on new data. To overcome these obstacles, this paper presents AmortizedPeriod, an innovative approach to periodicity identification based on amortized variational inference that integrates Bayesian statistics and deep learning. Through the Bayesian generative process, our method flexibly captures the dependencies of the periods, trends, noise, and outliers in time series, while also considering missing data and irregular periods in a robust manner. In addition, it utilizes the evidence lower bound of the log-likelihood of the observed time series as the loss function to train a deep attention inference network, facilitating knowledge transfer from the seen time series (and their labels) to unseen ones. Experimental results show that AmortizedPeriod surpasses the state-of-the-art methods by a large margin of 28.5% on average in terms of micro F1-score, with at least 55% less inference time.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Conformal Prediction time series uncertainty quantification calibration RNN
Scores: [ 5 8 6 6 ]
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Bayesian optimization Gaussian processes
Scores: [ 8 5 5 6 ]
Bayesian optimization under contextual uncertainty (BOCU) is a family of BO problems in which the learner makes a decision prior to observing the context and must manage the risks involved. Distributionally robust BO (DRBO) is a subset of BOCU that affords robustness against context distribution shift, and includes the optimization of expected values and worst-case values as special cases. By considering the first derivatives of the DRBO objective, we generalize DRBO to one that includes several other uncertainty objectives studied in the BOCU literature such as worst-case sensitivity (and thus notions of risk such as variance, range, and conditional value-at-risk) and mean-risk tradeoffs. We develop a general Thompson sampling algorithm that is able to optimize any objective within the BOCU framework, analyze its theoretical properties, and compare it to suitable baselines across different experimental settings and uncertainty objectives.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: MCMC Bayesian Deep Learning Flatness-aware Learning
Scores: [ 5 6 6 8 6 ]
Bayesian deep learning counts on the quality of posterior distribution estimation. However, the posterior of deep neural networks is highly multi-modal in nature, with local modes exhibiting varying generalization performances. Given a practical budget, sampling from the original posterior can lead to suboptimal performances, as some samples may become trapped in "bad" modes and suffer from overfitting. Leveraging the observation that "good" modes with low generalization error often reside in flat basins of the energy landscape, we propose to bias the sampling on the posterior toward these flat regions. Specifically, we introduce an auxiliary guiding variable, the stationary distribution of which resembles a smoothed posterior free from sharp modes, to lead the MCMC sampler to flat basins. We prove the convergence of our method and further show that it converges faster than several existing flatness-aware methods in the strongly convex setting. Empirical results demonstrate that our method can successfully sample from flat basins of the posterior, and outperforms all compared baselines on multiple benchmarks including classification, calibration and out-of-distribution detection.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Uncertainty estimation binary classification imbalanced classification score recalibration uncertainty based decision making classification decision boundary bin packing estimation bias posterior networks
Scores: [ 3 6 6 ]
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: deep ensembles diversity input gradient robustness covariate shift particle variational inference
Scores: [ 6 8 6 8 ]
Deep Ensembles (DEs) demonstrate improved accuracy, calibration and robustness to perturbations over single neural networks partly due to their functional diversity. Particle-based variational inference (ParVI) methods enhance diversity by formalizing a repulsion term based on a network similarity kernel. However, weight-space repulsion is inefficient due to over-parameterization, while direct function-space repulsion has been found to produce little improvement over DEs. To sidestep these difficulties, we propose First-order Repulsive Deep Ensemble (FoRDE), an ensemble learning method based on ParVI, which performs repulsion in the space of first-order input gradients. As input gradients uniquely characterize a function up to translation and are much smaller in dimension than the weights, this method guarantees that ensemble members are functionally different. Intuitively, diversifying the input gradients encourages each network to learn different features, which is expected to improve the robustness of an ensemble. Experiments on image classification datasets and transfer learning tasks show that FoRDE significantly outperforms the gold-standard DEs and other ensemble methods in accuracy and calibration under covariate shift due to input perturbations.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Monte Carlo Denoising Diffusion model score-based generative models Sequential Monte Carlo Bayesian Inverse Problems Generative Models.
Scores: [ 6 10 8 10 ]
Ill-posed linear inverse problems arise frequently in various applications, from computational photography to medical imaging.A recent line of research exploits Bayesian inference with informative priors to handle the ill-posedness of such problems.Amongst such priors, score-based generative models (SGM) have recently been successfully applied to several different inverse problems.In this study, we exploit the particular structure of the prior defined by the SGM to define a sequence of intermediate linear inverse problems. As the noise level decreases, the posteriors of these inverse problems get closer to the target posterior of the original inverse problem. To sample from this sequence of posteriors, we propose the use of Sequential Monte Carlo (SMC) methods.The proposed algorithm, \algo, is shown to be theoretically grounded and we provide numerical simulations showing that it outperforms competing baselines when dealing with ill-posed inverse problems in a Bayesian setting.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Bayesian models Meta learning Few-shot learning
Scores: [ 6 6 8 6 6 8 ]
We propose a novel hierarchical Bayesian model for the few-shot meta learning problem. We consider episode-wise random variables to model episode-specific generative processes, where these local random variables are governed by a higher-level global random variable. The global variable captures information shared across episodes, while controlling how much the model needs to be adapted to new episodes in a principled Bayesian manner. Within our framework, prediction on a novel episode/task can be seen as a Bayesian inference problem. For tractable training, we need to be able to relate each local episode-specific solution to the global higher-level parameters. We propose a Normal-Inverse-Wishart model, for which establishing this local-global relationship becomes feasible due to the approximate closed-form solutions for the local posterior distributions. The resulting algorithm is more attractive than the MAML in that it does not maintain a costly computational graph for the sequence of gradient descent steps in an episode. Our approach is also different from existing Bayesian meta learning methods in that rather than modeling a single random variable for all episodes, it leverages a hierarchical structure that exploits the local-global relationships desirable for principled Bayesian learning with many related tasks.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Diffusion based models Anomaly detection Probabilistic Inference
Scores: [ 6 8 6 8 ]
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Posterior Sampling Multi-modal sampling
Scores: [ 8 6 8 6 ]
The efficacy of modern generative models is commonly contingent upon the precision of score estimation along the diffusion path, with a focus on diffusion models and their ability to generate high-quality data samples. This study delves into the application of reverse diffusion to Monte Carlo sampling. It is shown that score estimation can be transformed into a mean estimation problem via the decomposition of the transition kernel. By estimating the mean of the posterior distribution, we derive a novel Monte Carlo sampling algorithm from the reverse diffusion process, which is distinct from traditional Markov Chain Monte Carlo (MCMC) methods. We calculate the error requirements and sample size for the posterior distribution, and use the result to derive an algorithm that can approximate the target distribution to any desired accuracy. Additionally, by estimating the log-Sobolev constant of the posterior distribution, we show under suitable conditions the problem of sampling from the posterior can be easier than direct sampling from the target distribution using traditional MCMC techniques. For Gaussian mixture models, we demonstrate that the new algorithm achieves significant improvement over the traditional Langevin-style MCMC sampling methods both theoretically and practically. Our algorithm offers a new perspective and solution beyond classical MCMC algorithms for challenging complex distributions.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: conformal prediction conformal risk control non-exchangeable data uncertainty
Scores: [ 6 5 5 8 ]
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Machine Learning Maximum Likelihood Density Estimation Statistics Kernels
Scores: [ 8 6 5 8 ]
Normalising Flows are non-parametric statistical models known for their dual capabilities of density estimation and generation. They are distinguished by their inherently invertible architecture. However, the requirement of invertibility imposes constraints on their expressiveness, necessitating a large number of parameters and innovative architectural designs to achieve satisfactory outcomes. Whilst flow-based models predominantly rely on neural-network-based transformations for expressive designs, alternative transformation methods have received limited attention. In this work, we present Ferumal flow, a novel kernelised normalising flow paradigm that integrates kernels into the framework. Our results demonstrate that a kernelised flow can yield competitive or superior results compared to neural network-based flows whilst maintaining parameter efficiency. Kernelised flows excel especially in the low-data regime, enabling flexible non-parametric density estimation in applications with sparse data availability.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Markov Chain Monte Carlo Kinetic Langevin Langevin algorithm Midpoint method Mixing rate
Scores: [ 6 5 8 6 ]
We revisit the problem of sampling from a target distribution that has a smooth strongly log-concave density everywhere in Rp. In this context, if no additional density information is available, the randomized midpoint discretization for the kinetic Langevin diffusion is known to be the most scalable method in high dimensions with large condition numbers. Our main result is a nonasymptotic and easy to compute upper bound on the W2-error of this method. To provide a more thorough explanation of our method for establishing the computable upper bound, we conduct an analysis of the midpoint discretization for the vanilla Langevin process. This analysis helps to clarify the underlying principles and provides valuable insights that we use to establish an improved upper bound for the kinetic Langevin process with the midpoint discretization. Furthermore, by applying these techniques we establish new guarantees for the kinetic Langevin process with Euler discretization, which have a better dependence on the condition number than existing upper bounds
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Schrödinger bridge sampling from densities stochastic optimal control diffusion-based generative modeling
Scores: [ 6 6 6 8 ]
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: efficient training algorithms stochastic gradient descent importance sampling variance reduction
Scores: [ 6 6 6 8 ]
Sampling-based algorithms, which eliminate "unimportant" computations during forward and/or backpropagation (BP), offer potential solutions to accelerate neural network training. However, since sampling introduces approximations to training, such algorithms may not consistently maintain accuracy across various tasks. In this work, we introduce a variance-controlled adaptive sampling (VCAS) method designed to minimize the computational load of BP. VCAS computes an unbiased stochastic gradient with fine-grained layerwise importance sampling in data dimension for activation gradient calculation and leverage score sampling in token dimension for weight gradient calculation. To preserve accuracy, we control the additional variance introduced by learning the sample ratio jointly with model parameters during training. We assessed VCAS on multiple fine-tuning and pre-training tasks in both vision and natural language domains. On all the tasks, VCAS can preserve the original training loss trajectory and validation accuracy with an up to 73.87% FLOPs reduction of BP and 49.58% FLOPs reduction of the whole training process.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Gaussian process stochastic gradient descent
Scores: [ 8 6 8 5 5 ]
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: softmax bottleneck truncation sampling nucleus top-k theory linear programming linear algebra decoding generation autoregressive language model NLP open-ended generation
Scores: [ 8 8 6 8 ]
Despite their ubiquity in language generation, it remains unknown why truncation sampling heuristics like nucleus sampling are so effective. We provide a theoretical explanation for the effectiveness of the truncation sampling by proving that truncation methods that discard tokens below some probability threshold (the most common type of truncation) can guarantee that all sampled tokens have nonzero true probability. However, thresholds are a coarse heuristic, and necessarily discard some tokens with nonzero true probability as well. In pursuit of a more precise sampling strategy, we show that we can leverage a known source of model errors, the softmax bottleneck, to prove that certain tokens have nonzero true probability, without relying on a threshold. Based on our findings, we develop an experimental truncation strategy and the present pilot studies demonstrating the promise of this type of algorithm. Our evaluations show that our method outperforms its threshold-based counterparts under automatic and human evaluation metrics for low-entropy (i.e., close to greedy) open-ended text generation. Our theoretical findings and pilot experiments provide both insight into why truncation sampling works, and make progress toward more expressive sampling algorithms that better surface the generative capabilities of large language models.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: propagating distributions uncertainty stable distributions input uncertainty
Scores: [ 8 6 6 8 6 ]
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Bayesian filtering optimization neural networks
Scores: [ 8 6 6 ]
Bayesian filtering approximates the true underlying behavior of a time-varying system by inverting an explicit generative model to convert noisy measurements into state estimates. This process typically requires matrix storage, inversion, and multiplication or Monte Carlo estimation, none of which are practical in high-dimensional state spaces such as the weight spaces of artificial neural networks. Here, we consider the standard Bayesian filtering problem as optimization over a time-varying objective. Instead of maintaining matrices for the filtering equations or simulating particles, we specify an optimizer that defines the Bayesian filter implicitly. In the linear-Gaussian setting, we show that every Kalman filter has an equivalent formulation using K steps of gradient descent. In the nonlinear setting, our experiments demonstrate that our framework results in filters that are effective, robust, and scalable to high-dimensional systems, comparing well against the standard toolbox of Bayesian filtering solutions. We suggest that it is easier to fine-tune an optimizer than it is to specify the correct filtering equations, making our framework an attractive option for high-dimensional filtering problems.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: graph compression entropy coding neural compression bits-back coding lossless compression generative models information theory probabilistic models graph neural networks multiset compression asymmetric numeral systems compression entropy shuffle coding
Scores: [ 8 6 6 5 ]
We present shuffle coding, a general method for optimal compression of sequences of unordered objects using bits-back coding. Data structures that can be compressed using shuffle coding include multisets, graphs, hypergraphs, and others. We release an implementation that can easily be adapted to different data types and statistical models, and demonstrate that our implementation achieves state-of-the-art compression rates on a range of graph datasets including molecular data.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Gaussian Denoising Posterior Moments Estimation Uncertainty Quantification Uncertainty Visualization
Scores: [ 6 3 6 8 ]
Denoisers play a central role in many applications, from noise suppression in low-grade imaging sensors, to empowering score-based generative models. The latter category of methods makes use of Tweedie's formula, which links the posterior mean in Gaussian denoising (i.e., the minimum MSE denoiser) with the score of the data distribution. Here, we derive a fundamental relation between the higher-order central moments of the posterior distribution, and the higher-order derivatives of the posterior mean. We harness this result for uncertainty quantification of pre-trained denoisers. Particularly, we show how to efficiently compute the principal components of the posterior distribution for any desired region of an image, as well as to approximate the full marginal distribution along those (or any other) one-dimensional directions. Our method is fast and memory efficient, as it does not explicitly compute or store the high-order moment tensors and it requires no training or fine tuning of the denoiser. Anonymized code is available here.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Bayesian model Tensor decomposition Gaussian Processes
Scores: [ 5 6 8 5 ]
Tucker decomposition is a powerful tensor model to handle multi-aspect data. It demonstrates the low-rank property by decomposing the grid-structured data as interactions between a core tensor and a set of object representations (factors). A fundamental assumption of such decomposition is that there were finite objects in each aspect or mode, corresponding to discrete indexes of data entries. However, many real-world data are not naturally posed in the setting. For example, geographic data is represented as continuous indexes of latitude and longitude coordinates, and cannot fit tensor models directly. To generalize Tucker decomposition to such scenarios, we propose FunBaT: Functional Bayesian Tucker Decomposition. We treat the continuous-indexed data as the interaction between the Tucker core and a group of latent functions. We use Gaussian processes (GP) as functional priors to model the latent functions, and then convert the GPs into state-space priors by constructing equivalent stochastic differential equations (SDE) to reduce computational cost. An efficient inference algorithm is further developed for scalable posterior approximation based on advanced message-passing techniques. The advantage of our method is shown in both synthetic data and several real-world applications.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Bayesian inverse Problems MMD Gradient Flows Deep Learning
Scores: [ 6 5 8 5 ]
We propose conditional flows of the maximum mean discrepancy (MMD) with the negative distance kernel for posterior sampling and conditional generative modelling. This MMD, which is also known as energy distance, has several advantageous properties like efficient computation via slicing and sorting. We approximate the joint distribution of the ground truth and the observations using discrete Wasserstein gradient flows and establish an error bound for the posterior distributions. Further, we prove that our particle flow is indeed a Wasserstein gradient flow of an appropriate functional. The power of our method is demonstrated by numerical examples including conditional image generation and inverse problems like superresolution, inpainting and computed tomography in low-dose and limited-angle settings.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Gaussian processes neuroscience vector field tangent bundle connection Laplacian EEG Alzheimer's disease
Scores: [ 6 8 8 ]
Gaussian processes (GPs) are popular nonparametric statistical models for learning unknown functions and quantifying the spatiotemporal uncertainty in data. Recent works have extended GPs to model scalar and vector quantities distributed over non-Euclidean domains, including smooth manifolds, appearing in numerous fields such as computer vision, dynamical systems, and neuroscience. However, these approaches assume that the manifold underlying the data is known, limiting their practical utility. We introduce RVGP, a generalisation of GPs for learning vector signals over latent Riemannian manifolds. Our method uses positional encoding with eigenfunctions of the connection Laplacian, associated with the tangent bundle, readily derived from common graph-based approximation of data. We demonstrate that RVGP possesses global regularity over the manifold, which allows it to super-resolve and inpaint vector fields while preserving singularities. Furthermore, we use RVGP to reconstruct high-density neural dynamics derived from low-density EEG recordings in healthy individuals and Alzheimer's patients. We show that vector field singularities are important disease markers and that their reconstruction leads to a comparable classification accuracy of disease states to high-density recordings. Thus, our method overcomes a significant practical limitation in experimental and clinical applications.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: sliced maximum mean discrepancy energy distance gradient flows Riesz kernels generative modelling
Scores: [ 6 8 5 5 ]
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: probabilistic sampling multi-objective optimization GFlowNet
Scores: [ 6 6 8 6 6 ]
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Bayesian neural networks sparse Bayesian learning variational inference
Scores: [ 6 8 6 ]
Bayesian neural networks (BNNs) offer uncertainty quantification but come with the downside of substantially increased training and inference costs. Sparse BNNs have been investigated for efficient inference, typically by either slowly introducing sparsity throughout the training or by post-training compression of dense BNNs. The dilemma of how to cut down massive training costs remains, particularly given the requirement to learn about the uncertainty. To solve this challenge, we introduce Sparse Subspace Variational Inference (SSVI), the first fully sparse BNN framework that maintains a consistently sparse Bayesian model throughout the training and inference phases. Starting from a randomly initialized low-dimensional sparse subspace, our approach alternately optimizes the sparse subspace basis selection and its associated parameters. While basis selection is characterized as a non-differentiable problem, we approximate the optimal solution with a removal-and-addition strategy, guided by novel criteria based on weight distribution statistics. Our extensive experiments show that SSVI sets new benchmarks in crafting sparse BNNs, achieving, for instance, a 10-20× compression in model size with under 3% performance drop, and up to 20× FLOPs reduction during training. Remarkably, SSVI also demonstrates enhanced robustness to hyperparameters, reducing the need for intricate tuning in VI and occasionally even surpassing VI-trained dense BNNs.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: bayesian deep learning variational methods bayesian last layers neural linear models
Scores: [ 8 10 6 8 ]
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: uncertainty quantification evidential deep learning subjective logic single-forward-pass uncertainty method
Scores: [ 6 6 8 8 8 ]
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Bayesian Optimization Gaussian Processes Bayesian Neural Networks
Scores: [ 6 8 8 ]
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Variational autoencoder Information geometry Heavy-tail learning Generative model
Scores: [ 8 6 8 ]
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: SDEs Diffusion Models Optimal Transport Annealed Importance Sampling Schroedinger Bridges Variational Inference
Scores: [ 6 8 6 8 8 ]
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: large language models LLMs Bayesian inference chain-of-thought reasoning latent variable models generative flow networks GFlowNets
Scores: [ 8 8 10 5 ]
Autoregressive large language models (LLMs) compress knowledge from their training data through next-token conditional distributions. This limits tractable querying of this knowledge to start-to-end autoregressive sampling. However, many tasks of interest---including sequence continuation, infilling, and other forms of constrained generation---involve sampling from intractable posterior distributions. We address this limitation by using amortized Bayesian inference to sample from these intractable posteriors. Such amortization is algorithmically achieved by fine-tuning LLMs via diversity-seeking reinforcement learning algorithms: generative flow networks (GFlowNets). We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training and reward-maximizing policy optimization. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem and demonstrate that our approach enables data-efficient adaptation of LLMs to tasks that require multi-step rationalization and tool use.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: dynamic feature selection adaptive feature selection mutual information information theory
Scores: [ 8 6 8 ]
Dynamic feature selection, where we sequentially query features to make accurate predictions with a minimal budget, is a promising paradigm to reduce feature acquisition costs and provide transparency into the prediction process. The problem is challenging, however, as it requires both making predictions with arbitrary feature sets and learning a policy to identify the most valuable selections. Here, we take an information-theoretic perspective and prioritize features based on their mutual information with the response variable. The main challenge is implementing this policy, and we design a new approach that estimates the mutual information in a discriminative rather than a generative fashion. Building on our learning approach, we introduce several further improvements: allowing variable feature budgets across samples, enabling non-uniform costs between features, incorporating prior information, and exploring modern architectures to handle partial input information. We find that our method provides consistent gains over recent state-of-the-art methods across a variety of datasets.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: probabilistic inference sampling stochastic optimal control gflownets
Scores: [ 8 8 8 6 ]
We tackle the problem of sampling from intractable high-dimensional density functions, a fundamental task that often appears in machine learning and statistics. We extend recent sampling-based approaches that leverage controlled stochastic processes to model approximate samples from these target densities. The main drawback of these approaches is that the training objective requires full trajectories to compute, resulting in sluggish credit assignment issues due to use of entire trajectories and a weak learning signal.In this work, we present Diffusion Generative Flow Samplers (DGFS), a sampling-based framework where the learning process can be tractably broken down into short partial trajectory segments, via parameterizing an additional flow function''.Our method takes inspiration from the theory developed for generative flow networks (GFlowNets), allowing us to make use of intermediate learning signals and benefit from off-policy exploration capabilities.Through a variety of challenging experiments, we demonstrate that DGFS results in more accurate estimates of the normalization constant than closely-related prior methods.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Bayesian Optimization Hyperparameter Optimization Gaussian Processes
Scores: [ 6 8 6 8 ]
The optimization of expensive-to-evaluate black-box functions is prevalent in various scientific disciplines. Bayesian optimization is an automatic, general and sample-efficient method to solve these problems with minimal knowledge of the the underlying function dynamics. However, the ability of Bayesian optimization to incorporate prior knowledge or beliefs about the function at hand in order to accelerate the optimization is limited, which reduces its appeal for knowledgeable practitioners with tight budgets. To allow domain experts to customize the optimization routine, we propose ColaBO, the first Bayesian-principled framework for incorporating prior beliefs beyond the typical kernel structure, such as the likely location of the optimizer or the optimal value. The generality of ColaBO makes it applicable across different Monte Carlo acquisition functions and types of user beliefs. We empirically demonstrate ColaBO's ability to substantially accelerate optimization when the prior information is accurate, and to retain approximately default performance when it is misleading.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Probabilistic Methods Tractable Probabilistic Models Diffusion Models
Scores: [ 5 6 6 5 ]
Diffusion models are the current state of the art for generating photorealistic images. Controlling the sampling process for constrained image generation tasks such as inpainting, however, remains challenging since exact conditioning on such constraints is intractable. While existing methods use various techniques to approximate the constrained posterior, this paper proposes to exploit the ability of Tractable Probabilistic Models (TPMs) to exactly and efficiently compute the constrained posterior, and to leverage this signal to steer the denoising process of diffusion models. Specifically, this paper adopts a class of expressive TPMs termed Probabilistic Circuits (PCs). Building upon prior advances, we further scale up PCs and make them capable of guiding the image generation process of diffusion models. Empirical results suggest that our approach can consistently improve the overall quality and semantic coherence of inpainted images across three natural image datasets (i.e., CelebA-HQ, ImageNet, and LSUN) with only ~10% additional computational overhead brought by the TPM. Further, with the help of an image encoder and decoder, our method can readily accept semantic constraints on specific regions of the image, which opens up the potential for more controlled image generation tasks. In addition to proposing a new framework for constrained image generation, this paper highlights the benefit of more tractable models and motivates the development of expressive TPMs.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Deep Ensemble Diffusion Schr"odinger Bridge Bayesian
Scores: [ 6 8 6 8 5 ]
Deep Ensemble approach is a straightforward technique used to enhance the performance of deep neural networks by training them from different initial points, converging towards various local optima. However, a limitation of this methodology lies in its high computational overhead for inference, arising from the necessity to store numerous learned parameters and execute individual forward passes for each parameter during the inference stage.We propose a novel approach called Diffusion Bridge Network to address this challenge. Based on the theory of Schr"odinger bridge, this method directly learns to simulate an Stochastic Differential Equation (SDE) that connects the output distribution of a single ensemble member to the output distribution of the ensembled model, allowing us to obtain ensemble prediction without having to invoke forward pass through all the ensemble models. By substituting the heavy ensembles with this lightweight neural network constructing DBN, we achieved inference with reduced computational cost while maintaining accuracy and uncertainty scores on benchmark datasets such as CIFAR-10, CIFAR-100, and TinyImageNet.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: amortized inference variational inference graphical models Markov random fields generative flow networks GFlowNets
Scores: [ 8 6 6 ]
We present a new algorithm for amortized inference in sparse probabilistic graphical models (PGMs), which we call Δ-amortized inference (Δ-AI). Our approach is based on the observation that when the sampling of variables in a PGM is seen as a sequence of actions taken by an agent, sparsity of the PGM enables local credit assignment in the agent's policy learning objective. This yields a local constraint that can be turned into a local loss in the style of generative flow networks (GFlowNets) that enables off-policy training but avoids the need to instantiate all the random variables for each parameter update, thus speeding up training considerably. The Δ-AI objective matches the conditional distribution of a variable given its Markov blanket in a tractable learned sampler, which has the structure of a Bayesian network, with the same conditional distribution under the target PGM. As such, the trained sampler recovers marginals and conditional distributions of interest and enables inference of partial subsets of variables. We illustrate Δ-AI's effectiveness for sampling from synthetic PGMs and training latent variable models with sparse factor structure.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Importance sampling χ2 divergency latent variable models
Scores: [ 8 8 6 5 ]
Maximizing the log-likelihood is a crucial aspect of learning latent variable models, and variational inference (VI) stands as the commonly adopted method. However, VI can encounter challenges in achieving a high log-likelihood when dealing with complicated posterior distributions. In response to this limitation, we introduce a novel variational importance sampling (VIS) approach that directly estimates and maximizes the log-likelihood. VIS leverages the optimal proposal distribution, achieved by minimizing the χ2 divergence, to enhance log-likelihood estimation. We apply VIS to various popular latent variable models, including mixture models, variational auto-encoders, and partially observable generalized linear models. Results demonstrate that our approach consistently outperforms state-of-the-art baselines, both in terms of log-likelihood and model parameter estimation.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Large language models Bayesian deep learning Laplace approximation uncertainty calibration
Scores: [ 8 6 6 6 ]
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: risk measure value-at-risk conditional value-at-risk
Scores: [ 8 6 6 ]
Research on optimizing the risk measure of a blackbox function using Gaussian processes, especially Bayesian optimization (BO) of risk measures, has become increasingly important due to the inevitable presence of uncontrollable variables in real-world applications. Nevertheless, existing works on BO of risk measures start the optimization from scratch for every new task without considering the results of previous tasks. In contrast, its vanilla BO counterpart has received a thorough investigation on utilizing previous tasks to speed up the current task through the body of works on meta-BO which, however, have not considered risk measures. To bridge this gap, this paper presents the first algorithm for meta-BO of risk measures (i.e., value-at-risk (VaR) and the conditional VaR) by introducing a novel adjustment to the upper confidence bound acquisition function. Our proposed algorithm exhibits two desirable properties: (i) invariance to scaling and vertical shifting of the blackbox function and (ii) robustness to previous harmful tasks. We provide a theoretical performance guarantee for our algorithm and empirically demonstrate its performance using several synthetic function benchmarks and real-world objective functions.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Bayesian optimization black box constraint decoupled query
Scores: [ 6 8 5 8 ]
Though some research efforts have been dedicated to constrained Bayesian optimization (BO), there remains a notable absence of a principled approach with a theoretical performance guarantee in the decoupled setting. Such a setting involves independent evaluations of the objective function and constraints at different inputs, and is hence a relaxation of the commonly-studied coupled setting where functions must be evaluated together. As a result, the decoupled setting requires an adaptive selection between evaluating either the objective function or a constraint, in addition to selecting an input (in the coupled setting). This paper presents a novel constrained BO algorithm with a provable performance guarantee that can address the above relaxed setting. Specifically, it considers the fundamental trade-off between exploration and exploitation in constrained BO, and, interestingly, affords a noteworthy connection to active learning. The performance of our proposed algorithms is also empirically evaluated using several synthetic and real-world optimization problems.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Marked Temporal Point Process Neural ODE
Scores: [ 8 5 6 ]
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Variational Inference
Scores: [ 6 5 8 5 ]
Integrating first-order logic constraints (FOLCs) with neural networks is a crucial but challenging problem since it involves modeling intricate correlations to satisfy the constraints. This paper proposes a novel neural layer, LogicMP, which performs mean-field variational inference over an MLN. It can be plugged into any off-the-shelf neural network to encode FOLCs while retaining modularity and efficiency.By exploiting the structure and symmetries in MLNs, we theoretically demonstrate that our well-designed, efficient mean-field iterations greatly mitigate the difficulty of MLN inference, reducing the inference from sequential calculation to a series of parallel tensor operations.Empirical results in three kinds of tasks over images, graphs, and text show that LogicMP outperforms advanced competitors in both performance and efficiency.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Bayesian optimization Gaussian Cox process
Scores: [ 8 8 6 ]
Bayesian optimization (BO) has established itself as a leading strategy for efficiently optimizing expensive-to-evaluate functions. Existing BO methods mostly rely on Gaussian process (GP) surrogate models and are not applicable to (doubly-stochastic) Gaussian Cox processes, where the observation process is modulated by a latent intensity function modeled as a GP. In this paper, we propose a novel maximum a posteriori inference of Gaussian Cox processes. It leverages the Laplace approximation and change of kernel technique to transform the problem into a new reproducing kernel Hilbert space, where it becomes more tractable computationally. It enables us to obtain both a functional posterior of the latent intensity function and the covariance of the posterior, thus extending existing works that often focus on specific link functions or estimating the posterior mean. Using the result, we propose a BO framework based on the Gaussian Cox process model and further develop a Nyström approximation for efficient computation. Extensive evaluations on various synthetic and real-world datasets demonstrate significant improvement over state-of-the-art inference solutions for Gaussian Cox processes, as well as effective BO with a wide range of acquisition functions designed through the underlying Gaussian Cox process model.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Uncertainty Quantification Graph Hyperspectral Image Classification
Scores: [ 5 6 6 5 6 ]
Hyperspectral imaging (HSI) technology captures spectral information across a broad wavelength range, providing richer pixel features compared to traditional color images with only three channels. Although pixel classification in HSI has been extensively studied, especially using graph convolution neural networks (GCNs), quantifying epistemic and aleatoric uncertainties associated with the HSI classification (HSIC) results remains an unexplored area. These two uncertainties are effective for out-of-distribution (OOD) and misclassification detection, respectively. In this paper, we adapt two advanced uncertainty quantification models, evidential GCNs (EGCN) and graph posterior networks (GPN), designed for node classifications in graphs, into the realm of HSIC. We first analyze theoretically the limitations of a popular uncertainty cross-entropy (UCE) loss function when learning EGCNs for epistemic uncertainty estimation. To mitigate the limitations, we propose two regularization terms. One leverages the inherent property of HSI data where pixel features can be decomposed into weighted sums of various material features, and the other is the total variation (TV) regularization to enforce the spatial smoothness of the evidence with edge-preserving. We demonstrate the effectiveness of the proposed regularization terms on both EGCN and GPN on three real-world HSIC datasets for OOD and misclassification detection tasks. The code is available at \url{https://anonymous.4open.science/r/HSI_torch-1586/}
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: adversarial adaptive sampling optimal transport neural network approximation of PDEs
Scores: [ 8 10 6 5 ]
Solving partial differential equations (PDEs) is a central task in scientific computing. Recently, neural network approximation of PDEs has received increasing attention due to its flexible meshless discretization and its potential for high-dimensional problems. One fundamental numerical difficulty is that random samples in the training set introduce statistical errors into the discretization of the loss functional which may become the dominant error in the final approximation, and therefore overshadow the modeling capability of the neural network. In this work, we propose a new minmax formulation to optimize simultaneously the approximate solution, given by a neural network model, and the random samples in the training set, provided by a deep generative model. The key idea is to use a deep generative model to adjust the random samples in the training set such that the residual induced by the neural network model can maintain a smooth profile in the training process. Such an idea is achieved by implicitly embedding the Wasserstein distance between the residual-induced distribution and the uniform distribution into the loss, which is then minimized together with the residual. A nearly uniform residual profile means that its variance is small for any normalized weight function such that the Monte Carlo approximation error of the loss functional is reduced significantly for a certain sample size. The adversarial adaptive sampling (AAS) approach proposed in this work is the first attempt to formulate two essential components, minimizing the residual and seeking the optimal training set, into one minmax objective functional for the neural network approximation of PDEs.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Gaussian process regression Multioutput Gaussian process Attention mechanism
Scores: [ 6 8 8 8 ]
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Bayesian Deep Learning Implicit neural representations Probabilistic machine learning Hypernetworks
Scores: [ 8 5 5 5 ]
Bayesian inference is the standard for providing full predictive distributions with well calibrated uncertainty estimates. However, scaling to a modern, overparameterized deep learning setting typically comes at the cost of severe and restrictive approximations, sacrificing model predictive strength. With our approach, we factor model parameters as a function of deterministic and probabilistic components; the model is solved by combining maximum a posteriori estimation of the former, with inference over a low-dimensional, Implicit Neural Representation of the latter. This results in a solution that combines both predictive accuracy and good calibration, as it entails inducing stochasticity over the full set of model weights while being comparatively cheap to compute. Experimentally, our approach compares favorably to the state of the art, including much more expensive methods as well as less expressive posterior approximations over full network parameters.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: prediction set label shift distribution-free uncertainty quantification probably approximately correct Clopper-Pearson binomial interval rejection sampling
Scores: [ 8 6 6 6 6 ]
Prediction sets capture uncertainty by predicting sets of labels rather than individual labels, enabling downstream decisions to conservatively account for all plausible outcomes. Conformal inference algorithms construct prediction sets guaranteed to contain the true label with high probability. These guarantees fail to hold in the face of distribution shift, which is precisely when reliable uncertainty quantification can be most useful. We propose a novel algorithm for constructing prediction sets with PAC guarantees in the label shift setting. It estimates importance weights, then propagates uncertainty in these estimates through a Gaussian elimination algorithm to compute confidence intervals that contain the importance weights, and finally uses these intervals to construct prediction sets. We evaluate our approach on four datasets: the CIFAR-10 and ChestX-Ray image datasets, the tabular CDC Heart Dataset, and the AGNews text dataset. Our algorithm satisfies the PAC guarantee while producing smaller prediction set sizes compared to several baselines.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: conformal prediction uncertainty quantification
Scores: [ 8 6 8 8 6 6 ]
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Neural Networks Bayesian deep learning deep learning Gaussian processes sequential learning
Scores: [ 6 6 8 ]
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Gaussian process infinite-width neural network NNGP Bayesian deep learning
Scores: [ 8 5 5 5 ]
Standard infinite-width limits of neural networks sacrifice the ability for intermediate layers to learn representations from data. Recent work ("A theory of representation learning gives a deep generalisation of kernel methods", Yang et al. 2023) modified the Neural Network Gaussian Process (NNGP) limit of Bayesian neural networks so that representation learning is retained. Furthermore, they found that applying this modified limit to a deep Gaussian process gives a practical learning algorithm which they dubbed the "deep kernel machine" (DKM). However, they only considered the simplest possible setting: regression in small, fully connected networks with e.g. 10 input features. Here, we introduce convolutional deep kernel machines. This required us to develop a novel inter-domain inducing point approximation, as well as introducing and experimentally assessing a number of techniques not previously seen in DKMs, including analogues to batch normalisation, different likelihoods, and different types of top-layer. The resulting model trains in roughly 28 GPU hours, achieving around 99% test accuracy on MNIST, 71% on CIFAR-100, and 92% on CIFAR-10, which is SOTA for kernel methods.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: Bayesian model health selective labels distribution shift domain constraint biomedicine
Scores: [ 8 8 8 5 ]
Machine learning models are often trained to predict the outcome resulting from a human decision. For example, if a doctor decides to test a patient for disease, will the patient test positive? A challenge is that the human decision censors the outcome data: we only observe test outcomes for patients doctors historically tested. Untested patients, for whom outcomes are unobserved, may differ from tested patients along observed and unobserved dimensions. We propose a Bayesian model which captures this setting. The purpose of the model is to accurately estimate risk for both tested and untested patients. Estimating this model is challenging due to the wide range of possibilities for untested patients. To address this, we propose two domain constraints which are plausible in health settings: a prevalence constraint, where the overall disease prevalence is known, and an expertise constraint, where the human decision-maker deviates from purely risk-based decision-making only along a constrained feature set. We show theoretically and on synthetic data that domain constraints improve parameter inference. We apply our model to a case study of cancer risk prediction, showing that the model's inferred risk predicts cancer diagnoses, its inferred testing policy captures known public health policies, and it can identify suboptimalities in test allocation. Though our case study is in healthcare, our analysis reveals a general class of domain constraints which can improve model estimation in many settings.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Keywords: mutual information score matching diffusion models
Scores: [ 8 6 6 6 ]
In this work we present a new method for the estimation of Mutual Information (MI) between random variables. Our approach is based on an original interpretation of the Girsanov theorem, which allows us to use score-based diffusion models to estimate the KL divergence between two densities as a difference between their score functions. As a by-product, our method also enables the estimation of the entropy of random variables. Armed with such building blocks, we present a general recipe to measure MI, which unfolds in two directions: one uses conditional diffusion process, whereas the other uses joint diffusion processes that allow simultaneous modelling of two random variables. Our results, which derive from a thorough experimental protocol over all the variants of our approach, indicate that our method is more accurate than the main alternatives from the literature, especially for challenging distributions. Furthermore, our methods pass MI self-consistency tests, including data processing and additivity under independence, which instead are a pain-point of existing methods
Primary Area: reinforcement learning
Keywords: unsupervised skill learning reward-free RL downstream task adaptation wasserstein distance theoretical analysis
Scores: [ 8 6 8 8 ]
Primary Area: reinforcement learning
Keywords: Multi-agent reinforcement learning Robustness Game Theory Adversarial Attack
Scores: [ 8 5 6 ]
In this study, we explore the robustness of cooperative multi-agent reinforcement learning (c-MARL) against Byzantine failures, where any agent can enact arbitrary, worst-case actions due to malfunction or adversarial attack. To address the uncertainty that any agent can be adversarial, we propose a Bayesian Adversarial Robust Dec-POMDP (BARDec-POMDP) framework, which views Byzantine adversaries as nature-dictated types, represented by a separate transition. This allows agents to learn policies grounded on their posterior beliefs about the type of other agents, fostering collaboration with identified allies and minimizing vulnerability to adversarial manipulation. We define the optimal solution to the BARDec-POMDP as an ex interim robust Markov perfect Bayesian equilibrium, which we proof to exist and the corresponding policy weakly dominates previous approaches as time goes to infinity. To realize this equilibrium, we put forward a two-timescale actor-critic algorithm with almost sure convergence under specific conditions. Experiments on matrix game, Level-based Foraging and StarCraft II indicate that, our method successfully acquires intricate micromanagement skills and adaptively aligns with allies under worst-case perturbations, showing resilience against non-oblivious adversaries, random allies, observation-based attacks, and transfer-based attacks.
Primary Area: reinforcement learning
Keywords: Inverse reinforcement learning route optimization
Scores: [ 6 6 8 10 ]
Optimizing for humans’ latent preferences remains a grand challenge in route recommendation. Prior research has provided increasingly general methods based on inverse reinforcement learning (IRL), yet no approach has successfully addressed planetary-scale routing problems with hundreds of millions of states and demonstration trajectories. In this paper, we introduce scaling techniques based on graph compression, spatial parallelization, and improved initialization conditions inspired by a connection to eigenvector algorithms. We revisit classic algorithms in the routing context, and make the key observation that there exists a trade-off between the use of cheap, deterministic planners and expensive yet robust stochastic policies. This insight is leveraged in Receding Horizon Inverse Planning (RHIP), a new generalization of classic IRL algorithms that provides fine-grained control over performance trade-offs via its planning horizon. Our contributions culminate in a policy that achieves a 16-24% improvement in route quality at a global scale, and to the best of our knowledge, represents the largest published benchmark of IRL algorithms in a real-world setting to date. We conclude by conducting an ablation study of key components, presenting negative results from alternative eigenvalue solvers, and identifying opportunities to further improve scalability via IRL-specific batching strategies.
Primary Area: reinforcement learning
Keywords: reinforcement learning theory reward-agnostic learning
Scores: [ 8 8 8 6 ]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories, rather than explicit reward signals. While PbRL has demonstrated practical success in fine-tuning language models, existing theoretical work focuses on regret minimization and fails to capture most of the practical frameworks. In this study, we fill in such a gap between theoretical PbRL and practical algorithms by proposing a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired before collecting any human feedback. Theoretical analysis demonstrates that our algorithm requires less human feedback for learning the optimal policy under preference-based models with linear parameterization and unknown transitions, compared to the existing theoretical literature. Specifically, our framework can incorporate linear and low-rank MDPs with efficient sample complexity. Additionally, we investigate reward-agnostic RL with action-based comparison feedback and introduce an efficient querying algorithm tailored to this scenario.
Primary Area: reinforcement learning
Keywords: online learning bandits with knapsack best-of-both-worlds
Scores: [ 6 6 6 6 ]
Primary Area: reinforcement learning
Keywords: Robot Learning Offline Imitation Learning Offline Reinforcement Learning Deep Reinforcement Learning
Scores: [ 10 6 8 6 ]
Primary Area: reinforcement learning
Keywords: Integral Reinforcement Learning Bayesian Quadrature Newton's Method
Scores: [ 8 6 6 8 ]
Primary Area: reinforcement learning
Keywords: offline reinforcement learning generative models diffusion models behavior modeling computational efficiency
Scores: [ 6 8 3 8 ]
Recent developments in offline reinforcement learning have uncovered the immense potential of diffusion modeling, which excels at representing heterogeneous behavior policies. However, sampling from diffusion policies is considerably slow because it necessitates tens to hundreds of iterative inference steps for one action. To address this issue, we propose to extract an efficient deterministic inference policy from critic models and pretrained diffusion behavior models, leveraging the latter to directly regularize the policy gradient with the behaviordistribution’s score function during optimization. Our method enjoys powerful generative capabilities of diffusion modeling while completely circumventing the computationally intensive and time-consuming diffusion sampling scheme, both during training and evaluation. Extensive results on D4RL tasks show that our method boosts action sampling speed by more than 25 times compared with various leading diffusion-based methods in locomotion tasks, while still maintaining state-of-the-art performance.
Primary Area: reinforcement learning
Keywords: Reinforcement Learning Robustness Adversarial Learning
Scores: [ 5 6 5 ]
Deploying reinforcement learning (RL) systems requires robustness to uncertainty and model misspecification, yet prior robust RL methods typically only study noise introduced independently across time. However, practical sources of uncertainty are usually coupled across time.We formally introduce temporally-coupled perturbations, presenting a novel challenge for existing robust RL methods. To tackle this challenge, we propose GRAD, a novel game-theoretic approach that treats the temporally-coupled robust RL problem as a partially-observable two-player zero-sum game. By finding an approximate equilibrium within this game, GRAD optimizes for general robustness against temporally-coupled perturbations. Experiments on continuous control tasks demonstrate that, compared with prior methods, our approach achieves a higher degree of robustness to various types of attacks on different attack domains, both in settings with temporally-coupled perturbations and decoupled perturbations.
Primary Area: reinforcement learning
Keywords: reward functions reward learning metrics evaluations
Scores: [ 8 6 6 6 ]
Primary Area: reinforcement learning
Keywords: Multi-Agent Reinforcement Learning Constrained reinforcement learning Primal-Dual Duality gap Primal algorithm
Scores: [ 5 6 6 6 6 6 ]
Primary Area: reinforcement learning
Keywords: reinforcement learning learning from demonstrations curriculum learning reverse curriculum learning robot learning robotics
Scores: [ 6 8 3 ]
Reinforcement learning (RL) presents a promising framework to learn policies through environment interaction, but often requires an infeasible amount of interaction data to solve complex tasks from sparse rewards. One direction includes augmenting RL with offline data demonstrating desired tasks, but past work often require a lot of high-quality demonstration data that is difficult to obtain, especially for domains such as robotics. Our approach consists of a reverse curriculum followed by a forward curriculum. Unique to our approach compared to past work is the ability to efficiently leverage more than one demonstration via a per-demonstration reverse curriculum generated via state resets. The result of our reverse curriculum is an initial policy that performs well on a narrow initial state distribution and helps overcome difficult exploration problems. A forward curriculum is then used to accelerate the training of the initial policy to perform well on the full initial state distribution of the task and improve demonstration and sample efficiency. We show how the combination of a reverse curriculum and forward curriculum in our method, RFCL, enables significant improvements in demonstration and sample efficiency compared against various state-of-the-art learning-from-demonstration baselines, even solving previously unsolvable tasks that require high precision and control. Website with code and visualizations are here: https://reverseforward-cl.github.io/
Primary Area: reinforcement learning
Keywords: Safe Reinforcement Learning
Scores: [ 5 6 8 8 ]
Primary Area: reinforcement learning
Keywords: Best Arm Identification Differential Privacy
Scores: [ 8 5 8 6 8 ]
We study best arm identification (BAI) in linear bandits in the fixed-budget regime under differential privacy constraints, when the arm rewards are supported on the unit interval. Given a finite budget T and a privacy parameter ε>0, the goal is to minimise the error probability in finding the arm with the largest mean after T sampling rounds, subject to the constraint that the policy of the decision maker satisfies a certain {\em ε-differential privacy} (ε-DP) constraint. We construct a policy satisfying the ε-DP constraint (called {\sc DP-BAI}), based on the principle of {\em maximum absolute determinants}, and derive an upper bound on its error probability. Furthermore, we derive a minimax lower bound on the error probability, and demonstrate that the lower and the upper bounds decay exponentially in T, with exponents in the two bounds matching order-wise in (a) the sub-optimality gaps of the arms, (b) ε, and (c) the problem complexity that is expressible as the sum of two terms, one characterising the complexity of standard fixed-budget BAI (without privacy constraints), and the other accounting for the ε-DP constraint. Additionally, we present some auxiliary results that contribute to the derivation of the lower bound on the error probability. These results, we posit, may be of independent interest and could prove instrumental in proving lower bounds on error probabilities in several other bandit problems.Whereas prior works provide results for BAI in the fixed-budget regime without privacy constraints or in the fixed-confidence regime with privacy constraints, our work fills the gap in the literature by providing the results for BAI in the fixed-budget regime under the ε-DP constraint.
Primary Area: reinforcement learning
Keywords: offline reinforcement learning sequence modeling reinforcement learning
Scores: [ 6 6 6 6 ]
Primary Area: reinforcement learning
Keywords: Multi-Agent Reinforcement Learning Emergent Communication Contrastive Learning
Scores: [ 5 8 6 5 ]
Primary Area: reinforcement learning
Keywords: Reinforcement Learning Iterated CVaR Learning Theory Function Approximation Human Feedback
Scores: [ 6 8 3 ]
Risk-sensitive reinforcement learning (RL) aims to optimize policies that balance the expected reward and risk. In this paper, we present a novel risk-sensitive RL framework that employs an Iterated Conditional Value-at-Risk (CVaR) objective under both linear and general function approximations, enriched by human feedback. These new formulations provide a principled way to guarantee safety in each decision making step throughout the control process. Moreover, integrating human feedback into risk-sensitive RL framework bridges the gap between algorithmic decision-making and human participation, allowing us to also guarantee safety for human-in-the-loop systems. We propose provably sample-efficient algorithms for this Iterated CVaR RL and provide rigorous theoretical analysis. Furthermore, we establish a matching lower bound to corroborate the optimality of our algorithms in a linear context.
Primary Area: reinforcement learning
Keywords: Generative Model Expressiveness Deep Reinforcement Learning
Scores: [ 3 6 6 ]
Primary Area: reinforcement learning
Keywords: PPO PPO-Clip stochastic optimization
Scores: [ 3 6 8 ]
Primary Area: reinforcement learning
Keywords: Open World Minecraft Instruction Following Goal-conditioned Policy
Scores: [ 8 8 5 8 ]
We study the problem of building a controller that can follow open-ended instructions in open-world environments. We propose to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations. A new learning framework is derived to allow learning such instruction-following controllers from gameplay videos while producing a video instruction encoder that induces a structured goal space. We implement our agent GROOT in a simple yet effective encoder-decoder architecture based on causal transformers. We evaluate GROOT against open-world counterparts and human players on a proposed Minecraft SkillForge benchmark. The Elo ratings clearly show that GROOT is closing the human-machine gap as well as exhibiting a 70% winning rate over the best generalist agent baseline. Qualitative analysis of the induced goal space further demonstrates some interesting emergent properties, including the goal composition and complex gameplay behavior synthesis.
Primary Area: reinforcement learning
Keywords: Offline reinforcement learning in-context learning decision transformer
Scores: [ 8 8 5 8 ]
In-context learning is a promising approach for online policy learning of offline reinforcement learning (RL) methods, which can be achieved at inference time without gradient optimization. However, this method is hindered by significant computational costs resulting from the gathering of large training trajectory sets and the need to train large Transformer models. We address this challenge by introducing an In-context Exploration-Exploitation (ICEE) algorithm, designed to optimize the efficiency of in-context policy learning. Unlike existing models, ICEE performs an exploration-exploitation trade-off at inference time within a Transformer model, without the need for explicit Bayesian inference. Consequently, ICEE can solve Bayesian optimization problems as efficiently as Gaussian process biased methods do, but in significantly less time. Through experiments in grid world environments, we demonstrate that ICEE can learn to solve new RL tasks using only tens of episodes, marking a substantial improvement over the hundreds of episodes needed by the previous in-context learning method.
Primary Area: reinforcement learning
Keywords: deep reinforcement learning visual reinforcement learning object-centric robotic object manipulation compositional generalization
Scores: [ 8 6 8 8 ]
Manipulating objects is a hallmark of human intelligence, and an important task in domains such as robotics. In principle, Reinforcement Learning (RL) offers a general approach to learn object manipulation. In practice, however, domains with more than a few objects are difficult for RL agents due to the curse of dimensionality, especially when learning from raw image observations. In this work we propose a structured approach for visual RL that is suitable for representing multiple objects and their interaction, and use it to learn goal-conditioned manipulation of several objects. Key to our method is the ability to handle goals with dependencies between the objects (e.g., moving objects in a certain order). We further relate our architecture to the generalization capability of the trained agent, and demonstrate agents that learn with 3 objects but generalize to similar tasks with over 10 objects. Rollout videos are available on our website: https://sites.google.com/view/entity-centric-rl
Primary Area: reinforcement learning
Keywords: Safe Reinforcement Learning SafeRL World Model
Scores: [ 6 8 6 6 ]
Primary Area: reinforcement learning
Keywords: cooperative multi-agent reinforcement learning heterogeneous-agent soft actor-critic maximum entropy heterogeneous-agent mirror learning
Scores: [ 8 8 6 8 ]
Multi-agent reinforcement learning (MARL) has been shown effective for cooperative games in recent years. However, existing state-of-the-art methods face challenges related to sample complexity, training instability, and the risk of converging to a suboptimal Nash Equilibrium. In this paper, we propose a unified framework for learning \emph{stochastic} policies to resolve these issues. We embed cooperative MARL problems into probabilistic graphical models, from which we derive the maximum entropy (MaxEnt) objective for MARL. Based on the MaxEnt framework, we propose Heterogeneous-Agent Soft Actor-Critic (HASAC) algorithm. Theoretically, we prove the monotonic improvement and convergence to quantal response equilibrium (QRE) properties of HASAC. Furthermore, we generalize a unified template for MaxEnt algorithmic design named Maximum Entropy Heterogeneous-Agent Mirror Learning (MEHAML), which provides any induced method with the same guarantees as HASAC. We evaluate HASAC on six benchmarks: Bi-DexHands, Multi-Agent MuJoCo, StarCraft Multi-Agent Challenge, Google Research Football, Multi-Agent Particle Environment, and Light Aircraft Game. Results show that HASAC consistently outperforms strong baselines, exhibiting better sample efficiency, robustness, and sufficient exploration.
Primary Area: reinforcement learning
Keywords: intrinsic motivation rlaif large language models exploration open-endedness nethack alignment diversity
Scores: [ 5 8 8 8 ]
Primary Area: reinforcement learning
Keywords: Offline-to-Online Fine-tuning On-policy Learning Robot Learning Reinforcement Learning
Scores: [ 6 6 8 8 ]
Primary Area: reinforcement learning
Keywords: reinforcement learning theory offline reinforcement learning
Scores: [ 8 8 6 8 ]
In this paper, we investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback where feedback is available in the form of preference between trajectory pairs rather than explicit rewards. Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offline data and (2) solve a distributionally robust planning problem over a confidence set around the MLE. We consider the general reward setting where the reward can be defined over the whole trajectory and provide a novel guarantee that allows us to learn any target policy with a polynomial number of samples, as long as the target policy is covered by the offline data. This guarantee is the first of its kind with general function approximation. To measure the coverage of the target policy, we introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability coefficient. We also establish lower bounds that highlight the necessity of such concentrability and the difference from standard RL, where state-action-wise rewards are directly observed. We further extend and analyze our algorithm when the feedback is given over action pairs.
Primary Area: reinforcement learning
Keywords: reinforcement learning; large language models; robotics
Scores: [ 6 8 6 8 ]
Designing reward functions is a longstanding challenge in reinforcement learning (RL); it requires specialized knowledge or domain data, leading to high costs for development. To address this, we introduce Text2Reward, a data-free framework that automates the generation of dense reward functions based on large language models (LLMs). Given a goal described in natural language, Text2Reward generates dense reward functions as an executable program grounded in a compact representation of the environment. Unlike inverse RL and recent work that uses LLMs to write sparse reward codes, Text2Reward produces interpretable, free-form dense reward codes that cover a wide range of tasks, utilize existing packages, and allow iterative refinement with human feedback. We evaluate Text2Reward on two robotic manipulation benchmarks (Maniskill2, MetaWorld) and two locomotion environments of MuJoCo. On 13 of the 17 manipulation tasks, policies trained with generated reward codes achieve similar or better task success rates and convergence speed than expert-written reward codes. For locomotion tasks, our method learns six novel locomotion behaviors with a success rate exceeding 94%. Furthermore, we show that the policies trained in the simulator with our method can be deployed in the real world. Finally, Text2Reward further improves the policies by refining their reward functions with human feedback. Video results are available at https://text-to-reward-review.github.io.
Primary Area: reinforcement learning
Keywords: Reinforcement Learning Diffusion Models
Scores: [ 5 8 6 ]
Primary Area: reinforcement learning
Keywords: reinforcement learning data augmentation model-free
Scores: [ 8 8 3 5 ]
Recently, data augmentation (DA) has emerged as a method for leveraging domain knowledge to inexpensively generate additional data in reinforcement learning (RL) tasks, often yielding substantial improvements in data efficiency.While prior work has demonstrated the utility of incorporating augmented data directly into model-free RL updates,it is not well-understood when a particular DA strategy will improve data efficiency.In this paper, we seek to identify general aspects of DA responsible for observed learning improvements.Our study focuses on sparse-reward tasks with dynamics-invariant data augmentation functions, serving as an initial step towards a more general understanding of DA and its integration into RL training.Experimentally, we isolate three relevant aspects of DA: state-action coverage, reward density, and the number of augmented transitions generated per update (the augmented replay ratio).From our experiments, we draw two conclusions: (1) increasing state-action coverage often has a much greater impact on data efficiency than increasing reward density, and (2) decreasing the augmented replay ratio substantially improves data efficiency.In fact, certain tasks in our empirical study are solvable only when the replay ratio is sufficiently low.
Primary Area: reinforcement learning
Keywords: reinforcement learning pre-training inverse dynamics models world models imitation learning representation learning
Scores: [ 6 8 8 8 ]
Pre-training large models on vast amounts of web data has proven to be an effective approach for obtaining powerful, general models in several domains, including language and vision. However, this paradigm has not yet taken hold in deep reinforcement learning (RL). This gap is due to the fact that the most abundant form of embodied behavioral data on the web consists of videos, which do not include the action labels required by existing methods for training policies from offline data. We introduce Latent Action Policies from Observation (LAPO), a method to infer latent actions and, consequently, latent-action policies purely from action-free demonstrations. Our experiments on challenging procedurally-generated environments show that LAPO can act as an effective pre-training method to obtain RL policies that can then be rapidly fine-tuned to expert-level performance. Our approach serves as a key stepping stone to enabling the pre-training of powerful, generalist RL models on the vast amounts of action-free demonstrations readily available on the web.
Primary Area: reinforcement learning
Keywords: Reinforcement Learning Diversity in RL Offline RL
Scores: [ 6 6 6 8 ]
Primary Area: reinforcement learning
Keywords: Dueling Bandit Variance-aware contextual bandit
Scores: [ 5 6 5 8 ]
Dueling bandits is a prominent framework for decision-making involving preferential feedback, a valuable feature that fits various applications involving human interaction, such as ranking, information retrieval, and recommendation systems. While substantial efforts have been made to minimize the cumulative regret in dueling bandits, a notable gap in the current research is the absence of regret bounds that account for the inherent uncertainty in pairwise comparisons between the dueling arms. Intuitively, greater uncertainty suggests a higher level of difficulty in the problem. To bridge this gap, this paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM). We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound ~O(d√∑Tt=1σ2t+d), where σt is the variance of the pairwise comparison at round t, d is the dimension of the context vectors, and T is the time horizon. Our regret bound naturally aligns with the intuitive expectation — in scenarios where the comparison is deterministic, the algorithm only suffers from an ~O(d) regret. We perform empirical experiments on synthetic data to confirm the advantage of our method over previous variance-agnostic algorithms.
Primary Area: reinforcement learning
Keywords: Max-Entropy RL Entropy Energy-Based-Models
Scores: [ 5 6 6 8 6 6 3 ]
Primary Area: reinforcement learning
Keywords: knowledge distillation evolutionary strategies reinforcement learning dataset distillation
Scores: [ 8 6 6 ]
Dataset distillation aims to condense large datasets into a small number of synthetic examples that can be used as drop-in replacements when training new models. It has applications to interpretability, neural architecture search, privacy, and continual learning. Despite strong successes in supervised domains, such methods have not yet been extended to reinforcement learning, where the lack of fixed dataset renders most distillation methods unusable.Filling the gap, we formalize behaviour distillation, a setting that aims to discover and then condense the information required for training an expert policy into a synthetic dataset of state-action pairs, without access to expert data. We then introduce Hallucinating Datasets with Evolution Strategies (HaDES), a method for behaviour distillation that can discover datasets of just four state-action pairs which, under supervised learning, train agents to competitive performance levels in continuous control tasks.We show that these datasets generalize out of distribution to training policies with a wide range of architectures and hyperparameters. We also demonstrate application to a downstream task, namely training multi-task agents in a zero-shot fashion.Beyond behaviour distillation, HaDES provides significant improvements in neuroevolution for RL over previous approaches and achieves SoTA results on one standard supervised dataset distillation task. Finally, we show that visualizing the synthetic datasets can provide human-interpretable task insights.
Primary Area: reinforcement learning
Keywords: reward hypothesis Markov reward multi-objective reinforcement learning linear temporal logic reward machines
Scores: [ 3 6 10 8 ]
Primary Area: reinforcement learning
Keywords: Behavior Foundation Models unsupervised reinforcement learning imitation learning
Scores: [ 8 6 8 8 ]
Primary Area: reinforcement learning
Keywords: Reinforcement Learning Multi task Temporal logic LTL Composition Zero-shot Transfer learning
Scores: [ 5 6 6 8 ]
Primary Area: reinforcement learning
Keywords: reinforcement learning data augmentation stitching
Scores: [ 6 5 5 6 ]
Some reinforcement learning (RL) algorithms have the capability of recombining together pieces of previously seen experience to solve a task never seen before during training. This oft-sought property is one of the few ways in which dynamic programming based RL algorithms are considered different from supervised learning (SL) based RL algorithms. Yet, recent RL methods based on off-the-shelf SL algorithms achieve excellent results without an explicit mechanism for stitching; it remains unclear whether those methods forgo this important stitching property. This paper studies this question in the setting of goal-reaching problems. We show that the desirable stitching property corresponds to a form of generalization: after training on a distribution of (state, goal) pairs, one would like to evaluate on (state, goal) pairs not seen together in the training data. Our analysis shows that this sort of generalization is different from i.i.d. generalization. This connection between stitching and generalization reveals why we should not expect existing RL methods based on SL to perform stitching, even in the limit of large datasets and models. We experimentally validate this result on carefully constructed datasets.This connection suggests a simple remedy, the same remedy for improving generalization in supervised learning: data augmentation. We propose a naive temporal data augmentation approach and demonstrate that adding it to RL methods based on SL enables them to successfully stitch together experience, so that they succeed in navigating between states and goals unseen together during training.
Primary Area: reinforcement learning
Keywords: Projections Epistemic Uncertainty Ensembles Exploration Reinforcement Learning Distributional Reinforcement Learning Machine Learning Artificial Intelligence
Scores: [ 3 8 8 8 ]
Primary Area: reinforcement learning
Keywords: Robot Design Hyperbolic Space Coarse-to-Fine Multi-Cellular
Scores: [ 6 8 6 6 ]
Primary Area: reinforcement learning
Keywords: off-policy evaluation reinforcement learning deterministic policy continuous actions metric learning
Scores: [ 6 8 8 ]
We consider off-policy evaluation (OPE) of deterministic target policies for reinforcement learning (RL) in environments with continuous action spaces. While it is common to use importance sampling for OPE, it suffers from high variance when the behavior policy deviates significantly from the target policy. In order to address this issue, some recent works on OPE proposed in-sample learning with importance resampling. Yet, these approaches are not applicable to deterministic target policies for continuous action spaces. To address this limitation, we propose to relax the deterministic target policy using a kernel and learn the kernel metrics that minimize the overall mean squared error of the estimated temporal difference update vector of an action value function, where the action value function is used for policy evaluation. We derive the bias and variance of the estimation error due to this relaxation and provide analytic solutions for the optimal kernel metric. In empirical studies using various test domains, we show that the OPE with in-sample learning using the kernel with optimized metric achieves significantly improved accuracy than other baselines.
Primary Area: reinforcement learning
Keywords: imitation learning inverse reinforcement learning reinforcement learning currilulum learning optimal transport
Scores: [ 8 5 8 5 ]
Humans often acquire new skills through the observation and imitation of others. For robotic agents, learning from the plethora of unlabeled video demonstration data available on the Internet necessitates imitating the expert without access to its action, presenting a challenge known as Imitation Learning from Observations (ILfO). A common approach to tackle ILfO problems is to convert them into Inverse Reinforcement Learning (RL) problems, utilizing a proxy reward computed from the agent's and the expert's observations. Nonetheless, we identify that tasks characterized by a progress dependency property pose significant challenges for such approaches; in these tasks, the agent needs to initially learn the expert's preceding behaviors before mastering the subsequent ones. Our investigation reveals that the main cause is that the reward signals assigned to later steps hinder the learning of initial behaviors. To address this challenge, we present a novel ILfO framework that enables the agent to master earlier behaviors before advancing to later ones. We introduce an Automatic Discount Scheduling (ADS) mechanism that adaptively alters the discount factor in RL during the training phase, prioritizing earlier rewards initially and gradually engaging later rewards only when the earlier behaviors have been mastered. Our experiments, conducted on nine Meta-World tasks, demonstrate that our method significantly outperforms state-of-the-art methods across all tasks, including those that are unsolvable by them.
Primary Area: reinforcement learning
Keywords: reinforcement learning adversarial bounded rationality curriculum
Scores: [ 6 8 6 6 ]
Primary Area: reinforcement learning
Keywords: Reinforcement learning adversarial policies
Scores: [ 6 6 6 8 ]
Most existing works focus on direct perturbations to the victim's state/action or the underlying transition dynamics to demonstrate the vulnerability of reinforcement learning agents to adversarial attacks. However, such direct manipulations may not be always realizable.In this paper, we consider a multi-agent setting where a well-trained victim agent ν is exploited by an attacker controlling another agent α with an \textit{adversarial policy}. Previous models do not account for the possibility that the attacker may only have partial control over α or that the attack may produce easily detectable abnormal'' behaviors. Furthermore, there is a lack of provably efficient defenses against these adversarial policies. To address these limitations, we introduce a generalized attack framework that has the flexibility to model to what extent the adversary is able to control the agent, and allows the attacker to regulate the state distribution shift and produce stealthier adversarial policies. Moreover, we offer a provably efficient defense with polynomial convergence to the most robust victim policy through adversarial training with timescale separation. This stands in sharp contrast to supervised learning, where adversarial training typically provides only \textit{empirical} defenses.Using the Robosumo competition experiments, we show that our generalized attack formulation results in much stealthier adversarial policies when maintaining the same winning rate as baselines. Additionally, our adversarial training approach yields stable learning dynamics and less exploitable victim policies.
Primary Area: reinforcement learning
Keywords: Hierarchical Reinforcement Learning Goal Representation Reachability Analysis
Scores: [ 5 6 8 ]
Primary Area: reinforcement learning
Keywords: Robot Learning Goal-Conditioned Reinforcement Learning Deep Reinforcement Learning
Scores: [ 6 6 6 6 ]
Primary Area: reinforcement learning
Keywords: online learning information acquisition mechanism design
Scores: [ 6 8 5 8 ]
Primary Area: reinforcement learning
Keywords: Reinforcement learning imitation learning online learning
Scores: [ 8 6 6 6 ]
Although reinforcement learning methods offer a powerful framework for automatic skill acquisition, for practical learning-based control problems in domains such as robotics, imitation learning often provides a more convenient and accessible alternative. In particular, an interactive imitation learning method such as DAgger, which queries a near-optimal expert to intervene online to collect correction data for addressing the distributional shift challenges that afflict naïve behavioral cloning, can enjoy good performance both in theory and practice without requiring manually specified reward functions and other components of full reinforcement learning methods. In this paper, we explore how off-policy reinforcement learning can enable improved performance under assumptions that are similar but potentially even more practical than those of interactive imitation learning. Our proposed method uses reinforcement learning with user intervention signals themselves as rewards. This relaxes the assumption that intervening experts in interactive imitation learning should be near-optimal and enables the algorithm to learn behaviors that improve over the potential suboptimal human expert. We also provide a unified framework to analyze our RL method and DAgger; for which we present the asymptotic analysis of the suboptimal gap for both methods as well as the non-asymptotic sample complexity bound of our method. We then evaluate our method on challenging high-dimensional continuous control simulation benchmarks as well as real-world robotic vision-based manipulation tasks. The results show that it strongly outperforms DAgger-like approaches across the different tasks, especially when the intervening experts are suboptimal. Additional ablations also empirically verify the proposed theoretical justification that the performance of our method is associated with the choice of intervention model and suboptimality of the expert. Code and videos can be found on our project website: https://rlifpaper.github.io.
Primary Area: reinforcement learning
Keywords: reinforcement learning pre-training goal-conditioned models open-world environments
Scores: [ 8 6 8 ]
Pre-training on task-agnostic large datasets is a promising approach for enhancing the sample efficiency of reinforcement learning (RL) in solving complex tasks. We present PTGM, a novel method that pre-trains goal-based models to augment RL by providing temporal abstractions and behavior regularization. PTGM involves pre-training a low-level, goal-conditioned policy and training a high-level policy to generate goals for subsequent RL tasks. To address the challenges posed by the high-dimensional goal space, while simultaneously maintaining the agent's capability to accomplish various skills, we propose clustering goals in the dataset to form a discrete high-level action space. Additionally, we introduce a pre-trained goal prior model to regularize the behavior of the high-level policy in RL, enhancing sample efficiency and learning stability. Experimental results in a robotic simulation environment and the challenging open-world environment of Minecraft demonstrate PTGM’s superiority in sample efficiency and task performance compared to baselines. Moreover, PTGM exemplifies enhanced interpretability and generalization of the acquired low-level skills.
Primary Area: reinforcement learning
Keywords: Exploration Reinforcement Learning Thompson Sampling Langevin Monte Carlo Deep Reinforcement learning
Scores: [ 8 6 6 ]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcomings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q-function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q-function, which makes our approach easy to deploy in deep RL. Our theoretical analysis shows that, in the linear Markov decision process (linear MDP) setting, the proposed method has a regret bound of ~O(d3/2H5/2√T), where d is the dimension of the feature mapping, H is the planning horizon, and T is the total number of steps. We apply the proposed approach to deep RL, by using the Adam optimizer to perform gradient updates. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
Primary Area: reinforcement learning
Keywords: Reinforcement Learning Representation Learning POMDPs Information States Self-supervised Learning
Scores: [ 8 8 8 3 ]
Representations are at the core of all deep reinforcement learning (RL) methods for both Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs). Many representation learning methods and theoretical frameworks have been developed to understand what constitutes an effective representation. However, the relationships between these methods and the shared properties among them remain unclear. In this paper, we show that many of these seemingly distinct methods and frameworks for state and history abstractions are, in fact, based on a common idea of self-predictive abstraction. Furthermore, we provide theoretical insights into the widely adopted objectives and optimization, such as the stop-gradient technique, in learning self-predictive representations. These findings together yield a minimalist algorithm to learn self-predictive representations for states and histories. We validate our theories by applying our algorithm to standard MDPs, distracted MDPs, and sparse-reward POMDPs.
Primary Area: reinforcement learning
Keywords: Reinforcement Learning Model-based Reinforcement Learning Offline Reinforcement Learning
Scores: [ 6 8 8 8 ]
Human beings can make adaptive decisions in a preparatory manner, i.e., by making preparations in advance, which offers significant advantages in scenarios where both online and offline experiences are expensive and limited. Meanwhile, current reinforcement learning methods commonly rely on numerous environment interactions but hardly obtain generalizable policies. In this paper, we introduce the idea of \textit{rehearsal} into policy optimization, where the agent plans for all possible outcomes in mind and acts adaptively according to actual responses from the environment. To effectively rehearse, we propose ReDM, an algorithm that generates a diverse and eligible set of dynamics models and then rehearse the policy via adaptive training on the generated model set. Rehearsal enables the policy to make decision plans for various hypothetical dynamics and to naturally generalize to previously unseen environments. Our experimental results demonstrate that ReDM is capable of learning a valid policy solely through rehearsal, even with \emph{zero} interaction data. We further extend ReDM to scenarios where limited or mismatched interaction data is available, and our experimental results reveal that ReDM produces high-performing policies compared to other offline RL baselines.
Primary Area: reinforcement learning
Keywords: Reinforcement learning Non-Markovian rewards Submodular optimization Policy gradient Complex objectives in RL
Scores: [ 8 6 8 6 ]
In reinforcement learning (RL), rewards of states are typically considered additive, and following the Markov assumption, they are independent of states visited previously. In many important applications, such as coverage control, experiment design and informative path planning, rewards naturally have diminishing returns, i.e., their value decreases in light of similar states visited previously. To tackle this, we propose Submodular RL (subRL), a paradigm which seeks to optimize more general, non-additive (and history-dependent) rewards modelled via submodular set functions, which capture diminishing returns. Unfortunately, in general, even in tabular settings, we show that the resulting optimization problem is hard to approximate. On the other hand, motivated by the success of greedy algorithms in classical submodular optimization, we propose subPO, a simple policy gradient-based algorithm for subRL that handles non-additive rewards by greedily maximizing marginal gains. Indeed, under some assumptions on the underlying Markov Decision Process (MDP), subPO recovers optimal constant factor approximations of submodular bandits. Moreover, we derive a natural policy gradient approach for locally optimizing subRL instances even in large state- and action- spaces. We showcase the versatility of our approach by applying subPO to several applications, such as biodiversity monitoring, Bayesian experiment design, informative path planning, and coverage maximization. Our results demonstrate sample efficiency, as well as scalability to high-dimensional state-action spaces.
Primary Area: reinforcement learning
Keywords: offline reinforcement learning hidden confounding uncertainty quantification causal inference healthcare vasopressor and fluid administration
Scores: [ 8 6 8 8 ]
A prominent challenge of offline reinforcement learning (RL) is the issue of hidden confounding: unobserved variables may influence both the actions taken by the agent and the observed outcomes. Hidden confounding can compromise the validity of any causal conclusion drawn from data and presents a major obstacle to effective offline RL. In the present paper, we tackle the problem of hidden confounding in the nonidentifiable setting. We propose a definition of uncertainty due to hidden confounding bias, termed delphic uncertainty, which uses variation over world models compatible with the observations, and differentiate it from the well-known epistemic and aleatoric uncertainties. We derive a practical method for estimating the three types of uncertainties, and construct a pessimistic offline RL algorithm to account for them. Our method does not assume identifiability of the unobserved confounders, and attempts to reduce the amount of confounding bias. We demonstrate through extensive experiments and ablations the efficacy of our approach on a sepsis management benchmark, as well as on electronic health records. Our results suggest that nonidentifiable hidden confounding bias can be mitigated to improve offline RL solutions in practice.
Primary Area: reinforcement learning
Keywords: reinforcement learning regularization in reinforcement leaning learning with demonstrations reinforcemenet learning with human feedback
Scores: [ 6 6 8 6 ]
Primary Area: reinforcement learning
Keywords: Hierarchical Offline RL Hierarchical planning Hierarchical Reinforcement Learning Diffusion-Based Planning
Scores: [ 6 6 6 5 ]
Primary Area: reinforcement learning
Keywords: reinforcement learning model-based reinforcement learning maximum entropy planning
Scores: [ 8 8 6 ]
Primary Area: reinforcement learning
Keywords: Embodied-AI Task-conditioned Representations Visual Navigation Reinforcement Learning
Scores: [ 6 8 8 8 ]
Primary Area: reinforcement learning
Keywords: Model-based RL; Model-predictive control; Model rollouts; Model error
Scores: [ 6 6 6 ]
Primary Area: reinforcement learning
Keywords: Adversarial Bandits Online Learning Heavy Tails Follow-the-Perturbed-Leader
Scores: [ 8 8 5 6 ]
We study adversarial bandit problems with potentially heavy-tailed losses. Unlike standard settings with non-negative and bounded losses, managing negative and unbounded losses introduces a unique challenge in controlling the stability'' of the algorithm and hence the regret. To tackle this challenge, we propose a Follow-the-Perturbed-Leader (FTPL) based learning algorithm. Notably, our method achieves (nearly) optimal worst-case regret, eliminating the need for an undesired assumption inherent in the Follow-the-Regularized-Leader (FTRL) based approach. Thanks to this distinctive advantage, our algorithmic framework finds novel applications in two important scenarios with unbounded heavy-tailed losses. For adversarial bandits with heavy-tailed losses and Huber contamination, which we call the robust setting, our algorithm is the first to match the lower bound (up to a \polylog(K) factor, where K is the number of actions). In the private setting, where true losses are in a bounded range (e.g., [0,1]) but with additional Local Differential Privacy (LDP) guarantees, our algorithm achieves an improvement of a \polylog(T) factor in the regret bound compared to the best-known results, where T is the total number of rounds. Furthermore, when compared to state-of-the-art FTRL-based algorithms, our FTPL-based algorithm has a more streamlined design. It eliminates the need for additional explicit exploration and solely maintains the absolute value of loss estimates below a predetermined threshold.
Primary Area: reinforcement learning
Keywords: reinforcement learning Markov decision processes long run average reward sample complexity
Scores: [ 6 6 8 6 ]
We settle the sample complexity of policy learning for the maximization of the long run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of ˜O(|S||At2mixϵ−2) and a lower bound of Ω(|S||A|tmixϵ−2). In these expressions, |S| and |A| denote the cardinalities of the state and action spaces respectively, tmix serves as a uniform upper limit for the total variation mixing times, and ϵ signifies the error tolerance. Therefore, a notable gap of tmix still remains to be bridged. Our primary contribution is to establish an estimator for the optimal policy of average reward MDPs with a sample complexity of ˜O(|S||A|tmixϵ−2), effectively reaching the lower bound in the literature. This is achieved by combining algorithmic ideas in Jin & Sidford (2021) with those of Li et al. (2020).
Primary Area: reinforcement learning
Keywords: Reinforcement learning Sequential decision-making problem Predictive state representation POMDP UCB online and offline
Scores: [ 6 5 6 8 ]
The general sequential decision-making problem, which includes Markov decision processes (MDPs) and partially observable MDPs (POMDPs) as special cases, aims at maximizing a cumulative reward by making a sequence of decisions based on a history of observations and actions over time. Recent studies have shown that the sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs). Despite these advancements, existing approaches typically involve oracles or steps that are computationally intractable. On the other hand, the upper confidence bound (UCB) based approaches, which have served successfully as computationally efficient methods in bandits and MDPs, have not been investigated for more general PSRs, due to the difficulty of optimistic bonus design in these more challenging settings. This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models. We further characterize the sample complexity bounds for our designed UCB-type algorithms for both online and offline PSRs. In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
Primary Area: reinforcement learning
Keywords: Safe offline reinforcement learning Hamilton-Jacobi reachability diffusion model
Scores: [ 8 8 6 8 ]
Primary Area: reinforcement learning
Keywords: reinforcement learning from human feedback preference-based RL human-in-the-loop RL preference learning
Scores: [ 6 8 8 6 ]
Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular paradigm for aligning models with human intent. Typically RLHF algorithms operate in two phases: first, use human preferences to learn a reward function and second, align the model by optimizing the learned reward via reinforcement learning (RL). This paradigm assumes that human preferences are distributed according to reward, but recent work suggests that they instead follow the regret under the user's optimal policy. Thus, learning a reward function from feedback is not only based on a flawed assumption of human preference, but also leads to unwieldy optimization challenges that stem from policy gradients or bootstrapping in the RL phase. Because of these optimization challenges, contemporary RLHF methods restrict themselves to contextual bandit settings (e.g., as in large language models) or limit observation dimensionality (e.g., state-based robotics). We overcome these limitations by introducing a new family of algorithms for optimizing behavior from human feedback using the regret model of human preferences. Using the principle of maximum entropy, we derive Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions, circumventing the need for RL. CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs. In contrast to prior work, this enables CPL to elegantly scale to high-dimensional and sequential RLHF problems.
Primary Area: reinforcement learning
Keywords: reinforcement learning policy gradients gradient subspaces
Scores: [ 6 6 6 8 ]
Policy gradient methods hold great potential for solving complex continuous control tasks. Still, their training efficiency can be improved by exploiting structure within the optimization problem. Recent work indicates that supervised learning can be accelerated by leveraging the fact that gradients lie in a low-dimensional and slowly-changing subspace. In this paper, we demonstrate the existence of such gradient subspaces for policy gradient algorithms despite the continuously changing data distribution inherent to reinforcement learning. Our findings reveal promising directions for more efficient reinforcement learning, e.g., through improving parameter-space exploration or enabling second-order optimization.
Primary Area: reinforcement learning
Keywords: Level Generation Video Games Deep Reinforcement Learning Ensemble Learning Regularisation
Scores: [ 6 6 5 6 ]
Deep reinforcement learning has recently been successfully applied to online procedural content generation in which a policy determines promising game-level segments. However, existing methods can hardly discover diverse level patterns, while the lack of diversity makes the gameplay boring. This paper proposes an ensemble reinforcement learning approach that uses multiple negatively correlated sub-policies to generate different alternative level segments, and stochastically selects one of them following a selector model. A novel policy regularisation technique is integrated into the approach to diversify the generated alternatives. In addition, we develop theorems to provide general methodologies for optimising policy regularisation in a Markov decision process. The proposed approach is compared with several state-of-the-art policy ensemble methods and classic methods on a well-known level generation benchmark, with two different reward functions expressing game-design goals from different perspectives. Results show that our approach boosts level diversity notably with competitive performance in terms of the reward. Furthermore, by varying the regularisation coefficient, the trained generators form a well-spread Pareto front, allowing explicit trade-offs between diversity and rewards of generated levels.
Primary Area: reinforcement learning
Keywords: model-based rl online learning incremental learning catastrophic forgetting
Scores: [ 6 8 6 ]
Primary Area: reinforcement learning
Keywords: Reinforcement learning interpretability control navigation transparency discrete
Scores: [ 6 8 6 3 ]
Primary Area: reinforcement learning
Keywords: Reinforcement Learning Large Language Models Parameter-Efficient Fine-Tuning
Scores: [ 6 6 6 6 ]
Primary Area: reinforcement learning
Keywords: Multi-Agent Cooperation Credit Assignment Homogeneous and Heterogeneous Cooperative Tasks
Scores: [ 6 8 8 6 ]
Effectively handling both homogeneous and heterogeneous tasks is crucial for the practical application of cooperative agents. However, existing solutions have not been successful in addressing both types of tasks simultaneously. On one hand, value-decomposition-based approaches demonstrate superior performance in homogeneous tasks. Nevertheless, they tend to produce agents with similar policies, which is unsuitable for heterogeneous tasks. On the other hand, solutions based on personalized observation or assigned roles are well-suited for heterogeneous tasks. However, they often lead to a trade-off situation where the agent's performance in homogeneous scenarios is negatively affected due to the aggregation of distinct policies. An alternative approach is to adopt sequential execution policies, which offer a flexible form for learning both types of tasks. However, learning sequential execution policies poses challenges in terms of credit assignment, and the lack of sufficient information about subsequently executed agents can lead to sub-optimal solutions. To tackle these issues, this paper proposes Greedy Sequential Execution (GSE) as a solution to learn the optimal policy that covers both scenarios. In the proposed GSE framework, we introduce an individual utility function into the framework of value decomposition to consider the complex interactions between agents. This function is capable of representing both the homogeneous and heterogeneous optimal policies. Furthermore, we utilize a greedy marginal contribution calculated by the utility function as the credit value of the sequential execution policy to address the credit assignment problem. We evaluated GSE in both homogeneous and heterogeneous scenarios. The results demonstrate that GSE achieves significant improvement in performance across multiple domains, especially in scenarios involving both homogeneous and heterogeneous tasks.
Primary Area: reinforcement learning
Keywords: rlhf overoptimization constrained RL
Scores: [ 6 8 6 8 ]
Primary Area: reinforcement learning
Keywords: Deep Reinforcement Learning Neural Plasticity Activation Functions Rational Functions
Scores: [ 6 8 8 8 ]
Latest insights from biology show that intelligence not only emerges from the connections between neurons, but that individual neurons shoulder more computational responsibility than previously anticipated. Specifically, neural plasticity should be critical in the context of constantly changing reinforcement learning (RL) environments, yet current approaches still primarily employ static activation functions. In this work, we motivate the use of adaptable activation functions in RL and show that rational activation functions are particularly suitable for augmenting plasticity. Inspired by residual networks, we derive a condition under which rational units are closed under residual connections and formulate a naturally regularised version. The proposed joint-rational activation allows for desirable degrees of flexibility, yet regularises plasticity to an extent that avoids overfitting by leveraging a mutual set of activation function parameters across layers. We demonstrate that equipping popular algorithms with (joint) rational activations leads to consistent improvements on different games from the Atari Learning Environment benchmark, notably making DQN competitive to DDQN and Rainbow.
Primary Area: reinforcement learning
Keywords: Reinforcement learning theory General function approximation Infinite-horizon Average-reward MDPs Sample efficiency
Scores: [ 6 6 8 5 ]
We study infinite-horizon average-reward Markov decision processes (AMDPs) in the context of general function approximation. Specifically, we propose a novel algorithmic framework named Fixed-Point Local Optimization (FLOP), which incorporates both model-based and value-based incarnations. In particular, FLOP features a novel construction of confidence sets and a low-switching policy updating scheme, which are tailored to the average-reward and function approximation setting. Moreover, for AMDPs, we propose a novel complexity measure --- average-reward generalized eluder coefficient (AGEC) --- which captures the challenge of exploration in AMDPs with general function approximation. Such a complexity measure encompasses almost all previously known tractable AMDP models, such as linear AMDPs and linear mixture AMDPs, and also includes newly identified cases such as kernel AMDPs and AMDPs with low Bellman eluder dimensions. Using AGEC, we prove that FLOP achieves a sublinear ~O(poly(d,sp(v∗))√Tβ) regret, where d and β correspond to AGEC and the log-covering number of the hypothesis class respectively, sp(v∗) represents the span of the optimal state bias function, T denotes the number of steps, and $\tilde{\mathcal{O}} (\cdot) $ omits logarithmic factors. When specialized to concrete AMDP models, our regret bounds are comparable to those established by the existing algorithms designed specifically for these special cases. To the best of our knowledge, this paper presents the first comprehensive theoretical framework capable of handling nearly all AMDPs.
Primary Area: reinforcement learning
Keywords: Koopman Theory Reinforcement Learning Dynamical System Planning Longe range dynamics prediction models Efficient forward dynamics
Scores: [ 8 5 6 3 ]
The accurate modeling of dynamics in interactive environments is critical for successful long-range prediction. Such a capability could advance Reinforcement Learning (RL) and Planning algorithms, but achieving it is challenging. Inaccuracies in model estimates can compound, resulting in increased errors over long horizons.We approach this problem from the lens of Koopman theory, where the nonlinear dynamics of the environment can be linearized in a high-dimensional latent space. This allows us to efficiently parallelize the sequential problem of long-range prediction using convolution while accounting for the agent's action at every time step.Our approach also enables stability analysis and better control over gradients through time. Taken together, these advantages result in significant improvement over the existing approaches, both in the efficiency and the accuracy of modeling dynamics over extended horizons. We also show that this model can be easily incorporated into dynamics modeling for model-based planning and model-free RL and report promising experimental results.
Primary Area: reinforcement learning
Keywords: Federated Learning Reinforcement Learning Q-Learning Regret Communication Cost
Scores: [ 5 5 6 6 ]
Primary Area: reinforcement learning
Keywords: Program Synthesis Code Generation Reinforcement Learning Value-Based RL
Scores: [ 8 8 6 8 ]
Program synthesis aims to create accurate, executable code from natural language descriptions. This field has leveraged the power of reinforcement learning (RL) in conjunction with large language models (LLMs), significantly enhancing code generation capabilities. This integration focuses on directly optimizing functional correctness, transcending conventional supervised losses. While current literature predominantly favors policy-based algorithms, attributes of program synthesis suggest a natural compatibility with value-based methods. This stems from rich collection of off-policy programs developed by human programmers, and the straightforward verification of generated programs through automated unit testing (i.e. easily obtainable rewards in RL language). Diverging from the predominant use of policy-based algorithms, our work explores the applicability of value-based approaches, leading to the development of our B-Coder (pronounced Bellman coder). Yet, training value-based methods presents challenges due to the enormous search space inherent to program synthesis. To this end, we propose an initialization protocol for RL agents utilizing pre-trained LMs and a conservative Bellman operator to reduce training complexities. Moreover, we demonstrate how to leverage the learned value functions as a dual strategy to post-process generated programs. Our empirical evaluations demonstrated B-Coder's capability in achieving state-of-the-art performance compared with policy-based methods. Remarkably, this achievement is reached with minimal reward engineering effort, highlighting the effectiveness of value-based RL, independent of reward designs.
Primary Area: reinforcement learning
Keywords: RL Theory Low adaptive RL Linear RL Multi-Batch RL
Scores: [ 8 5 6 ]
We theoretically explore the relationship between sample-efficiency and adaptivity in reinforcement learning. An algorithm is sample-efficient if it uses a number of queries n to the environment that is polynomial in the dimension d of the problem. Adaptivity refers to the frequency at which queries are sent and feedback is processed to update the querying strategy. To investigate this interplay, we employ a learning framework that allows sending queries in K batches, with feedback being processed and queries updated after each batch. This model encompasses the whole adaptivity spectrum, ranging from non-adaptive `offline' (K=1) to fully adaptive (K=n) scenarios, and regimes in between. For the problems of policy evaluation and best-policy identification under d-dimensional linear function approximation, we establish Ω(loglogd) lower bounds on the number of batches K required for sample-efficient algorithms with n=O(poly(d)) queries. Our results show that just having adaptivity (K>1) does not necessarily guarantee sample-efficiency. Notably, the adaptivity-boundary for sample-efficiency is not between offline reinforcement learning (K=1), where sample-efficiency was known to not be possible, and adaptive settings. Instead, the boundary lies between different regimes of adaptivity and depends on the problem dimension.
Primary Area: reinforcement learning
Keywords: Reinforcement Learning Planning Neural Networks Temporal Difference Learning Generalization Deep Reinforcement Learning
Scores: [ 6 6 5 6 ]
Primary Area: reinforcement learning
Keywords: Trajectory augmentation Off-policy evaluation Sub-trajectory mining from offline dataset
Scores: [ 6 6 6 ]
In the realm of reinforcement learning (RL), off-policy evaluation (OPE) holds a pivotal position, especially in high-stake human-involved scenarios such as e-learning and healthcare. Applying OPE to these domains is often challenging with scarce and underrepresentative offline training trajectories. Data augmentation has been a successful technique to enrich training data. However, directly employing existing data augmentation methods to OPE may not be feasible, due to the Markovian nature within the offline trajectories and the desire for generalizability across diverse target policies. In this work, we propose an offline trajectory augmentation approach to specifically facilitate OPE in human-involved scenarios. We propose sub-trajectory mining to extract potentially valuable sub-trajectories from offline data, and diversify the behaviors within those sub-trajectories by varying coverage of the state-action space. Our work was empirically evaluated in a wide array of environments, encompassing both simulated scenarios and real-world domains like robotic control, healthcare, and e-learning, where the training trajectories include varying levels of coverage of the state-action space. By enhancing the performance of a variety of OPE methods, our work offers a promising path forward for tackling OPE challenges in situations where data may be limited or underrepresentative.
Primary Area: reinforcement learning
Keywords: imitation learning reinforcement learning multiple experts
Scores: [ 8 8 8 5 ]
While reinforcement learning (RL) has shown promising performance, its sample complexity continues to be a substantial hurdle, restricting its broader application across a variety of domains. Imitation learning (IL) utilizes oracles to improve sample efficiency, yet it is often constrained by the quality of the oracles deployed. To address the demand for robust policy improvement in real-world scenarios, we introduce a novel algorithm, Robust Policy Improvement (RPI), which actively interleaves between IL and RL based on an online estimate of their performance. RPI draws on the strengths of IL, using oracle queries to facilitate exploration—an aspect that is notably challenging in sparse-reward RL—particularly during the early stages of learning. As learning unfolds, RPI gradually transitions to RL, effectively treating the learned policy as an improved oracle. This algorithm is capable of learning from and improving upon a diverse set of black-box oracles. Integral to RPI are Robust Active Policy Selection (RAPS) and Robust Policy Gradient (RPG), both of which reason over whether to perform state-wise imitation from the oracles or learn from its own value function when the learner’s performance surpasses that of the oracles in a specific state. Empirical evaluations and theoretical analysis validate that RPI excels in comparison to existing state-of-the-art methodologies, demonstrating superior performance across various benchmark domains.
Primary Area: reinforcement learning
Keywords: curricula model-based reinforcement learning world models reward-free reinforcement learning robustness unsupervised environment design
Scores: [ 6 8 6 5 ]
There has been a recent surge of interest in developing generally-capable agents that can adapt to new tasks without additional training in the environment. Learning world models from reward-free exploration is a promising approach, and enables policies to be trained using imagined experience for new tasks. However, achieving a general agent requires robustness across different environments. In this work, we address the novel problem of generating curricula in the reward-free setting to train robust world models. We consider robustness in terms of minimax regret over all environment instantiations and show that the minimax regret can be connected to minimising the maximum error in the world model across environment instances. This result informs our algorithm, WAKER: Weighted Acquisition of Knowledge across Environments for Robustness. WAKER selects environments for data collection based on the estimated error of the world model for each environment. Our experiments demonstrate that WAKER outperforms naı̈ve domain randomisation, resulting in improved robustness, efficiency, and generalisation.
Primary Area: reinforcement learning
Keywords: ensembles overoptimization RLHF reinforcement learning from human feedback language models uncertainty weighted optimization
Scores: [ 6 8 6 6 ]
Reinforcement learning from human feedback (RLHF) is a standard approach for fine-tuning large language models to follow instructions. As part of this process, learned reward models are used to approximately model human preferences. However, as imperfect representations of the “true” reward, these learned reward models are susceptible to overoptimization. Gao et al. (2023) studied this phenomenon in a synthetic human feedback setup with a significantly larger “gold” reward model acting as the true reward (instead of humans) and showed that overoptimization remains a persistent problem regardless of the size of the proxy reward model and training data used. Using a similar setup, we conduct a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specifically worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for mitigating reward model overoptimization when using two optimization methods: (a) best-of-n sampling (BoN) (b) proximal policy optimization (PPO). We additionally extend the setup of Gao et al. (2023) to include 25% label noise to better mirror real-world conditions. Both with and without label noise we find that conservative optimization practically eliminates overoptimization and improves performance by up to 70% for BoN sampling. For PPO, ensemble-based conservative optimization always reduces overoptimization and outperforms single reward model optimization. Moreover, combining it with a small KL penalty successfully prevents overoptimization at no performance cost. Overall, our results demonstrate that ensemble-based conservative optimization can effectively counter overoptimization.
Primary Area: reinforcement learning
Keywords: risk-sensitive reinforcement learning robust Markov Decision Processes
Scores: [ 6 8 6 6 ]
Primary Area: reinforcement learning
Keywords: linear contextual bandits federated learning differential privacy
Scores: [ 6 8 8 6 ]
We consider cross-silo federated linear contextual bandit (LCB) problem under differential privacy, where multiple silos (agents) interact with the local users and communicate via a central server to realize collaboration while without sacrificing each user's privacy. We identify three issues in the state-of-the-art: (i) failure of claimed privacy protection and (ii) incorrect regret bound due to noise miscalculation and (iii) ungrounded communication cost. To resolve these issues, we take a two-step principled approach. First, we design an algorithmic framework consisting of a generic federated LCB algorithm and flexible privacy protocols. Then, leveraging the proposed framework, we study federated LCBs under two different privacy constraints. We first establish privacy and regret guarantees under silo-level local differential privacy, which fix the issues present in state-of-the-art algorithm.To further improve the regret performance, we next consider shuffle model of differential privacy, under which we show that our algorithm can achieve nearly optimal'' regret without a trusted server. We accomplish this via two different schemes -- one relies on a new result on privacy amplification via shuffling for DP mechanisms and another one leverages the integration of a shuffle protocol for vector sum into the tree-based mechanism, both of which might be of independent interest. Finally, we support our theoretical results withnumerical evaluations over contextual bandit instances generated from both synthetic and real-life data.
Primary Area: reinforcement learning
Keywords: reinforcement learning
Scores: [ 8 6 8 8 ]
Unsupervised pre-training strategies have proven to be highly effective in natural language processing and computer vision. Likewise, unsupervised reinforcement learning (RL) holds the promise of discovering a variety of potentially useful behaviors that can accelerate the learning of a wide array of downstream tasks. Previous unsupervised RL approaches have mainly focused on pure exploration and mutual information skill learning. However, despite the previous attempts, making unsupervised RL truly scalable still remains a major open challenge: pure exploration approaches might struggle in complex environments with large state spaces, where covering every possible transition is infeasible, and mutual information skill learning approaches might completely fail to explore the environment due to the lack of incentives. To make unsupervised RL scalable to complex, high-dimensional environments, we propose a novel unsupervised RL objective, which we call Metric-Aware Abstraction (METRA). Our main idea is, instead of directly covering the state space, to only cover a compact latent space Z that is metrically connected to the state space S by temporal distances. By learning to move in every direction in the latent space, METRA obtains a tractable set of diverse behaviors that approximately cover the state space, being scalable to high-dimensional environments. Through our experiments in five locomotion and manipulation environments, we demonstrate that METRA can discover a variety of useful behaviors even in complex, pixel-based environments, being the first unsupervised RL method that discovers diverse locomotion behaviors in pixel-based Quadruped and Humanoid. Our code and video are available at https://sites.google.com/view/metra0
Primary Area: reinforcement learning
Keywords: Multi-Objective Reinforcement Learning Lexicographic Task Priorities Constrained RL Transfer RL
Scores: [ 6 5 6 ]
Reinforcement learning (RL) for complex tasks remains a challenge, primarily due to the difficulties of engineering scalar reward functions and the inherent inefficiency of training models from scratch. Instead, it would be better to specify complex tasks in terms of elementary subtasks and to reuse subtask solutions whenever possible. In this work, we address continuous space lexicographic multi-objective RL problems, consisting of prioritized subtasks, which are notoriously difficult to solve. We show that these can be scalarized with a subtask transformation and then solved incrementally using value decomposition. Exploiting this insight, we propose prioritized soft Q-decomposition (PSQD), a novel algorithm for learning and adapting subtask solutions under lexicographic priorities in continuous state-action spaces. PSQD offers the ability to reuse previously learned subtask solutions in a zero-shot composition, followed by an adaptation step. Its ability to use retained subtask training data for offline learning eliminates the need for new environment interaction during adaptation. We demonstrate the efficacy of our approach by presenting successful learning, reuse, and adaptation results for both low- and high-dimensional simulated robot control tasks, as well as offline learning results. In contrast to baseline approaches, PSQD does not trade off between conflicting subtasks or priority constraints and satisfies subtask priorities during learning. PSQD provides an intuitive framework for tackling complex RL problems, offering insights into the inner workings of the subtask composition.
Primary Area: reinforcement learning
Keywords: Reinforcement learning; Diffusion models; RLHF; Preference aligning
Scores: [ 8 8 6 6 ]
Aligning agent behaviors with diverse human preferences remains a challenging problem in reinforcement learning (RL), owing to the inherent abstractness and mutability of human preferences. To address these issues, we propose AlignDiff, a novel framework that leverages RLHF to quantify human preferences, covering abstractness, and utilizes them to guide diffusion planning for zero-shot behavior customizing, covering mutability. AlignDiff can accurately match user-customized behaviors and efficiently switch from one to another. To build the framework, we first establish the multi-perspective human feedback datasets, which contain comparisons for the attributes of diverse behaviors, and then train an attribute strength model to predict quantified relative strengths. After relabeling behavioral datasets with relative strengths, we proceed to train an attribute-conditioned diffusion model, which serves as a planner with the attribute strength model as a director for preference aligning at the inference phase. We evaluate AlignDiff on various locomotion tasks and demonstrate its superior performance on preference matching, switching, and covering compared to other baselines. Its capability of completing unseen downstream tasks under human instructions also showcases the promising potential for human-AI collaboration. More visualization videos are released on https://aligndiff.github.io/.
Primary Area: reinforcement learning
Keywords: Offline Decision-Making Offline Reinforcement Learning Generative Model Diffusion Model
Scores: [ 8 5 8 6 ]
Temporal abstraction and efficient planning pose significant challenges in offline reinforcement learning, mainly when dealing with domains that involve temporally extended tasks and delayed sparse rewards. Existing methods typically plan in the raw action space and can be inefficient and inflexible. Latent action spaces offer a more flexible approach, capturing only possible actions within the behavior policy support and decoupling the temporal structure between planning and modeling. However, current latent-action-based methods are limited to discrete spaces and require expensive planning steps. This paper presents a unified framework for continuous latent action space representation learning and planning by leveraging latent, score-based diffusion models. We establish the theoretical equivalence between planning in the latent action space and energy-guided sampling with a pretrained diffusion model and introduce a novel sequence-level exact sampling method. Our proposed method, LatentDiffuser, demonstrates competitive performance on low-dimensional locomotion control tasks and surpasses existing methods in higher-dimensional tasks.
Primary Area: reinforcement learning
Keywords: extensive-form games correlated equilibria no-regret learning
Scores: [ 6 6 6 6 ]
A recent paper by Farina and Pipis (2023) established the existence of uncoupled no-linear-swap regret dynamics with polynomial-time iterations in extensive-form games. The equilibrium points reached by these dynamics, known as linear correlated equilibria, are currently the tightest known relaxation of correlated equilibrium that can be learned in polynomial time in any finite extensive-form game. However, their properties remain vastly unexplored, and their computation is onerous. In this paper, we provide several contributions shedding light on the fundamental nature of linear-swap regret. First, we show a connection between linear deviations and a generalization of communication deviations in which the player can make queries to a mediator'' who replies with action recommendations, and, critically, the player is not constrained to match the timing of the game as would be the case for communication deviations. We coin this latter set the untimed communication (UTC) deviations. We show that the UTC deviations coincide precisely with the linear deviations, and therefore that any player minimizing UTC regret also minimizes linear-swap regret. We then leverage this connection to develop state-of-the-art no-regret algorithms for computing linear correlated equilibria, both in theory and in practice. In theory, our algorithms achieve polynomially better per-iteration runtimes; in practice, our algorithms represent the state of the art by several orders of magnitude.
Primary Area: reinforcement learning
Keywords: inverse reinforcement learning meta learning
Scores: [ 6 6 6 6 ]
Primary Area: reinforcement learning
Keywords: MetaFormer Convolution Reinforcement Learning Representation Learning
Scores: [ 8 8 5 ]
The recent success of Transformer in natural language processing has sparked its use in various domains. In offline reinforcement learning (RL), Decision Transformer (DT) is emerging as a promising model based on Transformer. However, we discovered that the attention module of DT is not appropriate to capture the inherent local dependence pattern in trajectories of RL modeled as a Markov decision process. To overcome the limitations of DT, we propose a novel action sequence predictor, named Decision ConvFormer (DC), based on the architecture of MetaFormer, which is a general structure to process multiple entities in parallel and understand the interrelationship among the multiple entities. DC employs local convolution filtering as the token mixer and can effectively capture the inherent local associations of the RL dataset. In extensive experiments, DC achieved state-of-the-art performance across various standard RL benchmarks while requiring fewer resources. Furthermore, we show that DC better understands the underlying meaning in data and exhibits enhanced generalization capability.
Primary Area: reinforcement learning
Keywords: Reinforcement Learning Multi-Task Learning Mixture of Experts
Scores: [ 6 6 6 5 5 ]
Primary Area: reinforcement learning
Keywords: reinforcement learning theory reinforcement learning with adversarial reward regret minimization linear function approximation
Scores: [ 6 6 6 6 ]
Recent studies have shown that the regret of reinforcement learning (RL) can be polylogarithmic in the planning horizon H. However, it remains an open question whether such a result holds for adversarial RL. In this paper, we answer this question affirmatively by proposing the first horizon-free policy search algorithm. To tackle the challenges caused by exploration and adversarially chosen reward over episodes, our algorithm employs (1) a variance-uncertainty-aware weighted least square estimator for the transition kernel; and (2) an occupancy measure-based technique for the online search of a stochastic policy. We show that our algorithm achieves an ~O((d+log|S|)√K+d2) regret with full-information feedback, where d is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, K is the number of episodes, |S| is the cardinality of the state space. We also provide hardness results to justify the near optimality of our algorithm and the inevitability of log|S| in the regret bound.
Primary Area: reinforcement learning
Keywords: reinforcement learning options
Scores: [ 6 6 6 ]
In reinforcement learning, agents often learn policies for specific tasks without the ability to generalize this knowledge to related tasks. This paper introduces an algorithm that attempts to address this limitation by decomposing neural networks encoding policies for Markov Decision Processes into reusable sub-policies, which are used to synthesize temporally extended actions, or options. We consider neural networks with piecewise linear activation functions, so that they can be mapped to an equivalent tree that is similar to oblique decision trees. Since each node in such a tree serves as a function of the input of the tree, each sub-tree is a sub-policy of the main policy. We turn each of these sub-policies into options by wrapping it with while-loops of varied number of iterations. Given the large number of options, we propose a selection mechanism based on minimizing the Levin loss for a uniform policy on these options. Empirical results in two grid-world domains where exploration can be difficult confirm that our method can identify useful options, thereby accelerating the learning process on similar but different tasks.
Primary Area: reinforcement learning
Keywords: Robust Reinforcement Learning Offline Reinforcement Learning Diffusion Models
Scores: [ 8 6 6 8 ]
Offline reinforcement learning (RL), which aims to fully explore offline datasets for training without interaction with environments, has attracted growing recent attention. A major challenge for the real-world application of offline RL stems from the robustness against state observation perturbations, e.g., as a result of sensor errors or adversarial attacks. Unlike online robust RL, agents cannot be adversarially trained in the offline setting. In this work, we propose Diffusion Model-Based Predictor (DMBP) in a new framework that recovers the actual states with conditional diffusion models for state-based RL tasks. To mitigate the error accumulation issue in model-based estimation resulting from the classical training of conventional diffusion models, we propose a non-Markovian training objective to minimize the sum entropy of denoised states in RL trajectory. Experiments on standard benchmark problems demonstrate that DMBP can significantly enhance the robustness of existing offline RL algorithms against different scales of random noise and adversarial attacks on state observations. Further, the proposed framework can effectively deal with incomplete state observations with random combinations of multiple unobserved dimensions in the test.
Primary Area: reinforcement learning
Keywords: reinforcement learning off-policy advantage function
Scores: [ 6 5 8 6 ]
Primary Area: reinforcement learning
Keywords: Reinforcement Learning Natural Language Generation Offline Policy Gradients
Scores: [ 5 6 8 6 ]
Language Models (LMs) achieve substantial language capabilities when finetuned using Reinforcement Learning with Human Feedback (RLHF). However, RLHF is an unstable and data-hungry process that continually requires new high-quality LM-generated data for finetuning. We introduce Advantage-Leftover Lunch RL (A-LoL), a new class of offline policy gradient algorithms that enable RL training on any pre-existing data. By assuming the entire LM output sequence as a single action, A-LoL allows incorporating sequence-level classifiers or human-designed scoring functions as rewards. Subsequently, by using LM’s internal sequence-level value estimate, A-LoL filters negative advantage (low-quality) data points during training, making it resilient to noise. Overall, A-LoL is an easy-to-implement LM training recipe that is sample-efficient and stable.We demonstrate the effectiveness of A-LoL and its variants with a set of four different language generation tasks. We compare against both online RL (PPO) and recent preference-based (DPO, PRO) and reward-based (GOLD) offline RL baselines. On the commonly-used RLHF benchmark, Helpful and Harmless Assistant (HHA), LMs trained with A-LoL methods achieve the highest diversity while also being rated more safe and helpful than baselines according to humans. Additionally, in the remaining three tasks, A-LoL could optimize multiple distinct reward functions even when using noisy or suboptimal training data.
Primary Area: reinforcement learning
Keywords: Deep Reinforcement Learning Signal Delay Robotic Control Continuous Control
Scores: [ 3 6 6 8 ]
Primary Area: reinforcement learning
Keywords: Reinforcement Learning Exploration Intrinsic Rewards Stationarity
Scores: [ 5 6 5 8 ]
Primary Area: reinforcement learning
Keywords: Reinforcement Learning; Multitask Learning; Exploration
Scores: [ 6 6 8 5 ]
Multitask Reinforcement Learning (MTRL) approaches have gained increasing attention for its wide applications in many important Reinforcement Learning (RL) tasks. However, while recent advancements in MTRL theory have focused on the improved statistical efficiency by assuming a shared structure across tasks, exploration--a crucial aspect of RL--has been largely overlooked. This paper addresses this gap by showing that when an agent is trained on a sufficiently diverse set of tasks, a generic policy-sharing algorithm with myopic exploration design like ϵ-greedy that are inefficient in general can be sample-efficient for MTRL. To the best of our knowledge, this is the first theoretical demonstration of the "exploration benefits" of MTRL. It may also shed light on the enigmatic success of the wide applications of myopic exploration in practice. To validate the role of diversity, we conduct experiments on synthetic robotic control environments, where the diverse task set aligns with the task selection by automatic curriculum learning, which is empirically shown to improve sample-efficiency.
Primary Area: reinforcement learning
Keywords: RLHF Diverse Human Feedback Reinforcement Learning
Scores: [ 5 6 8 ]
Reinforcement Learning with Human Feedback (RLHF) has received significant attention for performing tasks without the need for costly manual reward design by aligning human preferences. It is crucial to consider diverse human feedback types and various learning methods in different environments. However, quantifying progress in RLHF with diverse feedback is challenging due to the lack of standardized annotation platforms and widely used unified benchmarks. To bridge this gap, we introduce Uni-RLHF, a comprehensive system implementation tailored for RLHF. It aims to provide a complete workflow from real human feedback, fostering progress in the development of practical problems. Uni-RLHF contains three packages: 1) a universal multi-feedback annotation platform, 2) large-scale crowdsourced feedback datasets, and 3) modular offline RLHF baseline implementations. Uni-RLHF develops a user-friendly annotation interface tailored to various feedback types, compatible with a wide range of mainstream RL environments. We then establish a systematic pipeline of crowdsourced annotations, resulting in large-scale annotated datasets comprising more than 15 million steps across 32 popular tasks. Through extensive experiments, the results in the collected datasets demonstrate competitive performance compared to those from well-designed manual rewards. We evaluate various design choices and offer insights into their strengths and potential areas of improvement. We wish to build valuable open-source platforms, datasets, and baselines to facilitate the development of more robust and reliable RLHF solutions based on realistic human feedback. The website is available at https://uni-rlhf.github.io/.
Primary Area: reinforcement learning
Keywords: principal-agent problems sample complexity online learning contract design
Scores: [ 6 6 6 ]
Primary Area: reinforcement learning
Keywords: preference-based reinforcement learning human feedback efficiency query-policy misalignment
Scores: [ 6 6 8 ]
Primary Area: reinforcement learning
Keywords: reinforcement learning regularized markov decision process soft actor-critique
Scores: [ 6 5 6 ]
Regularized reinforcement learning (RL), particularly the entropy-regularized kind, has gained traction in optimal control and inverse RL. While standard unregularized RL methods remain unaffected by changes in the number of actions, we show that it can severely impact their regularized counterparts. This paper demonstrates the importance of decoupling the regularizer from the action space: that is, to maintain a consistent level of regularization regardless of how many actions are involved to avoid over-regularization. Whereas the problem can be avoided by introducing a task-specific temperature parameter, it is often undesirable and cannot solve the problem when action spaces are state-dependent. In the state-dependent action context, different states with varying action spaces are regularized inconsistently. We introduce two solutions: a static temperature selection approach and a dynamic counterpart, universally applicable where this problem arises. Implementing these changes improves performance on the DeepMind control suite in static and dynamic temperature regimes and a biological design task.
Primary Area: reinforcement learning
Keywords: Reinforcement learning multi-task learning bracketing number predictive state representation POMDP
Scores: [ 6 6 8 6 8 ]
In multi-task reinforcement learning (RL) under Markov decision processes (MDPs), the presence of shared latent structures among multiple MDPs has been shown to yield significant benefits to the sample efficiency compared to single-task RL. In this paper, we investigate whether such a benefit can extend to more general sequential decision making problems, such as partially observable MDPs (POMDPs) and more general predictive state representations (PSRs). The main challenge here is that the large and complex model space makes it hard to identify what types of common latent structure of multi-task PSRs can reduce the model complexity and improve sample efficiency. To this end, we posit a {\em joint model class} for tasks and use the notion of η-bracketing number to quantify its complexity; this number also serves as a general metric to capture the similarity of tasks and thus determines the benefit of multi-task over single-task RL. We first study upstream multi-task learning over PSRs, in which all tasks share the same observation and action spaces. We propose a provably efficient algorithm UMT-PSR for finding near-optimal policies for all PSRs, and demonstrate that the advantage of multi-task learning manifests if the joint model class of PSRs has a smaller η-bracketing number compared to that of individual single-task learning. We also provide several example multi-task PSRs with small η-bracketing numbers, which reap the benefits of multi-task learning. We further investigate downstream learning, in which the agent needs to learn a new target task that shares some commonalities with the upstream tasks via a similarity constraint. By exploiting the learned PSRs from the upstream, we develop a sample-efficient algorithm that provably finds a near-optimal policy. Upon specialization to the examples used to elucidate the η-bracketing numbers, our downstream results further highlight the benefit compared to directly learning the target PSR without upstream information. Ours is the first theoretical study that quantifies the benefits of multi-task RL with PSRs over its single-task counterpart.
Primary Area: reinforcement learning
Keywords: Safe Reinforcement Learning Reinforcement Learning from Human Feedback Large Language Model AI Safety
Scores: [ 8 6 8 8 ]
Primary Area: reinforcement learning
Keywords: world models temporal abstraction hierarchical learning model-based reinforcement learning hierarchical planning
Scores: [ 8 6 6 ]
Hierarchical world models can significantly improve model-based reinforcement learning (MBRL) and planning by enabling reasoning across multiple time scales. Nonetheless, the majority of state-of-the-art MBRL methods still employ flat, non-hierarchical models. We propose Temporal Hierarchies from Invariant Context Kernels (THICK), an algorithm that learns a world model hierarchy via discrete latent dynamics. The lower level of THICK updates parts of its latent state sparsely in time, forming invariant contexts. The higher level exclusively predicts situations involving context state changes. Our experiments demonstrate that THICK learns categorical, interpretable, temporal abstractions on the high level, while maintaining precise low-level predictions. Furthermore, we show that the emergent hierarchical predictive model seamlessly enhances the abilities of MBRL or planning methods. We believe that THICK contributes to the further development of hierarchical, context-conditioned, event-predictive world models that can enhance planning and reasoning abilities and produce more human-like behavior.
Primary Area: reinforcement learning
Keywords: Reinforcement Learning Quality Diversity Robotics Machine Learning Evolution Strategies
Scores: [ 8 6 6 8 ]
Training generally capable agents that thoroughly explore their environment andlearn new and diverse skills is a long-term goal of robot learning. Quality DiversityReinforcement Learning (QD-RL) is an emerging research area that blends thebest aspects of both fields – Quality Diversity (QD) provides a principled formof exploration and produces collections of behaviorally diverse agents, whileReinforcement Learning (RL) provides a powerful performance improvementoperator enabling generalization across tasks and dynamic environments. ExistingQD-RL approaches have been constrained to sample efficient, deterministic off-policy RL algorithms and/or evolution strategies and struggle with highly stochasticenvironments. In this work, we, for the first time, adapt on-policy RL, specificallyProximal Policy Optimization (PPO), to the Differentiable Quality Diversity (DQD)framework and propose several changes that enable efficient optimization anddiscovery of novel skills on high-dimensional, stochastic robotics tasks. Our newalgorithm, Proximal Policy Gradient Arborescence (PPGA), achieves state-of-the-art results, including a 4x improvement in best reward over baselines on thechallenging humanoid domain.
Primary Area: reinforcement learning
Keywords: policy transfer transfer learning imitation learning reinforcement learning
Scores: [ 5 6 8 5 ]
Primary Area: reinforcement learning
Keywords: Human-AI interaction Reinforcement Learning
Scores: [ 5 6 5 ]
Primary Area: reinforcement learning
Keywords: Reinforcement Learning Delay EfficientZero Tree-search Sample efficiency
Scores: [ 6 6 6 5 ]
Primary Area: reinforcement learning
Keywords: Efficient Adaptation Continual Learning Robot Learning Language-Conditioned Visuomotor Control Few-shot Adaptation Large Pretrained Models Imitation Learning Transfer Learning
Scores: [ 8 6 5 6 6 ]
The full potential of large pretrained models remains largely untapped in control domains like robotics. This is mainly because of the scarcity of data and the computational challenges associated with training or fine-tuning these large models for such applications. Prior work mainly emphasizes effective \emph{pretraining} of large models for decision-making, with little exploration into how to perform data-efficient continual \emph{adaptation} of these models for new tasks. Recognizing these constraints, we introduce TAIL (Task-specific Adapters for Imitation Learning), a framework for efficient adaptation to new control tasks. Inspired by recent advancements in parameter-efficient fine-tuning in language domains, we explore efficient fine-tuning techniques---e.g., Bottleneck Adapters, P-Tuning, and Low-Rank Adaptation (LoRA)---in TAIL to adapt large pretrained models for new tasks with limited demonstration data. Our extensive experiments comparing prevalent parameter-efficient fine-tuning techniques and adaptation baselines suggest that TAIL with LoRA can achieve the best post-adaptation performance with only 1% of the trainable parameters of full fine-tuning, while avoiding catastrophic forgetting and preserving adaptation plasticity in continual learning settings.
Primary Area: reinforcement learning
Keywords: reinforcement learning theory hybrid reinforcement learning actor critic method online reinforcement learning offline reinforcement learning
Scores: [ 6 8 6 8 ]
Hybrid RL is the setting where an RL agent has access to both offline data and online data by interacting with the real-world environment. In this work, we propose a new hybrid RL algorithm that combines an on-policy actor-critic method with offline data. On-policy methods such as policy gradient and natural policy gradient (NPG) have shown to be more robust to model misspecification, though sometimes it may not be as sample efficient as methods that rely on off-policy learning. On the other hand, offline methods that depend on off-policy training often require strong assumptions in theory and are less stable to train in practice. Our new approach integrates a procedure of off-policy training on the offline data into an on-policy NPG framework. We show that our approach, in theory, can obtain a best-of-both-worlds type of result --- it achieves the state-of-art theoretical guarantees of offline RL when offline RL-specific assumptions hold, while at the same time maintaining the theoretical guarantees of on-policy NPG regardless of the offline RL assumptions' validity. Experimentally, in challenging rich-observation environments, we show that our approach outperforms a state-of-the-art hybrid RL baseline which only relies on off-policy policy optimization, demonstrating the empirical benefit of combining on-policy and off-policy learning.
Primary Area: reinforcement learning
Keywords: linear MDPs hierarchical reinforcement learning linear function approximation aggregation
Scores: [ 6 6 6 5 6 ]
Primary Area: reinforcement learning
Keywords: Reinforcement learning exploration intrinsic motivation knowledge
Scores: [ 6 6 6 8 ]
Human intelligence is adept at absorbing valuable insights from external knowledge.This capability is equally crucial for artificial intelligence. In contrast, classical reinforcement learning agents lack such capabilities and often resort to extensive trial and error to explore the environment. This paper introduces PAE: $\textbf{P}lanner−\textbf{A}ctor−\textbf{E}$valuator, a novel framework for teaching agents to learn to absorb external knowledge. PAE integrates the Planner's knowledge-state alignment mechanism, the Actor's mutual information skill control, and the Evaluator's adaptive intrinsic exploration reward to achieve 1) effective cross-modal information fusion, 2) enhanced linkage between knowledge and state, and 3) hierarchical mastery of complex tasks.Comprehensive experiments in six challenging sparse reward environments demonstrate PAE's superior exploration efficiency with good interpretability compared to existing methods. We provide the source code in the supplementary for further study and application.
Primary Area: reinforcement learning
Keywords: Imitation Learning Behavior Cloning Deep Learning
Scores: [ 6 6 6 6 ]
Primary Area: reinforcement learning
Keywords: Reinforcement Learning pytorch control robotics
Scores: [ 6 6 8 6 ]
PyTorch has ascended as a premier machine learning framework, yet it lacks a native and comprehensive library for decision and control tasks suitable for large development teams dealing with complex real-world data and environments. To address this issue, we propose TorchRL, a generalistic control library for PyTorch that provides well-integrated, yet standalone components. We introduce a new and flexible PyTorch primitive, the TensorDict, which facilitates streamlined algorithm development across the many branches of Reinforcement Learning (RL) and control. We provide a detailed description of the building blocks and an extensive overview of the library across domains and tasks. Finally, we experimentally demonstrate its reliability and flexibility, and show comparative benchmarks to demonstrate its computational efficiency. TorchRL fosters long-term support and is publicly available on GitHub for greater reproducibility and collaboration within the research community. The code is open-sourced on GitHub.
Primary Area: reinforcement learning
Keywords: communication learning multi-agent reinforcement learning
Scores: [ 6 6 8 8 ]
Primary Area: reinforcement learning
Keywords: reinforcement learning preference-based feedback theory active learning posterior sampling
Scores: [ 6 6 5 8 ]
RL algorithms that learn from human feedback (RLHF) need to be efficient in terms of statistical complexity, computational complexity, and query complexity. In this work, we consider the RLHF setting where the feedback is given in the format of preferences over pairs of trajectories. In the linear MDP model, by using randomization in algorithm design, we present an algorithm that is sample efficient (i.e., has near-optimal worst-case regret bounds) and has polynomial running time (i.e., computational complexity is polynomial with respect to relevant parameters). Our algorithm further minimizes the query complexity through a novel randomized active learning procedure. Particularly, our algorithm demonstrates a near-optimal tradeoff between the regret bound and the query complexity. To extend the results to more general nonlinear function approximation, we design a model-based randomized algorithm inspired by the idea of Thompson sampling. Our algorithm minimizes Bayesian regret bound and query complexity, again achieving a near-optimal tradeoff between these two quantities. Computation-wise, similar to the prior Thompson sampling algorithms under the regular RL setting, the main computation primitives of our algorithm are Bayesian supervised learning oracles which have been heavily investigated on the empirical side when applying Thompson sampling algorithms to RL benchmark problems.
Primary Area: reinforcement learning
Keywords: Deep Reinforcement Learning
Scores: [ 6 8 6 8 ]
Sample efficiency is a crucial problem in deep reinforcement learning. Recent algorithms, such as REDQ and DroQ, found a way to improve the sample efficiency by increasing the update-to-data (UTD) ratio to 20 gradient update steps on the critic per environment sample. However, this comes at the expense of a greatly increased computational cost. To reduce this computational burden, we introduce CrossQ: a lightweight algorithm that makes careful use of Batch Normalization and removes target networks to surpass the state-of-the-art in sample efficiency while maintaining a low UTD ratio of 1. Notably, CrossQ does not rely on advanced bias-reduction schemes used in current methods. CrossQ’s contributions are thus threefold: (1) state-of-the-art sample efficiency, (2) substantial reduction in computational cost compared to REDQ and DroQ, and (3) ease of implementation, requiring just a few lines of code on top of SAC.
Primary Area: reinforcement learning
Keywords: Deep Reinforcement Learning Offline Reinforcement Learning Pretraining
Scores: [ 6 5 6 ]
Primary Area: reinforcement learning
Keywords: Deep Reinforcement Learning Classical Planning Genetic Programming Symbolic AI Learning from Demonstration
Scores: [ 6 6 8 5 ]
Despite the recent advancement in deep reinforcement learning (DRL), it still struggles at learning sparse-reward goal-directed tasks. On the other hand, classical planning approaches excel at addressing tasks with hierarchical structures by employing symbolic knowledge for high-level planning. Yet, most classical planning methods rely on assumptions about pre-defined subtasks, making them inapplicable in domains without domain knowledge or models. To bridge the best of both worlds, we propose a framework that integrates DRL with classical planning by automatically inducing task structures and substructures from a few demonstrations. Specifically, we use symbolic regression for substructure induction by adopting genetic programming where the program model reflects prior domain knowledge of effect rules. We compare our proposed framework to DRL algorithms, imitation learning methods, and an exploration approach in various domains. The experimental results show that our framework outperforms the baselines in terms of sample efficiency and task performance. Moreover, our framework achieves strong generalization performance by inducing the new rules and composing the task structures. Ablation studies justify the design of the induction module and the proposed genetic programming procedure.
Primary Area: reinforcement learning
Keywords: reinforcement learning model-based reinforcement learning world models
Scores: [ 8 8 8 8 ]
TD-MPC is a model-based reinforcement learning (RL) algorithm that performs local trajectory optimization in the latent space of a learned implicit (decoder-free) world model. In this work, we present TD-MPC2: a series of improvements upon the TD-MPC algorithm. We demonstrate that TD-MPC2 improves significantly over baselines across 104 online RL tasks spanning 4 diverse task domains, achieving consistently strong results with a single set of hyperparameters. We further show that agent capabilities increase with model and data size, and successfully train a single 317M parameter agent to perform 80 tasks across multiple task domains, embodiments, and action spaces. We conclude with an account of lessons, opportunities, and risks associated with large TD-MPC2 agents.Explore videos, models, data, code, and more at https://tdmpc2.com
Primary Area: reinforcement learning
Keywords: Deep RL exploration density estimation representation learning
Scores: [ 8 6 8 ]
Primary Area: reinforcement learning
Keywords: Reinforcement learning Policy search Bayesian optimization Gaussian Processes
Scores: [ 6 6 8 8 6 ]
Deterministic policies are often preferred over stochastic ones when implemented on physical systems. They can prevent erratic and harmful behaviors while being easier to implement and interpret. However, in practice, exploration is largely performed by stochastic policies.First-order Bayesian Optimization (BO) methods offer a principled way of performing exploration using deterministic policies. This is done through a learned probabilistic model of the objective function and its gradient. Nonetheless, such approaches treat policy search as a black-box problem, and thus, neglect the reinforcement learning nature of the problem. In this work, we leverage the performance difference lemma to introduce a novel mean function for the probabilistic model. This results in augmenting BO methods with the action-value function. Hence, we call our method Augmented Bayesian Search (ABS).Interestingly, this new mean function enhances the posterior gradient with the deterministic policy gradient, effectively bridging the gap between BO and policy gradient methods. The resulting algorithm combines the convenience of the direct policy search with the scalability of reinforcement learning.We validate ABS on high-dimensional locomotion problems and demonstrate competitive performance compared to existing direct policy search schemes.
Primary Area: reinforcement learning
Keywords: reinforcement learning multi-agent reinforcement learning reward design alignment
Scores: [ 6 6 8 8 ]
Primary Area: reinforcement learning
Keywords: Reset-Free RL
Scores: [ 8 8 5 6 ]
Primary Area: reinforcement learning
Keywords: reinforcement learning robotics data transfer
Scores: [ 8 6 6 6 ]
Replaying data is a principal mechanism underlying the stability and data efficiency of off-policy reinforcement learning (RL).We present an effective yet simple framework to extend the use of replays across multiple experiments, minimally adapting the RL workflow for sizeable improvements in controller performance and research iteration times.At its core, Replay across Experiments (RaE) involves reusing experience from previous experiments to improve exploration and bootstrap learning while reducing required changes to a minimum in comparison to prior work. We empirically show benefits across a number of RL algorithms and challenging control domains spanning both locomotion and manipulation, including hard exploration tasks from egocentric vision. Through comprehensive ablations, we demonstrate robustness to the quality and amount of data available and various hyperparameter choices. Finally, we discuss how our approach can be applied more broadly across research life cycles and can increase resilience by reloading data across random seeds or hyperparameter variations.
Primary Area: reinforcement learning
Keywords: preference based reinforcement learning world models return redistribution
Scores: [ 8 5 6 ]
Primary Area: reinforcement learning
Keywords: deep learning deep reinforcement learning reward learning vision-language models vision transformer transformer CLIP
Scores: [ 6 3 3 8 8 ]
Reinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or learning a reward model from a large amount of human feedback, which is often very expensive. We study a more sampleefficient alternative: using pretrained vision-language models (VLMs) as zeroshot reward models (RMs) to specify tasks via natural language. We propose a natural and general approach to using VLMs as reward models, which we call VLM-RMs. We use VLM-RMs based on CLIP to train a MuJoCo humanoid to learn complex tasks without a manually specified reward function, such as kneeling, doing the splits, and sitting in a lotus position. For each of these tasks, we only provide a single sentence text prompt describing the desired task with minimal prompt engineering. We provide videos of the trained agents at: https://sites.google.com/view/anon-vlmrm. We can improve performance by providing a second “baseline” prompt and projecting out parts of the CLIP embedding space irrelevant to distinguish between goal and baseline. Further, we find a strong scaling effect for VLM-RMs: larger VLMs trained with more compute and data are better reward models. The failure modes of VLM-RMs we encountered are all related to known capability limitations of current VLMs, such as limited spatial reasoning ability or visually unrealistic environments that are far off-distribution for the VLM. We find that VLM-RMs are remarkably robust as long as the VLM is large enough. This suggests that future VLMs will become more and more useful reward models for a wide range of RL applications.
Primary Area: reinforcement learning
Keywords: Deep Reinforcement Learning Soft Robot
Scores: [ 5 8 5 6 ]
Primary Area: reinforcement learning
Keywords: Visual RL; Dormant Ratio
Scores: [ 6 6 8 6 ]
Visual reinforcement learning (RL) has shown promise in continuous control tasks.Despite its progress, current algorithms are still unsatisfactory in virtually every aspect of the performance such as sample efficiency, asymptotic performance, and their robustness to the choice of random seeds.In this paper, we identify a major shortcoming in existing visual RL methods that is the agents often exhibit sustained inactivity during early training, thereby limiting their ability to explore effectively. Expanding upon this crucial observation, we additionally unveil a significant correlation between the agents' inclination towards motorically inactive exploration and the absence of neuronal activity within their policy networks.To quantify this inactivity, we adopt dormant ratio as a metric to measure inactivity in the RL agent's network.Empirically, we also recognize that the dormant ratio can act as a standalone indicator of an agent's activity level, regardless of the received reward signals.Leveraging the aforementioned insights, we introduce DrM, a method that uses three core mechanisms to guide agents' exploration-exploitation trade-offs by actively minimizing the dormant ratio. Experiments demonstrate that DrM achieves significant improvements in sample efficiency and asymptotic performance with no broken seeds (76 seeds in total) across three continuous control benchmark environments, including DeepMind Control Suite, MetaWorld, and Adroit.Most importantly, DrM is the first model-free algorithm that consistently solves tasks in both the Dog and Manipulator domains from the DeepMind Control Suite as well as three dexterous hand manipulation tasks without demonstrations in Adroit, all based on pixel observations.
Primary Area: reinforcement learning
Keywords: reinforcement learning abstraction
Scores: [ 6 6 5 6 ]
Reinforcement learning often needs to deal with the exponential growth of states and actions when exploring optimal control in high-dimensional spaces (often known as the curse of dimensionality). In this work, we address this issue by learning the inherent structure of action-wise similar MDP to appropriately balance the performance degradation versus sample/computational complexity. In particular, we partition the action spaces into multiple groups based on the similarity in transition distribution and reward function, and build a linear decomposition model to capture the difference between the intra-group transition kernel and the intra-group rewards. Both our theoretical analysis and experiments reveal a surprising and counter-intuitive result: while a more refined grouping strategy can reduce the approximation error caused by treating actions in the same group as identical, it also leads to increased estimation error when the size of samples or the computation resources is limited. This finding highlights the grouping strategy as a new degree of freedom that can be optimized to minimize the overall performance loss. To address this issue, we formulate a general optimization problem for determining the optimal grouping strategy, which strikes a balance between performance loss and sample/computational complexity. We further propose a computationally efficient method for selecting a nearly-optimal grouping strategy, which maintains its computational complexity independent of the size of the action space.
Primary Area: reinforcement learning
Keywords: offline imitation learning; reward learning; reinforcement learning
Scores: [ 8 6 8 6 ]
Primary Area: reinforcement learning
Keywords: Reinforcement Learning Meta-Learning Meta-RL Meta-Optimization Policy Meta-Optimization Learned Objective Functions
Scores: [ 8 8 5 ]
Primary Area: reinforcement learning
Keywords: offline RL heuristic RL MDP sequential decision-making
Scores: [ 8 8 8 5 ]
Primary Area: reinforcement learning
Keywords: Meta-RL Generalization Long-Term Memory Transformers
Scores: [ 8 8 8 6 ]
We introduce AMAGO, an in-context Reinforcement Learning (RL) agent that uses sequence models to tackle the challenges of generalization, long-term memory, and meta-learning. Recent works have shown that off-policy learning can make in-context RL with recurrent policies viable. Nonetheless, these approaches require extensive tuning and limit scalability by creating key bottlenecks in agents' memory capacity, planning horizon, and model size. AMAGO revisits and redesigns the off-policy in-context approach to successfully train long-sequence Transformers over entire rollouts in parallel with end-to-end RL. Our agent is uniquely scalable and applicable to a wide range of problems. We demonstrate its strong performance empirically in meta-RL and long-term memory domains. AMAGO's focus on sparse rewards and off-policy data also allows in-context learning to extend to goal-conditioned problems with challenging exploration. When combined with a novel hindsight relabeling scheme, AMAGO can solve a previously difficult category of open-world domains, where agents complete many possible instructions in procedurally generated environments. We evaluate our agent on three goal-conditioned domains and study how its individual improvements connect to create a generalist policy.
Primary Area: reinforcement learning
Keywords: deep reinforcement learning ensemble-based exploration off-policy learning representation learning auxiliary tasks
Scores: [ 8 6 8 8 ]
We uncover a surprising phenomenon in deep reinforcement learning: training a diverse ensemble of data-sharing agents -- a well-established exploration strategy -- can significantly impair the performance of the individual ensemble members when compared to standard single-agent training. Through careful analysis, we attribute the degradation in performance to the low proportion of self-generated data in the shared training data for each ensemble member, as well as the inefficiency of the individual ensemble members to learn from such highly off-policy data. We thus name this phenomenon the curse of diversity. We find that several intuitive solutions -- such as a larger replay buffer or a smaller ensemble size -- either fail to consistently mitigate the performance loss or undermine the advantages of ensembling. Finally, we demonstrate the potential of representation learning to counteract the curse of diversity with a novel method named Cross-Ensemble Representation Learning (CERL) in both discrete and continuous control domains. Our work offers valuable insights into an unexpected pitfall in ensemble-based exploration and raises important caveats for future applications of similar approaches.
Primary Area: reinforcement learning
Keywords: Reinforcement learning Posterior sampling Causality
Scores: [ 8 8 6 8 ]
Posterior sampling allows the exploitation of prior knowledge of the environment's transition dynamics to improve the sample efficiency of reinforcement learning. The prior is typically specified as a class of parametric distributions, a task that can be cumbersome in practice, often resulting in the choice of uninformative priors. In this work, we propose a novel posterior sampling approach in which the prior is given as a (partial) causal graph over the environment's variables. The latter is often more natural to design, such as listing known causal dependencies between biometric features in a medical treatment study. Specifically, we propose a hierarchical Bayesian procedure, called C-PSRL, simultaneously learning the full causal graph at the higher level and the parameters of the resulting factored dynamics at the lower level. For this procedure, we provide an analysis of its Bayesian regret, which explicitly connects the regret rate with the degree of prior knowledge. Our numerical evaluation conducted in illustrative domains confirms that C-PSRL strongly improves the efficiency of posterior sampling with an uninformative prior while performing close to posterior sampling with the full causal graph.
Primary Area: reinforcement learning
Keywords: Multi-agent reinforcement learning episodic control episodic incentive state embedding
Scores: [ 6 6 6 6 ]
Primary Area: reinforcement learning
Keywords: Reinforcement learning SGMCMC uncertainty quantification
Scores: [ 5 6 5 ]
Reinforcement learning tackles sequential decision-making problems by designing an agent that interacts with the environment. However, existing algorithms often treat the problem as static, calculating a point estimator for model parameters to achieve maximal expected reward (also known as value function) for the agent. They tend to overlook the stochastic nature of the agent-environment interaction system and the importance of uncertainty quantification associated with the model parameters. In our research, leveraging the Kalman filtering paradigm, we introduce a novel and scalable sampling algorithm called Langevinized Kalman Temporal-Difference (LKTD) for deep reinforcement learning. This algorithm, grounded in stochastic gradient Markov chain Monte Carlo (SGMCMC), efficiently draws samples from the posterior distribution of deep neural network parameters. Under mild conditions, we prove that the posterior samples generated by the LKTD algorithm converge to a stationary distribution. This convergence not only enables us to quantify uncertainties associated with the value function and model parameters, but also allows us to monitor these uncertainties during policy updates throughout the training phase. The LKTD algorithm paves the way for more robust and adaptable reinforcement learning approaches.
Primary Area: reinforcement learning
Keywords: Multi-agent reinforcement learning contrastive learning attention mechanism
Scores: [ 6 6 3 5 ]
Primary Area: reinforcement learning
Keywords: imitation learning data augmentation robotics model-based method
Scores: [ 5 6 6 6 ]
We present a new technique to enhance the robustness of imitation learning methods by generating corrective data to account for compounding error and disturbances. While existing methods rely on interactive expert labeling, additional offline datasets, or domain-specific invariances, our approach requires minimal additional assumptions beyond expert data. The key insight is to leverage local continuity in the environment dynamics. Our method first constructs a dynamics model from the expert demonstration, enforcing local Lipschitz continuity while skipping the discontinuous regions. In the locally continuous regions, this model allows us to generate corrective labels within the neighborhood of the demonstrations but beyond the actual set of states and actions in the dataset. Training on this augmented data enhances the agent's ability to recover from perturbations and deal with compounding error. We demonstrate the effectiveness of our generated labels through experiments in a variety of robotics domains that have distinct forms of continuity and discontinuity, including classic control, drone flying, high-dimensional navigation, locomotion, and tabletop manipulation.
Primary Area: reinforcement learning
Keywords: offline reinforcement learning imitation learning distribution correction estimation
Scores: [ 8 6 8 8 ]
Primary Area: reinforcement learning
Keywords: pomdp guarantees representation learning reinforcement learning
Scores: [ 6 6 5 10 ]
Partially Observable Markov Decision Processes (POMDPs) are used to model environments where the full state cannot be perceived by an agent. As such the agent needs to reason taking into account the past observations and actions. However, simply remembering the full history is generally intractable due to the exponential growth in the history space. Maintaining a probability distribution that models the belief over what the true state is can be used as a sufficient statistic of the history, but its computation requires access to the model of the environment and is often intractable. While SOTA algorithms use Recurrent Neural Networks to compress the observation-action history aiming to learn a sufficient statistic, they lack guarantees of success and can lead to sub-optimal policies. To overcome this, we propose the Wasserstein Belief Updater, an RL algorithm that learns a latent model of the POMDP and an approximation of the belief update. Our approach comes with theoretical guarantees on the quality of our approximation ensuring that our outputted beliefs allow for learning the optimal value function.
Primary Area: reinforcement learning
Keywords: reinforcement learning cascading bandits combinatorial action space computational and sample efficiency
Scores: [ 6 8 6 8 ]
Cascading bandits have gained popularity in recent years due to their applicability to recommendation systems and online advertising. In the cascading bandit model, at each timestep, an agent recommends an ordered subset of items (called an item list) from a pool of items, each associated with an unknown attraction probability. Then, the user examines the list, and clicks the first attractive item (if any), and after that, the agent receives a reward. The goal of the agent is to maximize the expected cumulative reward. However, the prior literature on cascading bandits ignores the influences of user states (e.g., historical behaviors) on recommendations and the change of states as the session proceeds. Motivated by this fact, we propose a generalized cascading RL framework, which considers the impact of user states and state transition into decisions. In cascading RL, we need to select items not only with large attraction probabilities but also leading to good successor states. This imposes a huge computational challenge due to the combinatorial action space. To tackle this challenge, we delve into the properties of value functions, and design an oracle BestPerm to efficiently find the optimal item list. Equipped with BestPerm, we develop two algorithms CascadingVI and CascadingBPI, which are both computationally-efficient and sample-efficient, and provide near-optimal regret and sample complexity guarantees. Furthermore, we present experiments to show the improved computational and sample efficiencies of our algorithms compared to straightforward adaptations of existing RL algorithms in practice.
Primary Area: reinforcement learning
Keywords: Inverse Game Theory Inverse Multiagent Reinforcement Learning
Scores: [ 8 10 6 6 ]
Primary Area: reinforcement learning
Keywords: imperfect-information games search decision-time planning
Scores: [ 3 5 3 6 ]
Primary Area: reinforcement learning
Keywords: reinforcement learning theory offline reinforcement learning general function approximation learnability minimax lower bounds
Scores: [ 8 8 6 6 6 ]
Primary Area: reinforcement learning
Keywords: Quality-Diversity Reinforcement Learning Evolutionary Algorithms
Scores: [ 8 8 8 ]
Quality-Diversity (QD) algorithms, as a subset of evolutionary algorithms, have emerged as a powerful optimization paradigm with the aim of generating a set of high-quality and diverse solutions. Although QD has demonstrated competitive performance in reinforcement learning, its low sample efficiency remains a significant impediment for real-world applications. Recent research has primarily focused on augmenting sample efficiency by refining selection and variation operators of QD. However, one of the less considered yet crucial factors is the inherently large-scale issue of the QD optimization problem. In this paper, we propose a novel Cooperative Coevolution QD (CCQD) framework, which decomposes a policy network naturally into two types of layers, corresponding to representation and decision respectively, and thus simplifies the problem significantly. The resulting two (representation and decision) subpopulations are coevolved cooperatively. CCQD can be implemented with different selection and variation operators. Experiments on several popular tasks within the QDAX suite demonstrate that an instantiation of CCQD achieves approximately a 200% improvement in sample efficiency.
Primary Area: reinforcement learning
Keywords: PSRO game theory deep RL Nash population
Scores: [ 6 6 6 ]
Primary Area: reinforcement learning
Keywords: Inverse Constrained Reinforcement Learning Constrained Reinforcement Learning Inverse Reinforcement Learning Uncertainty Modeling
Scores: [ 6 6 6 5 ]
Primary Area: reinforcement learning
Keywords: Large Language Models Reinforcement Learning Dexterous Manipulation Reward Learning Robotics
Scores: [ 8 5 6 6 ]
Large Language Models (LLMs) have excelled as high-level semantic planners for sequential decision-making tasks. However, harnessing them to learn complex low-level manipulation tasks, such as dexterous pen spinning, remains an open problem. We bridge this fundamental gap and present Eureka, a human-level reward design algorithm powered by LLMs. Eureka exploits the remarkable zero-shot generation, code-writing, and in-context improvement capabilities of state-of-the-art LLMs, such as GPT-4, to perform evolutionary optimization over reward code. The resulting rewards can then be used to acquire complex skills via reinforcement learning. Without any task-specific prompting or pre-defined reward templates, Eureka generates reward functions that outperform expert human-engineered rewards. In a diverse suite of 29 open-source RL environments that include 10 distinct robot morphologies, Eureka outperforms human experts on 83% of the tasks, leading to an average normalized improvement of 52%. The generality of Eureka also enables a new gradient-free in-context learning approach to reinforcement learning from human feedback (RLHF), readily incorporating human inputs to improve the quality and the safety of the generated rewards without model updating. Finally, using Eureka rewards in a curriculum learning setting, we demonstrate for the first time, a simulated Shadow Hand capable of performing pen spinning tricks, adeptly manipulating a pen in circles at rapid speed.
Primary Area: reinforcement learning
Keywords: offline reinforcement learning compositional generalization conservatism transduction
Scores: [ 5 8 6 ]
Primary Area: reinforcement learning
Keywords: Stochastic optimization submodular maximization Frank-Wolfe algorithm
Scores: [ 6 5 10 ]
This paper introduces unified projection-free Frank-Wolfe type algorithms for adversarial continuous DR-submodular optimization, spanning scenarios such as full information and (semi-)bandit feedback, monotone and non-monotone functions, different constraints, and types of stochastic queries. For every problem considered in the non-monotone setting, the proposed algorithms are either the first with proven sub-linear α-regret bounds or have better α-regret bounds than the state of the art, where α is a corresponding approximation bound in the offline setting. In the monotone setting, the proposed approach gives state-of-the-art sub-linear α-regret bounds among projection-free algorithms in 7 of the 8 considered cases while matching the result of the remaining case. Additionally, this paper addresses semi-bandit and bandit feedback for adversarial DR-submodular optimization, advancing the understanding of this optimization area.
Primary Area: reinforcement learning
Keywords: Model-based Reinforcement Learning; Reward Shaping; Reward Smoothing
Scores: [ 6 6 5 6 ]
Primary Area: reinforcement learning
Keywords: offline reinforcement learning POMDPs representation learning
Scores: [ 5 5 5 ]
Offline reinforcement learning (RL) can in principle synthesize more optimal behavior from a dataset consisting only of suboptimal trials. One way that this can happen is by "stitching" together the best parts of otherwise suboptimal trajectories that overlap on similar states, to create new behaviors where each individual state is in-distribution, but the overall returns are higher. However, in many interesting and complex applications, such as autonomous navigation and dialogue systems, the state is partially observed. Even worse, the state representation is unknown or not easy to define. In such cases, policies and value functions are often conditioned on observation histories instead of states. In these cases, it is not clear if the same kind of "stitching" is feasible at the level of observation histories, since two different trajectories would always have different histories, and thus "similar states" that might lead to effective stitching cannot be leveraged. Theoretically, we show that standard offline RL algorithms conditioned on observation histories suffer from poor sample complexity, in accordance with the above intuition. We then identify sufficient conditions under which offline RL can still be efficient -- intuitively, it needs to learn a compact representation of history comprising only features relevant for action selection. We introduce a bisimulation loss that captures the extent to which this happens, and propose that offline RL can explicitly optimize this loss to aid worst-case sample complexity. Empirically, we show that across a variety of tasks either our proposed loss improves performance, or the value of this loss is already minimized as a consequence of standard offline RL, indicating that it correlates well with good performance.
Primary Area: reinforcement learning
Keywords: reinforcement learning learning efficiency ensembles value-decomposition
Scores: [ 6 6 5 ]
Primary Area: reinforcement learning
Keywords: Robust control Reinforcement learning
Scores: [ 8 6 6 6 ]
Primary Area: reinforcement learning
Keywords: Reinforcement Learning Sim-to-Real Transfer Domain Randomization
Scores: [ 5 6 6 6 5 8 ]
Varying dynamics parameters in simulation is a popular Domain Randomization (DR) approach for overcoming the reality gap in Reinforcement Learning (RL). Nevertheless, DR heavily hinges on the choice of the sampling distribution of the dynamics parameters, since high variability is crucial to regularize the agent's behavior but notoriously leads to overly conservative policies when randomizing excessively. In this paper, we propose a novel approach to address sim-to-real transfer, which automatically shapes dynamics distributions during training in simulation without requiring real-world data. We introduce DOmain RAndomization via Entropy MaximizatiON (DORAEMON), a constrained optimization problem that directly maximizes the entropy of the training distribution while retaining generalization capabilities. In achieving this, DORAEMON gradually increases the diversity of sampled dynamics parameters as long as the probability of success of the current policy is sufficiently high. We empirically validate the consistent benefits of DORAEMON in obtaining highly adaptive and generalizable policies, i.e., solving the task at hand across the widest range of dynamics parameters, as opposed to representative baselines from the DR literature. Notably, we also demonstrate the Sim2Real applicability of DORAEMON through its successful zero-shot transfer in a robotic manipulation setup under unknown real-world parameters.
Primary Area: reinforcement learning
Keywords: model-based offline reinforcement learning dynamics reward reward-consistent dynamics model learning
Scores: [ 6 8 6 6 ]
Learning a precise dynamics model can be crucial for offline reinforcement learning, which, unfortunately, has been found to be quite challenging. Dynamics models that are learned by fitting historical transitions often struggle to generalize to unseen transitions. In this study, we identify a hidden but pivotal factor termed \emph{dynamics reward} that remains consistent across transitions, offering a pathway to better generalization. Therefore, we propose the idea of reward-consistent dynamics models: any trajectory generated by the dynamics model should maximize the dynamics reward derived from the data. We implement this idea as the MOREC (Model-based Offline reinforcement learning with Reward Consistency) method, which can be seamlessly integrated into previous offline model-based reinforcement learning (MBRL) methods. MOREC learns a generalizable dynamics reward function from offline data, which is subsequently employed as a transition filter in any offline MBRL method: when generating transitions, the dynamics model generates a batch of transitions and selects the one with the highest dynamics reward value. On a synthetic task, we visualize that MOREC has a strong generalization ability and can surprisingly recover some distant unseen transitions. On 21 offline tasks in D4RL and NeoRL benchmarks, MOREC improves the previous state-of-the-art performance by a significant margin, i.e., 4.6% on D4RL tasks and 25.9% on NeoRL tasks. Notably, MOREC is the first method that can achieve above 95% online RL performance in 6 out of 12 D4RL tasks and 3 out of 9 NeoRL tasks.
Primary Area: reinforcement learning
Keywords: multi-agent reinforcement learning social dilemmas deep reinforcement learning
Scores: [ 3 6 3 5 3 ]
Primary Area: reinforcement learning
Keywords: sequential decision making adversarial attacks robust human-AI systems robust mixed-autonomy systems
Scores: [ 8 6 8 ]
Autonomous agents deployed in the real world need to be robust against adversarial attacks on sensory inputs. Robustifying agent policies requires anticipating the strongest attacks possible.We demonstrate that existing observation-space attacks on reinforcement learning agents have a common weakness: while effective, their lack of information-theoretic detectability constraints makes them \textit{detectable} using automated means or human inspection. Detectability is undesirable to adversaries as it may trigger security escalations.We introduce \textit{ϵ-illusory attacks}, a novel form of adversarial attack on sequential decision-makers that is both effective and of ϵ-bounded statistical detectability. We propose a novel dual ascent algorithm to learn such attacks end-to-end.Compared to existing attacks, we empirically find ϵ-illusory attacks to be significantly harder to detect with automated methods, and a small study with human subjects\footnote{IRB approval under reference XXXXX/XXXXX} suggests they are similarly harder to detect for humans. Our findings suggest the need for better anomaly detectors, as well as effective hardware- and system-level defenses.
Primary Area: reinforcement learning
Keywords: reinforcement learning model-based reinforcement learning world models robotics multimodality perception sensing
Scores: [ 10 8 8 8 ]
Primary Area: reinforcement learning
Keywords: Reinforcement Learning Data Augmentation
Scores: [ 6 6 6 6 ]
Various data augmentation techniques have been recently proposed in image-based deep reinforcement learning (DRL).Although they empirically demonstrate the effectiveness of data augmentation for improving sample efficiency or generalization, which technique should be preferred is not always clear. To tackle this question, we analyze existing methods to better understand them and to uncover how they are connected.Notably, by expressing the variance of the Q-targets and that of the empirical actor/critic losses of these methods, we can analyze the effects of their different components and compare them.We furthermore formulate an explanation about how these methods may be affected by choosing different data augmentation transformations in calculating the target Q-values.This analysis suggests recommendations on how to exploit data augmentation in a more principled way.In addition, we include a regularization term called tangent prop, previously proposed in computer vision, but whose adaptation to DRL is novel to the best of our knowledge.We evaluate our proposition and validate our analysis in several domains. Compared to different relevant baselines, we demonstrate that it achieves state-of-the-art performance in most environments and shows higher sample efficiency and better generalization ability in some complex environments.
Primary Area: reinforcement learning
Keywords: Preference-based Reinforcement Learning Offline Reinforcement Learning Conditional Generative Modeling Diffusion Models
Scores: [ 6 5 6 ]
Offline preference-based reinforcement learning (PbRL) offers an effective solution to overcome the challenges associated with designing rewards and the high costs of online interactions. In offline PbRL, agents are provided with a fixed dataset containing human preferences between pairs of trajectories. Previous studies mainly focus on recovering the rewards from the preferences, followed by policy optimization with an off-the-shelf offline RL algorithm. However, given that preference label in PbRL is inherently trajectory-based, accurately learning transition-wise rewards from such label can be challenging, potentially leading to misguidance during subsequent offline RL training. To address this issue, we introduce our method named Flow-to-Better (FTB), which leverages the pairwise preference relationship to guide a generative model in producing preferred trajectories, avoiding Temporal Difference (TD) learning with inaccurate rewards. Conditioning on a low-preference trajectory, FTB uses a diffusion model to generate a better one with a higher preference, achieving high-fidelity full-horizon trajectory improvement. During diffusion training, we propose a technique called Preference Augmentation to alleviate the problem of insufficient preference data. As a result, we surprisingly find that the model-generated trajectories not only exhibit increased preference and consistency with the real transition but also introduce elements of novelty and diversity, from which we can derive a desirable policy through imitation learning. Experimental results on D4RL benchmarks demonstrate that FTB achieves a remarkable improvement compared to state-of-the-art offline PbRL methods. Furthermore, we show that FTB can also serve as an effective data augmentation method for offline RL.
Primary Area: reinforcement learning
Keywords: Adversarial Imitation Learning Boosting Reinforcement Learning
Scores: [ 5 6 6 6 ]
Adversarial imitation learning (AIL) has stood out as a dominant framework across various imitation learning (IL) applications, with Discriminator Actor Critic (DAC) demonstrating the effectiveness of off-policy learning algorithms in improving sample efficiency and scalability to higher-dimensional observations. Despite DAC’s empirical success, the original AIL objective is on-policy and DAC’s ad-hoc application of off-policy training does not guarantee successful imitation. Follow-up work such as ValueDICE tackles this issue by deriving a fully off-policy AIL objective. Instead in this work, we develop a novel and principled AIL algorithm via the framework of boosting. Like boosting, our new algorithm, AILBoost, maintains an ensemble of weighted weak learners (i.e., policies) and trains a discriminator that witnesses the maximum discrepancy between the distributions of the ensemble and the expert policy. We maintain a weighted replay buffer to represent the state-action distribution induced by the ensemble, allowing us to train discriminators using the entire data collected so far. Empirically, we evaluate our algorithm on both controller state-based and pixel-based environments from the DeepMind Control Suite. AILBoost outperforms DAC on both types of environments, demonstrating the benefit of properly weighting replay buffer data for off-policy training. On state-based environments, AILBoost outperforms ValueDICE and IQ-Learn, achieving state-of-the-art performance with as little as one expert trajectory.
Primary Area: reinforcement learning
Keywords: Distributed Deep Reinforcement Learning Distributed Deep Reinforcement Learning Reproducibility
Scores: [ 6 6 3 6 ]
Primary Area: reinforcement learning
Keywords: robust reinforcement learning; beyond worse-case
Scores: [ 6 8 8 6 ]
In light of the burgeoning success of reinforcement learning (RL) in diverse real-world applications, considerable focus has been directed towards ensuring RL policies are robust to adversarial attacks during test time. Current approaches largely revolve around solving a minimax problem to prepare for potential worst-case scenarios. While effective against strong attacks, these methods often compromise performance in the absence of attacks or the presence of only weak attacks. To address this, we study policy robustness under the well-accepted state-adversarial attack model, extending our focus beyond merely worst-case attacks. We first formalize this task at test time as a regret minimization problem and establish its intrinsic difficulty in achieving sublinear regret when the baseline policy is from a general continuous policy class, Π. This finding prompts us to \textit{refine} the baseline policy class Π prior to test time, aiming for efficient adaptation within a compact, finite policy class ~Π, which can resort to an adversarial bandit subroutine. In light of the importance of a finite and compact ~Π, we propose a novel training-time algorithm to iteratively discover \textit{non-dominated policies}, forming a near-optimal and minimal ~Π, thereby ensuring both robustness and test-time efficiency. Empirical validation on the Mujoco corroborates the superiority of our approach in terms of natural and robust performance, as well as adaptability to various attack scenarios.
Primary Area: reinforcement learning
Keywords: reinforcement learning federated learning temporal difference learning
Scores: [ 6 8 5 6 ]
Primary Area: reinforcement learning
Keywords: Model-based RL MARL Planning MCTS
Scores: [ 6 6 6 8 ]
Primary Area: reinforcement learning
Keywords: Reinforcement Learning Policy optimization Policy alignment Preference based RL RLHF
Scores: [ 8 6 6 8 ]
We present a novel unified bilevel optimization-based framework, \textsf{PARL}, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning using utility or preference-based feedback. We identify a major gap within current algorithmic designs for solving policy alignment due to a lack of precise characterization of the dependence of the alignment objective on the data generated by policy trajectories. This shortfall contributes to the sub-optimal performance observed in contemporary algorithms.Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable (optimal policy for the designed reward). Interestingly, from an optimization perspective, our formulation leads to a new class of stochastic bilevel problems where the stochasticity at the upper objective depends upon the lower-level variable. To demonstrate the efficacy of our formulation in resolving alignment issues in RL, we devised an algorithm named \textsf{A-PARL} to solve PARL problem, establishing sample complexity bounds of order O(1/T). Our empirical results substantiate that the proposed \textsf{PARL} can address the alignment concerns in RL by showing significant improvements (up to 63% in terms of required samples) for policy alignment in large-scale environments of the Deepmind control suite and Meta world tasks.
Primary Area: reinforcement learning
Keywords: human enhancement human-agent collaboration game playing deep reinforcement learning
Scores: [ 6 6 6 ]
Existing game AI research mainly focuses on enhancing agents' abilities to win games, but this does not inherently make humans have a better experience when collaborating with these agents. For example, agents may dominate the collaboration and exhibit unintended or detrimental behaviors, leading to poor experiences for their human partners. In other words, most game AI agents are modeled in a "self-centered" manner. In this paper, we propose a "human-centered" modeling scheme for collaborative agents that aims to enhance the experience of humans. Specifically, we model the experience of humans as the goals they expect to achieve during the task. We expect that agents should learn to enhance the extent to which humans achieve these goals while maintaining agents' original abilities (e.g., winning games). To achieve this, we propose the Reinforcement Learning from Human Gain (RLHG) approach. The RLHG approach introduces a "baseline", which corresponds to the extent to which humans primitively achieve their goals, and encourages agents to learn behaviors that can effectively enhance humans in achieving their goals better. We evaluate the RLHG agent in the popular Multi-player Online Battle Arena (MOBA) game, Honor of Kings, by conducting real-world human-agent tests. Both objective performance and subjective preference results show that the RLHG agent provides participants better gaming experience.
Primary Area: reinforcement learning
Keywords: Open-endedness Auto-Curriculum Learning Reinforcement Learning
Scores: [ 8 6 8 3 ]
Open-ended algorithms aim to learn new, interesting behaviors forever. That requires a vast environment search space, but there are thus infinitely many possible tasks. Even after filtering for tasks the current agent can learn (i.e., learning progress), countless learnable yet uninteresting tasks remain (e.g., minor variations of previously learned tasks). An Achilles Heel of open-endedness research is the inability to quantify (and thus prioritize) tasks that are not just learnable, but also interesting (e.g., worthwhile and novel). We propose solving this problem by Open-endedness via Models of human Notions of Interestingness (OMNI). The insight is that we can utilize large (language) models (LMs) as a model of interestingness (MoI), because they already internalize human concepts of interestingness from training on vast amounts of human-generated data, where humans naturally write about what they find interesting or boring. We show that LM-based MoIs improve open-ended learning by focusing on tasks that are both learnable and interesting, outperforming baselines based on uniform task sampling or learning progress alone. This approach has the potential to dramatically advance the ability to intelligently select which tasks to focus on next (i.e., auto-curricula), and could be seen as AI selecting its own next task to learn, facilitating self-improving AI and AI-Generating Algorithms.
Primary Area: reinforcement learning
Keywords: Offline Reinforcement Learning Decision Transformer Motion Control
Scores: [ 6 6 5 8 ]
Offline reinforcement learning (RL) aims to find a near-optimal policy using pre-collected datasets. Given recent advances in Large Language Models (LLMs) and their few-shot learning prowess, this paper introduces $\textbf{La}$nguage Models for $\textbf{Mo}tionControl(\textbf{LaMo}$), a general framework based on Decision Transformers to effectively use pre-trained Language Models (LMs) for offline RL. Our framework highlights four crucial components: (1) Initializing Decision Transformers with sequentially pre-trained LMs, (2) employing the LoRA fine-tuning method, in contrast to full-weight fine-tuning, to combine the pre-trained knowledge from LMs and in-domain knowledge effectively, (3) using the non-linear MLP transformation instead of linear projections, to generate embeddings, and (4) integrating an auxiliary language prediction loss during fine-tuning to stabilize the LMs and retain their original abilities on languages. Empirical results indicate LaMo achieves state-of-the-art performance in sparse-reward tasks and closes the gap between value-based offline RL methods and decision transformers in dense-reward tasks. In particular, our method demonstrates superior performance in scenarios with limited data samples.
Primary Area: reinforcement learning
Keywords: reinforment learning theory risk-sensitive reinforment learning Conditional Value at Risk
Scores: [ 6 6 6 6 ]
We study risk-sensitive Reinforcement Learning (RL), where we aim to maximizethe Conditional Value at Risk (CVaR) with a fixed risk tolerance τ. Prior theoretical work studying risk-sensitive RL focuses on the tabular Markov Decision Processes (MDPs) setting. To extend CVaR RL to settings where state space is large, function approximation must be deployed. We study CVaR RL in low-rank MDPs with nonlinear function approximation. Low-rank MDPs assume the underlying transition kernel admits a low-rank decomposition, but unlike prior linear models, low-rank MDPs do not assume the feature or state-action representation is known. We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to carefully balance the interplay between exploration, exploitation, and representation learning in CVaR RL. We prove that our algorithm achieves a sample complexity of ~O(H7A2d4τ2ϵ2) to yield an ϵ-optimal CVaR, where H is the length of each episode, A is the capacity of action space, and d is the dimension of representations.Computational-wise, we design a novel discretized Least-Squares Value Iteration (LSVI) algorithm for the CVaR objective as the planning oracle and show that we can find the near-optimal policy in a polynomial running time with a Maximum Likelihood Estimation oracle. To our knowledge, this is the first provably efficient CVaR RL algorithm in low-rank MDPs.
Primary Area: reinforcement learning
Keywords: Regularized Least-Square Temporal Difference double-descent over-parameterization random features
Scores: [ 6 8 10 6 ]
Temporal Difference (TD) algorithms are widely used in Deep Reinforcement Learning (RL). Their performance is heavily influenced by the size of the neural network. While in supervised learning, the regime of over-parameterization and its benefits are well understood, the situation in RL is much less clear. In this paper, we present a theoretical analysis of the influence of network size and l2-regularization on performance. We identify the ratio between the number of parameters and the number of visited states as a crucial factor and define over-parameterization as the regime when it is larger than one. Furthermore, we observe a double-descent phenomenon, i.e., a sudden drop in performance around the parameter/state ratio of one. Leveraging random features and the lazy training regime, we study the regularized Least-Square Temporal Difference (LSTD) algorithm in an asymptotic regime, as both the number of parameters and states go to infinity, maintaining a constant ratio. We derive deterministic limits of both the empirical and the true Mean-Square Bellman Error (MSBE) that feature correction terms responsible for the double-descent. Correction terms vanish when the l2-regularization is increased or the number of unvisited states goes to zero. Numerical experiments with synthetic and small real-world environments closely match the theoretical predictions.
Primary Area: reinforcement learning
Keywords: Mean Field Control Multi-Agent Reinforcement Learning Partial Observability Collective Behavior
Scores: [ 6 5 8 ]
Recent reinforcement learning (RL) methods have achieved success in various domains. However, multi-agent RL (MARL) remains a challenge in terms of decentralization, partial observability and scalability to many agents. Meanwhile, collective behavior requires resolution of the aforementioned challenges, and remains of importance to many state-of-the-art applications such as active matter physics, self-organizing systems, opinion dynamics, and biological or robotic swarms. Here, MARL via mean field control (MFC) offers a potential solution to scalability, but fails to consider decentralized and partially observable systems. In this paper, we enable decentralized behavior of agents under partial information by proposing novel models for decentralized partially observable MFC (Dec-POMFC), a broad class of problems with permutation-invariant agents allowing for reduction to tractable single-agent Markov decision processes (MDP) with single-agent RL solution. We provide rigorous theoretical results, including a dynamic programming principle, together with optimality guarantees for Dec-POMFC solutions applied to finite swarms of interest. Algorithmically, we propose Dec-POMFC-based policy gradient methods for MARL via centralized training and decentralized execution, together with policy gradient approximation guarantees. In addition, we improve upon state-of-the-art histogram-based MFC by kernel methods, which is of separate interest also for fully observable MFC. We evaluate numerically on representative collective behavior tasks such as adapted Kuramoto and Vicsek swarming models, being on par with state-of-the-art MARL. Overall, our framework takes a step towards RL-based engineering of artificial collective behavior via MFC.
Primary Area: reinforcement learning
Keywords: Structured large discrete action space Reinforcement learning Neighborhood search
Scores: [ 8 6 8 ]
Large discrete action spaces (LDAS) remain a central challenge in reinforcement learning. Existing solution approaches can handle unstructured LDAS with up to a few million actions. However, many real-world applications in logistics, production, and transportation systems have combinatorial action spaces, whose size grows well beyond millions of actions, even on small instances. Fortunately, such action spaces exhibit structure, e.g., equally spaced discrete resource units. With this work, we focus on handling structured LDAS (SLDAS) with sizes that cannot be handled by current benchmarks: we propose Dynamic Neighborhood Construction (DNC), a novel exploitation paradigm for SLDAS. We present a scalable neighborhood exploration heuristic that utilizes this paradigm and efficiently explores the discrete neighborhood around the continuous proxy action in structured action spaces with up to 1073 actions. We demonstrate the performance of our method by benchmarking it against three state-of-the-art approaches designed for large discrete action spaces across two distinct environments. Our results show that DNC matches or outperforms state-of-the-art approaches while being computationally more efficient. Furthermore, our method scales to action spaces that so far remained computationally intractable for existing methodologies.
Primary Area: reinforcement learning
Keywords: Offline Reinforcement Learning Return-Conditioned Supervised Learning Bellman Completeness Trajectory Stitching
Scores: [ 6 8 5 5 ]
Primary Area: reinforcement learning
Keywords: Reinforcement learning Graph Laplacian Representation learning Augmented Lagrangian optimization Hyperparameter robustness
Scores: [ 6 5 6 6 ]
Primary Area: reinforcement learning
Keywords: Reinforcement Learning Robustness Adversarial Attack Adversarial Defense
Scores: [ 6 6 5 8 ]
Reinforcement learning (RL) has achieved phenomenal success in various domains. However, its data-driven nature also introduces new vulnerabilities that can be exploited by malicious opponents. Recent work shows that a well-trained RL agent can be easily manipulated by strategically perturbing its state observations at the test stage. Existing solutions either introduce a regularization term to improve the smoothness of the trained policy against perturbations or alternatively train the agent's policy and the attacker's policy. However, the former does not provide sufficient protection against strong attacks, while the latter is computationally prohibitive for large environments. In this work, we propose a new robust RL algorithm for deriving a pessimistic policy to safeguard against an agent's uncertainty about true states. This approach is further enhanced with belief state inference and diffusion-based state purification to reduce uncertainty. Empirical results show that our approach obtains superb performance under strong attacks and has a comparable training overhead with regularization-based methods.
Primary Area: reinforcement learning
Keywords: adversarial MDPs policy optimization bandit feedback
Scores: [ 8 8 8 6 ]
Primary Area: reinforcement learning
Keywords: Reinforcement Learning Reinforcement Learning from Human Feedback RLHF
Scores: [ 6 6 8 8 ]
Primary Area: reinforcement learning
Keywords: Reward Learning Multi-stage Task
Scores: [ 6 8 3 8 ]
The success of many RL techniques heavily relies on human-engineered dense rewards, which typically demands substantial domain expertise and extensive trial and error. In our work, we propose DrS (Dense reward learning from Stages), a novel approach for learning reusable dense rewards for multi-stage tasks in a data-driven manner. By leveraging the stage structures of the task, DrS learns a high-quality dense reward from sparse rewards and demonstrations if given. The learned rewards can be reused in unseen tasks, thus reducing the human effort for reward engineering. Extensive experiments on three physical robot manipulation task families with 1000+ task variants demonstrate that our learned rewards can be reused in unseen tasks, resulting in improved performance and sample efficiency of RL algorithms. The learned rewards even achieve comparable performance to human-engineered rewards on some tasks. See our project page for videos.
Primary Area: reinforcement learning
Keywords: MARL Inventory Management Whittle Index
Scores: [ 5 3 8 ]
Primary Area: reinforcement learning
Keywords: reinforcement learning goodhart's law misspecification reward learning
Scores: [ 8 6 6 5 ]
Primary Area: reinforcement learning
Keywords: reinforcement learning recurrent neural networks stateful policies backpropagation through time imitation learning
Scores: [ 6 8 5 ]
Stateful policies play an important role in reinforcement learning, such as handling partially observable environments, enhancing robustness, or imposing an inductive bias directly into the policy structure. The conventional method for training stateful policies is Backpropagation Through Time (BPTT), which comes with significant drawbacks, such as slow training due to sequential gradient propagation and the occurrence of vanishing or exploding gradients. The gradient is often truncated to address these issues, resulting in a biased policy update. We present a novel approach for training stateful policies by decomposing the latter into a stochastic internal state kernel and a stateless policy, jointly optimized by following the stateful policy gradient. We introduce different versions of the stateful policy gradient theorem, enabling us to easily instantiate stateful variants of popular reinforcement learning and imitation learning algorithms. Furthermore, we provide a theoretical analysis of our new gradient estimator and compare it with BPTT. We evaluate our approach on complex continuous control tasks, e.g. humanoid locomotion, and demonstrate that our gradient estimator scales effectively with task complexity while offering a faster and simpler alternative to BPTT.
Primary Area: reinforcement learning
Keywords: Offline RL robust RL data corruption training-time attack
Scores: [ 8 6 6 8 ]
Primary Area: reinforcement learning
Keywords: reinforcement learning effective horizon RL theory theory of reinforcement learning instance-dependent bounds empirical validation of theory
Scores: [ 6 6 6 5 ]
Primary Area: reinforcement learning
Keywords: Physics-informed deep reinforcement learning Safety-critical autonomous systems
Scores: [ 8 8 8 6 ]
This paper proposes the Phy-DRL: a physics-regulated deep reinforcement learning (DRL) framework for safety-critical autonomous systems. The designs of Phy-DRL are based on three invariant-embedding principles: i) residual action policy (i.e., integrating data-driven-DRL action policy and physics-model-based action policy), ii) safety-embedded reward, and iii) physics-model-guided neural network (NN) editing, including link editing and activation editing. Theoretically, the Phy-DRL exhibits 1) mathematically-provable safety guarantee, and 2) strict compliance of critic and actor networks with physics knowledge about the action-value function and action policy. Finally, we evaluate the Phy-DRL on a cart-pole system and a quadruped robot. The experiments validate our theoretical results and demonstrate that Phy-DRL features guaranteed safety compared to purely data-driven DRL and solely model-based design, while offering remarkably fewer learning parameters, and fast and stable training.
Primary Area: reinforcement learning
Keywords: Offline reinforcement learning instance-dependent least-squares value iteration
Scores: [ 5 6 6 8 ]
Offline reinforcement learning (RL), where the agent aims to learn the optimal policy based on the data collected by a behavior policy, has attracted increasing attention in recent years. While offline RL with linear function approximation has been extensively studied with optimal results achieved under certain assumptions, many works shift their interest to offline RL with non-linear function approximation.However, limited works on offline RL with non-linear function approximation have instance-dependent regret guarantees. In this paper, we propose an oracle-efficient algorithm, dubbed Pessimistic Nonlinear Least-Square Value Iteration (PNLSVI), for offline RL with non-linear function approximation. Our algorithmic design comprises three innovative components: (1) a variance-based weighted regression scheme that can be applied to a wide range of function classes, (2) a subroutine for variance estimation, and (3) a planning phase that utilizes a pessimistic value iteration approach. Our algorithm enjoys a regret bound that has a tight dependency on the function class complexity and achieves minimax optimal instance-dependent regret when specialized to linear function approximation. Our work extends the previous instance-dependent results within simpler function classes, such as linear and differentiable function to a more general framework. To the best of our knowledge, this is the first statistically optimal algorithm for nonlinear offline RL.
Primary Area: reinforcement learning
Keywords: offline reinforcement learning stationary distribution correction estimation
Scores: [ 8 5 6 8 ]
Primary Area: reinforcement learning
Keywords: Curiosity-driven exploration Reinforcement learning Language model
Scores: [ 8 8 8 8 ]
Large language models (LLMs) hold great potential for various natural language applications but risk generating incorrect or toxic content. In order to probe when an LLM generates unwanted content, the current paradigm is to recruit human testers to create input prompts (i.e., test cases) designed to elicit unfavorable responses from LLMs. This procedure is called red teaming. However, relying solely on human testers can be both expensive and time-consuming. Recent works automate red teaming by training LLMs (i.e., red team LLMs) with reinforcement learning (RL) to maximize the chance of eliciting undesirable responses (i.e., successful test cases) from the target LLMs being evaluated. However, while effective at provoking undesired responses, current RL methods lack test case diversity as RL-based methods tend to consistently generate the same few successful test cases once found. To overcome this limitation, we introduce curiosity-driven exploration to train red team models. This approach jointly maximizes the test case effectiveness and novelty. Maximizing novelty motivates the red-team model to search for new and diverse test cases. We evaluate our method by performing red teaming against LLMs in text continuation and instruction following tasks. Our experiments show that curiosity-driven exploration achieves greater diversity in all the experiments compared to existing RL-based red team methods while maintaining effectiveness. Remarkably, curiosity-driven exploration also enhances the effectiveness when performing red teaming in instruction following test cases, generating a higher number of successful test cases. We even demonstrate that curiosity-driven exploration successfully provokes toxic responses from the LLaMA2 model that has undergone finetuning based on human preferences.
Primary Area: reinforcement learning
Keywords: Offline RL Dataset Generalization Procgen Webshop
Scores: [ 6 6 6 8 ]
Primary Area: reinforcement learning
Keywords: Reinforcement Learning Representation Learning
Scores: [ 8 5 8 8 ]
We study pre-training representations for decision-making using video data, which is abundantly available for tasks such as game agents and software testing. Even though significant empirical advances have been made on this problem, a theoretical understanding remains absent. We initiate the theoretical investigation into principled approaches for representation learning and focus on learning the latent state representations of the underlying MDP using video data. We study two types of settings: one where there is iid noise in the observation, and a more challenging setting where there is also the presence of exogenous noise, which is non-iid noise that is temporally correlated, such as the motion of people or cars in the background. We study three commonly used approaches: autoencoding, temporal contrastive learning, and forward modeling. We prove upper bounds for temporal contrastive and forward modeling in the presence of only iid noise. We show that these approaches can learn the latent state and use it to do efficient downstream RL with polynomial sample complexity. When exogenous noise is also present, we establish a lower bound result showing that learning from video data can be exponentially worse than learning from action-labeled trajectory data. This partially explains why reinforcement learning with video pre-training is hard. We evaluate these representational learning methods in two visual domains, proving our theoretical findings.
Primary Area: reinforcement learning
Keywords: Reinforcement learning; language model self-improvement; text evaluation
Scores: [ 5 8 3 8 6 ]
Language model self-improvement (LMSI) techniques have recently gained significant attention as they improve language models without requiring external supervision. A common approach is reinforcement learning from AI feedback (RLAIF), which trains a reward model based on AI preference data and employs a reinforcement learning algorithm to train the language model. However, RLAIF relies on the heuristic assumption that an AI model can provide effective feedback and correct wrong answers, requiring a solid capability of the language model. This paper presents a novel LMSI method, Reinforcement Learning Contemplation (RLC). We disclose that it is simpler for language models to evaluate a sentence than to generate it, even for small language models. Leveraging the gap between the evaluation and generation, RLC evaluates generated answers and updates language model parameters using reinforcement learning to maximize evaluation scores. Through testing on various challenging reasoning tasks and text summarization task, our experiments show that RLC effectively improves language model performance without external supervision, resulting in an answering accuracy increase (from 31.23% to 37.09%) for BigBench-hard reasoning tasks, and a rise in BERTScore for CNN/Daily Mail summarization tasks. Furthermore, RLC can be applied to models of different sizes, showcasing its broad applicability.
Primary Area: reinforcement learning
Keywords: Opponent Modeling Offline Transformer
Scores: [ 6 6 6 5 ]
Opponent modeling aims at learning the opponent's behaviors, goals, or beliefs to reduce the uncertainty of the competitive environment and assist decision-making. Existing work has mostly focused on learning opponent models online, which is impractical and inefficient in practical scenarios. To this end, we formalize an Offline Opponent Modeling (OOM) problem with the objective of utilizing pre-collected offline datasets to learn opponent models that characterize the opponent from the viewpoint of the controlled agent, which aids in adapting to the unknown fixed policies of the opponent. Drawing on the promises of the Transformers for decision-making, we introduce a general approach, Transformer Against Opponent (TAO), for OOM. Essentially, TAO tackles the problem by harnessing the full potential of the supervised pre-trained Transformers' in-context learning capabilities. The foundation of TAO lies in three stages: an innovative offline policy embedding learning stage, an offline opponent-aware response policy training stage, and a deployment stage for opponent adaptation with in-context learning. Theoretical analysis establishes TAO's equivalence to Bayesian posterior sampling in opponent modeling and guarantees TAO's convergence in opponent policy recognition. Extensive experiments and ablation studies on competitive environments with sparse and dense rewards demonstrate the impressive performance of TAO. Our approach manifests remarkable prowess for fast adaptation, especially in the face of unseen opponent policies, confirming its in-context learning potency.
Primary Area: reinforcement learning
Keywords: model-based reinforcement learning state space models memory in reinforcement learning
Scores: [ 6 8 10 ]
Primary Area: reinforcement learning
Keywords: reinforcement learning theory POMDPs partially observable reinforcement learning
Scores: [ 6 6 6 6 ]
This paper studies the sample-efficiency of learning in Partially Observable Markov Decision Processes (POMDPs), a challenging problem in reinforcement learning that is known to be exponentially hard in the worst-case. Motivated by real-world settings such as loading in game playing, we propose an enhanced feedback model called multiple observations in hindsight'', where after each episode of interaction with the POMDP, the learner may collect multiple additional observations emitted from the encountered latent states, but may not observe the latent states themselves. We show that sample-efficient learning under this feedback model is possible for two new subclasses of POMDPs: \emph{multi-observation revealing POMDPs} and \emph{distinguishable POMDPs}. Both subclasses generalize and substantially relax \emph{revealing POMDPs}---a widely studied subclass for which sample-efficient learning is possible under standard trajectory feedback. Notably, distinguishable POMDPs only require the emission distributions from different latent states to be \emph{different} instead of \emph{linearly independent} as required in revealing POMDPs.
Primary Area: reinforcement learning
Keywords: reinforcement learning policy gradient stochastic approximation finite-time MDP
Scores: [ 6 8 8 ]
Primary Area: reinforcement learning
Keywords: Meta Reinforcement Learning World Models Model Based Reinforcement Learning
Scores: [ 6 6 6 6 ]
Meta-reinforcement learning (meta-RL) is a promising framework for tackling challenging domains requiring efficient exploration. Existing meta-RL algorithms are characterized by low sample efficiency, and mostly focus on low-dimensional task distributions. In parallel, model-based RL methods have been successful in solving partially observable MDPs, of which meta-RL is a special case.In this work, we leverage this success and propose a new model-based approach to meta-RL, based on elements from existing state-of-the-art model-based and meta-RL methods. We demonstrate the effectiveness of our approach on common meta-RL benchmark domains, attaining greater return with better sample efficiency (up to 15×) while requiring very little hyperparameter tuning. In addition, we validate our approach on a slate of more challenging, higher-dimensional domains, taking a step towards real-world generalizing agents.
Primary Area: reinforcement learning
Keywords: contrastive learning reinforcement learning goal-reaching goal-conditioned RL temporal difference
Scores: [ 6 8 6 8 ]
Predicting and reasoning about the future lies at the heart of many time-series questions. For example, goal-conditioned reinforcement learning can be viewed as learning representations to predict which states are likely to be visited in the future. While prior methods have used contrastive predictive coding to model time series data, learning representations that encode long-term dependencies usually requires large amounts of data. In this paper, we introduce a temporal difference version of contrastive predictive coding that stitching together pieces of different time series data to decrease the amount of data required to learn to predict future events. We apply this representation learning method to derive an off-policy algorithm for goal-conditioned RL. Experiments demonstrate that, compared with prior RL methods, ours achieves higher success rates with less data, and can better cope with stochastic environments.
Primary Area: reinforcement learning
Keywords: Plasticity Visual Reinforcement Learning Sample Efficiency
Scores: [ 6 6 6 6 ]
Plasticity, the ability of a neural network to evolve with new data, is crucial for high-performance and sample-efficient visual reinforcement learning (VRL). Although methods like resetting and regularization can potentially mitigate plasticity loss, the influences of various components within the VRL framework on the agent's plasticity are still poorly understood. In this work, we conduct a systematic empirical exploration focusing on three primary underexplored facets and derive the following insightful conclusions: (1) data augmentation is essential in maintaining plasticity; (2) the critic's plasticity loss serves as the principal bottleneck impeding efficient training; and (3) without timely intervention to recover critic's plasticity in the early stages, its loss becomes catastrophic. These insights suggest a novel strategy to address the high replay ratio (RR) dilemma, where exacerbated plasticity loss hinders the potential improvements of sample efficiency brought by increased reuse frequency. Rather than setting a static RR for the entire training process, we propose Adaptive RR, which dynamically adjusts the RR based on the critic’s plasticity level. Extensive evaluations indicate that Adaptive RR not only avoids catastrophic plasticity loss in the early stages but also benefits from more frequent reuse in later phases, resulting in superior sample efficiency.
Primary Area: reinforcement learning
Keywords: planning diffusion language RL reinforcement
Scores: [ 6 6 5 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: partial order embedding hyperbolic space shadow cone representation learning entailment cone hierarchy
Scores: [ 6 8 6 6 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Open-Vocabulary Action Recognition Action Recognition
Scores: [ 6 6 6 6 ]
In this paper, we introduce FROSTER, an effective framework for open-vocabulary action recognition. The CLIP model has achieved remarkable success in a range of image-based tasks, benefiting from its robust generalization capability resulting from extensive image-text pretraining. However, applying CLIP directly to open-vocabulary action recognition tasks is challenging due to the absence of temporal information in CLIP's pretraining. Fine-tuning CLIP on action recognition datasets may lead to overfitting and hinder its ability to generalize, resulting in unsatisfactory results when dealing with unseen actions.To ensure that CLIP retains its generalization capability while effectively adapting to the action recognition task, FROSTER employs a feature distillation approach, treating the frozen CLIP model as a teacher. This encourages the fine-tuned model to maintain the generalizability exhibited by the original CLIP. Additionally, we supervise the feature learning process specifically for action recognition, enabling the extraction of video-specific features to bridge the gap between images and videos.To strike a balance between the two distinct objectives of generalizability and video-specific learning, we propose a residual feature distillation method that effectively achieves both goals.We extensively evaluate FROSTER on open-vocabulary action recognition under both base-to-novel and cross-dataset settings. FROSTER consistently achieves state-of-the-art performance on all datasets across the board. The code will be made publicly available.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Neural Fields 3D classification
Scores: [ 3 8 6 5 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: language generation language modeling machine translation robustness estimating data quality
Scores: [ 6 6 6 6 ]
Text generation models are notoriously vulnerable to errors in the training data. With the wide-spread availability of massive amounts of web-crawled data becoming more commonplace, how can we enhance the robustness of models trained on a massive amount of noisy web-crawled text? In our work, we propose Error Norm Truncation (ENT), a robust enhancement method to the standard training objective that truncates noisy data. Compared to methods that only uses the negative log-likelihood loss to estimate data quality, our method provides a more accurate estimation by considering the distribution of non-target tokens, which is often overlooked by previous work. Through comprehensive experiments across language modeling, machine translation, and text summarization, we show that equipping text generation models with ENT improves generation quality over standard training and previous soft and hard truncation methods. Furthermore, we show that our method improves the robustness of models against two of the most detrimental types of noise in machine translation, resulting in an increase of more than 2 BLEU points over the MLE baseline when up to 50% of noise is added to the data.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Vision Foundation Models; Segment Anything; Training-Free Generalist; Matcher
Scores: [ 8 6 5 6 ]
Powered by large-scale pre-training, vision foundation models exhibit significant potential in open-world image understanding. However, unlike large language models that excel at directly tackling various language tasks, vision foundation models require a task-specific model structure followed by fine-tuning on specific tasks. In this work, we present Matcher, a novel perception paradigm that utilizes off-the-shelf vision foundation models to address various perception tasks. Matcher can segment anything by using an in-context example without training. Additionally, we design three effective components within the Matcher framework to collaborate with these foundation models and unleash their full potential in diverse perception tasks. Matcher demonstrates impressive generalization performance across various segmentation tasks, all without training. For example, it achieves 52.7% mIoU on COCO-20$^i$ with one example, surpassing the state-of-the-art specialist model by 1.6%. In addition, Matcher achieves 33.0% mIoU on the proposed LVIS-92$^i$ for one-shot semantic segmentation, outperforming the state-of-the-art generalist model by 14.4%. Our visualization results further showcase the open-world generality and flexibility of Matcher when applied to images in the wild.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Neural fields spatial-temporal NeRFs NeRF for driving
Scores: [ 6 8 8 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Multi-task learning Multi-label learning mixture-of-experts image classification
Scores: [ 6 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Math Reasoning Instruction Tuning Large Language Model
Scores: [ 6 6 8 8 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Multimodal Large Language Models Large Language Models Generative Models Vision Language Representation Learning GPT
Scores: [ 6 8 6 ]
This paper presents DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models (MLLMs) empowered with frequently overlooked synergy between multimodal comprehension and creation. DreamLLM operates on two fundamental principles. The first focuses on the generative modeling of both language and image posteriors by direct sampling in the raw multimodal space. This approach circumvents the limitations and information loss inherent to external feature extractors like CLIP, and a more thorough multimodal understanding is obtained. Second, DreamLLM fosters the generation of raw, interleaved documents, modeling both text and image contents, along with unstructured layouts. This allows DreamLLM to learn all conditional, marginal, and joint multimodal distributions effectively. As a result, DreamLLM is the first MLLM capable of generating free-form interleaved content. Comprehensive experiments highlight DreamLLM's superior performance as a zero-shot multimodal generalist, reaping from the enhanced learning synergy. Anonymous project page: https://dreamllmpaper.github.io.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Large Language Models Long Context Window Retrieval
Scores: [ 8 6 8 6 8 6 ]
Extending the context window of large language models (LLMs) is getting popular recently, while the solution of augmenting LLMs with retrieval has existed for years. The natural questions are: i) Retrieval-augmentation versus long context window, which one is better for downstream tasks? ii) Can both methods be combined to get the best of both worlds? In this work, we answer these questions by studying both solutions using two state-of-the-art pretrained LLMs, i.e., a proprietary 43B GPT and LLaMA2-70B. Perhaps surprisingly, we find that shorter context window LLM with simple retrieval-augmentation at inference can perform close to longer context LLM finetuned via positional interpolation for question answering and query-based summarization tasks, while taking much less computation. More importantly, we demonstrate that retrieval can significantly improve the performance of LLMs regardless of their context window sizes. Our study provides general insights on the choice of retrieval-augmentation versus long context extension of LLM for practitioners.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: latent dynamics structured motion representation and generation reinforcement learning
Scores: [ 8 6 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Attention steering Post-hoc Contextual emphasizing
Scores: [ 3 6 8 6 ]
In human-written articles, we often leverage the subtleties of text style, such as bold and italics, to guide the attention of readers. These textual emphases are vital for the readers to grasp the conveyed information. When interacting with large language models (LLMs), we have a similar need -- steering the model to pay closer attention to user-specified information, e.g., an instruction. Existing methods, however, are constrained to process plain text and do not support such a mechanism. This motivates us to introduce PASTA -- Post-hoc Attention STeering Approach, a method that allows LLMs to read text with user-specified emphasis marks. To this end, PASTA identifies a small subset of attention heads and applies precise attention reweighting on them, directing the model attention to user-specified parts. Like prompting, PASTA is applied at inference time and does not require changing any model parameters. Experiments demonstrate that PASTA can substantially enhance an LLM's ability to follow user instructions or integrate new knowledge from user inputs, leading to a significant performance improvement on a variety of tasks, e.g., an average accuracy improvement of 22% for LLAMA-7B. Code is provided at https://anonymous.4open.science/r/PASTA-10E9.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: 3D molecules Large Language Model 3D-text interpretation
Scores: [ 6 5 6 6 ]
Language Models (LMs) have greatly influenced diverse domains. However, their inherent limitation in comprehending 3D molecular structures has considerably constrained their potential in the biomolecular domain. To bridge this gap, we focus on 3D molecule-text interpretation, and propose 3D-MoLM: 3D-Molecular Language Modeling. Specifically, 3D-MoLM enables an LM to interpret andanalyze 3D molecules by equipping the LM with a 3D molecular encoder. This integration is achieved by a 3D molecule-text projector, bridging the 3D molecular encoder’s representation space and the LM’s input space. Moreover, to enhance 3DMoLM’s ability of cross-modal molecular understanding and instruction following, we meticulously curated a 3D molecule-centric instruction tuning dataset – 3D-MoIT. Through 3D molecule-text alignment and 3D molecule-centric instruction tuning, 3D-MoLM establishes an integration of 3D molecular encoder and LM. It significantly surpasses existing baselines on downstream tasks, including moleculetext retrieval, molecule captioning, and more challenging open-text molecular QA tasks, especially focusing on 3D-dependent properties. We will release our codes and datasets at https://anonymous.4open.science/r/3D-MoLM.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Image Recognition Vision-Language Pretraining Image Tagging
Scores: [ 8 6 3 6 5 ]
This paper presents Tag2Text, a vision language pre-training (VLP) framework, which introduces image tagging into vision-language models to guide the learning of visual-linguistic features. In contrast to prior works which utilize object tags either manually labeled or automatically detected with a limited detector, our approach utilizes tags parsed from its paired text to learn an image tagger and meanwhile provides guidance to vision-language models. Given that, Tag2Text can utilize large-scale annotation-free image tags in accordance with image-text pairs, and provides more diverse tag categories beyond objects. Strikingly, Tag2Text showcases the ability of a foundational image tagging model, with superior zero-shot performance even comparable to full supervision manner. Moreover, by leveraging tagging guidance, Tag2Text effectively enhances the performance of vision-language models on both generation-based and alignment-based tasks. Across a wide range of downstream benchmarks, Tag2Text achieves state-of-the-art results with similar model sizes and data scales, demonstrating the efficacy of the proposed tagging guidance.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: LLM alignment in-context learning instruction tuning
Scores: [ 6 6 5 3 ]
Large language models (LLMs) have shown significant improvements due to alignment tuning, that is, supervised fine-tuning (SFT) on instruction data and reinforcement learning from human feedback (RLHF).This raises questions about what is precisely learned during the alignment tuning process.We investigate the effects of alignment tuning through the lens of token distribution shift between untuned LLMs and their aligned counterparts (e.g., Llama-2 versus Llama-2-Chat).Our findings reveal that most distribution changes lie in stylistic tokens (e.g., transitional words, discourse markers), suggesting that LLMs primarily learn the language style of AI assistants during alignment tuning, while most of useful knowledge has been acquired by untuned LLMs. Thus, we pose the question: Is it necessary to update model weights to attain LLM alignment?Based on these insights, we propose an alternative method, Untuned LLMs with Restyled In-context Alignment (\textsc{Urial}), which achieves effective alignment solely through in-context learning (ICL) with as few as three curated, stylistic examples.Our evaluation on diverse examples from LIMA and AlpacaEval demonstrates that \textsc{Urial} can achieve highly satisfactory performance, sometimes equaling or surpassing SFT+RLHF counterparts, especially when the untuned LLM is sufficiently pre-trained.This implies that fine-tuning may not be as always crucial as previously assumed for LLM alignment, and lightweight alignment methods like \textsc{Urial} hold promise for efficiently tailoring LLM behavior without fine-tuning.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: 3D foundation model universal 3D representation at scale open-world 3D understanding
Scores: [ 6 8 8 ]
Scaling up representations for images or text has been extensively investigated in the past few years and has led to revolutions in learning vision and language. However, scalable representation for 3D objects and scenes is relatively unexplored. In this work, we present Uni3D, a 3D foundation model to explore the unified 3D representation at scale. Uni3D uses a 2D initialized ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features. Via the simple architecture and pretext task, Uni3D can leverage abundant 2D pretrained models as initialization and image-text aligned models as the target, unlocking the great potential of 2D model zoos and scaling-up strategies to the 3D world. We efficiently scale up Uni3D to one billion parameters, and set new records on a broad range of 3D tasks, such as zero-shot classification, few-shot classification, open-world understanding and zero-shot part segmentation. We show that the strong Uni3D representation also enables applications such as 3D painting and retrieval in the wild. We believe that Uni3D provides a new direction for exploring both scaling up and efficiency of the representation in 3D domain.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Graph Neural Networks Mixture of experts Node classification
Scores: [ 5 6 8 6 5 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Length Extrapolation Long Context Large Language Model (LLM) Neural Ordinary Differential Equation (ODE)
Scores: [ 6 6 8 6 ]
Transformer-based Large Language Models (LLMs) are pioneering advances in many natural language processing tasks, however, their exceptional capabilities are restricted within the preset context window of Transformer. Position Embedding (PE) scaling methods, while effective in extending the context window to a specific length, demonstrate either notable limitations in their extrapolation abilities or sacrificing partial performance within the context window. Length extrapolation methods, although theoretically capable of extending the context window beyond the training sequence length, often underperform in practical long-context applications. To address these challenges, we propose Continuous Length EXtrapolation (CLEX) for LLMs. We generalise the PE scaling approaches to model the continuous dynamics by ordinary differential equations over the length scaling factor, thereby overcoming the constraints of current PE scaling methods designed for specific lengths. Moreover, by extending the dynamics to desired context lengths beyond the training sequence length, CLEX facilitates the length extrapolation with impressive performance in practical tasks. We demonstrate that CLEX can be seamlessly incorporated into LLMs equipped with Rotary Position Embedding, such as LLaMA and GPT-NeoX, with negligible impact on training and inference latency. Experimental results reveal that CLEX can effectively extend the context window to over 4x training length, with no deterioration in performance. Furthermore, when evaluated on the practical LongBench benchmark, our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: remote sensing vision-language models zero-shot foundation models label-efficiency
Scores: [ 6 6 8 8 ]
We introduce a method to train vision-language models for remote-sensing images without using any textual annotations. Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language. Specifically, we train an image encoder for remote sensing images to align with the image encoder of CLIP using a large amount of paired internet and satellite images. Our unsupervised approach enables the training of a first-of-its-kind large scale VLM for remote sensing images at two different resolutions. We show that these VLMs enable zero-shot, open-vocabulary image classification, retrieval, segmentation and visual question answering for satellite images. On each of these tasks, our VLM trained without textual annotations outperforms existing VLMs trained with supervision, with gains of up to 20% for classification and 80% for segmentation.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: self-supervised learning music audio language model
Scores: [ 6 8 8 8 ]
Self-supervised learning (SSL) has recently emerged as a promising paradigm for training generalisable models on large-scale data in the fields of vision, text, and speech. Although SSL has been proven effective in speech and audio, its application to music audio has yet to be thoroughly explored. This is partially due to the distinctive challenges associated with modelling musical knowledge, particularly tonal and pitched characteristics of music.To address this research gap, we propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.In our exploration, we identified an effective combination of teacher models, which outperforms conventional speech and audio approaches in terms of performance. This combination includes an acoustic teacher based on Residual Vector Quantization - Variational AutoEncoder (RVQ-VAE) and a musical teacher based on the Constant-Q Transform (CQT). Furthermore, we explore a wide range of settings to overcome the instability in acoustic language model pre-training, which allows our designed paradigm to scale from 95M to 330M parameters.Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Time Series Analysis Deep Learning
Scores: [ 8 8 8 ]
Recently, Transformer-based and MLP-based models have emerged rapidly and won dominance in time series analysis. In contrast, convolution is losing steam in time series tasks nowadays for inferior performance. This paper studies the open question of how to better use convolution in time series analysis and makes efforts to bring convolution back to the arena of time series analysis. To this end, we modernize the traditional TCN and conduct time series related modifications to make it more suitable for time series tasks. As the outcome, we propose ModernTCN and successfully solve this open question through a seldom-explored way in time series community. As a pure convolution structure, ModernTCN still achieves the consistent state-of-the-art performance on five mainstream time series analysis tasks (long-term and short-term forecasting, imputation, classification and anomaly detection) while maintaining the efficiency advantage of convolution-based models, therefore providing a better balance of efficiency and performance than state-of-the-art Transformer-based and MLP-based models. Our study further reveals that, compared with previous convolution-based models, our ModernTCN has much larger effective receptive fields (ERFs), therefore can better unleash the potential of convolution in time series analysis. The code will be publicly available.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: invariance latent space latent comunication zero-shot stitching representation learning relative representation
Scores: [ 6 6 6 6 ]
It has been observed that representations learned by distinct neural networks conceal structural similarities when the models are trained under similar inductive biases. From a geometric perspective, identifying the classes of transformations and the related invariances that connect these representations is fundamental to unlocking applications, such as merging, stitching, and reusing different neural modules. However, estimating task-specific transformations a priori can be challenging and expensive due to several factors (e.g., weights initialization, training hyperparameters, or data modality). To this end, we introduce a versatile method to directly incorporate a set of invariances into the representations, constructing a product space of invariant components on top of the latent representations without requiring prior knowledge about the optimal invariance to infuse. We validate our solution on classification and reconstruction tasks, observing consistent latent similarity and downstream performance improvements in a zero-shot stitching setting. The experimental analysis comprises three modalities (vision, text, and graphs), twelve pretrained foundational models, eight benchmarks, and several architectures trained from scratch.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: code generation language model self-revision program synthesis modularity llm competitive programming clustering
Scores: [ 6 6 8 6 ]
Large Language Models (LLMs) have already become quite proficient at solving simpler programming tasks like those in HumanEval or MBPP benchmarks. However, solving more complex and competitive programming tasks is still quite challenging for these models - possibly due to their tendency to generate solutions as monolithic code blocks instead of decomposing them into logical sub-tasks and sub-modules. On the other hand, experienced programmers instinctively write modularized code with abstraction for solving complex tasks, often reusing previously developed modules. To address this gap, we propose CodeChain, a novel framework for inference that elicits modularized code generation through a chain of self-revisions, each being guided by some representative sub-modules generated in previous iterations. Concretely, CodeChain first instructs the LLM to generate modularized codes through chain-of-thought prompting. Then it applies a chain of self-revisions by iterating the two steps: 1) extracting and clustering the generated sub-modules and selecting the cluster representatives as the more generic and re-usable implementations, and 2) augmenting the original chain-of-thought prompt with these selected module-implementations and instructing the LLM to re-generate new modularized solutions. We find that by naturally encouraging the LLM to reuse the previously developed and verified sub-modules, CodeChain can significantly boost both modularity as well as correctness of the generated solutions, achieving relative pass@1 improvements of 35% on APPS and 76% on CodeContests. It is shown to be effective on both OpenAI LLMs as well as open-sourced LLMs like WizardCoder. We also conduct comprehensive ablation studies with different methods of prompting, number of clusters, model sizes, program qualities, etc., to provide useful insights that underpin CodeChain's success.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Large language models Long context handling Token pruning
Scores: [ 8 3 6 8 ]
Large language models (LLMs) have established new standards in various natural language processing tasks. However, a primary constraint they face is the context limit, i.e., the maximum number of tokens they can process.To relax the constraint, previous works have explored architectural changes and modifications in positional encoding, but they often require expensive training or do not address the computational demands of self-attention.In this paper, we present Hierarchical cOntext MERging (HOMER), a new training-freescheme designed to overcome the limitations. HOMER harnesses a divide-and-conquer methodology, segmenting extensive inputs into manageable units. The segments are then processed collectively, employing a hierarchical strategy that fuses adjacent chunks at progressive Transformer layers. A token reduction technique precedes each fusion, ensuring memory usage efficiency.We also propose an optimized computational order reducing the memory requirement to logarithmically scale with respect to input length, making it especially favorable for environments with tight memory restrictions. Our experimental results demonstrate the superior performance and memory efficiency of the proposed method, opening doors for broader applications of LLMs in scenarios with extended context requirements.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: implicit neural representations algebra multilayer perceptrons wavelet
Scores: [ 8 6 6 6 8 ]
Implicit neural representations (INRs) have arisen as useful methods for representing signals on Euclidean domains. By parameterizing an image as a multilayer perceptron (MLP) on Euclidean space, INRs effectively represent signals in a way that couples spatial and spectral features of the signal that is not obvious in the usual discrete representation, paving the way for continuous signal processing and machine learning approaches that were not previously possible. Although INRs using sinusoidal activation functions have been studied in terms of Fourier theory, recent works have shown the advantage of using wavelets instead of sinusoids as activation functions, due to their ability to simultaneously localize in both frequency and space. In this work, we approach such INRs and demonstrate how they resolve high-frequency features of signals from coarse approximations done in the first layer of the MLP. This leads to multiple prescriptions for the design of INR architectures, including the use of complex wavelets, decoupling of low and band-pass approximations, and initialization schemes based on the singularities of the desired signal.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Low-Rank Adaption Mixture-of-Experts
Scores: [ 3 5 6 6 ]
LoRA has gained widespread acceptance in the fine-tuning of large pre-trained models to cater to a diverse array of downstream tasks, showcasing notable effectiveness and efficiency, thereby solidifying its position as one of the most prevalent fine-tuning techniques. Due to the modular nature of LoRA's plug-and-play plugins, researchers have delved into the amalgamation of multiple LoRAs to empower models to excel across various downstream tasks. Nonetheless, extant approaches for LoRA fusion grapple with inherent challenges. Direct arithmetic merging may result in the loss of the original pre-trained model's generative capabilities or the distinct identity of LoRAs, thereby yielding suboptimal outcomes. On the other hand, Reference tuning-based fusion exhibits limitations concerning the requisite flexibility for the effective combination of multiple LoRAs. In response to these challenges, this paper introduces the Mixture of LoRA Experts (MoLE) approach, which harnesses hierarchical control and unfettered branch selection. The MoLE approach not only achieves superior LoRA fusion performance in comparison to direct arithmetic merging but also retains the crucial flexibility for combining LoRAs effectively. Extensive experimental evaluations conducted in both the Natural Language Processing (NLP) and Vision & Language (V&L) domains substantiate the efficacy of MoLE.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: large language model context compression in-context autoencoder pretraining fine-tuning Llama GPT memorization
Scores: [ 6 5 8 8 ]
We propose the In-context Autoencoder (ICAE), leveraging the power of a large language models (LLM) to compress a long context into short compact memory slots that can be directly conditioned on by the LLM for various purposes. ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data, enabling it to generate memory slots that accurately and comprehensively represent the original context; Then, it is fine-tuned on instruction data for producing desirable responses to various prompts. Experiments demonstrate that our lightweight ICAE, introducing fewer than 1% additional parameters, effectively achieves 4× context compression based on Llama, offering advantages in both improved latency and GPU memory cost during inference, and showing an interesting insight in memorization as well as potential for scalability. These promising results imply a novel perspective on the connection between working memory in cognitive science and representation learning in LLMs, revealing ICAE's significant implications in addressing the long context problem and suggesting further research in LLM context management. Our data, code and model will be released.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Large Language Model Efficient Inference Generative Inference Key-Value Cache
Scores: [ 8 8 8 8 8 8 ]
In this study, we introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for Large Language Models (LLMs). Different from the conventional KV cache that retains key and value vectors for all context tokens, we conduct targeted profiling to discern the intrinsic structure of attention modules. Based on the recognized structure, we then construct the KV cache in an adaptive manner: evicting long-range contexts on attention heads emphasizing local contexts, discarding non-special tokens on attention heads centered on special tokens, and only employing the standard KV cache for attention heads that broadly attend to all tokens. Moreover, with the lightweight attention profiling used to guide the construction of the adaptive KV cache, FastGen can be deployed without resource-intensive fine-tuning or re-training. In our experiments across various asks, FastGen demonstrates substantial reduction on GPU memory consumption with negligible generation quality loss. We will release our code and the compatible CUDA kernel for reproducibility.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: transformer attention geometry-aware attention multi-view understanding novel view synthesis group theory
Scores: [ 6 5 6 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Selective Rationalization Shortcut
Scores: [ 6 5 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Representation learning Computer vision Face recognition Intra-class incoherence Constraint
Scores: [ 8 6 6 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: NLP; recursive neural network; multi-grain representation; compositional representation;span labeling;relation extraction; grammar induction;language understanding
Scores: [ 6 5 8 6 ]
We present ReCAT, a recursive composition augmented Transformer that is able to explicitly model hierarchical syntactic structures of raw texts without relying on gold trees during both learning and inference. Existing research along this line restricts data to follow a hierarchical tree structure and thus lacks inter-span communications.To overcome the problem, we propose a novel contextual inside-outside (CIO) layer that learns contextualized representations of spans through bottom-up and top-down passes, where a bottom-up pass forms representations of high-level spans by composing low-level spans, while a top-down pass combines information inside and outside a span. By stacking several CIO layers between the embedding layer and the attention layers in Transformer, the ReCAT model can perform both deep intra-span and deep inter-span interactions, and thus generate multi-grained representations fully contextualized with other spans.Moreover, the CIO layers can be jointly pre-trained with Transformers, making ReCAT enjoy scaling ability, strong performance, and interpretability at the same time. We conduct experiments on various sentence-level and span-level tasks. Evaluation results indicate that ReCAT can significantly outperform vanilla Transformer models on all span-level tasks and recursive models on natural language inference tasks. More interestingly, the hierarchical structures induced by ReCAT exhibit strong consistency with human-annotated syntactic trees, indicating good interpretability brought by the CIO layers.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Neural Architecture Search Network evaluation Training-free metric Deep neural networks
Scores: [ 8 8 8 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: multi-modal contrastive learning captioning text-to-image generation
Scores: [ 6 8 8 6 ]
Building cross-modal applications is challenging due to limited paired multi-modal data. Recent works have shown that leveraging a pre-trained multi-modal contrastive representation space enables cross-modal tasks to be learned from uni-modal data. This is based on the assumption that contrastive optimization makes embeddings from different modalities interchangeable. However, this assumption is under-explored due to the poorly understood geometry of the multi-modal contrastive space, where a modality gap exists. In our study, we provide a theoretical explanation of this space's geometry and introduce a three-step method, C3 (Connect, Collapse, Corrupt), to bridge the modality gap, enhancing the interchangeability of embeddings. Our C3 method significantly improves cross-modal learning from uni-modal data, achieving state-of-the-art results on zero-shot image / audio / video captioning and text-to-image generation.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: federated learning;
Scores: [ 6 6 6 3 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: large language models rationalization explanation generation explainability rationale generation
Scores: [ 6 6 8 6 6 ]
Large language models (LMs) are capable of generating free-text rationales to aid question answering. However, prior work 1) suggests that useful self-rationalization is emergent only at significant scales (e.g., 175B parameter GPT-3); and 2) focuses largely on downstream performance, ignoring the semantics of the rationales themselves, e.g., are they faithful, true, and helpful for humans? In this work, we enable small-scale LMs (∼200x smaller than GPT-3) to generate rationales that not only improve downstream task performance, but are also more plausible, consistent, and diverse, assessed both by automatic and human evaluation. Our method, MaRio (Multi-rewArd RatIOnalization), is a multi-reward conditioned self-rationalization algorithm that optimizes multiple distinct properties like plausibility, diversity and consistency. Results on three difficult question-answering datasets StrategyQA, QuaRel and OpenBookQA show that not only does MaRio improve task accuracy, but it also improves the self-rationalization quality of small LMs across the aforementioned axes better than a supervised fine-tuning (SFT) baseline. Extensive human evaluations confirm that MaRio rationales are preferred vs. SFT rationales, as well as qualitative improvements in plausibility and consistency.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Image restoration vision-language model low-level vision
Scores: [ 6 6 3 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: representation learning vision-language models transformer
Scores: [ 6 6 6 8 ]
This paper reveals that large language models (LLMs), despite being trained solely on text data, are \emph{surprisingly} strong encoders for \emph{purely} visual tasks in the absence of language. Even more intriguingly, this can be achieved by a simple yet previously overlooked strategy -- employing a \emph{frozen} transformer block from \emph{pre-trained} LLMs as a constituent encoder layer to directly process visual tokens. Our work pushes the boundaries of leveraging LLMs for computer vision tasks, significantly departing from conventional practices that typically necessitate a multi-modal vision-language setup with associated language prompts, inputs, or outputs. We demonstrate that our approach consistently enhances performance across \emph{a diverse range of tasks}, encompassing pure 2D or 3D visual recognition tasks (e.g., image and point cloud classification), temporal modeling tasks (e.g., action recognition), non-semantic tasks (e.g., motion forecasting), and multi-modal tasks (e.g., 2D/3D visual question answering and image-text retrieval). Such improvements are a general phenomenon, applicable to various types of LLMs (e.g., LLaMA and OPT) and different LLM transformer blocks. We additionally propose the \emph{information filtering} hypothesis to explain the effectiveness of pre-trained LLMs in visual encoding -- the pre-trained LLM transformer blocks discern informative visual tokens and further amplify their effect. This hypothesis is empirically supported by the observation that the feature activation, after training with LLM transformer blocks, exhibits a stronger focus on relevant regions. We hope that our work inspires new perspectives on utilizing LLMs and deepening our understanding of their underlying mechanisms.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Cascaded generative models Diffusion models Symbolic Music Generation
Scores: [ 8 8 5 8 ]
Recent deep music generation studies have put much emphasis on \textit{music structure} and \textit{long-term} generation. However, we are yet to see high-quality, well-structured whole-song generation. In this paper, we make the first attempt to model a full music piece under the realization of \textit{compositional hierarchy}. With a focus on symbolic representations of pop songs, we define a hierarchical language, in which each level of hierarchy focuses on the context dependency at a certain music scope. The high-level languages reveal whole-song form, phrase, and cadence, whereas the low-level languages focus on notes, chords, and their local patterns. A cascaded diffusion model is trained to model the hierarchical language, where each level is conditioned on its upper levels. Experiments and analysis show that our model is capable of generating full-piece music with recognizable global verse-chorus structure and cadences, and the music quality is higher than the baselines. Additionally, we show that the proposed model is \textit{controllable} in a flexible way. By sampling from the interpretable hierarchical languages or adjusting external model controls, users can control the music flow via various features such as phrase harmonic structures, rhythmic patterns, and accompaniment texture.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: 3D Reconstruction Large-Scale Transformers
Scores: [ 10 8 8 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Human motion synthesis implicit neural representation diffusion model any-framerate training
Scores: [ 8 5 8 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Retraining-free Pruning Compression Transformers
Scores: [ 8 6 8 5 ]
Given a pretrained encoder-based language model, how can we accurately compress it without retraining? Retraining-free structured pruning algorithms are crucial in pretrained language model compression due to their significantly reduced pruning cost and capability to prune large language models. However, existing retraining-free algorithms encounter severe accuracy degradation, as they fail to handle pruning errors, especially at high compression rates. In this paper, we propose KPrune (Knowledge-preserving pruning), an accurate retraining-free structured pruning algorithm for pretrained encoder-based language models.KPrune focuses on preserving the useful knowledge of the pretrained model to minimize pruning errors through a carefully designed iterative pruning process composed of knowledge measurement, knowledge-preserving mask search, and knowledge-preserving weight-tuning. As a result, KPrune shows significant accuracy improvements up to 58.02%p higher F1 score compared to existing retraining-free pruning algorithms under a high compression rate of 80% on the SQuAD benchmark without any retraining process.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: 3D reconstruction NeRF dynamic scene
Scores: [ 8 8 8 8 ]
In this paper, we tackle the problem of 3D reconstruction of dynamic scenes from multi-view videos. Previous works attempt to model the motion of 3D points in space, which either constrains them to handle a single articulated object or requires extra efforts to handle topology changes. By contrast, we propose to directly estimate the change of Signed Distance Function (SDF), namely SDF flow, of the dynamic scene. We show that the SDF flow captures the evolution of the scene surface and handles topology changes naturally. We further derive the mathematical relation between the SDF flow and the scene flow, which allows us to calculate the scene flow from the SDF flow analytically by solving linear equations. Our experiments on real-world multi-view video datasets show that our reconstructions are better than those of the state-of-the-art methods.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: zero-shot classification prompting generative classification label descriptions
Scores: [ 6 6 6 6 ]
Language model (LM) prompting—a popular paradigm for solving NLP tasks—has been shown to be susceptible to miscalibration and brittleness to slight prompt variations, caused by its discriminative prompting approach, i.e., predicting the label given the input. To address these issues, we propose Gen-Z—a generative prompting framework for zero-shot text classification. GEN-Z is generative, as it measures the LM likelihood of input text, conditioned on natural language descriptions of labels. The framework is multivariate, as label descriptions allow us to seamlessly integrate additional contextual information about the labels to improve task performance. On various standard classification benchmarks, with six open-source LM families, we show that zero-shot classification with simple contextualization of the data source of the evaluation set consistently outperforms both zero-shot and few-shot baselines while improving robustness to prompt variations. Further, our approach enables personalizing classification in a zero-shot manner by incorporating author, subject, or reader information in the label descriptions.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Large Language Model (LLM) Multi-task learning Multi-modal learning Mixture-of-Experts (MoE) PEFT
Scores: [ 5 5 6 8 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Transformers positional encoding long context length generalization
Scores: [ 6 6 8 ]
Preventing the performance decay of Transformers on inputs longer than those used for training has been an important challenge in extending the context length of these models. Though the Transformer architecture has fundamentally no limits on the input sequence lengths it can process, the choice of position encoding used during training can limit the performance of these models on longer inputs. We propose a novel functional relative position encoding with progressive interpolation, FIRE, to improve Transformer generalization to longer contexts. We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple. We next empirically show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: named entity recognition large language model
Scores: [ 6 3 8 ]
Large language models (LLMs) have demonstrated remarkable generalizability, such as understanding arbitrary entities and relations. Instruction tuning has proven effective for distilling LLMs into more cost-efficient models such as Alpaca and Vicuna. Yet such student models still trail the original LLMs by large margins in downstream applications. In this paper, we explore targeted distillation with mission-focused instruction tuning to train student models that can excel in a broad application class such as open information extraction. Using named entity recognition (NER) for case study, we show how ChatGPT can be distilled into much smaller UniversalNER models for open NER. For evaluation, we assemble the largest NER benchmark to date, comprising 43 datasets across 9 diverse domains such as biomedicine, programming, social media, law, finance. Without using any direct supervision, UniversalNER attains remarkable NER accuracy across tens of thousands of entity types, outperforming general instruction-tuned models such as Alpaca and Vicuna by over 30 absolute F1 points in average. With a tiny fraction of parameters, UniversalNER not only acquires ChatGPT's capability in recognizing arbitrary entity types, but also outperforms its NER accuracy by 7-9 absolute F1 points in average. Remarkably, UniversalNER even outperforms by a large margin state-of-the-art multi-task instruction-tuned systems such as InstructUIE, which uses supervised NER examples. We also conduct thorough ablation studies to assess the impact of various components in our distillation approach. We will release the distillation recipe, data, and UniversalNER models to facilitate future research on targeted distillation.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Large Language Models In-context Learning Re-ranking Chain-of-thought PCA
Scores: [ 5 6 6 6 ]
Recent advances in natural language processing, primarily propelled by Large Language Models (LLMs), have showcased their remarkable capabilities grounded in in-context learning. A promising avenue for guiding LLMs in intricate reasoning tasks involves the utilization of intermediate reasoning steps within the Chain-of-Thought (CoT) paradigm. Nevertheless, the central challenge lies in the effective selection of exemplars for facilitating in-context learning. In this study, we introduce a framework that leverages Dual Queries and Low-rank approximation Re-ranking (DQ-LoRe) to automatically select exemplars for in-context learning. Dual Queries first query LLM to obtain LLM-generated knowledge such as CoT, then query the retriever to obtain the final exemplars via both question and the knowledge. Moreover, for the second query, LoRe employs dimensionality reduction techniques to refine exemplar selection, ensuring close alignment with the input question's knowledge. Through extensive experiments, we demonstrate that DQ-LoRe significantly outperforms prior state-of-the-art methods in the automatic selection of exemplars for GPT-4, enhancing performance from 92.5% to 94.2%. Our comprehensive analysis further reveals that DQ-LoRe consistently outperforms retrieval-based approaches in terms of both performance and adaptability, especially in scenarios characterized by distribution shifts. DQ-LoRe pushes the boundaries of in-context learning and opens up new avenues for addressing complex reasoning challenges.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Data Analysis Supervised Fine-tuning for Human Alignment Large Language Model
Scores: [ 8 6 5 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: LLM training and inference Downstream finetuning
Scores: [ 8 3 3 8 ]
Language models generate responses by producing a series of tokens in immediate succession: the (K+1)th token is an outcome of manipulating K hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, K+10 hidden vectors, before it outputs the (K+1)th token? We operationalize this idea by performing training and inference on language models with a (learnable) pause token, a sequence of which is appended to the input prefix. We then delay extracting the model's outputs until the last pause token is seen, thereby allowing the model to process extra computation before committing to an answer. We empirically evaluate pause-training on decoder-only models of 1B and 130M parameters with causal pretraining on C4, and on downstream tasks covering reasoning, question-answering, general understanding and fact recall. Our main finding is that inference-time delays show gains when the model is both pre-trained and finetuned with delays. For the 1B model, we witness gains on 8 of 9 tasks, most prominently, a gain of 18 EM score on the QA task of SQuAD, 8 on CommonSenseQA and 1 accuracy on the reasoning task of GSM8k. Our work raises a range of conceptual and practical future research questions on making delayed next-token prediction a widely applicable new paradigm.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Joint rain-/detail-aware representation learning context-based modulation mechanism image de-raining contrastive learning
Scores: [ 8 6 6 6 5 ]
Recent advances in image de-raining have focused on training powerful models on mixed multiple datasets comprising diverse rain types and backgrounds. However, this approach tends to overlook the inherent differences between datasets, resulting in suboptimal optimization and poor generalization. To address this limitation, we propose an approach to learn instance-specific de-raining models by exploring meaningful representations that characterize both the rain and background components in rainy images. Leveraging these representations as instructive guidance, we put forth a Context-based Instance-specific Modulation (CoI-M) mechanism which can modulate CNN- or Transformer-based models. Furthermore, we develop a rain-/detail-aware contrastive learning strategy to help extract joint rain-/detail-aware instance-specific representations. By integrating CoI-M with the rain-/detail-aware Contrastive learning, we develop CoIC, an innovative and effective algorithm for training models on mixed datasets. Moreover, CoIC offers insight into modeling relationships of datasets, quantitatively assessing the impact of rain and details on restoration, and revealing different behaviors of models given diverse inputs. Extensive experiments validate the effectiveness of CoIC in boosting the de-raining ability of CNN- and Transformer-based models, as well as significantly improving their generalization ability.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Neural 3D Reconstruction Specular Reflection
Scores: [ 6 6 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Human detection part detection association detection anchor free
Scores: [ 5 6 6 6 8 6 ]
The detection of human parts (e.g., hands, face) and their correct association with individuals is an essential task, e.g., for ubiquitous human-machine interfaces and action recognition. Traditional methodologies often employ multi-stage processes, rely on cumbersome anchor-based systems, or do not scale well to larger part sets. This paper presents PBADet, a novel one-stage, anchor-free approach for part-body association detection. Building upon the anchor-free object representation across multi-scale feature maps, we introduce a singular part-to-body center offset that effectively encapsulates the relationship between parts and their parent bodies. Our design is inherently versatile and capable of managing multiple parts-to-body associations without compromising on detection accuracy or robustness. Comprehensive experiments on various datasets underscore the efficacy of our approach, which not only outperforms existing state-of-the-art techniques but also offers a more streamlined and efficient solution to the part-body association challenge.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: modular neural network; module selection process; adaptive learning
Scores: [ 6 6 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: stochasticity distribution shift randomness deep learning variance random seed
Scores: [ 6 5 8 5 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: adversarial robustness adversarial training adversarial robust transfer
Scores: [ 6 6 8 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: language model Generalization
Scores: [ 8 6 8 5 ]
Pretrained large language models (LLMs) are general purpose problem solvers applicable to a diverse set of tasks with prompts. They can be further improved towards a specific task by fine-tuning on a specialized dataset. However, fine-tuning usually makes the model narrowly specialized on this dataset with reduced general in-context learning performances, which is undesirable whenever the fine-tuned model needs to handle additional tasks where no fine-tuning data is available. In this work, we first demonstrate that fine-tuning on a single task indeed decreases LLMs' general in-context learning performance. We discover one important cause of such forgetting, format specialization, where the model overfits to the format of the fine-tuned task. We further show that format specialization happens at the very beginning of fine-tuning. To solve this problem, we propose Prompt Tuning with MOdel Tuning (ProMoT), a simple yet effective two-stage fine-tuning framework that reduces format specialization and improves generalization. ProMoT offloads task-specific format learning into additional and removable parameters by first doing prompt tuning and then fine-tuning the model itself with this soft prompt attached. With experiments on several fine-tuning tasks and 8 in-context evaluation tasks, we show that ProMoT achieves comparable performance on fine-tuned tasks to standard fine-tuning, but with much less loss of in-context learning performances across a board range of out-of-domain evaluation tasks. More importantly, ProMoT can even enhance generalization on in-context learning tasks that are semantically related to the fine-tuned task, e.g. ProMoT on En-Fr translation significantly improves performance on other language pairs, and ProMoT on NLI improves performance on summarization.Experiments also show that ProMoT can improve the generalization performance of multi-task training.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Soft Contrastive Learning Time Series Analysis Self-supervised Learning
Scores: [ 8 6 6 6 ]
Contrastive learning has shown to be effective to learn representations from time series in a self-supervised way.However, contrasting similar time series instances or values from adjacent timestamps within a time series leads to ignore their inherent correlations, which results in deteriorating the quality of learned representations.To address this issue, we propose \textit{SoftCLT}, a simple yet effective soft contrastive learning strategy for time series.This is achieved by introducing instance-wise and temporal contrastive loss with soft assignments ranging from zero to one.Specifically, we define soft assignments for 1) instance-wise contrastive loss by distance between time series on the data space, warping and 2) temporal contrastive loss by the difference of timestamps.SoftCLT is a plug-and-play method for time series contrastive learning that improves the quality of learned representations without bells and whistles.In experiments, we demonstrate that SoftCLT consistently improves the performance in various downstream tasks including classification, semi-supervised learning, transfer learning, and anomaly detection, showing state-of-the-art performance.The code will be released.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: equivariant neural networks lie groups group convolution geometric deep learning
Scores: [ 8 5 6 6 8 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: multimodal survival prediction computational pathology
Scores: [ 6 5 8 10 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: text-to-speech speech synthesis neural audio codec
Scores: [ 8 8 5 3 8 ]
With the emergence of neural audio codecs, which encode multiple streams of discrete tokens from audio, large language models have recently gained attention as a promising approach for zero-shot Text-to-Speech (TTS) synthesis. Despite the ongoing rush towards scaling paradigms, audio tokenization ironically amplifies the scalability challenge, stemming from its long sequence length and the complexity of modelling the multiple sequences. To mitigate these issues, we present CLaM-TTS that employs a probabilistic residual vector quantization to 1) achieve superior compression in the token length, and 2) allow a language model to generate multiple tokens at once, thereby eliminating the need for cascaded modeling to handle the number of token streams. Our experimental results demonstrate that CLaM-TTS is better than or comparable to state-of-the-art zero-shot TTS baselines regarding naturalness, intelligibility, speaker similarity, and inference speed. In addition, we examine the impact of the pretraining extent of the language models and their text tokenization strategies on performances.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Hallucination large vision-language model multimodal learning
Scores: [ 6 8 6 5 ]
Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages. However, LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images. This can negatively impact many vision-language tasks, such as visual summarization and reasoning. To address this issue, we propose a simple yet powerful algorithm, LVLM Hallucination Revisor (LURE), to post-hoc rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions. LURE is grounded in a rigorous statistical analysis of the key factors underlying object hallucination, including co-occurrence (the frequent appearance of certain objects alongside others in images), uncertainty (objects with higher uncertainty during LVLM decoding), and object position (hallucination often appears in the later part of the generated text). LURE can also be seamlessly integrated with any LVLMs. We evaluate LURE on six open-source LVLMs and found it outperforms the previous best approach in both general object hallucination evaluation metrics, GPT, and human evaluations.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Artificial Intelligence Natural Language Processing Text Generation
Scores: [ 6 8 8 6 ]
Standard language models generate text by selecting tokens from a fixed, finite, and standalone vocabulary. We introduce a novel method that selects context-aware phrases from a collection of supporting documents. One of the most significant challenges for this paradigm shift is determining the training oracles, because a string of text can be segmented in various ways and each segment can be retrieved from numerous possible documents. To address this, we propose to initialize the training oracles using linguistic heuristics and, more importantly, bootstrap the oracles through iterative self-reinforcement. Extensive experiments show that our model not only outperforms standard language models on a variety of knowledge-intensive tasks but also demonstrates improved generation quality in open-ended text generation. For instance, compared to the standard language model counterpart, our model raises the accuracy from 23.47% to 36.27% on OpenbookQA, and improves the MAUVE score from 42.61% to 81.58% in open-ended text generation. Remarkably, our model also achieves the best performance and the lowest latency among several retrieval-augmented baselines. In conclusion, we assert that retrieval is more accurate generation and hope that our work will encourage further research on this new paradigm shift.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Data Augmentation Mixup Image Classification
Scores: [ 8 6 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Generalizable Neural Radiance Field; Point-based Rendering;
Scores: [ 6 6 6 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Pose estimation NeRF 3D Reconstruction Transformer
Scores: [ 8 8 8 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: audio processing; large language model
Scores: [ 8 6 8 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: retrieval augmented language model language modeling question answering summarization distillation
Scores: [ 8 6 8 6 ]
Retrieval-augmented language models improve language models (LMs) by retrieving documents and prepending them in-context.However, these documents, often spanning hundreds of words, make inference substantially less efficient. We propose compressing the retrieved documents into textual summaries prior to in-context integration. This not only reduces the computational costs but also relieve the burden of LMs to identify relevant information in long retrieved documents. We present two compressors -- an extractive compressor which selects useful sentences from retrieved documents and an abstractive compressor which generates summary by synthesizing information from multiple documents. Both are trained to achieve performance gain in LMs when we prepend the generated summary from the compressor to LMs' input, while minimizing the summary length. When retrieved documents are irrelevant to the input or offer no additional information to LM, our compressors output an empty string, enabling selective augmentation. We evaluate our approach on the language modeling task and open domain question answering task. We achieve a compression rate of as low as 6% with minimal loss in performance for both tasks, significantly outperforming the off-the-shelf summarization models. We show that our compressors trained for one LM can transfer to other LMs on the language modeling task and provide a summary largely faithful to the retrieved documents.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: large multimodal models in-context-learning evaluation alignment to human preferences
Scores: [ 8 8 3 3 ]
Following the success of Large Language Models (LLMs), Large Multimodal Models (LMMs), such as the Flamingo model and its subsequent competitors, have started to emerge as natural step towards generalist agents. However, interacting with recent LMMs reveals major limitations that are hardly captured by the current evaluation benchmarks. Indeed, task performances (e.g., VQA accuracy) alone do not provide enough clues to understand their real capabilities, limitations, and to which extent such models are aligned to human expectations. To refine our understanding on those flaws, we deviate from the current evaluation paradigm, and (1) evaluate 8 recent open-source LMMs (based on the Flamingo architecture such as OpenFlamingo and IDEFICS) on 5 different axes; hallucinations, abstention, compositionality, explainability and instruction following. Our evaluation on these axes reveals major flaws in LMMs. To efficiently address these problems, and inspired by the success of In-Context Learning (ICL) in LLMs, (2) we explore ICL as a solution, and study how it affects these limitations. Based on our ICL study, (3) we push ICL further, and propose new multimodal ICL approaches such as; Multitask-ICL, Chain-of-Hindsight-ICL and Self-Correcting-ICL. Our findings are as follows. (1) Despite their success, LMMs have flaws that remain unsolved with scaling alone. (2) The effect of ICL on LMMs flaws is nuanced; despite its effectiveness for improved explainability, abstention and instruction following, ICL does not improve compositional abilities, and actually even amplifies hallucinations. (3) The proposed ICL variants are promising as post-hoc approaches to efficiently tackle some of those flaws. The code will be made public.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Parametric Knowledge Transfer Large Language Model
Scores: [ 8 6 6 ]
Large Language Models (LLMs) inherently encode a wealth of knowledge within their parameters through pre-training on extensive corpora. While prior research has delved into operations on these parameters to manipulate the underlying implicit knowledge—encompassing detection, editing, and merging—there remains an ambiguous understanding regarding their transferability across models with varying scales. In this paper, we seek to empirically investigate knowledge transfer from larger to smaller models through a parametric perspective. To achieve this, we employ sensitivity-based techniques to extract and align knowledge-specific parameters between different LLMs. Moreover, the LoRA module is used as the intermediary mechanism for injecting the extracted knowledge into smaller models. Evaluations across four benchmarks validate the efficacy of our proposed method. Our findings highlight the critical factors contributing to the process of parametric knowledge transfer, underscoring the transferability of model parameters across LLMs of different scales.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Composed Image Retrieval Vision-Language Pre-trained Models
Scores: [ 6 8 6 ]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption. Most existing CIR models adopt the late-fusion strategy to combine visual and language features. Besides, several approaches have also been suggested to generate a pseudo-word token from the reference image, which is further integrated into the relative caption for CIR. However, these pseudo-word-based prompting methods have limitations when target image encompasses complex changes on reference image, e.g., object removal and attribute modification. In this work, we demonstrate that learning an appropriate sentence-level prompt for the relative caption (SPRC) is sufficient for achieving effective composed image retrieval. Instead of relying on pseudo- word-based prompts, we propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts. By concatenating the learned sentence-level prompt with the relative caption, one can readily use existing text-based image retrieval models to enhance CIR performance. Furthermore, we introduce both image-text contrastive loss and text prompt alignment loss to enforce the learning of suitable sentence-level prompts. Experiments show that our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets. The source code and pretrained model will be publicly available.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: high-dimensional network embedding quantum mechanics
Scores: [ 6 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: anomaly detection tabular data self-supervised learning
Scores: [ 6 8 6 ]
This paper addresses the problem of anomaly detection in tabular data, which is usually implemented in an one-class classification setting where the training set only contains normal samples. Inspired by the success of masked image/language modeling in vision and natural language domains, we extend masked modeling methods to address this problem by capturing intrinsic correlations between features in training set. Thus, a sample deviate from such correlations are related to a high possibility of anomaly. To obtain multiple and diverse correlations, we propose a novel masking strategy which generates multiple masks by learning, and design a diversity loss to reduce the similarity of different masks. Extensive experiments show our method achieves state-of-the-art performance. We also discuss the interpretability from the perspective of each individual feature and correlations between features.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Multimodal learning vision-language tasks frozen LLMs optimal transport assignment prediction
Scores: [ 5 5 6 5 8 ]
While pretrained large language models (LLMs) excel in understanding linguistic contexts, it is still an open question: Can LLMs extend their capabilities beyond linguistic contexts to non-linguistic information? This paper introduces VLAP, a novel approach that bridges vision encoders and language models through assignment prediction. Since the LLMs interpret and reason linguistic information from correlations between word embeddings, we harness the well-established word embeddings to map visual representations into language space. Specifically, we simultaneously assign the visual and text representations to a set of word embeddings within LLMs. We propose a new training objective, optimal transport-based assignment prediction, to enforce the consistency of word assignments for paired multimodal data. This allows frozen LLMs to ground their word embedding space in visual data and use their robust semantic taxonomy visually. Moreover, VLAP is memory- and parameter-efficient in that it trains only a single linear layer, and works without extra embedding space (e.g. learnable prototypes) for the assignment prediction. Experimental results show that VLAP achieves substantial improvements over the previous linear transformation-based methods across a range of vision-language tasks, including image captioning, visual question answering, and cross-modal retrieval. We also demonstrate the learned visual representations hold a semantic taxonomy of LLMs, making visual semantic arithmetic possible.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Time series Transformer Multi-scale
Scores: [ 6 6 8 ]
Transformer-based models have achieved significant success in time series forecasting. Existing methods mainly model time series from limited or fixed scales, making it challenging to capture different characteristics spanning various scales. In this paper, we propose multi-scale transformers with adaptive pathways (Pathformer). The proposed Transformer integrates both temporal resolution and temporal distance for multi-scale modeling. Multi-scale division divides the time series into different temporal resolutions using patches of various sizes. Based on the division of each scale, dual attention is performed over these patches to capture global correlations and local details as temporal dependencies. We further enrich the multi-scale transformer with adaptive pathways, which adaptively adjust the multi-scale modeling process based on the varying temporal dynamics in the input time series, improving the prediction accuracy and generalization of Pathformer. Extensive experiments on nine real-world datasets demonstrate that Pathformer not only achieves state-of-the-art performance by surpassing all current models but also exhibits stronger generalization abilities under various transfer scenarios.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Large Visual Language Model Robotics Imitation Learning
Scores: [ 6 6 6 8 ]
Recent progress in vision language foundation models has shown their ability to understand multimodal data and resolve complicated vision language tasks, including robotics manipulation. We seek a straightforward way of making use of existing vision-language models (VLMs) with simple fine-tuning on robotics data.To this end, we derive a simple and novel vision-language manipulation framework, dubbed RoboFlamingo, built upon the open-source VLMs, OpenFlamingo. Unlike prior works, RoboFlamingo utilizes pre-trained VLMs for single-step vision-language comprehension, models sequential history information with an explicit policy head, and is slightly fine-tuned by imitation learning only on language-conditioned manipulation datasets. Such a decomposition provides RoboFlamingo the flexibility for open-loop control and deployment on low-performance platforms. By exceeding the state-of-the-art performance with a large margin on the tested benchmark, we show RoboFlamingo can be an effective and competitive alternative to adapt VLMs to robot control.Our extensive experimental results also reveal several interesting conclusions regarding the behavior of different pre-trained VLMs on manipulation tasks. We believe RoboFlamingo has the potential to be a cost-effective and easy-to-use solution for robotics manipulation, empowering everyone with the ability to fine-tune their own robotics policy. Our code will be made public upon acceptance.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Semantic Segmentation Multi-Scale Representations Learning Attention
Scores: [ 6 5 6 6 ]
Learning multi-scale representations is central to semantic segmentation. We visualize the effective receptive field (ERF) of canonical multi-scale representations and point out two risks in learning them: \textit{scale inadequacy} and \textit{field inactivation}. To address these issues, a novel multi-scale learner, \textbf{varying window attention} (VWA), is presented. VWA leverages the local window attention (LWA) and disentangles LWA into the query window and context window, allowing the scale of context to vary for the query to learn representations at specific scales. However, varying the context to large-scale windows (enlarging ratio R) can significantly increase the memory footprint and computation cost (R2 times larger than LWA). We propose a simple but professional re-scaling strategy to zero the extra induced cost without compromising any performance. In consequence, VWA shows great superiority to previous multi-scale learners. Furthermore, building upon VWA and employing various MLPs, we introduce a multi-scale decoder (MSD), \textbf{VWFormer}, to improve learning multi-scale representations in semantic segmentation. VWFormer achieves efficiency competitive with the most compute-friendly MSDs, like FPN and MLP decoder, but performs much better than any MSDs. For instance, at little extra overhead, $\sim 10$G FLOPs, VWFormer improves Mask2Former by 1.0%−1.3% mIoU. Using only half of the computation, VWFormer outperforms the popular UperNet by 1.0%−2.1% mIoU.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Cross-modal Audio-text Retrieval Representation Learning
Scores: [ 8 6 6 ]
Learning-to-match (LTM) is an effective inverse optimal transport framework for learning the underlying ground metric between two sources of data, which can be further used to form the matching between them. Nevertheless, the conventional LTM framework is not scalable since it needs to use the entire dataset each time updating the parametric ground metric. To adapt the LTM framework to the deep learning setting, we propose the mini-batch learning-to-match (m-LTM) framework for audio-text retrieval problems, which is based on mini-batch subsampling and neural networks parameterized ground metric. In addition, we improve further the framework by introducing the Mahalanobis-enhanced family of ground metrics. Moreover, to cope with the noisy data correspondence problem arising from practice, we additionally propose a variant using partial optimal transport to mitigate the pairing uncertainty in training data. We conduct extensive experiments on audio-text matching problems usingthree datasets: AudioCaps, Clotho, and ESC-50. Results demonstrate that our proposed method is capable of learning rich and expressive joint embedding space, which achieves SOTA performance. Beyond this, the proposed m-LTM framework is able to close the modality gap across audio and text embedding, which surpasses both triplet and contrastive loss in the zero-shot sound event detection task on the ESC-50 dataset. Finally, our strategy to use partial OT with m-LTM has shown to be more noise tolerance than contrastive loss under a variant of noise ratio of training data in AudioCaps.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Ordinal Regression Contrastive Learning
Scores: [ 5 6 6 6 6 ]
Contrastive learning, while highly effective for a lot of tasks, shows limited improvement in ordinal regression. We find that the limitation comes from the predefined strong data augmentations employed in contrastive learning. Intuitively, for ordinal regression datasets, the discriminative information (ordinal content information) contained in instances is subtle. The strong augmentations can easily overshadow or diminish this ordinal content information. As a result, when contrastive learning is used to extract common features between weakly and strongly augmented images, the derived features often lack this essential ordinal content, rendering them less useful in training models for ordinal regression. To improve contrastive learning's utility for ordinal regression, we propose a novel augmentation method to replace the predefined strong argumentation based on the principle of "minimal change". Our method is designed in a generative manner that can effectively generate images with different styles but contains desired ordinal content information. Extensive experiments validate the effectiveness of our proposed method, which serves as a plug-and-play solution and consistently improves the performance of existing state-of-the-art methods in ordinal regression tasks.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Speech Representation Learning Self-supervised Learning Multi-resolution
Scores: [ 8 8 8 8 ]
Existing Self-Supervised Learning (SSL) models for speech typically process speech signals at a fixed resolution of 20 milliseconds. This approach overlooks the varying informational content present at different resolutions in speech signals. In contrast, this paper aims to incorporate multi-resolution information into speech self-supervised representation learning. We introduce a SSL model that leverages a hierarchical Transformer architecture, complemented by HuBERT-style masked prediction objectives, to process speech at multiple resolutions. Experimental results indicate that the proposed model not only achieves more efficient inference but also exhibits superior or comparable performance to the original HuBERT model over various tasks. Specifically, significant performance improvements over the original HuBERT have been observed in fine-tuning experiments on the LibriSpeech speech recognition benchmark as well as in evaluations using the Speech Universal PERformance Benchmark (SUPERB) and Multilingual SUPERB (ML-SUPERB).
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Inverse problems Proximal operators Plug-and-play Explicit regularizer Convergent PnP Input convex neural networks
Scores: [ 5 5 8 5 ]
Proximal operators are ubiquitous in inverse problems, commonly appearing as part of algorithmic strategies to regularize problems that are otherwise ill-posed. Modern deep learning models have been brought to bear for these tasks too, as in the framework of plug-and-play or deep unrolling, where they loosely resemble proximal operators. Yet, these do not provide any guarantee that these general functions, implemented by neural networks, provide a proximal operator of some function, nor do they provide any characterization of the function of which they provide some approximate proximal. Herein we provide a framework to develop learned proximal networks (LPN), which provide exact proximal operators for a data-driven regularizer, and show how a new training strategy, dubbed proximal matching, guarantees that the obtained regularizer recovers the log-prior of the true data distribution. Thus, such LPN provide general, unsupervised, proximal operators that can be used for general inverse problems. We illustrate our results in a series of cases of increasing complexity. demonstrating that these models not only result in state-of-the-art restoration results, but provide a window into the resulting priors learned from data.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Implicit Neural Representation Coordinate Network Multi-resolution Efficient Inference
Scores: [ 8 6 5 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Vision-Language Models Zero-Shot Generalization Test-Time Adaptation CLIP
Scores: [ 8 6 6 ]
One fascinating aspect of pre-trained vision-language models (VLMs) learning under language supervision is their impressive zero-shot generalization capability.However, this ability is hindered by distribution shifts between the training and testing data.Previous test time adaptation (TTA) methods for VLMs in zero-shot classification rely on minimizing the entropy of model outputs, tending to be stuck in incorrect model predictions.In this work, we propose TTA with feedback to rectify the model output and prevent the model from becoming blindly confident.Specifically, a CLIP model is adopted as the reward model during TTA and provides feedback for the VLM.Given a single test sample,the VLM is forced to maximize the CLIP reward between the input and sampled results from the VLM output distribution.The proposed \textit{reinforcement learning with CLIP feedback~(RLCF)} framework is highly flexible and universal.Beyond the classification task, with task-specific sampling strategies and a proper reward baseline choice, RLCF can be easily extended to not only discrimination tasks like retrieval but also generalization tasks like image captioning,improving the zero-shot generalization capacity of VLMs.According to the characteristics of these VL tasks, we build different fully TTA pipelines with RLCF to improve the zero-shot generalization ability of various VLMs.Extensive experiments along with promisingempirical results demonstrate the effectiveness of RLCF.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Vision Transformers Classification Detection Instance Segmentation Semantic Segmentation
Scores: [ 6 6 5 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: High Dynamic Range Imaging; Self-Supervised Learning; Multi-Exposure Images
Scores: [ 8 6 8 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Large Language Models Model Fusion
Scores: [ 6 6 8 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Grounding Reasoning Language Models
Scores: [ 6 6 8 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Large Language Models Chain-of-Thought Textual Reasoning
Scores: [ 5 8 6 8 ]
While large language models (LLMs) equipped with techniques like chain-of-thought prompting have demonstrated impressive capabilities, they still fall short in their ability to reason robustly in complex settings. However, evaluating LLM reasoning is challenging because system capabilities continue to grow while benchmark datasets for tasks like logical deduction have remained static. We introduce MuSR, a dataset for evaluating language models on multistep soft reasoning tasks specified in a natural language narrative. This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm, enabling the construction of complex reasoning instances that challenge GPT-4 (e.g., murder mysteries roughly 1000 words in length) and which can be scaled further as more capable LLMs are released. Second, our data instances are free text narratives corresponding to real-world domains of reasoning; this makes it simultaneously much more challenging than other synthetically-crafted benchmarks while remaining realistic and tractable for human annotators to solve with high accuracy. We evaluate a range of LLMs and prompting techniques on this dataset and characterize the gaps that remain for techniques like chain-of-thought to perform robust reasoning.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: graph neural networks foundation model knowledge graph link prediction knowledge graph reasoning
Scores: [ 6 8 5 8 ]
Foundation models in language and vision have the ability to run inference on any textual and visual inputs thanks to the transferable representations such as a vocabulary of tokens in language. Knowledge graphs (KGs) have different entity and relation vocabularies that generally do not overlap.The key challenge of designing foundation models on KGs is to learn such transferable representations that enable inference on any graph with arbitrary entity and relation vocabularies.In this work, we make a step towards such foundation models and present ULTRA, an approach for learning universal and transferable graph representations. ULTRA builds relational representations as a function conditioned on their interactions.Such a conditioning strategy allows a pre-trained ULTRA model to inductively generalize to any unseen KG with any relation vocabulary and to be fine-tuned on any graph.Conducting link prediction experiments on 57 different KGs, we find that the zero-shot inductive inference performance of a single pre-trained ULTRA model on unseen graphs of various sizes is often on par or better than strong baselines trained on specific graphs. Fine-tuning further boosts the performance.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: generalization medical images scaling law intrinsic dimension medical image analysis
Scores: [ 6 8 8 6 ]
This paper investigates discrepancies in how neural networks learn from different imaging domains, which are commonly overlooked when adopting computer vision techniques from the domain of natural images to other specialized domains such as medical images. Recent works have found that the generalization error of a trained network typically increases with the intrinsic dimension (ddata) of its training set. Yet, the strength of this relationship varies significantly between medical (radiological) and natural imaging domains: medical images are typically more challenging to generalize to, with no existing theoretical explanation. We address this gap in knowledge by establishing and empirically validating a generalization scaling law with respect to ddata, and propose that the substantial scaling discrepancy between the two considered domains may be at least partially attributed to the higher intrinsic label sharpness'' (KF) of medical imaging datasets, a metric which we propose. Next, we demonstrate an additional benefit of measuring the label sharpness of a training set: it is negatively correlated with the trained model's adversarial robustness, which notably leads to models for medical images having a substantially higher vulnerability to adversarial attack. Finally, we extend our ddata formalism to the related metric of learned representation intrinsic dimension (drepr), derive a generalization scaling law with respect to drepr, and show that ddata serves as an upper bound for drepr. Our theoretical results are supported by thorough experiments with six models and eleven natural and medical imaging datasets over a range of training set sizes. Our findings offer insights into the influence of intrinsic dataset properties on generalization, representation learning, and robustness in deep neural networks.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: EfficientMod Efficient Networks
Scores: [ 6 8 5 5 ]
In this work, we present efficient modulation, a novel design for efficient vision networks. We revisit the modulation mechanism, which operates input through convolutional context modeling and feature projection layers, and fuses features via element-wise multiplication and an MLP block. We demonstrate that the abstracted modulation mechanism is particularly well suited for efficient networks and further tailor the modulation design by proposing the efficient modulation (EfficientMod) block, which is considered the essential building block for our networks. Bene- fiting from the prominent representational ability of modulation mechanism and the efficiency of efficient modulation design, our network can accomplish better accuracy-efficiency trade-offs and set new state-of-the-art performance for efficient networks. When integrating EfficientMod block with the vanilla self-attention block, we obtain the hybrid architecture and further improve the performance without sacrificing the efficiency. We carry out comprehensive experiments to verify EfficientMod’s performance. With fewer parameters, our EfficientMod-s performs 0.6 top-1 accuracy better than the prior state-of-the-art approach EfficientFormerV2-s2 without any training tricks and is 25% faster on GPU. Additionally, our method presents a notable improvement in downstream tasks, outperforming EfficientFormerV2-s by 3.6 mIoU on the ADE20K benchmark. Codes and checkpoints are available in the supplementary material.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: point cloud point cloud pre-training multi-view representation
Scores: [ 6 6 6 6 ]
A promising direction for pre-training 3D point clouds is to leverage the massive amount of data in 2D, whereas the domain gap between 2D and 3D creates a fundamental challenge. This paper proposes a novel approach to point-cloud pre-training that learns 3D representations by leveraging pre-trained 2D networks. Different from the popular practice of predicting 2D features first and then obtaining 3D features through dimensionality lifting, our approach directly uses a 3D network for feature extraction. We train the 3D feature extraction network with the help of the novel 2D knowledge transfer loss, which enforces the 2D projections of the 3D feature to be consistent with the output of pre-trained 2D networks. To prevent the feature from discarding 3D signals, we introduce the multi-view consistency loss that additionally encourages the projected 2D feature representations to capture pixel-wise correspondences across different views. Such correspondences induce 3D geometry and effectively retain 3D features in the projected 2D features. Experimental results demonstrate that our pre-trained model can be successfully transferred to various downstream tasks, including 3D shape classification, part segmentation, 3D object detection, and semantic segmentation, achieving state-of-the-art performance.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Neural Radiance Fields Bundle Adjustment Rolling Shutter
Scores: [ 3 8 3 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: multi-modal learning model pruning layer-wise pruning
Scores: [ 5 6 6 5 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Learning disentangled representations Generalization Weak supervised learning Appliance usage Electricity Multi-modal learning
Scores: [ 8 8 1 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: camouflaged object detection adversarial training image segmentation
Scores: [ 8 5 5 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: NeRF Head Avatar Dynamic NeRF
Scores: [ 6 6 6 ]
Head avatar reconstruction, crucial for applications in virtual reality, online meetings, gaming, and film industries, has garnered substantial attention within the computer vision community. The fundamental objective of this field is to faithfully recreate the head avatar and precisely control expressions and postures. Existing methods, categorized into 2D-based warping, mesh-based, and neural rendering approaches, present challenges in maintaining multi-view consistency, incorporating non-facial information, and generalizing to new identities. In this paper, we propose a framework named GPAvatar that reconstructs 3D head avatars from one or several images in a single forward pass. The key idea of this work is to introduce a dynamic point-based expression field driven by a point cloud to precisely and effectively capture expressions. Furthermore, we use a Multi Tri-planes Attention (MTA) fusion module in tri-planes canonical field to leverage information from multiple input images. The proposed method achieves faithful identity reconstruction, precise expression control, and multi-view consistency, demonstrating promising results for free-viewpoint rendering and novel view synthesis.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Local motion deblurring transformer deep learning
Scores: [ 8 6 3 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: RGB-D Semantic Segmentation
Scores: [ 8 6 3 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Blind Face Restoration Vector Quantization Codebook Prior Feature Encoding
Scores: [ 5 5 8 8 8 ]
Restoring facial details from low-quality (LQ) images has remained challenging due to the nature of the problem caused by various degradations in the wild. The codebook prior has been proposed to address the ill-posed problems by leveraging an autoencoder and learned codebook of high-quality (HQ) features, achieving remarkable quality.However, existing approaches in this paradigm frequently depend on a single encoder pre-trained on HQ data for restoring HQ images, disregarding the domain gap and distinct feature representations between LQ and HQ images.As a result, encoding LQ inputs with the same encoder could be insufficient, resulting in imprecise feature representation and leading to suboptimal performance.To tackle this problem, we propose a novel dual-branch framework named DAEFR. Our method introduces an auxiliary LQ branch that extracts domain-specific information from the LQ inputs. Additionally, we incorporate association training to promote effective synergy between the two branches, enhancing code prediction and restoration quality.We evaluate the effectiveness of DAEFR on both synthetic and real-world datasets, demonstrating its superior performance in restoring facial details.Source codes and trained models will be publicly released.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: uncertainty quantification uncertainty estimation calibration failure prediction large language models black-box language models LLM evaluation
Scores: [ 5 8 6 5 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Vision Transformer Token Reduction
Scores: [ 8 6 8 8 ]
Vision Transformers (ViTs) are now flourishing in the computer vision area. Despite the remarkable success, ViTs suffer from high computational cost, which greatly hinders their practical usage. Token reduction, which identifies and discards unimportant tokens during forward propagation, has then been proposed to make ViTs more efficient. For token reduction methodologies, a scoring metric is essential to distinguish between important and unimportant tokens. The attention score from the [CLS] token, which takes the responsibility to aggregate useful information and form the final output, has been established by prior works as an advantageous choice. Nevertheless, whereas the task pressure is applied at the end of the whole model, token reduction generally starts from very early blocks. Given the long distance in between, in the early blocks [CLS] token lacks the impetus to gather task-relevant information, causing somewhat arbitrary attention score allocation. This phenomenon, in turn, degrades the reliability of token scoring and substantially compromises the effectiveness of token reduction methods. Inspired by advances in the domain of dynamic neural networks, in this paper, we introduce Multi-Exit Token Reduction (METR), a simple romance between multi-exit architecture and token reduction—two areas previously considered orthogonal. By injecting early task pressure via multi-exit loss, the [CLS] token is spurred to collect task-related information in even early blocks, thus bolstering the credibility of [CLS] attention as a token-scoring metric. Additionally, we employ self-distillation to further refine the quality of early supervision. Extensive experiments substantiate both the existence and effectiveness of the newfound chemistry. Comparative assessments also indicate that METR outperforms state-of-the-art token reduction methods on standard benchmarks, especially under aggressive reduction ratio. Codes will be released.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: representation learning for computer vision image classification vision-language models large language models CLIP GPT-3 LLAMA
Scores: [ 6 8 5 5 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: alignment language model invariant learning
Scores: [ 8 10 6 6 ]
The success of AI assistants based on language models (LLMs) hinges crucially on Reinforcement Learning from Human Feedback (RLHF), which enables the generation of responses more aligned with human preferences. As universal AI assistants, there's a growing expectation for them to perform consistently across various domains. However, previous work shows that Reinforcement Learning (RL) often exploits shortcuts to attain high rewards and overlooks challenging samples.This focus on quick reward gains undermines both the stability in training and the model's ability to generalize to new, unseen data.In this work, we propose a novel approach that can learn a consistent policy via RL across various data groups or domains. Given the challenges associated with acquiring group annotations, our method automatically classifies data into different groups, deliberately maximizing performance variance.Then, we optimize the policy to perform well on challenging groups. Lastly, leveraging the established groups, our approach adaptively adjusts the exploration space, allocating more learning capacity to more challenging data and preventing the model from over-optimizing on simpler data. Experimental results indicate that our approach significantly enhances training stability and model generalization.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Probabilistic embedding image-text matching cross-modal retrieval vision-language
Scores: [ 6 8 6 ]
Image-Text Matching (ITM) task, a fundamental vision-language (VL) task, suffers from the inherent ambiguity caused by the multiplicity and the imperfect annotations. Deterministic functions are not sufficiently powerful to capture ambiguity, prompting a focus on probabilistic embeddings to tackle the challenge. However, the existing probabilistic ITM approach encounters two drawbacks; heavy computations due to expensive Monte Carlo approximation and the loss saturation under abundant false negatives. To overcome the issues, we propose improved Probabilistic Cross-Modal Embeddings (PCME++) by introducing a new probabilistic distance with a closed-form solution. We propose two optimization techniques to enhance PCME++ further; first, we incorporate pseudo-positives to prevent the loss saturation problem under massive false negatives; second, we apply mixed sample data augmentation for probabilistic matching. Our experiments on MS-COCO Caption and two extended benchmarks, CxC and ECCV Caption, demonstrate the effectiveness of PCME++ compared to state-of-the-art ITM methods. We also evaluate the robustness of PCME++ under noisy image-text correspondences. In addition, we show the potential applicability of PCME++ in automatic prompt tuning for zero-shot classification. Code is available at Supplementary Materials.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: language model large language model few-shot learning natural language processing in-context learning
Scores: [ 8 6 6 6 ]
Despite the promising few-shot ability of large language models (LLMs), the standard paradigm of In-context Learning (ICL) suffers the disadvantages of susceptibility to selected demonstrations and the intricacy to generate these demonstrations. In this paper, we raise the fundamental question that whether human-generated demonstrations are necessary for ICL. To answer this question, we propose self-contemplation prompting strategy (SEC), a paradigm free from human-crafted demonstrations. The key point of SEC is that, instead of using hand-crafted examples as demonstrations in ICL, SEC asks LLMs to first create demonstrations on their own, based on which the final output is generated. SEC is a flexible framework and can be adapted to both the vanilla ICL and the chain-of-thought (CoT), but with greater ease: as the manual-generation process of both examples and rationale can be saved. Extensive experiments in arithmetic reasoning, commonsense reasoning, multi-task language understanding, and code generation benchmarks, show that SEC, which does not require hand-crafted demonstrations, significantly outperforms the zero-shot learning strategy, and achieves comparable results to ICL with hand-crafted demonstrations. This demonstrates that, for many tasks, contemporary LLMs possess a sufficient level of competence to exclusively depend on their own capacity for decision making, removing the need for external training data.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: out-of-domain knowledge distillation mixup learning domain shift uncertainty
Scores: [ 6 6 6 6 ]
Due to privacy or patent concerns, a growing number of large models are released without granting access to their training data, making transferring their knowledge inefficient and problematic. In response, Data-Free Knowledge Distillation (DFKD) methods have emerged as direct solutions. However, simply adopting models derived from DFKD for real-world applications suffers significant performance degradation, due to the discrepancy between teachers' training data and real-world scenarios (student domain). The degradation stems from the portions of teachers' knowledge that are not applicable to the student domain. They are specific to the teacher domain and would undermine students' performance. Hence, selectively transferring teachers' appropriate knowledge becomes the primary challenge in DFKD. In this work, we propose a simple but effective method AuG-KD. It utilizes an uncertainty-guided and sample-specific anchor to align student-domain data with the teacher domain and leverages a generative method to progressively trade off the learning process between OOD knowledge distillation and domain-specific information learning via mixup learning. Extensive experiments in 3 datasets and 8 settings demonstrate the stability and superiority of our approach.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Machine Translation Language Language Models Multilingual
Scores: [ 8 5 6 8 ]
Generative Large Language Models (LLMs) have achieved remarkable advancements in various NLP tasks. However, these advances have not been reflected in the translation task, especially those with moderate model sizes (i.e., 7B or 13B parameters), which still lag behind conventional supervised encoder-decoder translation models. Previous studies have attempted to improve the translation capabilities of these LLMs, but their gains have been limited. In this study, we propose a novel fine-tuning approach for LLMs that is specifically designed for the translation task, eliminating the need for the abundant parallel data that traditional translation models usually depend on.Our approach consists of two fine-tuning stages: initial fine-tuning on monolingual data followed by subsequent fine-tuning on a small set of high-quality parallel data. We introduce the LLM developed through this strategy as Advanced Language Model-based trAnslator (ALMA). Based on LLaMA-2 as our underlying model, our results show that the model can achieve an average improvement of more than 12 BLEU and 12 COMET over its zero-shot performance across 10 translation directions from the WMT'21 (2 directions) and WMT'22 (8 directions) test datasets. The performance is significantly better than all prior work and even superior to the NLLB-54B model \citep{nllb} and GPT-3.5-text-davinci-003, with only 7B or 13B parameters. This method establishes the foundation for a novel training paradigm in machine translation.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Computation and Language; Programming Languages; Software Engineering; Machine Learning
Scores: [ 6 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: ASR CTC seq2seq low latency WER
Scores: [ 8 8 6 6 ]
Connectionist Temporal Classification (CTC) is a widely used criterion for training supervised sequence-to-sequence (seq2seq) models. It learns the alignments between the input and output sequences, by marginalizing over the perfect alignments (that yield the ground truth), at the expense of the imperfect ones. This dichotomy, and in particular the equal treatment of all perfect alignments, results in a lack of controllability over the predicted alignments. This controllability is essential for capturing properties that hold significance in real-world applications.Here we propose Align With Purpose (AWP), a general Plug-and-Play framework for enhancing a desired property in models trained with the CTC criterion. We do that by complementing the CTC loss with an additional loss term that prioritizes alignments according to a desired property. AWP does not require any intervention in the CTC loss function, and allows to differentiate between both perfect and imperfect alignments for a variety of properties. We apply our framework in the domain of Automatic Speech Recognition (ASR) and show its generality in terms of property selection, architectural choice, and scale of training dataset (up to 280,000 hours). To demonstrate the effectiveness of our framework, we apply it to two unrelated properties: token emission time for latency optimization and word error rate (WER). For the former, we report an improvement of up to 590ms in latency optimization with a minor reduction in WER, and for the latter, we report a relative improvement of 4.5% in WER over the baseline models. To the best of our knowledge, these applications have never been demonstrated to work on this scale of data. Notably, our method can be easily implemented using only a few lines of code and can be extended to other alignment-free loss functions and to domains other than ASR.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: 3D object detection LiDAR point cloud pre-training autonomous driving self-supervised learning
Scores: [ 6 6 8 8 ]
Accurate 3D object detection and understanding for self-driving cars heavily relies on LiDAR point clouds, necessitating large amounts of labeled data to train. In this work, we introduce an innovative pre-training approach, Grounded Point Colorization (GPC), to bridge the gap between data and labels by teaching the model to colorize LiDAR point clouds, equipping it with valuable semantic cues. To tackle challenges arising from color variations and selection bias, we incorporate color as "context" by providing ground-truth colors as hints during colorization.Experimental results on the KITTI and Waymo datasets demonstrate GPC's remarkable effectiveness. Even with limited labeled data, GPC significantly improves fine-tuning performance; notably, on just 20% of the KITTI dataset, GPC outperforms training from scratch with the entire dataset. In sum, we introduce a fresh perspective on pre-training for 3D object detection, aligning the objective with the model's intended role and ultimately advancing the accuracy and efficiency of 3D object detection for autonomous vehicles.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: hyperbolic neural networks hyperbolic image embedding hyperbolic vision models. hyperboloid representation
Scores: [ 6 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Neural Set Function Hierarchical Structure Invariance Subset Selection
Scores: [ 6 5 6 ]
Learning neural subset selection tasks, such as compound selection in AI-aided drug discovery, have become increasingly pivotal across diverse applications. The existing methodologies in the field primarily concentrate on constructing models that capture the relationship between utility function values and subsets within their respective supersets. However, these approaches tend to overlook the valuable information contained within the superset when utilizing neural networks to model set functions. In this work, we address this oversight by adopting a probabilistic perspective. Our theoretical findings demonstrate that when the target value is conditioned on both the input set and subset, it is essential to incorporate an invariant sufficient statistic of the superset into the subset of interest for effective learning. This ensures that the output value remains invariant to permutations of the subset and its corresponding superset, enabling identification of the specific superset from which the subset originated. Motivated by these insights, we propose a simple yet effective information aggregation module designed to merge the representations of subsets and supersets from a permutation invariance perspective. Comprehensive empirical evaluations across diverse tasks and datasets validate the enhanced efficacy of our approach over conventional methods, underscoring the practicality and potency of our proposed strategies in real-world contexts.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Neural Surface Reconstruction Camera Pose Estimation Neural Radiance Field
Scores: [ 6 5 8 5 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: perceptual metrics lpips perceptual embeddings representation learning self-supervised learning FR-IQA image quality assessment compression data-free ssim bapps 2afc jnd
Scores: [ 6 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: isotropy LLMs outlier dimensions
Scores: [ 3 8 8 ]
Given the success of Large Language Models (LLMs), there has been considerable interest in studying the properties of model activations. The literature overwhelmingly agrees that LLM representations are dominated by a few outlier dimensions'' with exceedingly high variance and magnitude. Several studies in Natural Language Processing (NLP) have sought to mitigate the impact of such outlier dimensions and force LLMs to be isotropic (i.e., have uniform variance across all dimensions in embedding space). Isotropy is thought to be a desirable property for LLMs that improves model performance and more closely aligns textual representations with human intuition. However, many claims regarding isotropy in NLP have been based on the average cosine similarity of embeddings, which has recently been shown to be a flawed measure of isotropy. In this paper, we propose I-STAR: IsoScore${\star}−basedSTableAnisotropicRegularization,anovelregularizationmethodthatcanbeusedtoincreaseordecreaselevelsofisotropyinembeddingspaceduringtraining.I−STARusesIsoScore{\star}$, the first accurate measure of isotropy that is both differentiable and stable on mini-batch computations. In contrast to several previous works, we find that \textit{decreasing} isotropy in contextualized embeddings improves performance on the majority of tasks and models considered in this paper.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Human Motion Generation Style Transfer Generative Model
Scores: [ 5 6 6 6 ]
Human motion stylization aims to revise the style of an input motion while keeping its content unaltered. Unlike existing works that operate directly in pose space, we leverage the \textit{latent space} of pretrained autoencoders as a more expressive and robust representation for motion extraction and infusion. Building upon this, we present a novel \textit{generative} model that produces diverse stylization results of a single motion (latent) code. During training, a motion code is decomposed into two coding components: a deterministic content code, and a probabilistic style code adhering to a prior distribution; then a generator massages the random combination of content and style codes to reconstruct the corresponding motion codes. Our approach is versatile, allowing the learning of probabilistic style space from either style labeled or unlabeled motions, providing notable flexibility in stylization as well. In inference, users can opt to stylize a motion using style cues from a reference motion or a label. Even in the absence of explicit style input, our model facilitates novel re-stylization by sampling from the unconditional style prior distribution. Experimental results show that our proposed stylization models, despite their lightweight design, outperform the state-of-the-arts in style reeanactment, content preservation, and generalization across various applications and settings.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Scene Representation Video Analysis Motion Analysis Neural Rendering 3D Vision
Scores: [ 6 6 6 6 ]
We study macro motion analysis, where macro motion refers to the collection of all visually observable motions in a dynamic scene. Traditional filtering-based methods on motion analysis typically focus only on local and tiny motions, yet fail to represent large motions or 3D scenes. Recent dynamic neural representations can faithfully represent motions using correspondences, but they cannot be directly used for motion analysis. In this work, we propose Phase-based neural polynomial Gabor fields (Phase-PGF), which learns to represent scene dynamics with low-dimensional time-varying phases. We theoretically show that Phase-PGF has several properties suitable for macro motion analysis. In our experiments, we collect diverse 2D and 3D dynamic scenes and show that Phase-PGF enables dynamic scene analysis and editing tasks including motion loop detection, motion factorization, motion smoothing, and motion magnification.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Equivariance Point Clouds Message Passing Neural Network Molecules Diffusion Model
Scores: [ 5 6 8 ]
Based on the theory of homogeneous spaces we derive geometrically optimal edge attributes to be used within the flexible message passing framework. We formalize the notion of weight sharing in convolutional networks as the sharing of message functions over point-pairs that should be treated equally. We define equivalence classes of point-pairs that are identical up to a transformation in the group and derive attributes that uniquely identify these classes. Weight sharing is then obtained by conditioning message functions on these attributes. As an application of the theory, we develop an efficient equivariant group convolutional network for processing 3D point clouds. The theory of homogeneous spaces tells us how to do group convolutions with feature maps over the homogeneous space of positions R3, position and orientations R3×S2, and the group SE(3) itself. Among these, R3×S2 is an optimal choice due to the ability to represent directional information, which R3 methods cannot, and it significantly enhances computational efficiency compared to indexing features on the full SE(3) group. We empirically support this claim by reaching state-of-the-art results --in accuracy and speed-- on three different benchmarks: interatomic potential energy prediction, trajectory forecasting in N-body systems, and generating molecules via equivariant diffusion models.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Prompt Learning Quaternion Networks Multimodal Feature Fusion Zero-Shot.
Scores: [ 8 6 6 ]
Multimodal pre-trained models have shown impressive potential in enhancing performance on downstream tasks. However, existing fusion strategies for modalities primarily rely on explicit interaction structures that fail to capture the diverse aspects and patterns inherent in input data. This yields limited performance in zero-shot contexts, especially when fine-grained classifications and abstract interpretations are required. To address this, we propose an effective approach, namely Prompt Learning with Quaternion Networks (QNet), for semantic alignment across diverse modalities. QNet employs a quaternion hidden space where the mutually orthogonal imaginary axes capture rich intermodal semantic spatial correlations from various perspectives. Hierarchical features across multilayers are utilized to encode intricate interdependencies within various modalities with reduced parameters. Our experiments on 11 datasets demonstrate that QNet outperforms state-of-the-art prompt learning techniques in base-to-novel generalization, cross-dataset transfer, and domain transfer scenarios with fewer learnable parameters. Our code and models will be publicly available.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: language-based planning procedural/script knowledge distillation large language models decoding-time algorithm
Scores: [ 6 8 6 6 ]
Procedural planning, which entails decomposing a high-level goal into a sequence of temporally ordered steps, is an important yet intricate task for machines. It involves integrating common-sense knowledge to reason about complex and often contextualized situations, e.g. scheduling a doctor's appointment without a phone''. While current approaches show encouraging results using large language models (LLMs), they are hindered by drawbacks such as costly API calls and reproducibility issues. In this paper, we advocate planning using smaller language models. We present PlaSma, a novel two-pronged approach to endow small language models with procedural knowledge and (constrained) language-based planning capabilities. More concretely, we develop symbolic procedural knowledge distillation to enhance the commonsense knowledge in small language models and aninference-time algorithm to facilitate more structured and accurate reasoning. In addition, we introduce a new related task, Replanning, that requires a revision of a plan to cope with a constrained situation. In both the planning and replanning settings, we show that orders-of-magnitude smaller models (770M-11B parameters) can compete and often surpass their larger teacher models' capabilities. Finally, we showcase successful application of PlaSma in an embodied environment, VirtualHome.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Reward Model Large Language Model Tool Learning Augmented Language Model
Scores: [ 8 8 6 ]
Reward modeling (a.k.a. preference modeling) is instrumental for aligning large language models with human preferences, particularly within the context of reinforcement learning from human feedback (RLHF). While conventional reward models (RMs) have exhibited remarkable scalability, they oft struggle with fundamental functionality such as arithmetic computation, code execution, and factual lookup. In this paper, we propose a tool-augmented preference modeling approach, named Themis, to address these limitations by empowering RMs with access to external environments, including calculators and search engines. This approach not only fosters synergy between tool utilization and reward grading but also enhances interpretive capacity and scoring reliability. Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources and construct task-specific tool engagement and reasoning traces in an autoregressive manner. We validate our approach across a wide range of domains, incorporating seven distinct external tools. Our experimental results demonstrate a noteworthy overall improvement of 17.7% across eight tasks in preference ranking. Furthermore, our approach outperforms Gopher 280B by 7.3% on TruthfulQA task in zero-shot evaluation. In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines across four distinct tasks. Additionally, we provide a comprehensive collection of tool-related RM datasets, incorporating data from seven distinct tool APIs, totaling 15,000 instances. We anticipate that this publicly available dataset will facilitate and inspire further research advancements in the field.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: NLP Transformer Reinforcement Learning
Scores: [ 5 10 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Large Language Model (LLM) instruction tuning multi-modality learning
Scores: [ 5 8 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Large language models automatic speech recognition generative error correction noise-robustness
Scores: [ 8 6 10 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: machine text detection large language models AI safety natural language processing stylistic representations deep learning machine learning
Scores: [ 6 5 3 6 ]
The advent of instruction-tuned language models that convincingly mimic human writing poses a significant risk of abuse. For example, such models could be used for plagiarism, disinformation, spam, or phishing. However, such abuse may be counteracted with the ability to detect whether a piece of text was composed by a language model rather than a human. Some previous approaches to this problem have relied on supervised methods trained on corpora of confirmed human and machine-written documents. Unfortunately, model under-specification poses an unavoidable challenge for such detectors, making them brittle in the face of data shifts, such as the release of further language models producing still more fluent text than the models used to train the detectors. Other previous approaches require access to the models that generated the text to be detected at inference or detection time, which is often impractical. In light of these challenge, we pursue a fundamentally different approach not relying on samples from language models of concern at training time. Instead, we propose to leverage representations of writing style estimated from human-authored text. Indeed, we find that features effective at distinguishing among human authors are also effective at distinguishing human from machine authors, including state of the art large language models like Llama 2, ChatGPT, and GPT-4. Furthermore, given handfuls of examples composed by each of several specific language models of interest, our approach affords the ability to predict which model specifically generated a given document.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: large language model knowledge grounding
Scores: [ 6 6 6 6 ]
We present chain-of-knowledge (CoK), a novel framework that augments large language models (LLMs) by dynamically incorporating grounding information from heterogeneous sources. It results in more factual rationales and reduced hallucination in generation. Specifically, CoK consists of three stages: reasoning preparation, dynamic knowledge adapting, and answer consolidation. Given a knowledge-intensive question, CoK first prepares several preliminary rationales and answers while identifying the relevant knowledge domains.If there is no majority consensus among the answers from samples, CoK corrects the rationales step by step by adapting knowledge from the identified domains.These corrected rationales can plausibly serve as a better foundation for the final answer consolidation.Unlike prior studies that primarily use unstructured data, CoK also leverages structured knowledge sources such as Wikidata and tables that provide more reliable factual information.To access both unstructured and structured knowledge sources in the dynamic knowledge adapting stage, we propose an adaptive query generator that allows the generation of queries for various types of query languages, including SPARQL, SQL, and natural sentences. Moreover, to minimize error propagation between rationales, CoK corrects the rationales progressively using preceding corrected rationales to generate and correct subsequent rationales.Extensive experiments show that CoK consistently improves the performance of LLMs on knowledge-intensive tasks across different domains.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Auxiliary Learning; Neural Architecture Search; Soft Parameter Sharing; Multi-Task Learning; Single Task Inference Cost
Scores: [ 8 8 6 8 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Representation Learning Output-Domain Focused Inductive Bias Classification
Scores: [ 6 8 6 ]
Most neural networks for classification primarily learn features differentiated by input-domain related information such as visual similarity of objects in an image. While this focus is natural behavior, it can inadvertently introduce an inductive bias that conflicts with unseen relations in an implicit output-domain determined by human labeling based on their own world knowledge. Such conflicts can limit generalization of models by potential dominance of the input-domain focused bias in inference.To overcome this limitation without external resources, we introduce Output-Domain focused Biasing (ODB) training strategy that constructs inductive biases on features differentiated by only output labels. It has four steps: 1) it learns intermediate latent object features in an unsupervised manner; 2) it decouples their visual dependencies by assigning new independent embedding parameters; 3) it captures structured features optimized for the original classification task; and 4) it integrates the structured features with the original visual features for the final prediction.We implement the ODB on a vision transformer architecture, and achieved significant improvements on image classification benchmarks. This paper offers a straightforward and effective method to obtain and utilize output-domain focused inductive bias for classification mapping two different domains.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Vision-Language Models Large Language Models Prompting Multimodal Fine-grained Visual Recognition
Scores: [ 6 6 8 ]
Identifying subordinate-level categories from images is a longstanding task in computer vision and is referred to as fine-grained visual recognition (FGVR). It has tremendous significance in real-world applications since an average layperson does not excel at differentiating species of birds or mushrooms due to subtle differences among the species. A major bottleneck in developing FGVR systems is caused by the need of high-quality paired expert annotations. To circumvent the need of expert knowledge we propose Fine-grained Semantic Category Reasoning (FineR) that internally leverages the world knowledge of large language models (LLMs) as a proxy in order to reason about fine-grained category names. In detail, to bridge the modality gap between images and LLM, we extract part-level visual attributes from images as text and feed that information to a LLM. Based on the visual attributes and its internal world knowledge the LLM reasons about the subordinate-level category names. Our training-free FineR outperforms several state-of-the-art FGVR and language and vision assistant models and shows promise in working in the wild and in new domains where gathering expert annotation is arduous.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Time-series Transformer Spatiotemporal
Scores: [ 6 6 6 6 6 ]
Analyzing multivariate time series is important in many domains. However, it has been difficult to learn robust and generalizable representations within multivariate datasets due to complex inter-channel relationships and dynamic shifts. In this paper, we introduce a novel approach for learning spatiotemporal structure and using it to improve the application of transformers to timeseries datasets. Our framework learns a set of group tokens, and builds an instance-specific group embedding (GE) layer that assigns input tokens to a small number of group tokens to incorporate structure into learning. We then introduce a novel architecture, Group-Aware transFormer (GAFormer), which incorporates both spatial and temporal group embeddings to achieve state-of-the-art performance on a number of time-series classification and regression tasks. In evaluations on a number of diverse timeseries datasets, we show that GE on its own can provide a nice enhancement to a number of backbones, and that by coupling spatial and temporal group embeddings, the GAFormer can outperform the existing baselines. Finally, we show how our approach discerns latent structures in data even without information about the spatial ordering of channels, and yields a more interpretable decomposition of spatial and temporal structure underlying complex multivariate datasets.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: vision-language open-world region-text object recognition
Scores: [ 6 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Anomaly detection Zero-shot anomaly detection CLIP Industrial defect inspection
Scores: [ 8 5 6 5 8 5 ]
Zero-shot anomaly detection (ZSAD) requires detection models trained using auxiliarydata to detect anomalies without any training sample in a target dataset. Itis a crucial task when training data is not accessible due to various concerns, e.g.,data privacy, yet it is challenging since the models need to generalize to anomaliesacross different domains where the appearance of foreground objects, abnormalregions, and background features, such as defects/tumors on different products/organs, can vary significantly. Recently large pre-trained vision-languagemodels (VLMs), such as CLIP, have demonstrated strong zero-shot recognitionability in various vision tasks, including anomaly detection. However, their ZSADperformance is weak since the VLMs focus more on modeling the class semanticsof the foreground objects rather than the abnormality/normality in the images. Inthis paper we introduce a novel approach, namely AnomalyCLIP, to adapt CLIPfor accurate ZSAD across different domains. The key insight of AnomalyCLIPis to learn object-agnostic text prompts that capture generic normality and abnormalityin an image regardless of its foreground objects. This allows our model tofocus on the abnormal image regions rather than the object semantics, enablinggeneralized normality and abnormality recognition on diverse types of objects.Large-scale experiments on 17 real-world anomaly detection datasets show thatAnomalyCLIP achieves superior zero-shot performance of detecting and segmentinganomalies in datasets of highly diverse class semantics from various defectinspection and medical imaging domains.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Machine-generated text Detection Trustworthy AI
Scores: [ 6 6 8 6 ]
We find that large language models (LLMs) are more likely to modify human-written text than AI-generated text when tasked with rewriting. This tendency arises because LLMs often perceive AI-generated text as high-quality, leading to fewer modifications. We introduce a method to detect AI-generated content by prompting LLMs to rewrite text and calculating the editing distance of the output. Our approach significantly improves the F1 detection scores of existing AI content detection models -- both academic and commercial -- across various domains, including News, creative writing, student essays, code, Yelp reviews, and arXiv papers, with gains of up to 29 points. Operating solely on word symbols without high-dimensional features, our method is compatible with black box LLMs, and is inherently robust on new content. Our results illustrate the unique imprint of machine-generated text through the lens of the machines themselves.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: video activity localization boundary denoising
Scores: [ 3 8 6 6 ]
Video activity localization aims at understanding the semantic content in long, untrimmed videos and retrieving actions of interest. The retrieved action with its start and end locations can be used for highlight generation, temporal action detection, etc. Unfortunately, learning the exact boundary location of activities is highly challenging because temporal activities are continuous in time, and there are often no clear-cut transitions between actions. Moreover, the definition of the start and end of events is subjective, which may confuse the model. To alleviate the boundary ambiguity, we propose to study the video activity localization problem from a denoising perspective. Specifically, we propose an encoder-decoder model named DenosieLoc. During training, a set of action spans is randomly generated from the ground truth with a controlled noise scale. Then, we attempt to reverse this process by boundary denoising, allowing the localizer to predict activities with precise boundaries and resulting in faster convergence speed. Experiments show that DenosieLoc advances several video activity understanding tasks. For example, we observe a gain of +12.36% average mAP on the QV-Highlights dataset.Moreover, DenosieLoc achieves state-of-the-art performance on the MAD dataset but with much fewer predictions than others.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Visual prompt tuning
Scores: [ 8 8 6 8 ]
As the scale of vision models continues to grow, the emergence of Visual Prompt Tuning (VPT) as a parameter-efficient transfer learning technique has gained attention due to its superior performance compared to traditional full-finetuning. However, the conditions favoring VPT (the "when") and the underlying rationale (the "why") remain unclear. In this paper, we conduct a comprehensive analysis across 19 distinct datasets and tasks. To understand the "when" aspect, we identify the scenarios where VPT proves favorable by two dimensions: task objectives and data distributions. We find that VPT is preferrable when there is 1) a substantial disparity between the original and the downstream task objectives (e.g., transitioning from classification to counting), or 2) a notable similarity in data distributions between the two tasks (e.g., both involve natural images). In exploring the "why" dimension, our results indicate VPT's success cannot be attributed solely to overfitting and optimization considerations. The unique way VPT preserves original features and adds parameters appears to be a pivotal factor. Our study provides insights into VPT's mechanisms, and offers guidance for its optimal utilization.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Markov Processes Information Theory Information Bottleneck Latent Simulation
Scores: [ 8 8 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Vision-Language Foundation Model Video-Language Pretraining
Scores: [ 8 6 5 6 ]
Due to the limited scale and quality of video-text training corpus, most vision-language foundation models employ image-text datasets for pretraining and primarily focus on modeling visually semantic representations while disregarding temporal semantic representations and correlations. To address this issue, we propose COSA, a COncatenated SAmple pretrained vision-language foundation model. COSA can jointly model visual contents and event-level temporal cues using only image-text corpora. We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining. This transformation effectively converts existing image-text corpora into a pseudo video-paragraph corpus, enabling richer scene transformations and explicit event-description correspondence. Extensive experiments demonstrate that COSA consistently improves performance across a broad range of semantic vision-language downstream tasks, including paragraph-to-video retrieval, text-to-video/image retrieval, video/image captioning and video QA. Notably, COSA achieves state-of-the-art results on various competitive benchmarks. Models and codes will be released.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Masked Language Modeling
Scores: [ 3 10 8 6 ]
Masked Language Modeling (MLM) has been one of the most prominent approaches for pretraining bidirectional text encoders due to its simplicity and effectiveness. One notable concern about MLM is that the special [MASK] symbol causes a discrepancy between pretraining data and downstream data as it is present only in pretraining but not in fine-tuning. In this work, we offer a new perspective on the consequence of such a discrepancy: We demonstrate empirically and theoretically that MLM pretraining allocates some model dimensions exclusively for representing [MASK] tokens, resulting in a representation deficiency for real tokens and limiting the pretrained model's expressiveness when it is adapted to downstream data without [MASK] tokens. Motivated by the identified issue, we propose MAE-LM, which pretrains the Masked Autoencoder architecture with MLM where [MASK] tokens are excluded from the encoder. Empirically, we show that MAE-LM improves the utilization of model dimensions for real token representations, and MAE-LM consistently outperforms MLM-pretrained models on the GLUE and SQuAD benchmarks.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: vision-language models retrieval external memory knowledge-based approaches
Scores: [ 3 8 6 ]
Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art systems. While they excel at recognizing common generic concepts, they still struggle on fine-grained entities which are rare, or even absent from the pre-training dataset. Hence, a key ingredient to their success has been the use of large-scale curated pre-training data aiming at expanding the set of concepts that they can memorize during the pre-training stage. In this work, we explore an alternative to encoding fine-grained knowledge directly into the model's parameters: we instead train the model to retrieve this knowledge from an external memory. Specifically, we propose to equip existing vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time, which greatly improves their zero-shot predictions. Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP. Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks: for example +10.9 on Stanford Cars, +10.2 on CUB-2011 and +7.3 on the recent OVEN benchmark, where we even outperform the fine-tuned models on unseen classes.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Self-Supervised Learning Speech Quality Estimation Speech Enhancement
Scores: [ 8 6 8 6 ]
Speech quality estimation has recently undergone a paradigm shift from human-hearing expert designs to machine-learning models. However, current models rely mainly on supervised learning, which is time-consuming and expensive for label collection. To solve this problem, we propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variational autoencoder (VQ-VAE). The training of VQ-VAE relies on clean speech; hence, large quantization errors can be expected when the speech is distorted. To further improve correlation with real quality scores, domain knowledge of speech processing is incorporated into the model design. We found that the vector quantization mechanism could also be used for self-supervised speech enhancement (SE) model training. To improve the robustness of the encoder for SE, a novel self-distillation mechanism combined with adversarial training is introduced. In summary, the proposed speech quality estimation method and enhancement models require only clean speech for training without any label requirements. Experimental results show that the proposed VQScore and enhancement model are competitive with supervised baselines. The code will be released after publication.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: procedure planning; instructional video; state changes; large language models; vision and language alignment
Scores: [ 6 8 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Neural Radiance Fields Hypernetwork Neural Rendering Generalizability
Scores: [ 10 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Object-centric representations Gaussian Mixture Model Slot Attention Set Prediction Task
Scores: [ 5 8 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Retrieval Augmented Language Models Large Language Models Robustness Question Answering
Scores: [ 6 6 8 6 ]
Retrieval-augmented language models (RALMs) hold promise to produce language understanding systems that are are factual, efficient, and up-to-date. An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant, and does not harm performance when it is not. This is particularly important in multi-hop reasoning scenarios, where misuse of irrelevant evidence can lead to cascading errors. However, recent work has shown that retrieval augmentation can sometimes have a negative effect on performance. In this work, we present a thorough analysis on five open-domain question answering benchmarks, characterizing cases when retrieval reduces accuracy. We then propose two methods to mitigate this issue. First, a simple baseline that filters out retrieved passages that do not entail question-answer pairs according to a natural language inference (NLI) model. This is effective in preventing performance reduction, but at a cost of also discarding relevant passages. Thus, we propose a method for automatically generating data to fine-tune the language model to properly leverage retrieved passages, using a mix of relevant and irrelevant contexts at training time. We empirically show that even 1,000 examples suffice to train the model to be robust to irrelevant contexts while maintaining high performance on examples with relevant ones.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: multimodal learning information bottleneck sentiment analysis
Scores: [ 6 6 5 6 8 ]
Integrating and processing information from various sources or modalities are critical for obtaining a comprehensive and accurate perception of the real world. Drawing inspiration from neuroscience, we develop the Information-Theoretic Hierarchical Perception (ITHP) model, which utilizes the concept of information bottleneck. Different from most traditional fusion models that incorporate all modalities identically in neural networks, our model designates a prime modality and regards the remaining modalities as detectors in the information pathway, serving to distill the flow of information. Our proposed perception model focuses on constructing an effective and compact information flow by achieving a balance between the minimization of mutual information between the latent state and the input modal state, and the maximization of mutual information between the latent states and the remaining modal states. This approach leads to compact latent state representations that retain relevant information while minimizing redundancy, thereby substantially enhancing the performance of multimodal representation learning. Experimental evaluations on the MUStARD, CMU-MOSI, and CMU-MOSEI datasets demonstrate that our model consistently distills crucial information in multimodal learning scenarios, outperforming state-of-the-art benchmarks. Remarkably, on the CMU-MOSI dataset, ITHP-DeBERTa surpasses human-level performance in the multimodal sentiment binary classification task across all evaluation metrics (i.e., Binary Accuracy, F1 Score, Mean Absolute Error, and Pearson Correlation).
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Multi-modal learning;
Scores: [ 8 3 8 ]
We investigate composed image retrieval with text feedback. Users gradually look for the target of interest by moving from coarse to fine-grained feedback. However, existing methods merely focus on the latter, i.e., fine-grained search, by harnessing positive and negative pairs during training. This pair-based paradigm only considers the one-to-one distance between a pair of specific points, which is not aligned with the one-to-many coarse-grained retrieval process and compromises the recall rate. In an attempt to fill this gap, we introduce a unified learning approach to simultaneously modeling the coarse- and fine-grained retrieval by considering the multi-grained uncertainty. The key idea underpinning the proposed method is to integrate fine- and coarse-grained retrieval as matching data points with small and large fluctuations, respectively.Specifically, our method contains two modules: uncertainty modeling and uncertainty regularization. (1) The uncertainty modeling simulates the multi-grained queries by introducing identically distributed fluctuations in the feature space. (2) Based on the uncertainty modeling, we further introduce uncertainty regularization to adapt the matching objective according to the fluctuation range.Compared with existing methods, the proposed strategy explicitly prevents the model from pushing away potential candidates in the early stage and thus improves the recall rate. On the three public datasets, i.e., FashionIQ, Fashion200k, and Shoes, the proposed method has achieved +4.03%, + 3.38%, and + 2.40% Recall@50 accuracy over a strong baseline, respectively.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Forecasting; Time Series; Large Language Model
Scores: [ 5 6 8 ]
The past decade has witnessed significant advances in time series modeling with deep learning. While achieving state-of-the-art results, the best-performing architectures vary highly across applications and domains. On the other hand, for natural language processing, Generative Pre-trained Transformer (GPT) has demonstrated impressive performance via training one general-purpose model across various textual datasets. It is intriguing to explore whether GPT-type architectures can be effective for time series, capturing the intrinsic dynamic attributes and leading to significant accuracy improvements. In this paper, we propose a novel framework, TEMPO, that can effectively learn time series representations. We focus on utilizing two essential inductive biases of the time series task for pre-trained models: (i) decomposition of the complex interaction between trend, seasonal, and residual components; and (ii) introducing the selection-based prompts to facilitate distribution adaptation in non-stationary time series. TEMPO expands the capability for dynamically modeling real-world temporal phenomena from data within diverse domains. Our experiments demonstrate the superior performance of TEMPO, with 20%-60% improvement over state-of-the-art methods on a number of time series benchmark datasets. This performance gain is observed not only in standard supervised learning settings but also in scenarios involving previously unseen datasets. This compelling finding highlights TEMPO’s potential to constitute a foundational model building framework.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Robustness Multi-modal learning Modality preference
Scores: [ 6 6 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: 3D-aware editing Human instruction Conditional latent 3D diffusion
Scores: [ 6 8 5 ]
With the success of Neural Radiance Field (NeRF) in 3D-aware portrait editing, a variety of works have achieved promising results regarding both quality and 3D consistency. However, these methods heavily rely on per-prompt optimization when handling natural language as editing instructions. Due to the lack of labeled human face 3D datasets and effective architectures, the area of human-instructed 3D-aware editing for open-world portraits in an end-to-end manner remains under-explored. To solve this problem, we propose an end-to-end diffusion-based framework termed InstructPix2NeRF, which enables instructed 3D-aware portrait editing from a single open-world image with human instructions. At its core lies a conditional latent 3D diffusion process that lifts 2D editing to 3D space by learning the correlation between the paired images' difference and the instructions via triplet data. With the help of our proposed token position randomization strategy, we could even achieve multi-semantic editing through one single pass with the portrait identity well-preserved. Besides, we further propose an identity consistency module that directly modulates the extracted identity signals into our diffusion process, which increases the multi-view 3D identity consistency. Extensive experiments verify the effectiveness of our method and show its superiority against strong baselines quantitatively and qualitatively.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Data-Type Understanding Vision-Language Models Scaling
Scores: [ 8 8 8 ]
Recent advances in the development of vision-language models (VLMs) are yielding remarkable success in recognizing visual semantic content, including impressive instances of compositional image understanding. Here, we introduce the novel task of Visual Data-Type Identification, a basic perceptual skill with implications for data curation (e.g., noisy data-removal from large datasets, domains pecific retrieval) and autonomous vision (e.g., distinguishing changing weather conditions from camera lens staining). We develop two datasets consisting of animal images altered across a diverse set of 27 visual data-types, spanning four broad categories. An extensive zero-shot evaluation of 39 VLMs, ranging from 100M to 80B parameters, shows a nuanced performance landscape. While VLMs are reasonably good at identifying certain stylistic data-types, such as cartoons and sketches, they struggle with simpler data-types arising from basic manipulations like image rotations or additive noise. Our findings reveal that (i) model scaling alone yields marginal gains for contrastively-trained models like CLIP, and (ii) there is a pronounced drop in performance for the largest auto-regressively trained VLMs like OpenFlamingo. This finding points to a blind spot in current frontier VLMs: they excel in recognizing semantic content but fail to acquire anunderstanding of visual data-types through scaling. By analyzing the pre-training distributions of these models and incorporating data-type information into the captions during fine-tuning, we achieve a significant enhancement in performance. By exploring this previously uncharted task, we aim to set the stage for further advancing VLMs to equip them with visual data-type understanding. We will makeour code available online upon publication.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Scene Graph Generation Scene Understanding Imbalanced Classification Self-training Long-tailed Problem
Scores: [ 5 8 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Novel-view synthesis for dynamic scenes 4D Gaussian point-based rendering
Scores: [ 6 6 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: large language models black-box language models modular and collaborative knowledge
Scores: [ 8 8 8 ]
By design, large language models (LLMs) are static general-purpose models, expensive to retrain or update frequently. As they are increasingly adopted for knowledge-intensive tasks, it becomes evident that these design choices lead to failures to generate factual, relevant, and up-to-date knowledge. To this end, we propose Knowledge Card, a modular framework to plug in new factual and relevant knowledge into general-purpose LLMs. We first introduce knowledge cards---specialized language models trained on corpora from specific domains and sources. Knowledge cards serve as parametric repositories that are selected at inference time to generate background knowledge for the base LLM. We then propose three content selectors to dynamically select and retain information in documents generated by knowledge cards, specifically controlling for relevance, brevity, and factuality of outputs. Finally, we propose two complementary integration approaches to augment the base LLM with the (relevant, factual) knowledge curated from the specialized LMs. Through extensive experiments, we demonstrate that Knowledge Card achieves state-of-the-art performance on six benchmark datasets. Ultimately, Knowledge Card framework enables dynamic synthesis and updates of knowledge from diverse domains. Its modularity will ensure that relevant knowledge can be continuously updated through the collective efforts of the research community.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: implicit neural representations neural data compression relative entropy coding combiner
Scores: [ 6 6 8 ]
COMpression with Bayesian Implicit NEural Representations (COMBINER) is a recent data compression method that addresses a key inefficiency of previous Implicit Neural Representation (INR)-based approaches: it avoids quantization and enables direct optimization of the rate-distortion performance. However, COMBINER still has significant limitations: 1) it uses factorized priors and posterior approximations that lack flexibility; 2) it cannot effectively adapt to local deviations from global patterns in the data; and 3) its performance can be susceptible to modeling choices and the variational parameters' initializations. Our proposed method, Robust and Enhanced COMBINER (RECOMBINER), addresses these issues by 1) enriching the variational approximation while maintaining its computational cost via a linear reparameterization of the INR weights, 2) augmenting our INRs with learnable positional encodings that enable them to adapt to local details and 3) splitting high-resolution data into patches to increase robustness and utilizing expressive hierarchical priors to capture dependency across patches. We conduct extensive experiments across several data modalities, showcasing that RECOMBINER achieves competitive results with the best INR-based methods and even outperforms autoencoder-based codecs on low-resolution images at low bitrates.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Data Cleaning CLIP Image-caption Hard examples representation learning Web Scale LAION
Scores: [ 8 6 6 6 ]
Large web-crawled multimodal datasets have powered a slew of new methods for learning general-purpose visual representations, advancing the state of the art in computer vision and revolutionizing zero- and few-shot recognition. One crucial decision facing practitioners is how, if at all, to curate these ever-larger datasets. For example, the creators of the LAION-5B dataset chose to retain only image-caption pairs whose CLIP similarity score exceeded a designated threshold. In this paper, we propose a new state-of-the-art data filtering approach motivated by our observation that nearly 40% of LAION's images contain text that overlaps significantly with the caption. Intuitively, such data could be wasteful as it incentivizes models to perform optical character recognition rather than learning visual features. However, naively removing all such data could also be wasteful, as it throws away images that contain visual features (in addition to overlapping text). Our simple and scalable approach, T-MARS (Text Masking and Re-Scoring), filters out only those pairs where the text dominates the remaining visual features---by first masking out the text and then filtering out those with a low CLIP similarity score of the masked image with original captions. Experimentally, T-MARS is the top ranked approach on Imagenet at medium scale'' of DataComp (a data filtering benchmark), and outperforms CLIP filtering by a margin of 6.5% on ImageNet and 4.7% on VTAB. Additionally, we show that the accuracy gains enjoyed by T-MARS linearly increase as data and compute are scaled exponentially.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: time series forecasting large language models model reprogramming
Scores: [ 8 8 8 8 3 ]
Time series forecasting holds significant importance in many real-world dynamic systems and has been extensively studied. Unlike natural language process (NLP) and computer vision (CV), where a single large model can tackle multiple tasks, models for time series forecasting are often specialized, necessitating distinct designs for different tasks and applications. While pre-trained foundation models have made impressive strides in NLP and CV, their development in time series domains has been constrained by data sparsity. Recent studies have revealed that large language models (LLMs) possess robust pattern recognition and reasoning abilities over complex sequences of tokens. However, the challenge remains in effectively aligning the modalities of time series data and natural language to leverage these capabilities. In this work, we present Time-LLM, a reprogramming framework to repurpose LLMs for general time series forecasting with the backbone language models kept intact. We begin by reprogramming the input time series with text prototypes before feeding it into the frozen LLM to align the two modalities. To augment the LLM's ability to reason with time series data, we propose Prompt-as-Prefix (PaP), which enriches the input context and directs the transformation of reprogrammed input patches. The transformed time series patches from the LLM are finally projected to obtain the forecasts. Our comprehensive evaluations demonstrate that \method is a powerful time series learner that outperforms state-of-the-art, specialized forecasting models. Moreover, Time-LLM excels in both few-shot and zero-shot learning scenarios.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Object pose estimation; Unseen objects; 3D computer vision
Scores: [ 6 6 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: metric learning image retrieval parameter-efficient fine tuning
Scores: [ 6 8 5 ]
Deep Metric Learning (DML) has long attracted the attention of the machine learning community as a key objective. Existing solutions concentrate on fine-tuning the pre-trained models on conventional image datasets. As a result of the success of recent pre-trained models derived from larger-scale datasets, it is challenging to adapt the model to the DML tasks in the local data domain while retaining the previously gained knowledge. In this paper, we investigate parameter-efficient methods for fine-tuning the pre-trained model for DML tasks. In particular, we propose a novel and effective framework based on learning Visual Prompts (VPT) in the pre-trained Vision Transformers (ViT). Based on the conventional proxy-based DML paradigm, we augment the proxy by incorporating the semantic information from the input image and the ViT, in which we optimize the visual prompts for each class. We demonstrate that our new approximations with semantic information are superior to representative capabilities, thereby improving metric learning performance. We conduct extensive experiments to demonstrate that our proposed framework is superior and efficient by evaluating popular DML benchmarks. In particular, we demonstrate that our fine-tuning method achieves comparable or even better performance than recent state-of-the-art full fine-tuning works of DML while tuning only a small percentage of total parameters.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Visual Place Recognition Parameter-efficient Transfer Learning Global Adaptation Local Adaptation
Scores: [ 8 8 8 5 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Aliasing; Hard Pixel; Semantic Segmentation
Scores: [ 5 5 8 ]
Despite recent advancements in semantic segmentation, where and what pixels are hard to segment remains largely unexplored.Existing research only separates an image into easy and hard regions and empirically observes the latter are associated with object boundaries.In this paper, we conduct a comprehensive analysis of hard pixel errors, categorizing them into three types: false responses, merging mistakes, and displacements. Our findings reveal a quantitative association between hard pixels and aliasing, which is distortion caused by the overlapping of frequency components in the Fourier domain during downsampling.To identify the frequencies responsible for aliasing, we propose using the equivalent sampling rate to calculate the Nyquist frequency, which marks the threshold for aliasing. Then, we introduce the aliasing score as a metric to quantify the extent of aliasing.While positively correlated with the proposed aliasing score, three types of hard pixels exhibit different patterns.Here, we propose two novel de-aliasing filter (DAF) and frequency mixing (FreqMix) modules to alleviate aliasing degradation by accurately removing or adjusting frequencies higher than the Nyquist frequency.The DAF precisely removes the frequencies responsible for aliasing before downsampling, while the FreqMix dynamically selects high-frequency components within the encoder block.Experimental results demonstrate consistent improvements in semantic segmentation and low-light instance segmentation tasks.The code will be released upon publication.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Table Question Answering Large Language Models Noise Reduction Unsupervised Relevance Scoring Table Parsing Relevant Cell Highlighting
Scores: [ 8 8 8 8 ]
Table understanding capability of Large Language Models (LLMs) has been extensively studied through the task of question-answering (QA) over tables. Typically, only a small part of the whole table is relevant to derive the answer for a given question. The irrelevant parts act as noise and are distracting information, resulting in sub-optimal performance due to the vulnerability of LLMs to noise. To mitigate this, we propose CABINET (Content RelevAnce-Based NoIse ReductioN for TablE QuesTion-Answering) – a framework to enable LLMs to focus on relevant tabular data by suppressing extraneous information. CABINET comprises an Unsupervised Relevance Scorer (URS), trained differentially with the QA LLM, that weighs the table content based on its relevance to the input question before feeding it to the question answering LLM (QA LLM). To further aid the relevance scorer, CABINET employs a weakly supervised module that generates a parsing statement describing the criteria of rows and columns relevant to the question and highlights the content of corresponding table cells. CABINET significantly outperforms various tabular LLM baselines, as well as GPT3-based in-context learning methods, is more robust to noise, maintains outperformance on tables of varying sizes, and establishes new SoTA performance on WikiTQ, FeTaQA, and WikiSQL datasets. We release our code and datasets here.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: NeRF Pretraining Mask-based Modeling
Scores: [ 8 6 8 8 6 8 6 ]
Most Neural Radiance Fields (NeRFs) exhibit limited generalization capabilities,which restrict their applicability in representing multiple scenes using a single model. To address this problem, existing generalizable NeRF methods simply condition the model on image features. These methods still struggle to learn precise global representations over diverse scenes since they lack an effective mechanism for interacting among different points and views. In this work, we unveil that3D implicit representation learning can be significantly improved by mask-based modeling. Specifically, we propose masked ray and view modeling for generalizable NeRF (MRVM-NeRF), which is a self-supervised pretraining target to predict complete scene representations from partially masked features along each ray. With this pretraining target, MRVM-NeRF enables better use of correlations across different rays and views as the geometry priors, which thereby strengthens the capability of capturing intricate details within the scenes and boosts the generalization capability across different scenes. Extensive experiments demonstrate the effectiveness of our proposed MRVM-NeRF on both synthetic and real-world datasets, qualitatively and quantitatively. Besides, we also conduct experiments to show the compatibility of our proposed method with various backbones and itssuperiority under few-shot cases.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: text-to-speech large-scale corpus non-autoregressive diffusion
Scores: [ 8 8 8 8 ]
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is important to capture the diversity in human speech such as speaker identities, prosodies, and styles (e.g., singing). Current large TTS systems usually quantize speech into discrete tokens and use language models to generate these tokens one by one, which suffer from unstable prosody, word skipping/repeating issue, and poor voice quality. In this paper, we develop NaturalSpeech 2, a TTS system that leverages a neural audio codec with residual vector quantizers to get the quantized latent vectors and uses a diffusion model to generate these latent vectors conditioned on text input. To enhance the zero-shot capability that is important to achieve diverse speech synthesis, we design a speech prompting mechanism to facilitate in-context learning in the diffusion model and the duration/pitch predictor. We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers. NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, robustness, and voice quality in a zero-shot setting, and performs novel zero-shot singing synthesis with only a speech prompt. Audio samples are available at https://naturalspeech2.github.io/.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Constituency Parsing Unsupervised Grammar Induction Knowledge Distillation
Scores: [ 6 8 8 3 8 ]
We investigate the unsupervised constituency parsing task, which organizes words and phrases of a sentence into a hierarchical structure without using linguistically annotated data. We observe that existing unsupervised parsers capture differing aspects of parsing structures, which can be leveraged to enhance unsupervised parsing performance.To this end, we propose a notion of tree averaging,'' based on which we further propose a novel ensemble method for unsupervised parsing.To improve inference efficiency, we further distill the ensemble knowledge into a student model; such an ensemble-then-distill process is an effective approach to mitigate the over-smoothing problem existing in common multi-teacher distilling methods.Experiments show that our method surpasses all previous approaches, consistently demonstrating its effectiveness and robustness across various runs, with different ensemble components, and under domain-shift conditions.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Video Recognition Data Augmentation
Scores: [ 3 8 6 6 ]
Current training pipelines in object recognition neglect Hue Jittering when doing data augmentation as it not only brings appearance changes that are detrimental to classification, but also the implementation is inefficient in practice. In this study, we investigate the effect of hue variance in the context of video recognition and find this variance to be beneficial since static appearances are less important in videos that contain motion information. Based on this observation, we propose a data augmentation method for video recognition, named Motion Coherent Augmentation (MCA), that introduces appearance variation in videos and implicitly encourages the model to prioritize motion patterns, rather than static appearances. Concretely, we propose an operation SwapMix to efficiently modify the appearance of video samples, and introduce Variation Alignment (VA) to resolve the distribution shift caused by SwapMix, enforcing the model to learn appearance invariant representations. Comprehensive empirical validations across various architectures and different datasets solidly demonstrate the effectiveness and generalization ability of MCA (e.g., 1.95% average performance gain at different frames on Something-Something V1 dataset over the competing method Uniformer).
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: dynamic novel view synthesis generalized novel view synthesis
Scores: [ 8 8 3 8 ]
Rendering scenes observed in a monocular video from novel viewpoints is a challenging problem. For static scenes the community has studied both scene-specific optimization techniques, which optimize on every test scene, and generalized techniques, which only run a deep net forward pass on a test scene. In contrast, for dynamic scenes, scene-specific optimization techniques exist, but, to our best knowledge, there is currently no generalized method for dynamic novel view synthesis from a given monocular video. To answer whether generalized dynamic novel view synthesis from monocular videos is possible today, we establish an analysis framework based on existing techniques and work toward the generalized approach. We find a pseudo-generalized process without scene-specific \emph{appearance} optimization is possible, but geometrically and temporally consistent depth estimates are needed. Despite no scene-specific appearance optimization, the pseudo-generalized approach improves upon some scene-specific methods.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Automatic speech recognition large language model generative error correction.
Scores: [ 10 6 6 6 5 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: LanguageBind Multi-modal Pretraining Multi-modal Dataset
Scores: [ 6 6 6 8 ]
The video-language (VL) pretraining has achieved remarkable improvement in multiple downstream tasks. However, the current VL pretraining framework is hard to extend to multiple modalities (N modalities, N≥3) beyond vision and language. We thus propose LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics. Specifically, we freeze the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning. As a result, all modalities are mapped to a shared feature space, implementing multi-modal semantic alignment. While LanguageBind ensures that we can extend VL modalities to N modalities, we also need a high-quality dataset with alignment data pairs centered on language. We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M. In our VIDAL-10M, all videos are from short video platforms with complete semantics rather than truncated segments from long videos, and all the video, depth, infrared, and audio modalities are aligned to their textual descriptions. After pretraining on VIDAL-10M, we outperform ImageBind by 1.2% R@1 on the MSR-VTT dataset with only 15% of the parameters in the zero-shot video-text retrieval, validating the high quality of our dataset. Beyond this, our LanguageBind has achieved great improvement in the zero-shot video, audio, depth, and infrared understanding tasks. For instance, on the LLVIP and NYU-D datasets, LanguageBind outperforms ImageBind-huge with 23.8% and 11.1% top-1 accuracy.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: 3D single object tracking Category-unified model Point clouds
Scores: [ 6 8 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Image representation image embedding image based search
Scores: [ 6 5 8 6 ]
We present MOFI, a new vision foundation model designed to learn image representations from noisy entity annotated images. MOFI differs from previous work in two key aspects: (i) pre-training data, and (ii) training recipe. Regarding data, we introduce a new approach to automatically assign entity labels to images from noisy image-text pairs. Our approach involves employing a named entity recognition model to extract entities from the alt-text, and then using a CLIP model to select the correct entities as labels of the paired image. The approach is simple, does not require costly human annotation, and can be readily scaled up to billions of image-text pairs mined from the web. Through this method, we have created Image-to-Entities (I2E), a new large-scale dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild. Building upon the I2E dataset, we study different training recipes, including supervised pre-training, contrastive pre-training, and multi-task learning. For constrastive pre-training, we treat entity names as free-form text, and further enrich them with entity descriptions. Experiments show that supervised pre-training with large-scale fine-grained entity labels is highly effective for image retrieval tasks, and multi-task training further improves the performance. The final MOFI model achieves 86.66% mAP on the challenging GPR1200 dataset, surpassing the previous state-of-the-art performance of 72.19% from OpenAI's CLIP model. Further experiments on zero-shot and linear probe image classification also show that MOFI outperforms a CLIP model trained on the original image-text data, demonstrating the effectiveness of the I2E dataset in learning strong image representations.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: NeRF Dynamic Motion Part discovery
Scores: [ 8 8 8 8 ]
We present MovingParts, a NeRF-based method for dynamic scene reconstruction and part discovery. We consider motion as an important cue for identifying parts, that all particles on the same part share the common motion pattern. From the perspective of fluid simulation, existing deformation-based methods for dynamic NeRF can be seen as parameterizing the scene motion under the Eulerian view, i.e., focusing on specific locations in space through which the fluid flows as time passes. However, it is intractable to extract the motion of constituting objects or parts using the Eulerian view representation. In this work, we introduce the dual Lagrangian view and enforce representations under the Eulerian/Lagrangian views to be cycle-consistent. Under the Lagrangian view, we parameterize the scene motion by tracking the trajectory of particles on objects. The Lagrangian view makes it convenient to discover parts by factorizing the scene motion as a composition of part-level rigid motions. Experimentally, our method can achieve fast and high-quality dynamic scene reconstruction from even a single moving camera, and the induced part-based representation allows direct applications of part tracking, animation, 3D scene editing, etc.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Point Cloud PointNet Function Mapping Encoder Noise Robustness
Scores: [ 8 6 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Label-efficient LiDAR-based 3D object detection
Scores: [ 6 8 6 ]
Label-efficient LiDAR-based 3D object detection is currently dominated by weak/semi-supervised methods. Instead of exclusively following one of them, we propose MixSup, a more practical paradigm simultaneously utilizing massive cheap coarse labels and a limited number of accurate labels for Mixed-grained Supervision. We start by observing that point clouds are usually textureless, making it hard to learn semantics. However, point clouds are geometrically rich and scale-invariant to the distances from sensors, making it relatively easy to learn the geometry of objects, such as poses and shapes. Thus, MixSup leverages massive coarse cluster-level labels to learn semantics and a few expensive box-level labels to learn accurate poses and shapes. We redesign the label assignment in mainstream detectors, which allows them seamlessly integrated into MixSup, enabling practicality and universality. We validate its effectiveness in nuScenes, Waymo Open Dataset, and KITTI, employing various detectors. MixSup achieves up to 97.31% of fully supervised performance, using cheap cluster annotations and only 10% box annotations. Furthermore, we utilize the emerging Segment Anything Model (SAM) to automatically generate massive coarse labels, further reducing the annotation burden. The code will be made publicly available.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Vision-Language Model Finetuning Prompt Learning OOD Generalization
Scores: [ 5 8 6 5 ]
Existing vision-language models exhibit strong generalization on a variety of visual domains and tasks. However, such models mainly perform zero-shot recognition in a closed-set manner, and thus struggle to handle open-domain visual concepts by design. There are recent finetuning methods, such as prompt learning, that not only study the discrimination between in-distribution (ID) and out-of-distribution (OOD) samples, but also show some improvements in both ID and OOD accuracies. In this paper, we first demonstrate that vision-language models, after long enough finetuning but without proper regularization, tend to overfit the known classes in the given dataset, with degraded performance on unknown classes. Then we propose a novel approach OGEN to address this pitfall, with the main focus on improving the OOD GENeralization of finetuned models. Specifically, a class-conditional feature generator is introduced to synthesize OOD features using just the class name of any unknown class. Such synthesized features will provide useful knowledge about unknowns and help regularize the decision boundary between ID and OOD data when optimized jointly. Equally important is our adaptive self-distillation mechanism to regularize our feature generation model during joint optimization, i.e., adaptively transferring knowledge between model states to further prevent overfitting. Experiments validate that our method yields convincing gains in OOD generalization performance in different settings.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: vision-language model compositionality
Scores: [ 8 6 5 ]
A remarkable ability of human beings resides in compositional reasoning, i.e., the capacity to make "infinite use of finite means". However, current large vision-language foundation models (VLMs) fall short of such compositional abilities due to their bag-of-words" behaviors and inability to construct words that correctly represent visual entities and the relations among the entities. To this end, we propose Compositional VLM, which can guide the LLM to explicitly compose visual entities and relationships among the text and dynamically communicate with the vision encoder and detection network to achieve vision-language communicative decoding. Specifically, we first devise a set of novel communication tokens for the LLM, for dynamic communication between the visual detection system and the language system. A communication token is generated by the LLM following a visual entity or a relation, to inform the detection network to propose regions that are relevant to the sentence generated so far. The proposed regions-of-interests (ROIs) are then fed back into the LLM for better language generation contingent on the relevant regions. The LLM is thus able to compose the visual entities and relationships through the communication tokens. The vision-to-language and language-to-vision communication are iteratively performed until the entire sentence is generated. Our framework seamlessly bridges the gap between visual perception and LLMs and outperforms previous VLMs by a large margin on compositional reasoning benchmarks (e.g., ~20% in HICO-DET mAP, ~14% in Cola top-1 accuracy, and ~3% on ARO top-1 accuracy). We also achieve state-of-the-art performances on traditional vision-language tasks such as referring expression comprehension and visual question answering.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: 3D reconstruction Neural implicit surface learning
Scores: [ 6 6 8 6 ]
Advanced techniques using Neural Radiance Fields (NeRF), Signed Distance Fields (SDF), and Occupancy Fields have recently emerged as solutions for 3D indoor scene reconstruction. We introduce a novel two-phase learning approach, H2O-SDF, that discriminates between object and non-object regions within indoor environments. This method achieves a nuanced balance, carefully preserving the geometric integrity of room layouts while also capturing intricate surface details of specific objects. A cornerstone of our two-phase learning framework is the introduction of the Object Surface Field (OSF), a novel concept designed to mitigate the persistent vanishing gradient problem that has previously hindered the capture of high-frequency details in other methods. Our proposed approach is validated through several experiments that include ablation studies.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: humanoid control motion generation physics simulation
Scores: [ 8 8 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: representation learning NLP language modeling pretraining contrastive
Scores: [ 8 6 6 6 ]
Self-supervised pre-training of language models usually consists in predicting probability distributions over extensive token vocabularies. In this study, we propose an innovative method that shifts away from probability prediction and instead focuses on reconstructing input embeddings in a contrastive fashion via Constrastive Weight Tying (CWT). We apply this approach to pretrain Headless Language Models in both monolingual and multilingual contexts. Our method offers practical advantages, substantially reducing training computational requirements by up to 20 times, while simultaneously enhancing downstream performance and data efficiency. We observe a significant +1.6 GLUE score increase and a notable +2.7 LAMBADA accuracy improvement compared to classical LMs within similar compute budgets.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Shapelet Covolutional Neural Network Time-series
Scores: [ 5 5 8 6 5 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: 3D Reconstruction Generalizable Neural Fields
Scores: [ 6 5 5 8 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Text-to-Speech Voice Generation Prompt Variation Network
Scores: [ 6 6 6 6 ]
Speech conveys more information than text, as the same word can be uttered in various voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods relying on speech prompts (reference speech) for voice variability, using text prompts (descriptions) is more user-friendly since speech prompts can be hard to find or may not exist at all. TTS approaches based on the text prompt face two main challenges: 1) the one-to-many problem, where not all details about voice variability can be described in the text prompt, and 2) the limited availability of text prompt datasets, where vendors and large cost of data labeling are required to write text prompts for speech. In this work, we introduce VoiceGen to address these challenges with a variation network to provide variability information of voice not captured by text prompts, and a prompt generation pipeline to utilize the large language models (LLM) to compose high quality text prompts. Specifically, the variation network predicts the representation extracted from the reference speech (which contains full information about voice variability) based on the text prompt representation. For the prompt generation pipeline, it generates text prompts for speech with a speech language understanding model to recognize voice attributes (e.g., gender, speed) from speech and a large language model to formulate text prompts based on the recognition results. Experiments on a large-scale (44K hours) speech dataset demonstrate that compared to the previous works, VoiceGen generates voices more consistent with text prompts and supports the sampling of diverse voice variability, thereby offering users more choices on voice generation. Additionally, the prompt generation pipeline produces high-quality text prompts, eliminating the large labeling cost. The demo page of VoiceGen is available online (https://voicegen.github.io/voicegen).
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Large Language Models Length Extrapolation Efficiency
Scores: [ 8 6 8 8 ]
Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges.Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory.Secondly, popular LLMs cannot generalize to longer texts than the training sequence length.Window attention, where only the most recent KVs are cached, is a natural approach --- but we show that it fails when the text length surpasses the cache size.We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a sink'' even if they are not semantically important.Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning.We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more.In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2$\times$ speedup.Code and datasets are provided in the anonymous link.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: 3D Object Detection Detection Transformer Relative Position Encoding
Scores: [ 6 5 8 5 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Large Language Model; Mathematical Reasoning
Scores: [ 8 8 8 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Prompting Large Language Models Inverse Reinforcement Learning RLHF
Scores: [ 6 6 6 6 ]
In this study, we aim to enhance the arithmetic reasoning ability of Large Language Models (LLMs) through zero-shot prompt optimization. We identify a previously overlooked objective of query dependency in such optimization and elucidate two ensuing challenges that impede the successful and economical design of prompt optimization techniques. One primary issue is the absence of an effective method to evaluate prompts during inference when the golden answer is unavailable. Concurrently, learning via interactions with the LLMs to navigate the expansive natural language prompting space proves to be resource-intensive.To address this, we introduce Prompt-OIRL, which harnesses offline inverse reinforcement learning to draw insights from offline prompting demonstration data. Such data exists as by-products when diverse prompts are benchmarked on open-accessible datasets. With Prompt-OIRL, the query-dependent prompt optimization objective is achieved by first learning an offline reward model. This model can evaluate any query-prompt pairs without accessing LLMs. Subsequently, a best-of-N strategy is deployed to recommend the optimal prompt. Our experimental evaluations across various LLM scales and arithmetic reasoning datasets underscore both the efficacy and economic viability of the proposed approach.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Vision transformers High resolution Dense tasks Optical flow
Scores: [ 5 6 6 6 ]
Transformers have become the standard in state-of-the-art vision architectures, achieving impressive performance on both image-level and dense pixelwise tasks. However, training vision transformers for high-resolution pixelwise tasks has a prohibitive cost. Typical solutions boil down to hierarchical architectures, fast and approximate attention, or training on low-resolution crops. This latter solution does not constrain architectural choices, but it leads to a clear performance drop when testing at resolutions significantly higher than that used for training, thus requiring ad-hoc and slow post-processing schemes. In this paper, we propose a novel strategy for efficient training and inference of high-resolution vision transformers: the key principle is to mask out most of the high-resolution inputs during training, keeping only N random windows. This allows the model to learn local interactions between tokens inside each window, and global interactions between tokens from different windows. As a result, the model can directly process the high-resolution input at test time without any special trick. We show that this strategy is effective when using relative positional embedding such as rotary embeddings. It is 4 times faster to train than a full-resolution network, and it is straightforward to use at test time compared to existing approaches. We apply this strategy to the dense monocular task of semantic segmentation, and find that a simple setting with 2 windows performs best, hence the name of our method: Win-Win. To demonstrate the generality of our contribution, we further extend it to the binocular task of optical flow, reaching state-of-the-art performance on the Spring benchmark that contains Full-HD images with an inference time an orderof magnitude faster than the best competitor.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: neural scene representations scene representations representation learning novel view synthesis
Scores: [ 6 6 8 ]
Visual understanding of our world goes beyond the semantics and flat structure of individual images.In this paper, we work towards capturing both the 3D structure as well as the dynamics of real-world scenes from monocular real-world videos.Our model, the Dynamic Scene Transformer (DyST), builds upon recent work in neural scene representation and learns a latent decomposition into scene content as well as per-view scene dynamics and camera pose. This separation is achieved through a special co-training scheme on monocular videos and our new synthetic dataset DySO.DyST learns tangible latent representations for dynamic scenes that enable view generation with separate control over the camera and the content of the scene.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Large language model reward model RLHF consistency
Scores: [ 6 5 5 6 6 ]
Standard practice within Reinforcement Learning from Human Feedback (RLHF) involves optimizing against a Reward Model (RM), which itself is trained to reflect human preferences for desirable generations. A notable subject that is understudied is the (in-)consistency of RMs --- whether they can recognize the semantic changes to different prompts and appropriately adapt their reward assignments--- and their impact on the downstream RLHF model.In this paper, we visit a series of research questions relevant to RM inconsistency:(1) How can we measure the consistency of reward models? (2) How consistent are the existing RMs and how can we improve them? (3) In what ways does reward inconsistency influence the chatbots resulting from the RLHF model training?We propose Contrast Instruction -- a benchmarking strategy for the consistency of RM. Each example in Contrast Instruction features a pair of lexically similar instructions with different ground truth responses. A consistent RM is expected to rank the corresponding instruction and response higher than other combinations. We observe that current RMs trained with the standard ranking objective fail miserably on \contrast{} compared to average humans. To show that RM consistency can be improved efficiently without using extra training budget, we propose two techniques ConvexDA and RewardFusion, which enhance reward consistency through extrapolation during the RM training and inference stage, respectively.We show that RLHF models trained with a more consistent RM yield more useful responses, suggesting that reward inconsistency exhibits a trickle-down effect on the downstream RLHF process.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: transformers pruning re-training latency throughput token reduction one-shot
Scores: [ 6 6 6 ]
We introduce the $\textbf{O}$ne-shot $\textbf{P}$runing $\textbf{T}$echnique for $\textbf{I}$nterchangeable $\textbf{N}etworks(\textbf{OPTIN}$) framework as a tool to increase the efficiency of pre-trained transformer architectures without requiring re-training. Recent works have explored improving transformer efficiency, however often incur computationally expensive re-training procedures or depend on architecture-specific characteristics, thus impeding practical wide-scale adoption. To address these shortcomings, the OPTIN framework leverages intermediate feature distillation, capturing the long-range dependencies of model parameters (coined trajectory), to produce state-of-the-art results on natural language, image classification, transfer learning, and semantic segmentation tasks without re-training. Given a FLOP constraint, the OPTIN framework will compress the network while maintaining competitive accuracy performance and improved throughput. Particularly, we show a ≤2% accuracy degradation from NLP baselines and a 0.5% improvement from state-of-the-art methods on image classification at competitive FLOPs reductions. We further demonstrate the generalization of tasks and architecture with comparative performance using Mask2Former for semantic segmentation and cnn-style networks. OPTIN presents one of the first one-shot efficient frameworks for compressing transformer architectures that generalizes well across different class domains, in particular: natural language and image-related tasks, without re-training.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Adversarial training robust overfitting forgetting reinitialization robust accuracy generalization
Scores: [ 6 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Dialogue Policy Planning Proactive Dialogue Large Language Model
Scores: [ 6 5 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Multimodal learning object detection
Scores: [ 5 5 6 8 ]
This paper introduces Instruction-oriented Object Detection (IOD), a new task that enhances human-computer interaction by enabling object detectors to understand user instructions and locate relevant objects. Unlike traditional open-vocabulary object detection tasks that rely on users providing a list of required category names, IOD requires models to comprehend natural-language instructions, contextual reasoning, and output the name and location of the desired categories. This poses fresh challenges for modern object detection systems. To develop an IOD system, we create a dataset called IOD-Bench, which consists of instruction-guided detections, along with specialized evaluation metrics. We leverage large-scale language models (LLMs) to generate a diverse set of instructions (8k+) based on existing public object detection datasets, covering a wide range of real-world scenarios. As an initial approach to the IOD task, we propose a model called Ins-DetCLIP. It harnesses the extensive knowledge within LLMs to empower the detector with instruction-following capabilities. Specifically, our Ins-DetCLIP employs a visual encoder (i.e., DetCLIP, an open-vocabulary detector) to extract object-level features. These features are then aligned with the input instructions using a cross-modal fusion module integrated into a pre-trained LLM. Experimental results conducted on IOD-Bench demonstrate that our model consistently outperforms baseline methods that directly combine LLMs with detection models. This research aims to pave the way for a more adaptable and versatile interaction paradigm in modern object detection systems, making a significant contribution to the field.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: sign language translation sign recognition large language models
Scores: [ 6 6 6 5 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: point cloud point cloud self-supervised learning point cloud pre-training
Scores: [ 6 8 6 6 ]
Masked autoencoders (MAE) have recently been introduced to 3D self-supervised pretraining for point clouds due to their great success in NLP and computer vision. Unlike MAEs used in the image domain, where the pretext task is to restore features at the masked pixels, such as colors, the existing 3D MAE works reconstruct the missing geometry only, i.e, the location of the masked points. In contrast to previous studies, we advocate that point location recovery is inessential and restoring intrinsic point features is much superior. To this end, we propose to ignore point position reconstruction and recover high-order features at masked points including surface normals and surface variations, through a novel attention-based decoder which is independent of the encoder design. We validate the effectiveness of our pretext task and decoder design using different encoder structures for 3D training and demonstrate the advantages of our pretrained networks on various point cloud analysis tasks.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: nlp language models representation learning in-context learning
Scores: [ 8 3 8 ]
Convolution-based language models are asymptotically more efficient than Transformers and recent work shows they are competitive in quality. To better understand the relative language modeling quality of these architectures, we pre-train a suite of 14 language models across attention and convolution-based architectures, finding that the SoTA gated convolution architectures still underperform Transformers by up to 2.1 perplexity points on the Pile. Our analysis shows that a single language modeling capability, termed associative recall (AR) — output the next token using the prior context, e.g. Hakuna Matata means no worries Hakuna Matata it means no → ?? — accounts for 76% of the perplexity gap on average. We show the issue arises because the convolution-based models process sequences using fixed filters that do not depend on the input data, making it difficult to handle a variable number of input-specific recall distances (e.g. 4 tokens between instances of Hakuna vs. 5 between worries above). Theoretically, our core contributions are precise bounds for solving AR, applying to the entire class of gated convolution models, that show dimensionality scaling in sequence length. Meanwhile, attention enables tokens separated by any distance to interact and solves AR with model dimension independent of sequence length. We present (1) a concise synthetic AR task, on which we validate the theoretically predicted scaling holds, and (2) a series of architectural modifications, theoretically and empirically showing that they enable solving AR with improved scaling. Our analysis motivates a set of strong baseline models that outperform Transformers at 150M and 355M parameters. We release all checkpoints and code for future analysis.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Large Language Models
Scores: [ 6 8 8 8 ]
Language models are currently trained to predict tokens given document prefixes, enabling them to zero shot long form generation and prompting-style tasks which can be reduced to document completion. We instead present IN-CONTEXT PRETRAINING, a new approach where language models are trained on a sequence of related documents, thereby explicitly encouraging them to read and reason across document boundaries. Our approach builds on the fact that current pipelines train by concatenating random sets of shorter documents to create longer context windows; this improves efficiency even though the prior documents provide no signal for predicting the next document. Given this fact, we can do IN-CONTEXT PRETRAINING by simply changing the document ordering so that each context contains related documents, and directly applying existing pretraining pipelines. However, this document sorting problem is challenging. There are billions of documents and we would like the sort to maximize contextual similarity for every document without repeating any data. To do this, we introduce approximate algorithms for finding related documents with efficient nearest neighbor search and constructing coherent batches with a graph cover algorithm. Our experiments show IN-CONTEXT PRETRAINING offers a scalable and simple approach to significantly enhance LM performance: we see notable improvements in tasks that require more complex contextual reasoning, including in-context learning (+8%), reading comprehension (+15%), faithfulness to previous contexts (+16%), long-context reasoning (+5%), and retrieval augmentation (+9%).
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: large language models alignment fine-grained SFT
Scores: [ 6 8 6 6 ]
Alignment with human preference is a desired property of large language models (LLMs). Currently, the main alignment approach is based on reinforcement learning from human feedback (RLHF). Despite the effectiveness of RLHF, it is intricate to implement and train, thus recent studies explore how to develop alternative alignment approaches based on supervised fine-tuning (SFT). A major limitation of SFT is that it essentially does imitation learning, which can't fully understand what are the expected behaviors. To address this issue, we propose an improved alignment approach named FIGA. Different from prior methods, we incorporate fine-grained (i.e., token or phrase level) quality signals that are derived by contrasting good and bad responses. Our approach has made two major contributions. Firstly, we curate a refined alignment dataset that pairs initial responses and the corresponding revised ones. Secondly, we devise a new loss function can leverage fine-grained quailty signals to instruct the learning of LLMs for alignment. Extensive experiments have demonstrated the effectiveness of our approaches by comparing a number of competitive baselines.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Model Merging Mode Connectivity Classification Deep Learning
Scores: [ 6 6 5 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: visual question answering zero-shot large vision language models visual reasoning underspecification grounding language to vision
Scores: [ 6 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Large Multimodal Models Multilingual Transfer
Scores: [ 3 8 6 5 ]
Recently there has been a significant surge in multimodal learning in terms of both image-to-text and text-to-image generation. However, the success is typically limited to English, leaving other languages largely behind. Building a competitive counterpart in other languages is highly challenging due to the low-resource nature of non-English multimodal data (i.e., lack of large-scale, high-quality image-text data). In this work, we propose \trainname, an effective training paradigm for training large multimodal models in low-resource languages. \trainname demonstrates that \textbf{M}ultilingual language models can \textbf{P}ivot zero-shot \textbf{M}ultimodal learning across languages. Specifically, based on a strong multilingual large language model, multimodal models pretrained on English-only image-text data can well generalize to other languages in a zero-shot manner, even surpassing models trained on image-text data in native languages. Taking Chinese as a practice of \trainname, we build large multimodal models \modelname in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese. To facilitate future research, we open-source codes and model weights at https://anonymous.4open.science/r/VisCPM-8E13.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Collaborative Perception Sensor and Model Heterogeneity
Scores: [ 8 5 6 8 ]
Collaborative perception aims to mitigate the limitations of single-agent perception, such as occlusions, by facilitating data exchange among multiple agents. However, most current works consider a homogeneous scenario where all agents use identity sensors and perception models. In reality, heterogeneous agent types may continually emerge and inevitably face a domain gap when collaborating with existing agents. In this paper, we introduce a new open heterogeneous problem: how to accommodate continually emerging new heterogeneous agent types into collaborative perception, while ensuring high perception performance and low integration cost? To address this problem, we propose HEterogeneous ALliance (HEAL), a novel extensible collaborative perception framework. HEAL first establishes a unified feature space with initial agents via a novel multi-scale foreground-aware Pyramid Fusion network. When heterogeneous new agents emerge with previously unseen modalities or models, we align them to the established unified space with an innovative backward alignment. This step only involves individual training on the new agent type, thus presenting extremely low training costs and high extensibility. It also significantly protects new agents' model details from disclosure since the training can be conducted by the agent owner locally. To promote the research on heterogeneous collaborative perception, we bring OPV2V-H, a new large-scale dataset with more diverse agent types. Extensive experiments on OPV2V-H and DAIR-V2X datasets show that HEAL surpasses SOTA methods in performance while reducing the training parameters by 91.5% when integrating 3 new agent types. The code and data will be released.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Natural Language Processing Language Models Parameter-efficient Fine-tuning
Scores: [ 6 6 6 6 ]
Prompt tuning (PT), where a small amount of trainable soft (continuous) prompt vectors is affixed to the input of language models (LM), has shown promising results across various tasks and models for parameter-efficient fine-tuning (PEFT). PT stands out from other PEFT approaches because it maintains competitive performance with fewer trainable parameters and does not drastically scale up its parameters as the model size expands. However, PT introduces additional soft prompt tokens, leading to longer input sequences, which significantly impacts training and inference time and memory usage due to the Transformer's quadratic complexity. Particularly concerning for Large Language Models (LLMs) that face heavy daily querying. To address this issue, we propose Decomposed Prompt Tuning (DePT), which decomposes the soft prompt into a shorter soft prompt and a pair of low-rank matrices that are then optimised with two different learning rates. This allows DePT to achieve better performance while saving over 20% memory and time costs compared to vanilla PT and its variants, without changing trainable parameter sizes. Through extensive experiments on 23 natural language processing (NLP) and vision-language (VL) tasks, we demonstrate that DePT outperforms state-of-the-art PEFT approaches, including the full fine-tuning baseline in some scenarios. Additionally, we empirically show that DEPT grows more efficient as the model size increases. Our further study reveals that DePT integrates seamlessly with parameter-efficient transfer learning in the few-shot learning setting and highlights its adaptability to various model architectures and sizes.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Dynamic NeRF Optimal transport
Scores: [ 6 6 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: 3D reconstruction from videos Articulated Objects
Scores: [ 5 6 8 6 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Sparse-view 3D modeling NeRF Pose-free Camera Pose
Scores: [ 8 10 5 6 6 ]
Are camera poses necessary for multi-view 3D modeling? Existing approaches predominantly assume access to accurate camera poses. While this assumption might hold for dense views, accurately estimating camera poses for sparse views is often elusive. Our analysis reveals that noisy estimated poses lead to degraded performance for existing sparse-view 3D modeling methods. To address this issue, we present LEAP, a novel pose-free approach, therefore challenging the prevailing notion that camera poses are indispensable. LEAP discards pose-based operations and learns geometric knowledge from data. LEAP is equipped with a neural volume, which is shared across scenes and is parameterized to encode geometry and texture priors. For each incoming scene, we update the neural volume by aggregating 2D image features in a feature-similarity-driven manner. The updated neural volume is decoded into the radiance field, enabling novel view synthesis from any viewpoint. On both object-centric and scene-level datasets, we show that LEAP significantly outperforms prior methods when they employ predicted poses from state-of-the-art pose estimators. Notably, LEAP performs on par with prior approaches that use ground-truth poses while running 400× faster than PixelNeRF. We show LEAP generalizes to novel object categories and scenes, and learns knowledge closely resembles epipolar geometry.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Contrastive Learning Mixed Integer Programming Machine Learning Branching Stratigies Data Augmentation
Scores: [ 6 5 6 6 ]
Recent advancements have introduced machine learning frameworks to enhance the Branch and Bound (B&B) branching policies for solving Mixed Integer Linear Programming (MILP). These methods, primarily relying on imitation learning of Strong Branching, have shown superior performance. However, collecting expert samples for imitation learning, particularly for Strong Branching, is a time-consuming endeavor. To address this challenge, we propose \textbf{C}ontrastive Learning with \textbf{A}ugmented \textbf{M}ILPs for \textbf{Branch}ing (CAMBranch), a framework that generates Augmented MILPs (AMILPs) by applying variable shifting to limited expert data from their original MILPs. This approach enables the acquisition of a considerable number of labeled expert samples. CAMBranch leverages both MILPs and AMILPs for imitation learning and employs contrastive learning to enhance the model's ability to capture MILP features, thereby improving the quality of branching decisions. Experimental results demonstrate that CAMBranch, trained with only 10% of the complete dataset, exhibits superior performance. Ablation studies further validate the effectiveness of our method.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Time Series Contrastive Learning Healthcare Self-Supervised Representation Learning
Scores: [ 8 6 5 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: long-tailed recognition imbalanced learning weight decay regularization neural collapse simplex ETF machine learning learning theory
Scores: [ 6 6 8 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Time Series Forecasting Transformer
Scores: [ 8 8 8 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: multi-modal learning missing modalities missing labels clinical predictive modeling patient representation learning
Scores: [ 6 8 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Sparse Attention Attention Estimation Linear Attention Transformers NLP
Scores: [ 8 6 6 ]
The transformer architecture has made breakthroughs in recent years on tasks which require modeling pairwise relationships between sequential elements, as is the case in natural language understanding. However, transformers struggle with long sequences due to the quadratic complexity of the attention operation, and previous research has aimed to lower the complexity by sparsifying or linearly approximating the attention matrix. Yet, these approaches cannot straightforwardly distill knowledge from a teacher's attention matrix, and often require complete retraining from scratch. Furthermore, previous sparse and linear approaches may also lose interpretability if they do not produce full quadratic attention matrices. To address these challenges, we propose SEA: Sparse linear attention with an Estimated Attention mask. SEA estimates the attention matrix with linear complexity via kernel-based linear attention, then creates a sparse approximation to the full attention matrix with a top-k selection to perform a sparse attention operation. For language modeling tasks (Wikitext2), previous linear and sparse attention methods show a roughly two-fold worse perplexity scores over the quadratic OPT-125M baseline, while SEA achieves an even better perplexity than OPT-125M, using roughly half as much memory as OPT-125M. Moreover, SEA maintains an interpretable attention matrix and can utilize knowledge distillation to lower the complexity of existing pretrained transformers. We believe that our work will have a large practical impact, as it opens the possibility of running large transformers on resource-limited devices with less memory.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: deep learning time series forecasting latent variable models discrete representation
Scores: [ 6 6 8 ]
Probabilistic time series forecasting is a challenging problem due to the long sequences involved, the large number of samples needed for accurate probabilistic inference, and the need for real-time inference in many applications. These challenges necessitate methods that are not only accurate but computationally efficient. Unfortunately, most current state-of-the-art methods for time series forecasting are based on Transformers, which scale poorly due to quadratic complexity in sequence length, and are therefore needlessly computationally inefficient. Moreover, with a few exceptions, these methods have only been evaluated for non-probabilistic point estimation. In this work, we address these two shortcomings.For the first, we introduce VQ-TR, which maps large sequences to a discrete set of latent representations as part of the Attention module. This not only allows us to attend over larger context windows with linear complexity in sequence length but also allows for effective regularization to avoid overfitting.For the second, we provide what is to the best of our knowledge the first systematic comparison of modern Transformer-based time series forecasting methods for probabilistic forecasting. In this comparison, we find that VQ-TR performs better or comparably to all other methods while being computationally efficient.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: text-to-image generation text-to-image evaluation Davidsonian semantics large language models scene graphs visual question answering question generation benchmark
Scores: [ 5 8 5 ]
Evaluating text-to-image models is notoriously difficult. A strong recent approach for assessing text-image faithfulness is based on QG/A (question generation and answering), which uses pre-trained foundational models to automatically generate a set of questions and answers from the prompt, and output images are scored based on whether these answers extracted with a visual question answering model are consistent with the prompt-based answers. This kind of evaluation is naturally dependent on the quality of the underlying QG and QA models. We identify and address several reliability challenges in existing QG/A work: (a) QG questions should respect the prompt (avoiding hallucinations, duplications, and omissions) and (b) VQA answers should be consistent (not assert that there is no motorcycle in an image while also claiming the motorcycle is blue). We address these issues with Davidsonian Scene Graph (DSG), an empirically grounded evaluation framework inspired by formal semantics. DSG is an automatic, graph-based QG/A that is modularly implemented to be adaptable to any QG/A module. DSG produces atomic and unique questions organized in dependency graphs, which (i) ensure appropriate semantic coverage and (ii) sidestep inconsistent answers. With extensive experimentation and human evaluation on a range of model configurations (LLM, VQA, and T2I), we empirically demonstrate that DSG addresses the challenges noted above. Finally, we present DSG-1k, an open-sourced evaluation benchmark that includes 1,060 prompts, covering a wide range of fine-grained semantic categories with a balanced distribution. We will release the DSG-1k prompts and the corresponding DSG questions.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Bug fix window attention position embeddings high resolution finetuning image classification video classification object detection instance segmentation
Scores: [ 6 6 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: biological neuron artificial neural network dynamic neural response tuning activation function response-adaptive activation aggregated response regularization
Scores: [ 6 3 8 5 ]
Artificial Neural Networks (ANNs) have gained broad applications across various fields due to their brilliant architecture designs. ANNs were initially inspired by the principle of biology. The biological neural system's fundamental response process comprises information acquisition, transmission, and aggregation. The transmission of information in neurons is achieved by triggering action potentials that propagate through axons. ANNs utilize activation functions to simulate such behavior of biological neurons. However, previous studies have only considered static response conditions, while the biological neuron's response conditions are dynamic, depending on neuron properties and the real-time environment. Therefore, the dynamic response conditions of biological neurons could help improve the static ones of the existing activation functions. Additionally, the neuron's aggregated response exhibits high specificity for different categories, allowing the neural system to differentiate and identify distinct objects. Inspired by these biological patterns, we propose a novel Dynamic Neural Response Tuning (DNRT) mechanism, which aligns the response patterns of ANNs with those of biological neurons. DNRT comprises Response-Adaptive Activation (RAA) and Aggregated Response Regularization (ARR), mimicking biological neurons' information transmission and aggregation behaviors. RAA dynamically adjusts the response condition based on the strength and characteristics of the input signal. ARR is devised to enhance the network's ability to learn category specificity. Extensive experiments indicate that the proposed DNRT is highly interpretable, applicable to various mainstream network architectures, and can achieve remarkable performance compared with existing response activation functions in multiple tasks and domains.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Spatial relation Transformer Spatial predicates Visual recognition Neural network architecture
Scores: [ 8 5 6 8 5 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: neural topic model; contrastive learning
Scores: [ 6 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: CLIP debiasing fair representation learning vision-language models dependence measure
Scores: [ 6 6 8 6 ]
Large pre-trained vision-language models (VLMs) provide compact and general-purpose representations of text and images that are demonstrably effective across multiple downstream vision and language tasks. However, owing to the nature of their training process, these models have the potential to 1) propagate or amplify societal biases in the training data, and 2) learn to rely on spurious features. Thispaper proposes FairVLM, a general approach for making the zero-shot prediction of VLMs more fair and robust to spurious correlations. We formulate the problem of jointly debiasing VLMs’ image and text representations in reproducing kernel Hilbert spaces (RKHSs), which affords multiple benefits: 1) Flexibility: Unlike existing approaches, which are specialized to either learn with or without ground-truth labels, FairVLM is adaptable to learning in both scenarios, 2) Ease of Optimization: FairVLM lends itself to an iterative optimization involving closed-form solvers, which leads to 4×-10× faster training than the existing methods, 3) Sample Efficiency: Under sample-limited conditions, FairVLM significantly outperforms baselines when they fail entirely, and 4) Performance: Empirically, FairVLM achieves appreciable zero-shot accuracy gains on benchmark fairness and spurious correlation datasets over their respective baselines.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: learned image compression frequency-aware transformer entropy model
Scores: [ 6 6 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Micro-expression spotting Optical flow Facial alignment
Scores: [ 6 5 8 8 8 ]
Micro-expression spotting (MES) is challenging since the small magnitude of micro-expression (ME) makes them susceptible to global movements like head rotation. However, the unique movement pattern and inherent characteristics of ME allow them to be distinguished from other movements. Existing MES methods based on fixed reference frame degrade optical flow accuracy and are overly dependent on facial alignment. In this paper, we propose a skip-k-frame block-wise main directional mean optical flow (MDMO) feature for MES based on unfixed reference frame. By employing skip-k-frame strategy, we substantiate the existence of a distinct and exclusive movement pattern in ME, called M-pattern due to its feature curve resembling the letter `M'. Based on M-pattern and characteristics of ME, we then provide a novel spotting rules to precisely locate ME intervals. Block-wise MDMO feature is capable of removing global movements without compromising complete ME movements in the early feature extraction stage. Besides, A novel pixelmatch-based facial alignment algorithm with dynamic update of reference frame is proposed to better align facial images and reduce jitter between frames. Experimental results on CAS(ME)2, SAMM-LV and CASME II validate the proposed methods are superior to the state-of-the-art methods.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: deep learning deep features computer vision feature upsampling
Scores: [ 3 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: motion refinement hand-object interaction inverse problem generative prior
Scores: [ 6 6 8 6 ]
In this work, we tackle the challenging problem of denoising hand-object interactions (HOI). Given an erroneous interaction sequence, the objective is to refine the incorrect hand trajectory to remove interaction artifacts for a perceptually realistic sequence. This challenge involves intricate interaction noise, including unnatural hand poses and incorrect hand-object relations, alongside the necessity for robust generalization to new interactions and diverse noise patterns. We tackle those challenges through a novel approach, GeneOH Diffusion, incorporating two key designs: an innovative contact-centric HOI representation named GeneOH and a new domain-generalizable denoising scheme. The contact-centric representation GeneOH informatively parameterizes the HOI process, facilitating enhanced generalization across various HOI scenarios. The new denoising scheme consists of a canonical denoising model trained to project noisy data samples from a whitened noise space to a clean data manifold and a "denoising via diffusion" strategy which can handle input trajectories with various noise patterns by first diffusing them to align with the whitened noise space and cleaning via the canonical denoiser. Extensive experiments on four benchmarks with significant domain variations demonstrate the superior effectiveness of our method. GeneOH Diffusion also shows promise for various downstream applications. An anonymous website for introducing the work is available at Geneoh-Diffusion.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Vision Language Learning Large Language Model Pretraining
Scores: [ 6 5 8 6 ]
Recently, the remarkable advance of the Large Language Model (LLM) has inspired researchers to transfer its extraordinary reasoning capability to both vision and language data. However, the prevailing approaches primarily regard the visual input as a prompt and focus exclusively on optimizing the text generation process conditioned upon vision content by a frozen LLM. Such an inequitable treatment of vision and language heavily constrains the model's potential. In this paper, we break through this limitation by representing both vision and language in a unified form. Specifically, we introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language that LLM can read. The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image. Coped with this tokenizer, the presented foundation model called LaVIT can handle both image and text indiscriminately under the same generative learning paradigm. This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously. Extensive experiments further showcase that it outperforms the existing models by a large margin on massive vision-language tasks.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Network Quantization Large Language Models
Scores: [ 6 6 6 6 ]
Large Language Models (LLMs) have demonstrated unparalleled efficacy in natural language processing. However, their high computational demands and memory overheads hinder their broad deployment. To address this, two quantization strategies emerge, including Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ). For LLMs, the billions of parameters make the QAT impractical due to the prohibitive training cost and thus PTQ becomes more prevalent. In existing studies, activation outliers in particular channels are identified as the biggest challenge to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM is able to obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Mixture of Experts Parameter-Efficient Fine-Tuning
Scores: [ 6 5 8 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Low-rank sparsity closed-loop recurrent neural networks
Scores: [ 8 6 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Computer Vision Sparse-view Pose Estimation Diffusion
Scores: [ 5 5 8 8 ]
Estimating camera poses is a fundamental task for 3D reconstruction and remains challenging given sparse views ($<$10). In contrast to existing approaches that pursue top-down prediction of global parametrizations of camera extrinsics, we propose a distributed representation of camera pose that treats a camera as a bundle of rays. This representation allows for a tight coupling with spatial image features improving pose precision. We observe that this representation is naturally suited for set-level level transformers and develop a regression-based approach that maps image patches to corresponding rays. To capture the inherent uncertainties in sparse-view pose inference, we adapt this approach to learn a denoising diffusion model which allows us to sample plausible modes while improving performance. Our proposed methods, both regression- and diffusion-based, demonstrate state-of-the-art performance on camera pose estimation on CO3D while generalizing to unseen object categories and in-the-wild captures.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: neural fields NeRF reconstruction
Scores: [ 8 8 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Vision-Language Models Large Language Models
Scores: [ 5 5 6 6 ]
Given an image and a target modification (e.g an image of the Eiffel tower and the text “without people and at night-time”), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is costly (i.e. query image, textual modification, and target image), recent research sidesteps this need by using large-scale vision-language models (VLMs), performing Zero-Shot CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require training task-specific, customized models over large amounts of image-text pairs. In this work, we proposeto tackle CIR in a training-free manner via our Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple, yet human-understandable and scalable pipeline that effectively recombines large-scale VLMs with large language models (LLMs). By captioning the reference image using a pre-trained generative VLM and asking a LLM to recompose the caption based on the textual target modification for subsequent retrieval via e.g. CLIP, we achieve modular language reasoning. In four ZS-CIR benchmarks, we find competitive, in-part state-of-the-art performance - improving over supervised methods Moreover, the modularity of CIReVL offers simple scalability without re-training, allowing us to both investigate scaling laws and bottlenecks for ZS-CIR while easily scaling up to in parts more than double of previously reported results. Finally, we show that CIReVL makes CIR human-understandable by composing image and text in a modular fashion in the language domain, thereby making it intervenable, allowing to post-hoc re-align failure cases. Code will be released upon acceptance.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: 3D Point Cloud Representation 3D Point Cloud Registration Scan-to-CAD Spherical Gaussians Equivariant
Scores: [ 8 6 5 ]
Recently, SO(3)-equivariant methods have been explored for 3D reconstruction via Scan-to-CAD.Despite significant advancements attributed to the unique characteristics of 3D data, existing SO(3)-equivariant approaches often fall short in seamlessly integrating local and global contextual information in a widely generalizable manner.Our contributions in this paper are threefold.First, we introduce Spherical Patch Fields, a representation technique designed for patch-wise, SO(3)-equivariant 3D point clouds, anchored theoretically on the principles of Spherical Gaussians.Second, we present the Patch Gaussian Layer, designed for the adaptive extraction of local and global contextual information from resizable point cloud patches.Culminating our contributions, we present Learnable Spherical Patch Fields (DeepSPF) – a versatile and easily integrable backbone suitable for instance-based point networks.Through rigorous evaluations, we demonstrate significant enhancements in Scan-to-CAD performance for point cloud registration, retrieval, and completion: a significant reduction in the rotation error of existing registration methods, an improvement of up to 17% in the Top-1 error for retrieval tasks, and a notable reduction of up to 30% in the Chamfer Distance for completion models, all attributable to the incorporation of DeepSPF.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: open-vocabulary object detection open-vocabulary image segmentation
Scores: [ 6 8 6 8 ]
Open-vocabulary dense prediction tasks including object detection and image segmentation have been advanced by the success of Contrastive Language-Image Pre-training (CLIP). CLIP models, particularly those incorporating vision transformers (ViTs), have exhibited remarkable generalization ability in zero-shot image classification. However, when transferring the vision-language alignment of CLIP from global image representation to local region representation for the open-vocabulary dense prediction tasks, CLIP ViTs suffer from the domain shift from full images to local image regions. In this paper, we embark on an in-depth analysis of the region-language alignment in CLIP models, which is essential for downstream open-vocabulary dense prediction tasks. Subsequently, we propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs. CLIPSelf empowers ViTs to distill itself by aligning a region representation extracted from its dense feature map with the image-level representation of the corresponding image crop. With the enhanced CLIP ViTs, we achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks. Models and code will be released to facilitate future research.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: constrained decoding label projection
Scores: [ 8 6 6 8 ]
Zero-shot cross-lingual transfer utilizing multilingual LLMs has become a popular learning paradigm for low-resource languages with no labeled training data. However, for NLP tasks that involve fine-grained predictions on words and phrases, the performance of zero-shot cross-lingual transfer learning lags far behind supervised fine-tuning methods. Therefore, it is common to exploit translation and label projection to further improve the performance by (1) translating training data that is available in a high-resource language (e.g., English) together with the gold labels into low-resource languages, and/or (2) translating test data in low-resource languages to a high-source language to run inference on, then projecting the predicted span-level labels back onto the original test data. However, state-of-the-art marker-based label projection methods suffer from translation quality degradation due to the extra label markers injected in the input to the translation model. In this work, we explore a new direction that leverages constrained decoding for label projection to overcome the aforementioned issues. Our new method not only can preserve the quality of translated texts but also has the versatility of being applicable to both translating training and translating test data strategies. This versatility is crucial as our experiments reveal that translating test data can lead to a considerable boost in performance compared to translating only training data. We evaluate on two cross-lingual transfer tasks, namely Named Entity Recognition and Event Argument Extraction, spanning 20 languages. The results demonstrate that our approach outperforms the state-of-the-art marker-based method by a large margin and also shows better performance than other label projection methods that rely on external word alignment.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Video Question Answer Visual Reasoning Causal Represetation Learning
Scores: [ 8 3 6 5 8 ]
Current approaches to Video Question Answering (VideoQA) primarily focus on cross-modality matching, which is limited by the requirement for extensive data annotations and the insufficient capacity for causal reasoning (e.g. attributing accidents). To address these challenges, we introduce a causal framework for video reasoning, termed Learning Latent Causal Processes (LLCP). At the heart of LLCP lies a multivariate generative model designed to analyze the spatial-temporal dynamics of objects within events. Leveraging the inherent modularity of causal mechanisms, we train the model through self-supervised local auto-regression eliminating the need for annotated question-answer pairs. During inference, the model is applied to answer two types of reasoning questions: accident attribution, which infers the cause from observed effects, and counterfactual prediction, which predicts the effects of counterfactual conditions given the factual evidence. In the first scenario, we identify variables that deviate from the established distribution by the learned model, signifying the root cause of accidents. In the second scenario, we replace embeddings of previous variables with counterfactual ones, enabling us to forecast potential developments. Once we have identified these cause/effect variables, natural language answers are derived through a combination of grammatical parsing and a pre-trained vision-language model. We assess the efficacy of LLCP on both synthetic and real-world data, demonstrating comparable performance to supervised methods despite our framework using no paired textual annotations.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: object detection data-centric AI label translation dataset improvements
Scores: [ 6 6 6 ]
In object detection, varying annotation protocols across datasets can result in annotation mismatches, leading to inconsistent class labels and bounding regions. Addressing these mismatches typically involves manually identifying common trends and fixing the corresponding bounding boxes and class labels. To alleviate this laborious process, we introduce the label translation problem in object detection. Here, the goal is to translate bounding boxes from one or more source datasets to match the annotation style of a target dataset. We propose a data-centric approach, Label-Guided Pseudo-Labeling (LGPL), that improves downstream detectors in a manner agnostic to the detector learning algorithms and model architectures. Validating across four object detection scenarios, defined over seven different datasets and three different architectures, we show that translating labels for a target task via LGPL consistently improves the downstream detection in every setting, on average by 1.88 mAP and 2.65 AP$^{75}$. Most importantly, we find that when training with multiple labeled datasets, carefully addressing annotation mismatches with LGPL alone can improve downstream object detection better than off-the-shelf domain adaptation techniques that align only image features.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: world event prediction large language models
Scores: [ 6 3 8 ]
Machine-based prediction of real-world events is garnering attention due to its potential for informed decision-making. Whereas traditional forecasting predominantly hinges on structured data like time-series, recent breakthroughs in language models enable predictions using unstructured text. In particular, (Zou et al., 2022) unveils AutoCast, a new benchmark that employs news articles for answering forecasting queries. Nevertheless, existing methods still trail behind human performance. The cornerstone of accurate forecasting, we argue, lies in identifying a concise, yet rich subset of news snippets from a vast corpus. With this motivation, we introduce AutoCast++, a zero-shot ranking-based context retrieval system, tailored to sift through expansive news document collections for event forecasting. Our approach first re-ranks articles based on zero-shot question-passage relevance, honing in on semantically pertinent news. Following this, the chosen articles are subjected to zero-shot summarization to attain succinct context. Leveraging a pre-trained language model, we conduct both the relevance evaluation and article summarization without needing domain-specific training. Notably, recent articles can sometimes be at odds with preceding ones due to new facts or unanticipated incidents, leading to fluctuating temporal dynamics. To tackle this, our re-ranking mechanism gives preference to more recent articles, and we further regularize the multi-passage representation learning to align with human forecaster responses made on different dates. Empirical results underscore marked improvements across multiple metrics, improving the performance for multiple-choice questions (MCQ) by 48% and true/false (TF) questions by up to 8%.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Vision Transformers Model compression Dynamic pruning
Scores: [ 6 6 8 6 ]
The Vision Transformer (ViT) has emerged as a powerful architecture for various computer vision tasks. Nonetheless, this comes with substantially heavier computational costs than Convolutional Neural Networks (CNNs). The attention mechanism in ViTs, which integrates information from different image patches to the class token ([CLS]), renders traditional structured pruning methods used in CNNs unsuitable. To overcome this issue, we propose SynergisTic pAtch pRuning (STAR) that unifies intra-layer and inter-layer patch importance scoring. Specifically, our approach combines a) online evaluation of intra-layer importance for the [CLS] and b) offline evaluation of inter-layer importance of each patch, which is obtained by a newly designed mechanism that invokes Layer-wise Relevance Propagation (a popular tool in model interpretability analysis). The two importance scores are fused by minimizing a weighted average of Kullback-Leibler (KL) Divergences, which admits an analytical form and can be carried with minimal overhead. Subsequently, the patches are successively pruned at each layer by maintaining only the top-k most important ones. Unlike prior art that relies on manual selection of the pruning rates at each layer, we propose an automated method for selecting them based on offline-derived metrics. We also propose a variant that uses these rates as weighted percentile parameters (for the layer-wise normalized scores), thus leading to an alternate adaptive rate selection technique that is input-based. Extensive experiments demonstrate the significant acceleration of the inference with minimal performance degradation. For instance, on the ImageNet dataset, the pruned DeiT-Small reaches a throughput of 4,256 images/s, which is over 66% higher than the much smaller (unpruned) DeiT-Tiny model, while having a substantially higher accuracy (+6.8% Top-1 and +3.1% Top-5).
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: vision transformer representation learning hyper spectral imaging
Scores: [ 8 8 5 5 ]
Vision Transformer (ViT) has emerged as a powerful architecture in the realm of modern computer vision. However, its application in certain imaging fields, such as microscopy and satellite imaging, presents unique challenges. In these domains, images often contain multiple channels, each carrying semantically distinct and independent information. Furthermore, the model must demonstrate robustness to sparsity in input channels, as they may not be densely available during training or testing. In this paper, we propose a modification to the ViT architecture that enhances reasoning across the input channels and introduce Hierarchical Channel Sampling (HCS) as an additional regularization technique to ensure robustness when only partial channels are presented during test time. Our proposed model, ChannelViT, constructs patch tokens independently from each input channel and utilizes a learnable channel embedding that is added to the patch tokens, similar to positional embeddings. We evaluate the performance of ChannelViT on ImageNet, JUMP-CP (microscopy cell imaging), and So2Sat (satellite imaging). Our results show that ChannelViT outperforms ViT on classification tasks and generalizes well, even when a subset of input channels is used during testing. Across our experiments, HCS proves to be a powerful regularizer, independent of the architecture employed, suggesting itself as a straightforward technique for robust ViT training. Lastly, we find that ChannelViT generalizes effectively even when there is limited access to all channels during training, highlighting its potential for multi-channel imaging under real-world conditions with sparse sensors.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Automated Model Evalutaion Energy Meta-distribution Distribution shift
Scores: [ 8 8 3 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Image super-resolution; Real-World super-resolution
Scores: [ 8 6 8 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Spatio-temporal data time series virtual sensing imputation graph neural networks deep learning
Scores: [ 5 6 8 3 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: long-term action anticipation multimodal learning
Scores: [ 5 8 6 6 ]
Can we better anticipate an actor’s future actions (e.g. mix eggs) by knowing what commonly happens after the current action (e.g. crack eggs)? What if the actor also shares the goal (e.g. make fried rice) with us? The long-term action anticipation (LTA) task aims to predict an actor’s future behavior from video observations in the form of verb and noun sequences, and it is crucial for human-machine interaction.We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal. We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos),have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively. We propose AntGPT, which represents video observations as sequences of human actions, and uses the action representation for an LLM to infer the goals and model temporal dynamics. AntGPT achieves state-of-the-art performance on Ego4D LTA v1 and v2, EPIC-Kitchens-55, as well as EGTEA GAZE+, thanks to LLMs’ goal inference and temporal dynamics modeling capabilities. We further demonstrate that these capabilities can be effectively distilled into a compact neural network 1.3% of the original LLM model size. Code and model will be released upon acceptance.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Multimodal Large Language Models Demonstrative Instruction
Scores: [ 6 6 8 8 ]
Recent advancements in Multimodal Large Language Models (MLLMs) have been utilizing Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can recognize. This is achieved by training the VPGs on millions of image-caption pairs, where the VPG-generated tokens of images are fed into a frozen LLM to generate the corresponding captions. However, this image-captioning based training objective inherently biases the VPG to concentrate solely on the primary visual contents sufficient for caption generation, often neglecting other visual details. This shortcoming results in MLLMs’ underperformance in comprehending demonstrative instructions consisting of multiple, interleaved, and multimodal instructions that demonstrate the required context to complete a task. To address this issue, we introduce a generic and lightweight Visual Prompt Generator Complete module (VPG-C), which can infer and complete the missing details essential for comprehending demonstrative instructions. Further, we propose a synthetic discriminative training strategy to fine-tune VPG-C, eliminating the need for supervised demonstrative instructions. As for evaluation, we build DEMON, a comprehensive benchmark for demonstrative instruction understanding. Synthetically trained with the proposed strategy, VPG-C achieves significantly stronger zero-shot performance across all tasks of DEMON. Further evaluation on the MME and OwlEval benchmarks also demonstrate the superiority of VPG-C. The anonymous project is available at https://anonymous.4open.science/r/Cheetah-45B4.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: length generalization systematic generalization understanding transformer scratchpad LLM algorithmic reasoning
Scores: [ 8 6 8 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Non-watertight mesh; generative model; 3D geometry; differentiable rendering
Scores: [ 8 8 8 5 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: self-supervised learning deep one-class cilassification
Scores: [ 6 3 6 6 6 ]
One-class classification (OCC) involves predicting whether a new data is normal or anomalous based solely on the data from a single class during training. Various attempts have been made to learn suitable representations for OCC within a self-supervised framework. Notably, discriminative methods that use geometric visual transformations, such as rotation, to generate pseudo-anomaly samples have exhibited impressive detection performance. Although rotation is commonly viewed as a distribution-shifting transformation and is widely used in the literature, its effectiveness remains a mystery. In this study, we make a surprising observation: there exists a strong linear relationship (Pearson's Correlation, r>0.9) between the accuracy of rotation prediction and the performance of OCC. This suggests that a classifier that effectively distinguishes different rotations is more likely to excel in OCC, and vice versa. The root cause of this phenomenon can be attributed to the transformation bias in the dataset, where representations learned from transformations already present in the dataset tend to be less effective, making it essential to accurately estimate the transformation distribution before utilizing pretext tasks involving these transformations for reliable self-supervised representation learning. To the end, we propose a novel two-stage method to estimate the transformation distribution within the dataset. In the first stage, we learn general representations through standard contrastive pre-training. In the second stage, we select potentially semantics-preserving samples from the entire augmented dataset, which includes all rotations, by employing density matching with the provided reference distribution. By sorting samples based on semantics-preserving versus shifting transformations, we achieve improved performance on OCC benchmarks.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Graph Matching; Joint Optimization; Unsupervised Learning
Scores: [ 8 8 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Open-source Language Models Fine-tuning Mixed-quality Data
Scores: [ 6 6 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: speech audio multi-modal large language model
Scores: [ 6 6 3 8 ]
Current speech large language models build upon discrete speech representations, which can be categorized into semantic tokens and acoustic tokens. However, existing speech tokens are not specifically designed for speech language modeling. To assess the suitability of speech tokens for building speech language models, we established the first benchmark, SLMTokBench. Our results indicate that neither semantic nor acoustic tokens are ideal for this purpose. Therefore, we propose SpeechTokenizer, a unified speech tokenizer for speech large language models. SpeechTokenizer adopts the Encoder-Decoder architecture with residual vector quantization (RVQ). Unifying semantic and acoustic tokens, SpeechTokenizer disentangles different aspects of speech information hierarchically across different RVQ layers. Furthermore, We construct a Unified Speech Language Model (USLM) leveraging SpeechTokenizer. Experiments show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstrates strong performance on the SLMTokBench benchmark. Also, USLM outperforms VALL-E in zero-shot Text-to-Speech tasks. Speech samples are available at https://speechtokenizer.github.io/.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: 3D vision NeRF semantic understanding
Scores: [ 5 8 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Language Models Compiled Neural Networks Neural Comprehension Symbolic Operations Length Generalization
Scores: [ 6 6 5 8 ]
Language models (LMs) proficiency in handling deterministic symbolic reasoning and rule-based tasks remains limited due to their dependency implicit learning on textual data. To enable fully rule comprehension ability, we explore how to incorporate compiled neural networks (CoNNs) which weight is specially designed into the architecture of LMs, to achieve high accuracy and robust performance. CoNNs are transformer-based neural networks that execute rules through artificially generated attention weights. Our method, which call "Neural Comprehension", by incorporating CoNN modules into the LM, the framework effectively tackles rule-intensive challenges. Our experiments on symbolic reasoning tasks and real-world arithmetic reasoning tasks demonstrate the superior performance of our method compared to existing techniques. Furthermore, our LM achieves flawless execution on symbolic operations tasks, highlighting the potential of our method in enabling LMs to possess true symbolic comprehension capabilities.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Zipformer ScaledAdam automatic speech recognition
Scores: [ 8 6 8 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: anomaly detection OOD
Scores: [ 6 6 5 6 ]
Detecting out-of-distribution (OOD) data is a critical challenge in machine learning due to model overconfidence, often without awareness of their epistemological limits. We hypothesize that "neural collapse", a phenomenon affecting in-distribution data for models trained beyond loss convergence, also influences OOD data. To benefit from this interplay, we introduce NECO, a novel post-hoc method for OOD detection, which leverages the geometric properties of “neural collapse” and of principal component spaces to identify OOD data. Our extensive experiments demonstrate that NECO achieves state-of-the-art results on both small and large-scale OOD detection tasks while exhibiting strong generalization capabilities across different network architectures. Furthermore, we provide a theoretical explanation for the effectiveness of our method in OOD detection. We plan to release the code after the anonymity period.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Neural Fields NeRF 3D Vision Scene Reconstruction
Scores: [ 8 6 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: alt-text social media twitter clip computer vision image captioning accessibility
Scores: [ 5 6 8 ]
In this work we present an approach for generating alternative text (or alt-text) descriptions for images shared on social media, specifically Twitter. More than just a special case of image captioning, alt-text is both more literally descriptive and context-specific. Also critically, images posted to Twitter are often accompanied by user-written text that despite not necessarily describing the image may provide useful context that if properly leveraged can be informative. We address this task with a multimodal model that conditions on both textual information from the associated social media post as well as visual signal from the image, and demonstrate that the utility of these two information sources stacks. We put forward a new dataset of 371k images paired with alt-text and tweets scraped from Twitter and evaluate on it across a variety of automated metrics as well as human evaluation. We show that our approach of conditioning on both tweet text and visual information significantly outperforms prior work, by more than 2x on BLEU@4.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Implicit neural representations spectral bias computer vision neural network architectures activations image representation edge detection
Scores: [ 5 8 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Language model Pre-training ELECTRA Efficiency
Scores: [ 6 6 6 ]
ELECTRA pre-trains language models by detecting tokens in a sequence that have been replaced by an auxiliary model. Although ELECTRA offers a significant boost in efficiency, its potential is constrained by the training cost brought by the auxiliary model. Notably, this model, which is jointly trained with the main model, only serves to assist the training of the main model and is discarded post-training. This results in a substantial amount of training cost being expended in vain. To mitigate this issue, we propose Fast-ELECTRA, which leverages an existing language model as the auxiliary model. To construct a learning curriculum for the main model, we smooth its output distribution via temperature scaling following a descending schedule. Our approach rivals the performance of state-of-the-art ELECTRA-style pre-training methods, while significantly eliminating the computation and memory cost brought by the joint training of the auxiliary model. Our method also reduces the sensitivity to hyper-parameters and enhances the pre-training stability.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Contrastive Pre-training Vision-Language Text-to-Image Generation Auto-regressive Model.
Scores: [ 6 5 8 8 6 ]
The field of Vision-and-Language (VL) has witnessed a proliferation of pretrained foundation models. Current techniques typically employ only one type of training objective, whether it's (1) contrastive objectives (like CLIP), (2) image-to-text generative objectives (like PaLI), or (3) text-to-image generative objectives (like Parti). However, all these three objectives are mutually relevant and are all based on image-text pairs. Intuitively, the first two objectives can be considered as complementary projections between two modalities, and contrastive learning can preserve global alignment and generations facilitate fine-grained understanding. Inspired by this, we present a Contrastive Bi-directional Image-Text generation model (CoBIT) to first time unify the three pre-training objectives in one framework. Specifically, CoBIT employs a novel unicoder-decoder structure consisting of an image unicoder, a text unicoder, and a cross-modal decoder. The image/text unicoders can switch between encoding and decoding in different tasks, enabling flexibility and shared knowledge that benefits both image-to-text and text-to-image generations. CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE), and text-based content creation, particularly in zero-shot scenarios.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Dynamic graph; fourier transform; link prediction
Scores: [ 5 8 6 8 ]
Link prediction is a crucial task in dynamic graph learning. Recent advancements in continuous-time dynamic graph models, primarily by leveraging richer temporal details, have significantly improved link prediction performance. However, due to their complex modules, they still face several challenges, such as overfitting and optimization difficulties. More importantly, it is challenging for these methods to capture the 'shift' phenomenon, where node interaction patterns change over time. To address these issues, we propose a simple yet novel method called \textbf{Fre}quency \textbf{E}nhanced Decomposed Continuous-Time \textbf{Dy}namic \textbf{G}raph ({\bf FreeDyG}) model for link prediction. FreeDyG extracts node representations based on their historical first-hop neighbors thus transforming the dynamic graph learning problem into time series analysis where node interactions are observed over sequential time points. Unlike previous works that primarily focus on the time domain, we delve into the frequency domain, allowing a deeper and more nuanced extraction of interaction patterns, revealing periodic and "shift" behaviors. Extensive experiments conducted on seven real-world continuous-time dynamic graph datasets validate the effectiveness of FreeDyG. The results consistently demonstrate that FreeDyG outperforms existing methods in both transductive and inductive settings.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Time Series Forecasting Mixing Networks
Scores: [ 3 8 6 ]
Time series forecasting is widely used in extensive applications, such as traffic planning and weather forecasting. However, real-world time series usually present intricate temporal variations, making forecasting extremely challenging. Going beyond the mainstream paradigms of plain decomposition and multiperiodicity analysis, we analyze temporal variations in a novel view of multiscale-mixing, where time series present distinct patterns in different sampling scales. Specifically, the microscopic and the macroscopic information are reflected in fine and coarse scales, respectively, and thereby complex variations are inherently disentangled. Based on this observation, we propose TimeMixer as a fully MLP-based architecture with Past-Decomposable-Mixing (PDM) and Future-Multipredictor-Mixing (FMM) blocks to take full advantage of disentangled multiscale series in both past extraction and future prediction phases. Concretely, PDM applies the decomposition to multiscale series and further mixes the decomposed seasonal and trend components in fine-to-coarse and coarse-to-fine directions separately, which successively aggregates the microscopic seasonal and macroscopic trend information. FMM further ensembles multiple predictors to utilize complementary forecasting capabilities in multiscale observations. Consequently, our proposed TimeMixer is able to achieve consistent state-of-the-art performances in both long-term and short-term forecasting tasks with favorable run-time efficiency.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Spiking Neural Betwork Point Cloud Event Camera Action Recognition
Scores: [ 6 6 3 8 ]
Event cameras are bio-inspired sensors that respond to local changes in light intensity and feature low latency, high energy efficiency, and high dynamic range. Meanwhile, Spiking Neural Networks (SNNs) have gained significant attention due to their remarkable efficiency and fault tolerance. By synergistically harnessing the energy efficiency inherent in event cameras and the spike-based processing capabilities of SNNs, their integration could enable ultra-low-power application scenarios, such as action recognition tasks. However, existing approaches often entail converting asynchronous events into conventional frames, leading to additional data mapping efforts and a loss of sparsity, contradicting the design concept of SNNs and event cameras. To address this challenge, we propose SpikePoint, a novel end-to-end point-based SNN architecture. SpikePoint excels at processing sparse event cloud data, effectively extracting both global and local features through a singular-stage structure. Leveraging the surrogate training method, SpikePoint achieves high accuracy with few parameters and maintains low power consumption, specifically employing the identity mapping feature extractor on diverse datasets. SpikePoint achieves state-of-the-art (SOTA) performance on four event-based action recognition datasets using only 16 timesteps, surpassing other SNN methods. Moreover, it also achieves SOTA performance across all methods on three datasets, utilizing approximately 0.3 % of the parameters and 0.5 % of power consumption employed by artificial neural networks (ANNs). These results emphasize the significance of Point Cloud and pave the way for many ultra-low-power event-based data processing applications.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: one-shot segmentation computer vision text-to-image models stable diffusion cross attention
Scores: [ 8 8 6 6 ]
Significant strides have been made using large vision-language models, like Stable Diffusion (SD), for a variety of downstream tasks, including image generation, image editing, and 3D shape generation. Inspired by these advancements, we explore leveraging these vision-language models for segmenting images at any desired granularity using as few as one annotated sample. We propose SLiMe, which frames this problem as an optimization task. Specifically, given a single image and its segmentation mask, we first extract our novel “weighted accumulated self-attention map” along with cross-attention map from the SD prior. Then, using these extracted maps, the text embeddings of SD are optimized to highlight the segmented region in these attention maps, which in turn can be used to derive new segmentation results. Moreover, leveraging additional training data when available, i.e. few-shot, improves the performance of SLiMe. We performed comprehensive experiments examining various design factors and showed that SLiMe outperforms other existing one-shot and few-shot segmentation methods.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: collaborative filtering representation learning
Scores: [ 6 8 3 3 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Large Language Models Natural Language Processing Reasoning
Scores: [ 6 6 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: MoE Instruction Tuning
Scores: [ 8 5 6 8 ]
Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnable parameters to Large Language Models (LLMs) without increasing inference cost. Instruction tuning is a technique for training LLMs to follow instructions. We advocate combining these two approaches, as we find that MoE models benefit more from instruction tuning than dense models. In particular, we conduct empirical studies across three experimental setups: (i) Direct finetuning on individual downstream tasks devoid of instruction tuning; (ii) Instruction tuning followed by in-context few-shot or zero-shot generalization on downstream tasks; and (iii) Instruction tuning supplemented by further finetuning on individual downstream tasks. In the first scenario, MoE models overall underperform dense models of identical computational capacity. This narrative, however, dramatically changes with the introduction of instruction tuning (second and third scenario), used independently or in conjunction with task-specific finetuning. Our most powerful model, FLAN-MOE32B, surpasses the performance of FLAN-PALM62B on four benchmark tasks, while using only a third of the FLOPs. The advancements embodied by FLAN-MOE inspire a reevaluation of the design principles of large-scale, high-performance language models in the framework of task-agnostic learning.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Efficient fine-tuning Long context Large language model
Scores: [ 6 8 8 6 ]
We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shift short attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA demonstrates strong empirical results on various tasks on Llama2 models from 7B/13B to 70B. LongLoRA adopts Llama2 7B from 4k context to 100k, or Llama2 70B to 32k on a single 8$\times$ A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like Flash-Attention2. In addition, we further conduct supervised fine-tuning on our LongLoRA models, with long instruction-following data. Our code and models will be publicly available.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Protein Language Models Universal Representations Downstream Tasks Protein Structure Modeling
Scores: [ 8 8 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Pre-training Tabular Encoder Pre-training Heterogeneous Tabular Data Classification and Regression Deep Learning
Scores: [ 6 8 5 ]
Recent advancements in Natural Language Processing (NLP) have witnessed the groundbreaking impact of pretrained models, yielding impressive outcomes across various tasks. This study seeks to extend the power of pretraining methodologies to facilitating the prediction over tables in data science, a domain traditionally overlooked, yet inherently challenging due to the plethora of table schemas intrinsic to different tasks. The primary research questions underpinning this work revolve around the establishment of a universal pretraining protocol for tables with varied structures, the generalizability and transferability of learned knowledge across tasks, the adaptation to diverse downstream applications, and the incorporation of incremental columns over time. In response to these challenges, we introduce UniTabE, a straightforward yet effective method designed to process tables in a uniform manner, devoid of constraints imposed by specific table structures. UniTabE's core concept relies on representing each basic table element with a module, termed TabUnit. This is subsequently followed by a Transformer encoder to refine the representation. Moreover, our model is designed to facilitate pretraining and finetuning through the utilization of free-form prompts. In order to implement the pretraining phase, we curated an expansive tabular dataset comprising approximately 13 billion samples, meticulously gathered from the Kaggle platform. This research primarily centers on classification and regression tasks involving tabular data, and conducts rigorous experimental testing and analyses to validate the effectiveness of our methodology. The experimental results demonstrate UniTabE's superior performance against several baseline models across a multitude of benchmark datasets. This, therefore, underscores UniTabE's potential to significantly enhance the semantic representation of tabular data, thereby marking a significant stride for tabular data analysis.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Dynamic Sparse Training; Neuron Revitalization
Scores: [ 6 6 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: NeRF Neural Rendering Dynamic Avatars Frequency Modulation
Scores: [ 6 6 6 ]
It is now possible to reconstruct dynamic human motion and shape from a sparse set of cameras using Neural Radiance Fields (NeRF) driven by an underlying skeleton. However, a challenge remains to model the deformation of cloth and skin in relation to skeleton pose. Unlike existing avatar models that are learned implicitly or rely on a proxy surface, our approach is motivated by the observation that different poses necessitate unique frequency assignments. Neglecting this distinction yields noisy artifacts in smooth areas or blurs fine-grained texture and shape details in sharp regions. We develop a two-branch neural network that is adaptive and explicit in the frequency domain. The first branch is a graph neural network that models correlations among body parts locally, taking skeleton pose as input. The second branch combines these correlation features to a set of global frequencies and then modulates the feature encoding. Our experiments demonstrate that our network outperforms state-of-the-art methods in terms of preserving details and generalization capabilities.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: multi-modal learning video and language
Scores: [ 6 6 6 8 ]
Existing video-language pre-training methods primarily focus on instance-level alignment between video clips and captions via global contrastive learning but neglect rich fine-grained local information in both videos and text, which is of importance to downstream tasks requiring temporal localization and semantic reasoning. A powerful model is expected to be capable of capturing region-object correspondences and recognizing scene changes in a video clip, reflecting spatial and temporal granularity respectively. To strengthen model's understanding of such fine-grained information, we propose a simple yet effective video-language modeling framework, S-ViLM, based on intrinsic structures of these two modalities. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features simultaneously. Comprehensive evaluations demonstrate that S-ViLM performs favorably against existing approaches in learning more expressive representations.Specifically, S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks, covering text-video retrieval, video question answering, video action recognition and temporal action localization.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: multi-modal in-context learning; multi-modal instruction tuning; vision-language model
Scores: [ 3 8 6 6 5 ]
Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images, making VLMs less effective in downstream vision-language tasks.In this paper, we address the limitation above by 1) introducing MMICL, a new approach to allow the VLM to deal with multi-modal inputs efficiently; 2) proposing a novel context scheme to augment the in-context learning ability of the VLM; 3) constructing the Multi-modal In-Context Learning (MIC) dataset, designed to enhance the VLM's ability to understand complex multi-modal prompts.Our experiments confirm that MMICL achieves new state-of-the-art zero-shot performance on a wide range of general vision-language tasks, especially for complex benchmarks, including MME and MMBench. Our analysis demonstrates that MMICL effectively tackles the challenge of complex multi-modal prompt understanding and emerges the impressive ICL ability. Furthermore, we observe that MMICL successfully alleviates language bias in VLMs, a common issue for VLMs that often leads to hallucination when faced with extensive textual context.Our code, dataset and model are available at github link.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Equivariant networks SO(3) Equivariance Fourier features
Scores: [ 6 6 6 6 5 ]
The usage of 3D vision algorithms, such as shape reconstruction, remains limited because they require inputs to be at a fixed canonical rotation. Recently, a simple equivariant network, Vector Neuron (VN) has been proposed that can be easily used with the state-of-the-art 3D neural network (NN) architectures. However, its performance is limited because it is designed to use only three-dimensional features, which is insufficient to capture the details present in 3D data. In this paper, we introduce an equivariant feature representation for mapping a 3D point to a high-dimensional feature space. Our feature can discern multiple frequencies present in 3D data, which, as shown by Tancik et al. (2020), is the key to designing an expressive feature for 3D vision tasks. Our representation can be used as an input to VNs, and the results demonstrate that with our feature representation, VN captures more details, overcoming the limitation raised in its original paper.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: visual localization image retrieval synthetic data domain shift geometric consistency long-term visual localization
Scores: [ 8 6 6 ]
State-of-the-art visual localization approaches generally rely on a first image retrieval step whose role is crucial. Yet, retrieval often struggles when facing varying conditions, due to e.g. weather or time of day, with dramatic consequences on the visual localization accuracy. In this paper, we improve this retrieval step and tailor it to the final localization task. Among the several changes we advocate for, we propose to synthesize variants of the training set images, obtained from generative text-to-image models, in order to automatically expand the training set towards a number of nameable variations that particularly hurt visual localization. After expanding the training set, we propose a training approach that leverages the specificities and the underlying geometry of this mix of real and synthetic images. We experimentally show that those changes translate into large improvements for the most challenging visual localization datasets.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: zero-shot composed image retrieval asymmetrical
Scores: [ 8 8 8 6 ]
The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent. Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments when deploying the large vision-language model. To tackle the above problems, we propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning. In the framework, we propose a new adaptive token learner that maps an image to a sentence in the word embedding space of VL model. The sentence adaptively captures discriminative visual information and is further integrated with the text modifier. An asymmetric structure is devised for flexible deployment, in which the lightweight model is adopted for the query side while the large VL model is deployed on the gallery side. The global contrastive distillation and the local alignment regularization are adopted for the alignment between the light model and the VL model for CIR task. Our experiments demonstrate that the proposed ISA could better cope with the real retrieval scenarios and further improve retrieval accuracy and efficiency.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: CLIP Data
Scores: [ 8 6 8 5 ]
Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its \textit{data} and \textit{not} the \textit{model} architecture or pre-training {objective}. However, CLIP only provides very limited information about its data and how it has been collected, leading to works that aim to reproduce CLIP's data by filtering with its model parameters. In this work, we intend to reveal CLIP's data curation approach and in our pursuit of making it open to the community introduce Metadata-Curated Language-Image Pre-training (MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution. Our experimental study rigorously isolates the model and training settings, concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M image-text data pairs outperforms CLIP's data on multiple standard benchmarks. In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP's 68.3% on \mbox{ViT-B} models. Scaling to 1B data, while maintaining the same training budget, attains \textbf{72.4%}. Our observations hold across various model sizes, exemplified by ViT-H achieving \textbf{80.5%}, without any bells-and-whistles. Curation code and training data distribution over metadata will be made available.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: linear attention transformers
Scores: [ 6 8 5 ]
Linear attentions have shown promise for improving Transformer efficiency, reducing attention's quadratic complexity to linear in sequence length. This holds exciting promise for (1) training linear Transformers from scratch, (2) `inetuned-conversion of task-specific Transformers into linear versions that recover task performance, and (3) pretrained-conversion of Transformers, such as language models, into linear versions readily finetunable on downstream tasks. However, linear attentions often underperform compared to standard softmax attention. To close this performance gap, we study the behaviors of softmax and linear attentions in various train-from-scratch and finetuned-conversion settings. We find prior linear attentions lack key properties of softmax attention tied to good performance: low-entropy (or spiky) weights and dot-product monotonicity. We further observe surprisingly simple feature maps that retain these properties match softmax performance, but are inefficient to compute in linear attention. We thus propose Hedgehog, a learnable linear attention that retains the spiky and monotonic properties of softmax attention while maintaining linear complexity. Hedgehog uses simple, trainable MLPs to produce attention weights mimicking softmax attention. Experiments show Hedgehog recovers over 99% of standard Transformer performance in train-from-scratch and finetuned-conversion settings, outperforming prior linear attentions by up to 6 perplexity points on WikiText-103 when training causal GPT models from scratch, and up to 8.7 GLUE score points when converting finetuned bidirectional BERT models. Hedgehog also enables direct pretrained-conversion, achieving a new state-of-the-art WikiText-103 perplexity of 16.7 for 125M decoder-only Transformers by converting pretrained GPT-2 into a linear attention Transformer.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: language modeling earth mover distance language generation language understanding
Scores: [ 3 6 6 8 ]
Neural language models are probabilistic models of human text. They are predominantly trained using maximum likelihood estimation (MLE), which is equivalent to minimizing the forward cross-entropy between the empirical data distribution and the model distribution. However, various degeneration phenomena are still widely observed when decoding from the distributions learned by such models. We establish that the forward cross-entropy is suboptimal as a distance metric for aligning human and model distribution due to its (1) recall-prioritization (2) negative diversity ignorance and (3) train-test mismatch. In this paper, we propose Earth Mover Distance Optimization (EMO) for auto-regressive language modeling. EMO capitalizes on the inherent properties of earth mover distance to address the aforementioned challenges. Due to the high complexity of direct computation, we further introduce a feasible upper bound for EMO to ease end-to-end training. Upon extensive evaluation of language models trained using EMO and MLE. We find that EMO demonstrates a consistently better language modeling performance than MLE across domains. Moreover, EMO demonstrates noteworthy enhancements in downstream performance with minimal fine-tuning on merely 25,000 sentences. This highlights the tremendous potential of EMO as a lightweight calibration method for enhancing large-scale pre-trained language models.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Neural Fields Neural Representation
Scores: [ 8 8 6 6 ]
Neural fields, mapping low-dimensional input coordinates to corresponding signals, have shown promising results in representing various signals. Numerous methodologies have been proposed, and techniques employing MLPs and grid representations have achieved substantial success. MLPs allow compact and high expressibility, yet often suffer from spectral bias and slow convergence speed. On the other hand, methods using grids are free from spectral bias and achieve fast training speed, however, at the expense of high spatial complexity. In this work, we propose a novel way for exploiting both MLPs and grid representations in neural fields. Unlike the prevalent methods that combine them sequentially (extract features from the grids first and feed them to the MLP), we inject spectral bias-free grid representations into the intermediate features in the MLP. More specifically, we suggest a Coordinate-Aware Modulation (CAM), which modulates the intermediate features using scale and shift parameters extracted from the grid representations. This can maintain the strengths of MLPs while mitigating any remaining potential biases, facilitating the rapid learning of high-frequency components. In addition, we empirically found that the feature normalizations, which have not been successful in neural filed literature, proved to be effective when applied in conjunction with the proposed CAM. Experimental results demonstrate that CAM enhances the performance of neural representation and improves learning stability across a range of signals. Especially in the novel view synthesis task, we achieved state-of-the-art performance with the least number of parameters and fast training speed for dynamic scenes and the best performance under 1MB memory for static scenes. CAM also outperforms the best-performing video compression methods using neural fields by a large margin.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Forecasting time series foundation model transfer learning imputation
Scores: [ 8 6 6 8 ]
It is challenging to scale time series forecasting models such that they forecast accurately for multiple distinct domains and datasets, all with potentially different underlying collection procedures (e.g., sample resolution), patterns (e.g., periodicity), and prediction requirements (e.g., reconstruction vs. forecasting). We call this general task universal forecasting. Existing methods usually assume that input data is regularly sampled, and they forecast to pre-determined horizons, resulting in failure to generalise outside of the scope of their training. We propose the DAM - a neural model that takes randomly sampled histories and outputs an adjustable basis composition as a continuous function of time for forecasting to non-fixed horizons. It involves three key components: (1) a flexible approach for using randomly sampled histories, from a long-tail distribution, that enables an efficient global perspective of the underlying temporal dynamics while retaining focus on the recent history; (2) a transformer backbone that is trained on these actively sampled histories to produce, as representational output, (3) the basis coefficients of a continuous function of time. We show that a single DAM, trained on 10 common time series datasets, either outperformed or closely matched existing SoTA models, even though those models were trained to specialise on specific dataset and horizon combinations. The DAM also performs well at imputation, transfers well to held-out datasets, is interpretable via its basis function composition and the attention mechanism it uses, can be tuned for different inference-cost requirements, is easy to deploy, is robust to missing and irregularly sampled data by design, and can forecast effectively to distant horizons without adaptation or repeated autoregressive prediction.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: object detection annotation-free instance segmentation open-vocabulary SAM CLIP
Scores: [ 5 6 8 6 5 ]
Foundation models, pre-trained on a large amount of data have demonstrated impressive zero-shot capabilities in various downstream tasks. However, in object detection and instance segmentation, two fundamental computer vision tasks heavily reliant on extensive human annotations, foundation models such as SAM and DINO struggle to achieve satisfactory performance. In this study, we reveal that the devil is in the object boundary, i.e., these foundation models fail to discern boundaries between individual objects. For the first time, we probe that CLIP, which has never accessed any instance-level annotations, can provide a highly beneficial and strong instance-level boundary prior in the clustering results of its particular intermediate layer. Following this surprising observation, we propose \textit{Zip} which $\textbf{Z}ipsupCL\textbf{ip}$ and SAM in a novel classification-first-then-discovery pipeline, enabling annotation-free, complex-scene-capable, open-vocabulary object detection and instance segmentation. Our Zip significantly boosts SAM's mask AP on COCO dataset by 12.5% and establishes state-of-the-art performance in various settings, including training-free, self-training, and label-efficient finetuning. Furthermore, annotation-free Zip even achieves comparable performance to the best-performing open-vocabulary object detecters using base annotations.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: language model decoding algorithm energy-based model
Scores: [ 8 5 6 6 ]
Despite the remarkable advances in language modeling, current mainstream decoding methods still struggle to generate texts that align with human texts across different aspects. In particular, sampling-based methods produce less-repetitive texts which are often disjunctive in discourse, while search-based methods maintain topic coherence at the cost of increased repetition. Overall, these methods fall short in achieving holistic alignment across a broad range of aspects. In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts measured by multiple metrics of desired aspects simultaneously. The resulting decoding distribution enjoys an analytical solution that scales the input language model distribution via a sequence-level energy function defined by these metrics. And most importantly, we prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts. To facilitate tractable sampling from this globally normalized distribution, we adopt the Sampling-Importance-Resampling technique. Experiments on various domains and model scales demonstrate the superiority of our method in metrics alignment with human texts and human evaluation over strong baselines.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Video compression End-to-end Learning-based video coding Rate Control.
Scores: [ 8 6 8 5 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: time-series contrastive learning masked reconstruction self-supervised learning imputation unsupervised learning
Scores: [ 8 5 6 5 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Vision Transformers Re-parameterization Lightweight Models
Scores: [ 6 8 8 6 8 ]
Large-scale Vision Transformers have achieved promising performance on downstream tasks through feature pre-training. However, the performance of vanilla lightweight Vision Transformers (ViTs) is still far from satisfactory compared to that of recent lightweight CNNs or hybrid networks. In this paper, we aim to unlock the potential of vanilla lightweight ViTs by exploring the adaptation of the widely-used re-parameterization technology to ViTs for improving learning ability during training without increasing the inference cost. The main challenge comes from the fact that CNNs perfectly complement with re-parameterization over convolution and batch normalization, while vanilla Transformer architectures are mainly comprised of linear and layer normalization layers. We propose to incorporate the nonlinear ensemble into linear layers by expanding the depth of the linear layers with batch normalization and fusing multiple linear features with hierarchical representation ability through a pyramid structure. We also discover and solve a new transformer-specific distribution rectification problem caused by multi-branch re-parameterization. Finally, we propose our Two-Dimensional Re-parameterized Linear module (TDRL) for ViTs. Under the popular self-supervised pre-training and supervised fine-tuning strategy, our TDRL can be used in these two stages to enhance both generic and task-specific representation. Experiments demonstrate that our proposed method not only boosts the performance of vanilla Vit-Tiny on various vision tasks to new state-of-the-art (SOTA) but also shows promising generality ability on other networks. Code will be available.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: context window extension efficiency positional skip-wise training
Scores: [ 6 6 6 6 ]
Large Language Models (LLMs) are trained with a pre-defined context length, restricting their use in scenarios requiring long inputs. Previous efforts for adapting LLMs to a longer length usually requires fine-tuning with this target length (Full-length fine-tuning), suffering intensive training cost. To decouple train length from target length for efficient context window extension, we propose Positional Skip-wisE (PoSE) training that smartly simulates long inputs using a fixed context window. This is achieved by first dividing the original context window into several chunks, then designing distinct skipping bias terms to manipulate the position indices of each chunk. These bias terms and the lengths of each chunk are altered for every training example, allowing the model to adapt to all positions within target length. Experimental results show that PoSE greatly reduces memory and time overhead compared with Full-length fine-tuning, with minimal impact on performance. Leveraging this advantage, we have successfully extended the LLaMA model to 128k tokens using a 2k training context window. Furthermore, we empirically confirm that PoSE is compatible with all RoPE-based LLMs and position interpolation strategies. Notably, our method can potentially support infinite length, limited only by memory usage in inference. With ongoing progress for efficient inference, we believe PoSE can further scale the context window beyond 128k.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: language models question answering binary representations retrieval-augmented language models
Scores: [ 6 8 6 6 ]
Retrieval augmentation addresses many critical problems in large language models such as hallucination, staleness, and privacy leaks.However, running retrieval-augmented language models (LMs) is slow and difficult to scale due to processing large amounts of retrieved text. We introduce binary token representations (BTR), which use 1-bit vectors to precompute every token in passages, significantly reducing computation during inference. Despite the potential loss of accuracy, our new calibration techniques and training objectives restore performance. Combined with offline and runtime compression, this only requires 127GB of disk space for encoding 3 billion tokens in Wikipedia.Our experiments show that on five knowledge-intensive NLP tasks, BTR accelerates state-of-the-art inference by up to 4x and reduces storage by over 100x while maintaining over 95% task performance. Our code will be publicly available.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Generalized Category Discovery Novel Category Discovery
Scores: [ 8 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: large language model translation unintentional bilingualism
Scores: [ 6 6 5 6 ]
Multilingual large language models trained on non-parallel data yield impressive translation capabilities. Existing studies demonstrate that incidental sentence-level bilingualism within pre-training data contributes to the LLM's translation abilities. However, it has also been observed that LLM's translation capabilities persist even when incidental sentence-level bilingualism are excluded from the training corpus.In this study, we comprehensively investigate the unreasonable effectiveness and the underlying mechanism for LLM's translation abilities, specifically addressing the question why large language models learn to translate without parallel data, using the BLOOM model series as a representative example. Through extensive experiments, our findings suggest the existence of unintentional bilingualism in the pre-training corpus, especially word alignment data significantly contributes to the large language model's acquisition of translation ability. Moreover, the translation signal derived from word alignment data is comparable to that from sentence-level bilingualism. Additionally, we study the effects of monolingual data and parameter-sharing in assisting large language model to learn to translate. Together, these findings present another piece of the broader puzzle of trying to understand how large language models acquire translation capability.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: manifold learning representation learning gyrovector spaces deep learning
Scores: [ 6 8 3 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: large language models; self-improvement; alignment
Scores: [ 6 6 6 6 ]
Large Language Models (LLMs) have demonstrated remarkable capabilities in open-ended text generation tasks. However, the inherent open-ended nature of these tasks implies that there is always room for improvement in the quality of model responses. To address this challenge, various approaches have been proposed to enhance the performance of LLMs. There has been a growing focus on enabling LLMs to self-improve their response quality, thereby reducing the reliance on extensive human annotation efforts for collecting diverse and high-quality training data. Recently, prompting-based methods have been widely explored among self-improvement methods owing to their effectiveness, efficiency, and convenience. However, those methods usually require explicitly and thoroughly written rubrics as inputs to LLMs. It is expensive and challenging to manually derive and provide all necessary rubrics with a real-world complex goal for improvement (e.g., being more helpfulness and less harmful). To this end, we propose an imPlicit self-ImprovemenT (PIT) framework that implicitly learns the improvement goal from human preference data. PIT only requires preference data that are used to train reward models with no extra human efforts. Specifically, we reformulate the training objective of reinforcement learning from human feedback (RLHF) -- instead of maximizing response quality for a given input, we maximize the quality gap of the response conditioned on a reference response. In this way, PIT is implicitly trained with the improvement goal of better aligning with human preferences. Experiments on two real-world datasets and one synthetic dataset show that our method significantly outperforms prompting-based methods.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: sparse visual recognition vision transformer computer vision representation learning
Scores: [ 6 6 5 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Ferret Multimodal Large Language Model Referring Grounding
Scores: [ 6 6 8 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Learning from Demonstration Manipulation 3D Learning SE(3) Equivariance
Scores: [ 6 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: sRGB real noise modeling Normalizing flow Low-level vision
Scores: [ 6 8 6 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Data Contamination Large Language Models (LLMs) Guided Instruction
Scores: [ 6 8 8 6 ]
Data contamination, i.e., the presence of test data from downstream tasks in the training data of large language models (LLMs), is a potential major issue in measuring LLMs' real effectiveness on other tasks. We propose a straightforward yet effective method for identifying data contamination within LLMs. At its core, our approach starts by identifying potential contamination at the instance level; using this information, our approach then assesses wider contamination at the partition level. To estimate contamination of individual instances, we employ "guided instruction:" a prompt consisting of the dataset name, partition type, and the random-length initial segment of a reference instance, asking the LLM to complete it. An instance is flagged as contaminated if the LLM's output either exactly or nearly matches the latter segment of the reference. To understand if an entire partition is contaminated, we propose two ideas. The first idea marks a dataset partition as contaminated if the average overlap score with the reference instances (as measured by ROUGE-L or BLEURT) is statistically significantly better with the completions from guided instruction compared to a "general instruction" that does not include the dataset and partition name. The second idea marks a dataset partition as contaminated if a classifier based on GPT-4 with few-shot in-context learning prompt marks multiple generated completions as exact/near-exact matches of the corresponding reference instances. Our best method achieves an accuracy between 92% and 100% in detecting if an LLM is contaminated with seven datasets, containing train and test/validation partitions, when contrasted with manual evaluation by human experts. Further, our findings indicate that GPT-4 is contaminated with AG News, WNLI, and XSum datasets.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: language model knowledge neuron model editing formal and function competence syntax fact
Scores: [ 8 8 6 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: text similarity text embedding metric learning near-duplicate detection dataset deduplication
Scores: [ 6 8 8 6 ]
This paper introduces RETSim (Resilient and Efficient Text Similarity), a lightweight, multilingual deep learning model trained to produce robust metric embeddings for near-duplicate text retrieval, clustering, and dataset deduplication tasks. We demonstrate that RETSim is significantly more robust and accurate than MinHash and neural text embeddings, achieving new state-of-the-art performance on dataset deduplication, adversarial text retrieval benchmarks, and spam clustering tasks. Additionally, we introduce the \wa benchmark (Wiki-40B 4dversarial Near-T3xt Dataset), enabling the evaluation of models on typo-laden near-duplicate text retrieval in a multilingual setting. \rs and the \wa are open-sourced under the Apache 2 license at https://github.com/[anonymized].
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Large Language Models Expert-level Prompt Optimization Strategic Planning
Scores: [ 6 3 8 6 ]
Expert-level prompts, carefully engineered by human experts who have a deep understanding of both large language models (LLMs) and domain knowledge, are the future of prompting and pivotal to harnessing the full power of advanced LLMs. Discovering such prompts with an automated process remains a sought-after and unresolved challenge. Existing prompt optimization techniques, though automated through iterative sampling, often fall short in injecting domain knowledge and exploring the vast prompt space for complex expert-level prompts efficiently. To address this pressing need and achieve expert-level prompting, we introduce PromptAgent, which autonomously discovers prompts equivalent in quality to those handcrafted by experts. At its core, PromptAgent views prompt optimization as a strategic planning problem and employs a principled planning algorithm (rooted in Monte Carlo Tree Search) to strategically explore the vast expert-level prompt space. PromptAgent interacts with the LLM in a human-like trial-and-error manner during the planning, and injects expert-level knowledge by reflecting on model errors and generating insightful error feedback. This novel formulation allows it to iteratively evaluate intermediate prompts, refine them based on errors, simulate future rewards, and search for high-reward paths leading to expert-level prompts. We apply PromptAgent to 12 tasks spanning three practical domains: BIG-Bench Hard (BBH), domain-expert, and general NLU tasks, showing PromptAgent consistently outperforms strong prompting and prompt optimization baselines by great margins. Our qualitative analysis further emphasizes PromptAgent's capability to distill insightful errors into expert-level prompts.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: large language models vision language models
Scores: [ 6 6 5 5 ]
The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. However, the technical details behind GPT-4 continue to remain undisclosed.We believe that the enhanced multi-modal generation capabilities of GPT-4 stem from the utilization of sophisticated large language models (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced LLM, Vicuna, using one projection layer. Our work, for the first time, uncovers that properly aligning the visual features with an advanced large language model can possess numerous advanced multi-modal abilities demonstrated by GPT-4, such as detailed image description generation and website creation from hand-drawn drafts.Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, teaching users how to cook based on food photos, and so on. In our experiment, we found that the model trained on short image caption pairs could produce unnatural language outputs (e.g., repetition and fragmentation). To address this problem, we curate a detailed image description dataset in the second stage to finetune the model, which consequently improves the model's generation reliability and overall usability.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Unsupervised Fact Verification Unsupervised Learning Self-supervised Learning Deep Features Contrastive Learning Large Language Models Knowledge Distillation FEVER Fact Verification
Scores: [ 6 6 5 8 5 ]
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Multimodal large language models speech and audio processing music processing
Scores: [ 6 8 6 ]
Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events, and music. In this paper, we propose SALMONN, a speech audio language music open neural network, built by integrating a pre-trained text-based large language model (LLM) with speech and audio encoders into a single multimodal model. SALMONN enables the LLM to directly process and understand general audio inputs and achieve competitive performances on a number of speech and audio tasks used in training, such as automatic speech recognition and translation, auditory-information-based question answering, emotion recognition, speaker verification, and music and audio captioning etc. SALMONN also has a diverse set of emergent abilities unseen in the training, which includes but is not limited to speech translation to untrained languages, speech-based slot filling, spoken question answering, audio-based storytelling, and speech audio co-reasoning etc. The presence of the cross-modal emergent abilities is studied, and a novel few-shot activation tuning approach is proposed to activate such abilities of SALMONN. To our knowledge, SALMONN is the first model of its type and can be regarded as a step towards AI with generic hearing abilities. An interactive demo of SALMONN is available at https://github.com/the-anonymous-account/SALMONN, and the training code and model checkpoints will be released upon acceptance.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Autonomous Driving Driving Topology Understanding
Scores: [ 8 6 6 6 ]
Topology reasoning aims to comprehensively understand road scenes and present drivable routes in autonomous driving. It requires detecting road centerlines (lane) and traffic elements, further reasoning their topology relationship, \textit{i.e.}, lane-lane topology, and lane-traffic topology. In this work, we first present that the topology score relies heavily on detection performance on lane and traffic elements. Therefore, we introduce a powerful 3D lane detector and an improved 2D traffic element detector to extend the upper limit of topology performance. Further, we propose TopoMLP, a simple yet high-performance pipeline for driving topology reasoning. Based on the impressive detection performance, we develop two simple MLP-based heads for topology generation. TopoMLP achieves state-of-the-art performance on OpenLane-V2 benchmark, \textit{i.e.}, 41.2% OLS with ResNet-50 backbone. It is also the 1st solution for 1st OpenLane Topology in Autonomous Driving Challenge. We hope such simple and strong pipeline can provide some new insights to the community. Code will be made public.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: Multimodal Large Language Model Visual Tokenizer
Scores: [ 8 6 5 ]
The great success of Large Language Models (LLMs) has expanded the potential of multimodality, contributing to the gradual evolution of General Artificial Intelligence (AGI). A true AGI agent should not only possess the capability to perform predefined multi-tasks but also exhibit emergent abilities in an open-world context. However, despite the considerable advancements made by recent multimodal LLMs, they still fall short in effectively unifying comprehension and generation tasks, let alone open-world emergent abilities. We contend that the key to overcoming the present impasse lies in enabling text and images to be represented and processed interchangeably within a unified autoregressive Transformer. To this end, we introduce SEED, an elaborate image tokenizer that empowers LLMs with the ability to SEE and $\textbf{D}$raw at the same time. We identify two crucial design principles: (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a 1D causal dependency, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs. (2) Image tokens should capture high-level semantics consistent with the degree of semantic abstraction in words, and be optimized for both discriminativeness and reconstruction during the tokenizer training phase. With SEED tokens, LLM is able to perform scalable multimodal autoregression under its original training recipe, i.e., next-word prediction. SEED-LLaMA is therefore produced by large-scale pretraining and instruction tuning on the interleaved textual and visual data, demonstrating impressive performance on a broad range of multimodal comprehension and generation tasks. More importantly, SEED-LLaMA has exhibited compositional emergent abilities such as multi-turn in-context multimodal generation, acting like your AI assistant. The code and models will be publicly released.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Keywords: pruning efficiency large language models pre-training
Scores: [ 6 5 8 5 ]
The popularity of LLaMA (Touvron et al., 2023a;b) and other recently emerged moderate-sized large language models (LLMs) highlights the potential of building smaller yet powerful LLMs. Regardless, the cost of training such models from scratch on trillions of tokens remains high. In this work, we study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains. We demonstrate the efficacy of our approach by presenting the Sheared-LLaMA series, pruning the LLaMA2-7B model down to 1.3B and 2.7B parameters. Sheared-LLaMA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, and OpenLLaMA models, on a wide range of downstream and instruction tuning evaluations, while requiring less than 3% of compute compared to training such models from scratch. This work provides compelling evidence that leveraging existing LLMs with structured pruning is a far more cost-effective approach for building smaller LLMs.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Multilingual Federated Learning Natural Language Processing
Scores: [ 8 1 3 5 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Large language models detecting pretraining data
Scores: [ 5 6 8 6 ]
Although large language models (LLMs) are widely deployed, the data used to train them is rarely disclosed. Given the incredible scale of this data, up to trillions of tokens, it is all but certain that it includes potentially problematic text such as copyrighted materials, personally identifiable information, and test data for widely reported reference benchmarks. However, we currently have no way to know which data of these types is included or in what proportions. In this paper, we study the pretraining data detection problem: given a piece of text and black-box access to an LLM without knowing the pretraining data, can we determine if the model was trained on the provided text? To facilitate this study, we introduce a dynamic benchmark WIKIMIA that uses data created before and after model training to support gold truth detection. We also introduce a new detection method MIN-K PROB based on a simple hypothesis: an unseen example is likely to contain a few outlier words with low probabilities under the LLM, while a seen example is less likely to have words with such low probabilities. MIN-K PROB can be applied without any knowledge about the pretrainig corpus or any additional training, departing from previous detection methods that require training a reference model on data that is similar to the pretraining data. Moreover, our experiments demonstrate that MIN-K PROB achieves a 7.4% improvement on WIKIMIA over these previous methods. We apply MIN-K PROB to two real-world scenarios, copyrighted book detection and contaminated downstream example detection, and find that it to be a consistently effective solution.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Noisy label detection Poisoned sample detection data-centric AI multimodal large language model
Scores: [ 6 6 6 ]
The role of data in building AI systems has recently been emphasized by the emerging concept of data-centric AI. Unfortunately, in the real-world, datasets may contain dirty samples, such as poisoned samples from backdoor attack, noisy labels in crowdsourcing, and even hybrids of them. The presence of such dirty samples makes the DNNs vunerable and unreliable. Hence, it is critical to detect dirty samples to improve the quality and realiability of dataset. Existing detectors only focus on detecting poisoned samples or noisy labels, that are often prone to weak generalization when dealing with dirty samples from other fields. In this paper, we find a commonality of various dirty samples is visual-linguistic inconsistency between images and associated labels. To capture the semantic inconsistency between modalities, we propose versatile data cleanser (VDC) leveraging the surpassing capabilities of multimodal large language models (MLLM) in cross-modal alignment and reasoning. It consists of three consecutive modules: the visual question generation module to generate insightful questions about the image; the visual question answering module to acquire the semantics of the visual content by answering the questions with MLLM; followed by the visual answer evaluation module to evaluate the inconsistency. Extensive experiments demonstrate its superior performance and generalization to various categories and types of dirty samples.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: hallucination large language models synthetic data prefix tuning safety
Scores: [ 6 6 6 ]
Large language models (LLMs) frequently hallucinate on abstractive summarization tasks such as document-based question-answering, meeting summarization, and clinical report generation, even though all necessary information is included in context. However, optimizing to make LLMs hallucinate less is challenging, as hallucination is hard to efficiently, cheaply, and reliably evaluate at each optimization step. In this work, we show that reducing hallucination on a synthetic task can also reduce hallucination on real-world downstream tasks. Our method, SynTra, first designs a synthetic task where hallucinations are easy to elicit and measure. It next optimizes the LLM's system message via prefix tuning on the synthetic task, then uses the system message on realistic, hard-to-optimize tasks. Across three realistic abstractive summarization tasks, we reduce hallucination for two 13B-parameter LLMs using supervision signal from only a synthetic retrieval task. We also find that optimizing the system message rather than the model weights can be critical; fine-tuning the entire model on the synthetic task can counterintuitively increase hallucination. Overall, SynTra demonstrates that the extra flexibility of working with synthetic data can help mitigate undesired behaviors in practice.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Fairness DRO Adversarial Learning
Scores: [ 6 8 6 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: differential privacy privacy amplification matrix mechanism
Scores: [ 8 6 8 8 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: OOD detection
Scores: [ 8 8 6 8 ]
Out-of-distribution (OOD) detection aims at identifying samples from unknown classes, playing a crucial role in trustworthy models against errors on unexpected inputs. There is extensive research dedicated to exploring OOD detection in the vision modality. Vision-language models (VLMs) can leverage both textual and visual information for various multi-modal applications, whereas few OOD detection methods take into account information from the text modality. In this paper, we propose a novel post hoc OOD detection method, called NegLabel, which takes a vast number of negative labels from extensive corpus databases. We design a novel scheme for the OOD score collaborated with negative labels.Theoretical analysis helps to understand the mechanism of negative labels. Extensive experiments demonstrate that our method NegLabel achieves state-of-the-art performance on various OOD detection benchmarks and generalizes well on multiple VLM architectures. Furthermore, our method NegLabel exhibits remarkable robustness against diverse domain shifts.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Large Langauge Model In-context Learning Differential Privacy
Scores: [ 8 6 6 6 ]
In-context learning (ICL) is an important capability of Large Language Models (LLMs), enabling these models to dynamically adapt based on specific, in-context exemplars, thereby improving accuracy and relevance.However, LLM's responses may leak the sensitive private information contained in in-context exemplars. To address this challenge, we propose Differentially Private In-context Learning (DP-ICL), a general paradigm for privatizing ICL tasks. The key idea for DP-ICL paradigm is generating differentially private responses through a noisy consensus among an ensemble of LLM's responses based on disjoint exemplar sets. Based on the general paradigm of DP-ICL, we instantiate several techniques showing how to privatize ICL for text classification and language generation. We experiment on four text classification benchmarks and two language generation tasks, and our empirical findings suggest that our DP-ICL achieves a strong utility-privacy tradeoff.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Large language models bias evaluation gender bias gender co-occurring words gender-invariant pretraining data statistics
Scores: [ 5 6 6 5 ]
In the large language models era, it is imperative to measure and understand how gender biases present in the training data influence model behavior.Previous works construct benchmarks around known stereotypes (e.g., occupations) and demonstrate high levels of gender bias in large language models, raising serious concerns about models exhibiting undesirable behaviors.We expand on existing literature by asking the question: \textit{Do large language models still favor one gender over the other in non-stereotypical settings?}To tackle this question, we restrict language model evaluation to a \textit{neutral} subset, in which sentences are free of pronounced word-gender associations. After characterizing these associations in terms of pretraining data statistics,we use them to (1) create a new benchmark with low gender-word associations, and (2) repurpose popular benchmarks in the gendered pronoun setting | WinoBias and \Winogender |, removing pronounced gender-correlated words.Surprisingly, when testing 20+ models (e.g., Llama-2, Pythia, and OPT) in the proposed benchmarks, we still detect critically high gender bias across all tested models. For instance, after adjusting for strong word-gender associations, we find that all models still exhibit clear gender preferences in about 60%-95% of the sentences, representing a small change (up to 5%) from the original \textit{stereotypical} setting.By demonstrating that measured bias is not necessarily due to the presence of highly gender-associated words, our work highlights important questions about bias evaluation as well as potentially underlying model biases.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: adversarial attacks adversarial defense black-box attacks
Scores: [ 3 6 6 8 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Deep Neural Networks Backdoor Attacks Poisoning Efficiency.
Scores: [ 6 8 3 6 ]
Recent deep neural networks (DNNs) have come to rely on vast amounts of training data, providing an opportunity for malicious attackers to exploit and contaminate the data to carry out backdoor attacks. However, existing backdoor attack methods make unrealistic assumptions, assuming that all training data comes from a single source and that attackers have full access to the training data. In this paper, we introduce a more realistic attack scenario where victims collect data from multiple sources, and attackers cannot access the complete training data. We refer to this scenario as data-constrained backdoor attacks. In such cases, previous attack methods suffer from severe efficiency degradation due to the entanglement between benign and poisoning features during the backdoor injection process. To tackle this problem, we introduce three CLIP-based technologies from two distinct streams: Clean Feature Suppression and Poisoning Feature Augmentation. The results demonstrate remarkable improvements, with some settings achieving over 100% improvement compared to existing attacks in data-constrained scenarios. Our research contributes to addressing the limitations of existing methods and provides a practical and effective solution for data-constrained backdoor attacks.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Strategic Classification Performative Prediction Labor Market
Scores: [ 6 8 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: multilingual safety large language models
Scores: [ 6 6 6 6 8 ]
While large language models (LLMs) exhibit remarkable capabilities across a wide range of tasks, they pose potential safety concerns, such as the jailbreak'' problem. Although several preventive measures have been developed to mitigate the potential risks associated with LLMs, they have primarily focused on English data. In this study, we reveal the presence of multilingual jailbreak challenges within LLMs and consider two potential risky scenarios: unintentional and intentional. The unintentional scenario involves users querying LLMs using non-English prompts and inadvertently bypassing the safety mechanisms, while the intentional scenario entails malicious users combining jailbreak instructions with multilingual prompts to attack LLMs deliberately. The experimental results reveal that in the unintentional scenario, the rate of unsafe content increases as the availability of languages decreases. Specifically, low-resource languages exhibit three times the likelihood of encountering harmful content compared to high-resource languages, with both ChatGPT and GPT-4. In the intentional scenario, multilingual prompts can exacerbate the negative impact of jailbreak instructions, with astonishingly high rates of unsafe output: 80.92% for ChatGPT and 40.71% for GPT-4. Finally, we propose a novel \textsc{Self-Defense} framework that addresses the multilingual jailbreak challenges via automatically generating multilingual safety training data for fine-tuning. Experiment results demonstrate its effectiveness with notable reduction in unsafe rate.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Adversarial patches active defense embodied intelligence
Scores: [ 8 6 5 8 ]
The vulnerability of deep neural networks to adversarial patches has motivated numerous defense strategies for boosting model robustness. However, the prevailing defenses depend on single observation or pre-established adversary information to counter adversarial patches, often failing to be confronted with unseen or adaptive adversarial attacks and easily exhibiting unsatisfying performance in dynamic 3D environments. Inspired by active human perception and recurrent feedback mechanisms, we develop Embodied Active Defense (EAD), a proactive defensive strategy that actively contextualizes environmental information to address misaligned adversarial patches in 3D real-world settings. To achieve this, EAD develops two central recurrent sub-modules, i.e., a perception module and a policy module, to implement two critical functions of active vision. These models recurrently process a series of beliefs and observations, facilitating progressive refinement of their comprehension of the target object and enabling the development of strategic actions to counter adversarial patches in 3D environments. To optimize learning efficiency, we incorporate a differentiable approximation of environmental dynamics and deploy patches that are agnostic to the adversary’s strategies. Extensive experiments demonstrate that EAD substantially enhances robustness against a variety of patches within just a few steps through its action policy in safety-critical tasks (e.g., face recognition and object detection), without compromising standard accuracy. Furthermore, due to the attack-agnostic characteristic, EAD facilitates excellent generalization to unseen attacks, diminishing the averaged attack success rate by 95% across a range of unseen adversarial attacks.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: out-of-distribution detection prototypical learning hyperspherical embedding
Scores: [ 6 6 6 5 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Differential Privacy Error-feedback Differentially private training
Scores: [ 6 6 6 6 ]
Differentially Private Stochastic Gradient Descent with gradient clipping (DPSGD-GC) is a powerful tool for training deep learning models using sensitive data, providing both a solid theoretical privacy guarantee and high efficiency. However, existing research has shown that DPSGD-GC only converges when using large clipping thresholds that are dependent on problem-specific parameters that are often unknown in practice. Therefore, DPSGD-GC suffers from degraded performance due to the {\it constant} bias introduced by the clipping. In our work, we propose a new error-feedback (EF) DP algorithm as an alternative to DPSGD-GC, which offers a diminishing utility bound without inducing a constant clipping bias. More importantly, it allows for an arbitrary choice of clipping threshold that is independent of the problem. We establish an algorithm-specific DP analysis for our proposed algorithm, providing privacy guarantees based on R{'e}nyi DP. And we demonstrate that under mild conditions, our algorithm can achieve nearly the same utility bound as DPSGD without gradient clipping. Our empirical results on standard datasets show that the proposed algorithm achieves higher accuracies than DPSGD while maintaining the same level of DP guarantee.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: AI safety language models sycophancy human feedback RLHF
Scores: [ 6 6 8 6 ]
Reinforcement learning from human feedback (RLHF) is a popular technique for training high-quality AI assistants. However, RLHF may also encourage model responses that match user beliefs over truthful responses, a behavior known as sycophancy. We investigate the prevalence of sycophancy in RLHF-trained models and whether human preference judgments are responsible. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophancy behavior across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior of RLHF models, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of RLHF models, likely driven in part by human preference judgments favoring sycophantic responses.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: inversion language models prompt engineering security
Scores: [ 5 6 5 6 ]
Given a prompt, language models produce a distribution over all possible next tokens; when the prompt is unknown, can we use this distributional information to recover the prompt? We consider the problem of anguage model inversion and show that next-token probabilities contain a surprising amount of information about the preceding text. Often we can recover the text in cases where it is hidden from the user, motivating a method for recovering unknown prompts given only the model's current distribution output. We consider a variety of model access scenarios, and show how even without predictions for every token in the vocabulary we can recover the probability vector through search and reconstruction of the input. On LLAMA-7B, our inversion method reconstructs prompts with a BLEU of 59 and token-level F1 of 77 and recovers 23% of prompts exactly
Primary Area: societal considerations including fairness, safety, privacy
Keywords: classification interpretability explainability concept bottleneck models uncertainty selective classification
Scores: [ 6 8 8 6 ]
Machine learning models are often used to automate routine tasks. In settings where mistakes are costly, we can trade off accuracy for coverage by abstaining from making a prediction on instances for which the model is uncertain. In this work, we present a new approach to selective classification in deep learning with concepts. Our approach constructs a concept bottleneck model where the front-end model can make predictions given soft concepts and leverage concept confirmation to improve coverage and performance under abstention. We develop techniques to propagate uncertainty and identify concepts for confirmation. We evaluate our approach on real-world and synthetic datasets, showing that it can improve coverage while maintaining performance across a range of tasks.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Alignment LLM
Scores: [ 6 8 6 8 ]
Aligning large language models with human objectives is paramount, yet common approaches including RLHF suffer from unstable and resource-intensive training. In response to this challenge, we introduce ARGS, Alignment as Reward-Guided Search, a novel framework that integrates alignment into the decoding process, eliminating the need for expensive RL training. By adjusting the model’s probabilistic predictions using a reward signal, ARGS generates texts with semantic diversity while being aligned with human preferences, offering a promising and flexible solution for aligning language models. Notably, our method demonstrates consistent enhancements in average reward compared to baselines across diverse alignment tasks and various model dimensions. For example, under the same greedy-based decoding strategy, our method improves the average reward by 19.56% relative to the baseline and secures a preference or tie score of 64.33% in GPT-4 evaluation. We believe that our framework, emphasizing test-time alignment, paves the way for more responsive language models in the future.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Vertical federated learning Contribution valuation Fairness
Scores: [ 6 8 6 6 ]
Federated learning is an emerging technology for training machine learning models across decentralized data sources without sharing data. Vertical federated learning, also known as feature-based federated learning, applies to scenarios where data sources have the same sample IDs but different feature sets. To ensure fairness among data owners, it is critical to objectively assess the contributions from different data sources and compensate the corresponding data owners accordingly. The Shapley value is a provably fair contribution valuation metric originating from cooperative game theory. However, its straight-forward computation requires extensively retraining a model on each potential combination of data sources, leading to prohibitively high communication and computation overheads due to multiple rounds of federated learning. To tackle this challenge, we propose a contribution valuation metric called vertical federated Shapley value (VerFedSV) based on the classic Shapley value. We show that VerFedSV not only satisfies many desirable properties of fairness but is also efficient to compute. Moreover, VerFedSV can be adapted to both synchronous and asynchronous vertical federated learning algorithms. Both theoretical analysis and extensive experimental results demonstrate the fairness, efficiency, adaptability, and effectiveness of VerFedSV.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Algorithmic Fairness Distributionally Robust Optimization Distribution Shift Fairness f-divergences
Scores: [ 5 6 8 5 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: label recovery gradient inversion attacks
Scores: [ 8 6 8 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Backdoor Input Detection Backdoor Defense AI Security Deep Learning
Scores: [ 6 8 5 6 ]
We present a novel defense, against backdoor attacks on Deep Neural Networks (DNNs), wherein adversaries covertly implant malicious behaviors (backdoors) into DNNs. Our defense falls within the category of post-development defenses that operate independently of how the model was generated. The proposed defense is built upon a novel reverse engineering approach that can directly extract backdoor functionality of a given backdoored model to a backdoor expert model. The approach is straightforward --- finetuning the backdoored model over a small set of intentionally mislabeled clean samples, such that it unlearns the normal functionality while still preserving the backdoor functionality, and thus resulting in a model~(dubbed a backdoor expert model) that can only recognize backdoor inputs. Based on the extracted backdoor expert model, we show the feasibility of devising highly accurate backdoor input detectors that filter out the backdoor inputs during model inference. Further augmented by an ensemble strategy with a finetuned auxiliary model, our defense, BaDExpert (Backdoor Input Detection with Backdoor Expert), effectively mitigates 17 SOTA backdoor attacks while minimally impacting clean utility. The effectiveness of BaDExpert has been verified on multiple datasets (CIFAR10, GTSRB and ImageNet) across various model architectures (ResNet, VGG, MobileNetV2 and Vision Transformer).
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Misclassification detection Uncertainty estimation Trustworthy AI Safety
Scores: [ 6 6 8 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: MIA Diffusion Model Text-To-Speech
Scores: [ 8 6 8 8 ]
Recently, diffusion models have achieved remarkable success in generating tasks, including image and audio generation. However, like other generative models, diffusion models are prone to privacy issues. In this paper, we propose an efficient query-based membership inference attack (MIA), namely Proximal Initialization Attack (PIA), which utilizes groundtruth trajectory obtained by ϵ initialized in t=0 and predicted point to infer memberships. Experimental results indicate that the proposed method can achieve competitive performance with only two queries on both discrete-time and continuous-time diffusion models. Moreover, previous works on the privacy of diffusion models have focused on vision tasks without considering audio tasks. Therefore, we also explore the robustness of diffusion models to MIA in the text-to-speech (TTS) task, which is an audio generation task. To the best of our knowledge, this work is the first to study the robustness of diffusion models to MIA in the TTS task. Experimental results indicate that models with mel-spectrogram (image-like) output are vulnerable to MIA, while models with audio output are relatively robust to MIA.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Benchmarks Fairness Generalization
Scores: [ 6 8 5 6 ]
For more than a decade, researchers have measured progress in object recognition on the ImageNet dataset along with its associated generalization benchmarks such as ImageNet-A, -C, and -R. Recent advances in foundation models, trained on orders of magnitude more data, have begun to saturate performance on these benchmarks. Despite this progress, even today’s best models are brittle in practice. As a step toward more holistic measurement of model reliability, we propose studying performance on crowdsourced, global datasets, which contain natural distribution shifts seen practically in deployment. We perform a comprehensive empirical study on two crowdsourced, globally representative datasets, evaluating nearly 100 vision models to uncover several concerning empirical trends: first, that progress on crowdsourced, global data has significantly lagged behind standard benchmarks, with advances on ImageNet occurring at 2.5x the rate of progress on crowdsourced, global data. Second, we find that progress on standard benchmarks has failed to improve or exacerbated geographic disparities: \textit{geographic disparities between the least performant models and today's best models have more than tripled}. We showcase the promise of using more curated and/or representative training datasets for mitigating these trends, and emphasize curation of web-scale, geographically representative training datasets as a critical open problem for the research community.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: poisoning adversarial machine learning group robustness
Scores: [ 8 5 6 6 ]
Group robustness has become a major concern in machine learning (ML) as conventional training paradigms were found to produce high error on minority groups. Without explicit group annotations, proposed solutions rely on heuristics that aim to identify and then amplify the minority samples during training. In our work, we first uncover a critical shortcoming of these methods: an inability to distinguish legitimate minority samples from poison samples in the training set. By amplifying poison samples as well, group robustness methods inadvertently boost the success rate of an adversary---e.g., from 0% without amplification to over 97% with it. Notably, we supplement our empirical evidence with an impossibility result proving this inability of a standard heuristic under some assumptions. Moreover, scrutinizing recent poisoning defenses both in centralized and federated learning, we observe that they rely on similar heuristics to identify which samples should be eliminated as poisons. In consequence, minority samples are eliminated along with poisons, which damages group robustness---e.g., from 55% without the removal of the minority samples to 41% with it. Finally, as they pursue opposing goals using similar heuristics, our attempt to alleviate the trade-off by combining group robustness methods and poisoning defenses falls short. By exposing this tension, we also hope to highlight how benchmark-driven ML scholarship can obscure the trade-offs among different metrics with potentially detrimental consequences.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Large Language Model Alignment Attack
Scores: [ 8 8 6 6 ]
The rapid progress in open-source large language models (LLMs) is significantly advancing AI development. Extensive efforts have been made before model release to align their behavior with human values, with the primary goal of ensuring their helpfulness and harmlessness. However, even carefully aligned models can be manipulated maliciously, leading to unintended behaviors, known as "jailbreaks". These jailbreaks are typically triggered by specific text inputs, often referred to as adversarial prompts. In this work, we propose the \emph{generation exploitation} attack, an extremely simple approach that disrupts model alignment by only manipulating variations of decoding methods. By exploiting different generation strategies, including varying decoding hyper-parameters and sampling methods, we increase the misalignment rate from 0 to more than 95 across 11 language models including LLaMA2, Vicuna, Falcon, and MPT families, outperforming state-of-the-art attacks with 30× lower computational cost. Finally, we propose an effective alignment method that explores diverse generation strategies, which can reasonably reduce the misalignment rate under our attack. Altogether, our study underscores a major failure in current safety evaluation and alignment procedures for open-source LLMs, strongly advocating for more comprehensive red teaming and better alignment before releasing such models.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: LLMs machine learning memorization privacy data poisoning federated learning large language models privacy risks
Scores: [ 5 8 8 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Medical Segmentation Medical Imaging Fairness Learning Health Equity Deep Learning Trustworthy AI
Scores: [ 6 8 6 8 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Underrepresentation Self-supervised learning Contrastive learning
Scores: [ 8 6 5 ]
The effect of underrepresentation on the performance of minority groups is known to be a serious problem in supervised learning settings; however, it has been underexplored so far in the context of self-supervised learning (SSL). In this paper, we demonstrate that contrastive learning (CL), a popular variant of SSL, tends to collapse representations of minority groups with certain majority groups. We refer to this phenomenon as representation harm and demonstrate it on image and text datasets using the corresponding popular CL methods. Furthermore, our causal mediation analysis of allocation harm on a downstream classification task reveals that representation harm is partly responsible for it, thus emphasizing the importance of studying and mitigating representation harm. Finally, we provide a theoretical explanation for representation harm using a stochastic block model that leads to a representational neural collapse in a contrastive learning setting.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Adversarial Attacks Algorithmic Fairness Graph Neural Networks
Scores: [ 6 6 6 ]
Fairness-aware graph neural networks (GNNs) have gained a surge of attention as they can reduce the bias of predictions on any demographic group (e.g., female) in graph-based applications. Although these methods greatly improve the algorithmic fairness of GNNs, the fairness can be easily corrupted by carefully designed adversarial attacks. In this paper, we investigate the problem of adversarial attacks on fairness of GNNs and propose G-FairAttack, a general framework for attacking various types of fairness-aware GNNs in terms of fairness with an unnoticeable effect on prediction utility. In addition, we propose a fast computation technique to reduce the time complexity of G-FairAttack. The experimental study demonstrates that G-FairAttack successfully corrupts the fairness of different types of GNNs while keeping the attack unnoticeable. Our study on fairness attacks sheds light on potential vulnerabilities in fairness-aware GNNs and guides further research on the robustness of GNNs in terms of fairness.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: federated learning
Scores: [ 8 5 5 6 ]
Federated learning has emerged as a promising distributed learning paradigm that facilitates collaborative learning among multiple parties without transferring raw data. However, most existing federated learning studies focus on either horizontal or vertical data settings, where the data of different parties are assumed to be from the same feature or sample space. In practice, a common scenario is the hybrid data setting, where data from different parties may differ both in the features and samples. To address this, we propose HybridTree, a novel federated learning approach that enables federated tree learning on hybrid data. We observe the existence of consistent split rules in trees. With the help of these split rules, we theoretically show that the knowledge of parties can be incorporated into the lower layers of a tree. Based on our theoretical analysis, we propose a layer-level solution that does not need frequent communication traffic to train a tree. Our experiments demonstrate that HybridTree can achieve comparable accuracy to the centralized setting with low computational and communication overhead. HybridTree can achieve up to 8 times speedup compared with the other baselines.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Adversarial attacks Vision encoders Jailbreak Prompt Injection Security Embedding space attacks Black box LLM Vision-Language Models Multi-Modal Models VLM Alignment Cross-Modality alignment
Scores: [ 8 6 6 5 ]
We introduce new jailbreak attacks on vision language models (VLMs), which use aligned LLMs and are resilient to text-only jailbreak attacks. Specifically, we develop cross-modality attacks on alignment where we pair adversarial images going through the vision encoder with textual prompts to break the alignment of the language model. Our attacks employ a novel compositional strategy that combines an image, adversarially targeted towards toxic embeddings, with generic prompts to accomplish the jailbreak. Thus, the LLM draws the context to answer the generic prompt from the adversarial image. The generation of benign-appearing adversarial images leverages a novel embedding-space-based methodology, operating with no access to the LLM model. Instead, the attacks require access only to the vision encoder and utilize one of our four embedding space targeting strategies. By not requiring access to the LLM, the attacks lower the entry barrier for attackers, particularly when vision encoders such as CLIP are embedded in closed-source LLMs. The attacks achieve a high success rate across different VLMs, highlighting the risk of cross-modality alignment vulnerabilities, and the need for new alignment approaches for multi-modal models.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Large language models Alignment
Scores: [ 5 6 6 8 5 ]
Large language models (LLMs) often demonstrate inconsistencies with human preferences. Previous research typically gathered human preference data and then aligned the pre-trained models using reinforcement learning or instruction tuning, a.k.a. the finetuning step. In contrast, aligning frozen LLMs without requiring alignment data is more appealing. This work explores the potential of the latter setting. We discover that by integrating self-evaluation and rewind mechanisms, unaligned LLMs can directly produce responses consistent with human preferences via self-boosting. We introduce a novel inference method, Rewindable Auto-regressive INference (RAIN), that allows pre-trained LLMs to evaluate their own generation and use the evaluation results to guide rewind and generation for AI safety. Notably, RAIN operates without the need of extra data for model alignment and abstains from any training, gradient computation, or parameter updates. Experimental results evaluated by GPT-4 and humans demonstrate the effectiveness of RAIN: on the HH dataset, RAIN improves the harmlessness rate of LLaMA 30B from 82% of vanilla inference to 97%, while maintaining the helpfulness rate. On the TruthfulQA dataset, RAIN improves the truthfulness of the already-well-aligned LLaMA-2-chat 13B model by 5%.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: backdoor attack unsupervised contrastive learning
Scores: [ 6 6 5 6 ]
Contrastive Learning (CL) has attracted enormous attention due to its remarkable capability in unsupervised representation learning. However, recent works have revealed the vulnerability of CL to backdoor attacks: the feature extractor could be misled to embed backdoored data close to an attack target class, thus fooling the downstream predictor to misclassify it as the target. Existing attacks usually adopt a fixed trigger pattern and poison the training set with trigger-injected data, hoping for the feature extractor to learn the association between trigger and target class. However, we find that such fixed trigger design fails to effectively associate trigger-injected data with target class in the embedding space due to special CL mechanisms, leading to a limited attack success rate (ASR). This phenomenon motivates us to find a better backdoor trigger design tailored for CL framework. In this paper, we propose a bi-level optimization approach to achieve this goal, where the inner optimization simulates the CL dynamics of a surrogate victim, and the outer optimization enforces the backdoor trigger to stay close to the target throughout the surrogate CL procedure. Extensive experiments show that our attack can achieve a higher attack success rate (e.g., 99% ASR on ImageNet-100) with a very low poisoning rate (1%). Besides, our attack can effectively evade existing state-of-the-art defenses.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: differential privacy large language models in-context learning
Scores: [ 8 8 8 8 ]
We study the problem of in-context learning (ICL) with large language models (LLMs) on private datasets. This scenario poses privacy risks, as LLMs may leak or regurgitate the private examples demonstrated in the prompt.We propose a novel algorithm that generates synthetic few-shot demonstrations from the private dataset with formal differential privacy (DP) guarantees, and show empirically that it can achieve effective ICL.We conduct extensive experiments on standard benchmarks and compare our algorithm with non-private ICL and zero-shot solutions. Our results demonstrate that our algorithm can achieve competitive performance with strong privacy levels.These results open up new possibilities for ICL with privacy protection for a broad range of applications.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Vision Language Model Adversarial Transferability Prompt Tuning
Scores: [ 6 8 6 8 6 ]
Different from traditional task-specific vision models, recent large VLMs can readily adapt to different vision tasks by simply using different textual instructions, i.e., prompts. However, a well-known concern about traditional task-specific vision models is that they can be misled by imperceptible adversarial perturbations. Furthermore, the concern is exacerbated by the phenomenon that the same adversarial perturbations can fool different task-specific models. Given that VLMs rely on prompts to adapt to different tasks, an intriguing question emerges: Can a single adversarial image mislead all predictions of VLMs when a thousand different prompts are given? This question essentially introduces a novel perspective on adversarial transferability: cross-prompt adversarial transferability. In this work, we propose the Cross-Prompt Attack (CroPA). This proposed method updates the visual adversarial perturbation with learnable textual prompts, which are designed to counteract the misleading effects of the adversarial image. By doing this, CroPA significantly improves the transferability of adversarial examples across prompts. Extensive experiments are conducted to verify the strong cross-prompt adversarial transferability of CroPA with prevalent VLMs including Flamingo, BLIP-2, and InstructBLIP in various different tasks.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: adversarial patch attack transferability
Scores: [ 8 6 6 6 5 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Federated learning Model Inversion Attack Privacy-Preserving
Scores: [ 8 6 8 6 ]
Federated Learning (FL) is a distributed machine learning technique where multiple devices (such as smartphones or IoT devices) train a shared global model by using their local data. FL claims that the data privacy of local participants is preserved well because local data will not be shared with either the server-side or other training participants. However, this paper discovers a pioneering finding that a model inversion (MI) attacker, who acts as a benign participant, can invert the shared global model and obtain the data belonging to other participants. This will lead to severe data-leakage risk in FL because it is difficult to identify attackers from benign participants.In addition, we found even the most advanced defense approaches could not effectively address this issue. Therefore, it is important to evaluate such data-leakage risks of an FL system before using it. To alleviate this issue, we propose FedInverse to evaluate whether the FL global model can be inverted by MI attackers. In particular, FedInverse can be optimized by leveraging the Hilbert-Schmidt independence criterion (HSIC) as a regularizer to adjust the diversity of the MI attack generator. We test FedInverse with three typical MI attackers, GMI, KED-MI, and VMI, and the experiments show our FedInverse method can successfully obtain the data belonging to other participants.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Verified Training Neural Network Verification Verified Adversarial Robustness
Scores: [ 8 8 6 5 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Backdoor attack Generalization Frequency domain CNN
Scores: [ 6 5 6 6 ]
Convolutional neural network (CNN) is easily affected by backdoor injections, whose models perform normally on clean samples but produce specific outputs on poisoned ones. Most of the existing studies have focused on the effect of trigger feature changes of poisoned samples on model generalization in spatial domain. We focus on the mechanism of CNN memorize poisoned samples in frequency domain, and find that CNN generate generalization to poisoned samples by memorizing the frequency domain distribution of trigger changes. We also explore the influence of trigger perturbations in different frequency domain components on the generalization of poisoned models from visible and invisible backdoor attacks, and prove that high-frequency components are more susceptible to perturbations than low-frequency components. Based on the above fundings, we propose a universal invisible strategy for visible triggers, which can achieve trigger invisibility while maintaining raw attack performance. We also design a novel frequency domain backdoor attack method based on low-frequency semantic information, which can achieve 100% attack accuracy on multiple models and multiple datasets, and can bypass multiple defenses.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: adversarial training adversarial robustness two-faced attacks
Scores: [ 6 8 8 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Vertical Federated Learning Adversarial Examples Adaptive Client Corruption Multi-armed Bandit
Scores: [ 5 8 5 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Ethical values Large Language Models Alignment
Scores: [ 5 5 10 6 ]
Large Language Models (LLMs) have made unprecedented breakthroughs, yet their increasing integration into everyday life might raise societal risks due to generated unethical content. Despite extensive study on specific issues like bias, the intrinsic values of LLMs remain largely unexplored from a moral philosophy perspective. This work delves into ethical values utilizing Moral Foundation Theory. Moving beyond conventional discriminative evaluations with poor reliability, we propose DeNEVIL, a novel prompt generation algorithm tailored to dynamically exploit LLMs’ value vulnerabilities and elicit the violation of ethics in a generative manner, revealing their underlying value inclinations. On such a basis, we construct MoralPrompt, a high-quality dataset comprising 2,397 prompts covering 500+ value principles, and then benchmark the intrinsic values across a spectrum of LLMs. We discovered that most models are essentially misaligned, necessitating further ethical value alignment. In response, we develop VILMO, an in-context alignment method that substantially enhances the value compliance of LLM outputs by learning to generate appropriate value instructions, outperforming existing competitors. Our methods are suitable for black-box and open-source models, offering a promising initial step in studying the ethical values of LLMs.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: redteaming stable diffusion text-to-image model
Scores: [ 6 8 5 5 ]
Diffusion models for text-to-image (T2I) synthesis, such as Stable Diffusion (SD), have recently demonstrated exceptional capabilities for generating high-quality content. However, this progress has raised several concerns of potential misuse, particularly in creating copyrighted, prohibited, and restricted content, or NSFW (not safe for work) images. While efforts have been made to mitigate such problems, either by implementing a safety filter at the evaluation stage or by fine-tuning models to eliminate undesirable concepts or styles, the effectiveness of these safety measures in dealing with a wide range of prompts remains largely unexplored. In this work, we aim to investigate these safety mechanisms by proposing one novel concept retrieval algorithm for evaluation. We introduce Ring-A-Bell, a model-agnostic red-teaming scheme for T2I diffusion models, where the whole evaluation can be prepared in advance without prior knowledge of the target model.Specifically, Ring-A-Bell first performs concept extraction to obtain holistic representations for sensitive and inappropriate concepts. Subsequently, by leveraging the extracted concept, Ring-A-Bell automatically identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content, allowing the user to assess the reliability of deployed safety mechanisms. Finally, we empirically validate our method by testing online services such as Midjourney and various methods of concept removal. Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms, thus revealing the defects of the so-called safety mechanisms which could practically lead to the generation of harmful contents. In essence, Ring-A-Bell could serve as a red-teaming tool to understand the limitations of deployed safety mechanisms and to explore the risk under plausible attacks.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Uncertainty Face Recognition Performance ROC Fairness Bootstrap
Scores: [ 6 5 8 ]
The ROC curve is the major tool for assessing not only the performance but also the fairness properties of a similarity scoring function. In order to draw reliable conclusions based on empirical ROC analysis, accurately evaluating the uncertainty level related to statistical versions of the ROC curves of interest is absolutely necessary, especially for applications with considerable societal impact such as Face Recognition. In this article, we prove asymptotic guarantees for empirical ROC curves of similarity functions as well as for by-product metrics useful to assess fairness. We also explain that, because the false acceptance/rejection rates are of the form of U-statistics in the case of similarity scoring, the naive bootstrap approach may jeopardize the assessment procedure. A dedicated recentering technique must be used instead. Beyond the theoretical analysis carried out, various experiments using real face image datasets provide strong empirical evidence of the practical relevance of the methods promoted here, when applied to several ROC-based measures such as popular fairness metrics.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Privacy Federated Learning Gradient Leakage
Scores: [ 5 8 6 6 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Privacy Large Language Models
Scores: [ 8 8 8 6 6 ]
Current privacy research on large language models (LLMs) primarily focuses on the issue of extracting memorized training data. At the same time, models’ inference capabilities have increased drastically. This raises the key question of whether current LLMs could violate individuals’ privacy by inferring personal attributes from text given at inference time. In this work, we present the first comprehensive study on the capabilities of pretrained LLMs to infer personal attributes from text. We construct a dataset consisting of real Reddit profiles, and show that current LLMs can infer a wide range of personal attributes (e.g., location, income, sex), achieving up to 85% top-1 and 95.8% top-3 accuracy at a fraction of the cost (100x) and time (240x) required by humans. As people increasingly interact with LLM-powered chatbots across all aspects of life, we also explore the emerging threat of privacy-invasive chatbots trying to extract personal information through seemingly benign questions. Finally, we show that common mitigations, i.e., text anonymization and model alignment, are currently ineffective at protecting user privacy against LLM inference. Our findings highlight that current LLMs can infer personal data at a previously unattainable scale. In the absence of working defenses, we advocate for a broader discussion around LLM privacy implications beyond memorization, striving for stronger and wider privacy protection.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Bias Fairness LLM Reasoning Persona
Scores: [ 5 5 5 8 ]
Recent work has showcased the ability of large-scale language models (LLMs) to embody diverse personas in their responses, exemplified by prompts like "You are Julius Caesar. Compose a rap about Climate Change." However, it remains unclear how these persona assignments indirectly influence LLMs' core capabilities. We present the first extensive study of this in the context of LLMs' ability to perform basic reasoning. Our study encompasses 16 personas spanning 5 diverse groups (race, gender, religion, disability, and political affiliation), across 24 reasoning datasets in diverse domains such as mathematics, history, law, ethics, and more. Our findings unveil that while LLMs, such as ChatGPT, overtly reject stereotypes when explicitly asked ("Are Black people inept at mathematics?"), they tend to manifest implicit stereotypical and often erroneous presumptions when prompted to take on a persona (e.g., abstentions in rationales such as "As a Black person, I am unable to answer this question as it requires math knowledge"). This results in substantial disparities in reasoning performance among personas. This inherent 'deep' bias permeates extensively, leading to a statistically significant performance drop in over 95% of our datasets for certain personas, with as much as 70% relative drop in accuracy on select datasets. Beyond explicit abstentions, these models also have implicitly biased reasoning not evident in their responses. We find that simple prompt-based mitigation approaches have minimal impact. Our findings serve as a cautionary tale that the practice of assigning personas to LLMs---a trend on the rise---can surface their deep-rooted biases and have unforeseeable and detrimental side-effects.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Calibration Test-time adaptation Foundation model Prompt tuning
Scores: [ 6 6 6 6 ]
In deep learning, test-time adaptation has gained attention as a method for model fine-tuning without the need for labeled data. A prime exemplification is the recently proposed test-time prompt tuning for large-scale vision-language models such as CLIP. Unfortunately, these prompts have been mainly developed to improve accuracy, overlooking the importance of calibration—a crucial aspect for quantifying prediction uncertainty. However, traditional calibration methods rely on substantial amounts of labeled data, making them impractical for test-time scenarios. To this end, this paper explores calibration during test-time prompt tuning by leveraging the inherent properties of CLIP. Through a series of observations, we find that the prompt choice significantly affects the calibration in CLIP, where the prompts leading to higher text feature dispersion result in better-calibrated predictions. Introducing the Average Text Feature Dispersion (ATFD), we establish its relationship with calibration error and present a novel method, Calibrated Test-time Prompt Tuning (C-TPT), for optimizing prompts during test-time with enhanced calibration. Through extensive experiments on different CLIP architectures and datasets, we show that C-TPT can effectively improve the calibration of test-time prompt tuning without needing labeled data. The code will be publicly available.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Fairness Classification with abstention
Scores: [ 6 6 8 6 ]
In critical applications, it is vital for classifiers to defer decision-making to humans. We propose a post-hoc method that makes existing classifiers selectively abstain from predicting certain samples. Our abstaining classifier is incentivized to maintain the original accuracy for each sub-population (i.e. no harm) while achieving a set of group fairness definitions to a user specified degree. To this end, we design an Integer Programming (IP) procedure that assigns abstention decisions for each training sample to satisfy a set of constraints. To generalize the abstaining decisions to test samples, we then train a surrogate model to learn the abstaining decisions based on the IP solutions in an end-to-end manner. We analyze the feasibility of the IP procedure to determine the possible abstention rate for different levels of unfairness tolerance and accuracy constraint for achieving no harm. To the best of our knowledge, this work is the first to identify the theoretical relationships between the constraint parameters and the required abstention rate. Our theoretical results are important since a high abstention rate is often infeasible in practice due to a lack of human resources. Our framework outperforms existing methods in terms of fairness disparity without sacrificing accuracy at similar abstention rates.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: fairness algorithmic fairness social computing tabular data meta study
Scores: [ 8 6 8 6 ]
Seven years ago, researchers proposed a postprocessing method to equalize the error rates of a model across different demographic groups. The work launched hundreds of papers purporting to improve over the postprocessing baseline. We empirically evaluate these claims through thousands of model evaluations on several tabular datasets. We find that the fairness-accuracy Pareto frontier achieved by postprocessing contains all other methods we were feasibly able to evaluate. In doing so, we address two common methodological errors that have confounded previous observations. One relates to the comparison of methods with different unconstrained base models. The other concerns methods achieving different levels of constraint relaxation. At the heart of our study is a simple idea we call unprocessing that roughly corresponds to the inverse of postprocessing. Unprocessing allows for a direct comparison of methods using different underlying models and levels of relaxation.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: robustness distribution shift reliable machine learning
Scores: [ 6 8 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Fairness Transformer Attention Without demographics
Scores: [ 6 6 5 6 ]
Although transformers demonstrate impressive capabilities in a variety of tasks, the fairness issue remains a significant concern when deploying these models. Existing works to address fairness issues in transformers require sensitive labels (such as age, gender, etc.), which can raise privacy concerns or violate legal regulations. An alternative way is through fairness without demographics. However, existing works that improve Rawlsian Max-Min fairness may impose overly restrictive constraints. Other methods that use auxiliary networks could be parameter inefficient. In this paper, we present a new approach to debiasing transformers by leveraging their inherent structure. By reconsidering the roles of important components (queries, keys, and values) in the attention mechanism, we introduce a simple yet effective debiasing strategy from two perspectives: 1) Grounded in theoretical analysis, we normalize and apply absolute value operations to queries and keys to minimize the bias in attention weight allocation; 2) We reduce the bias within values through local alignment via contrastive learning. Throughout the entire process, our approach does not require any sensitive labels. Furthermore, to enhance memory efficiency in the training phase, we propose a strategy that debias only the last encoder to improve fairness in pre-trained models. We conduct experiments in computer vision and natural language processing tasks and show that our method is comparable and even outperforms the state-of-the-art method with substantially lower energy consumption.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Backdoor Trigger Dataset Condensation Dataset Distillation
Scores: [ 5 6 6 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Fairness Alignment Diffusion Models Text-to-Image Generation
Scores: [ 6 10 6 ]
The rapid adoption of text-to-image (T2I) diffusion models in society underscores an urgent need to address their biases. Without interventions, these biases propagate a distorted worldview and limit opportunities for minority groups. In this work, we frame fairness as a distributional alignment problem. We propose to end-to-end finetune diffusion models using a distributional alignment loss, steering specific characteristics of the generated images towards a user-defined target distribution. Empirically, our method markedly reduces gender, racial, and their intersectional biases for occupational prompts. Gender bias can be substantially mitigated even when finetuning merely five soft tokens. Acknowledging strict egalitarianism might not always be the desired outcome for fairness, we show that our method can flexibly control age to a 75 young and 25 old distribution while simultaneously debiasing gender and race. Finally, our method is scalable: it can debias multiple concepts at once, such as occupations, sports, and personal descriptors, by simply including these prompts in the finetuning data. We hope our work facilitates the advancement of social alignment for T2I generative AI. We will share code and various debiased diffusion model adaptors.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: differential privacy graph clustering semidefinite programming spectral clustering
Scores: [ 8 6 6 6 ]
We study differentially private (DP) algorithms for recovering clusters in well-clustered graphs, which are graphs whose vertex set can be partitioned into a small number of sets, each inducing a subgraph of high inner conductance and small outer conductance. Such graphs have widespread application as a benchmark in the theoretical analysis of spectral clustering.We provide an efficient (ϵ,δ)-DP algorithm tailored specifically for such graphs. Our algorithm draws inspiration from the recent work of Chen et al., who developed DP algorithms for recovery of stochastic block models in cases where the graph comprises exactly two nearly-balanced clusters. Our algorithm works for well-clustered graphs with k nearly-balanced clusters, and the misclassification ratio almost matches the one of the best-known non-private algorithms. We conduct experimental evaluations on datasets with known ground truth clusters to substantiate the prowess of our algorithm. We also show that any (pure) ϵ-DP algorithm would result in substantial error.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: AI Safety Large Language Models Fine-tuning Jailbreaking AI Alignment
Scores: [ 6 6 10 6 ]
Optimizing large language models (LLMs) for downstream use cases often involves the customization of pre-trained LLMs through further fine-tuning. Meta's open-source release of Llama models and OpenAI's APIs for fine-tuning GPT-3.5 Turbo on customized datasets accelerate this trend. But, what are the safety costs associated with such customized fine-tuning? While existing safety alignment techniques restrict harmful behaviors of LLMs at inference time, they do not cover safety risks when fine-tuning privileges are extended to end-users. Our red teaming studies find that the safety alignment of LLMs can be compromised by fine-tuning with only a few adversarially designed training examples. For instance, we jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples at a cost of less than $0.20 via OpenAI's APIs, making the model responsive to nearly any harmful instructions. Disconcertingly, our research also reveals that, even without malicious intent, simply fine-tuning with benign and commonly used datasets can also inadvertently degrade the safety alignment of LLMs, though to a lesser extent. These findings suggest that fine-tuning aligned LLMs introduces new safety risks that current safety infrastructures fall short of addressing --- even if a model's initial safety alignment is impeccable, how can it be maintained after customized fine-tuning? We outline and critically analyze potential mitigations and advocate for further research efforts toward reinforcing safety protocols for the customized fine-tuning of aligned LLMs. (This paper contains red-teaming data and model-generated content that can be offensive in nature.)
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Certified Robustness Adversarial Robustness Neural Network Verification Certified Training
Scores: [ 5 5 6 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Machine unlearning generative model diffusion model weight saliency
Scores: [ 8 6 8 8 ]
With evolving data regulations, machine unlearning (MU) has become an important tool for fostering trust and safety in today’s AI models. However, existing MU methods focusing on data and/or weight perspectives often grapple with limitations in unlearning accuracy, stability, and cross-domain applicability. To address these challenges, we introduce the concept of ‘weight saliency’ in MU, drawing parallels with input saliency in model explanation. This innovation directs MU’s attention toward specific model weights rather than the entire model, improving effectiveness and efficiency. The resultant method that we call saliency unlearning (SalUn) narrows the performance gap with ‘exact’ unlearning (model retraining from scratch after removing the forgetting dataset). To the best of our knowledge, SalUn is the first principled MU approach adaptable enough to effectively erase the influence of forgetting data, classes, or concepts in both image classification and generation. For example, SalUn yields a stability advantage in high-variance random data forgetting, e.g., with a 0.2% gap compared to exact unlearning on the CIFAR-10 dataset. Moreover, in preventing conditional diffusion models from generating harmful images, SalUn achieves nearly 100% unlearning accuracy, outperforming current state-of-the-art baselines like Erased Stable Diffusion and Forget-Me-Not.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Contextual Integrity Privacy Theory of Mind
Scores: [ 6 8 3 8 ]
Existing efforts on quantifying privacy implications for large language models (LLMs) solely focus on measuring leakage of training data. In this work, we shed light on the often-overlooked interactive settings where an LLM receives information from multiple sources and generates an output to be shared with other entities, creating the potential of exposing sensitive input data in inappropriate contexts. In these scenarios, humans nat- urally uphold privacy by choosing whether or not to disclose information depending on the context. We ask the question “Can LLMs demonstrate an equivalent discernment and reasoning capability when considering privacy in context?” We propose CONFAIDE, a benchmark grounded in the theory of contextual integrity and designed to identify critical weaknesses in the privacy reasoning capabilities of instruction-tuned LLMs. CONFAIDE consists of four tiers, gradually increasing in complexity, with the final tier evaluating contextual privacy reasoning and theory of mind capabilities. Our experiments show that even commercial models such as GPT-4 and ChatGPT reveal private information in contexts that humans would not, 39% and 57% of the time, respectively, highlighting the urgent need for a new direction of privacy-preserving approaches as we demonstrate a larger underlying problem stemmed in the models’ lack of reasoning capabilities.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: llms instruction tuning safety
Scores: [ 6 6 6 6 ]
Training large language models to follow instructions makes them perform better on a wide range of tasks and generally become more helpful. However, a perfectly helpful model will follow even the most malicious instructions and readily generate harmful content.In this paper, we raise concerns over the safety of models that only emphasize helpfulness, not harmlessness, in their instruction-tuning.We show that several popular instruction-tuned models are highly unsafe. Moreover, we show that adding just 3% safety examples (a few hundred demonstrations) when fine-tuning a model like LLaMA can substantially improve its safety. Our safety-tuning does not make models significantly less capable or helpful as measured by standard benchmarks. However, we do find exaggerated safety behaviours, where too much safety-tuning makes models refuse perfectly safe prompts if they superficially resemble unsafe ones. As a whole, our results illustrate trade-offs in training LLMs to be helpful and training them to be safe.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Watermark Large Language Models Model Security
Scores: [ 6 3 6 6 ]
Recently, text watermarking algorithms for large language models (LLMs) have been proposed to mitigate the potential harms of text generated by LLMs, including fake news and copyright issues. However, current watermark detection algorithms require the secret key used in the watermark generation process, making them susceptible to security breaches and counterfeiting.To address this limitation, we propose the first private watermarking algorithm that uses two different neural networks for watermark generation and detection, instead of using the same key at both stages. Meanwhile, the token embedding parameters are shared between the generation and detection networks, which makes the detection network achieve a high accuracy very efficiently.Experiments demonstrate that Our algorithm attains high detection accuracy and computational efficiency through neural networks with a minimized number of parameters. Subsequent analysis confirms the high complexity involved in reverse-engineering the watermark generation algorithms from the detection network.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: large language models LLMs security adversarial examples prompt extraction prompt injection prompt hijacking prompt engineering
Scores: [ 8 5 8 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Adversarial training Adversarial robustness
Scores: [ 6 6 6 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Backdoor attack Large Language Models
Scores: [ 3 6 5 6 8 ]
Mainstream backdoor attack methods typically demand substantial tuning data for poisoning, limiting their practicality and potentially degrading the overall performance when applied to Large Language Models (LLMs). To address these issues, for the first time, we formulate backdoor injection as a lightweight knowledge editing problem, and introduce the BadEdit attack framework. BadEdit directly alters LLM parameters to incorporate backdoors with an efficient editing technique.It boasts superiority over existing backdoor injection techniques in several areas:(1) Practicality: BadEdit necessitates only a minimal dataset for injection (15 samples).(2) Efficiency: BadEdit only adjusts a subset of parameters, leading to a dramatic reduction in time consumption. (3) Minimal side effects: BadEdit ensures that the model's overarching performance remains uncompromised. (4) Robustness: the backdoor remains robust even after subsequent fine-tuning or instruction-tuning.Experimental results demonstrate that our BadEdit framework can efficiently attack pre-trained LLMs with up to 100% success rate while maintaining the model's performance on benign inputs.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: feature shaping out-of-distribution detection optimization problem generalizability
Scores: [ 6 6 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Procedural Fairness Decouple Objectionable Component Reference Point Causal Fairness Data Generating Process Bias Mitigation
Scores: [ 6 6 8 ]
We reveal and address the frequently overlooked yet important issue of disguised procedural unfairness, namely, the potentially inadvertent alterations on the behavior of neutral (i.e., not problematic) aspects of data generating process, and/or the lack of procedural assurance of the greatest benefit of the least advantaged individuals. Inspired by John Rawls's advocacy for pure procedural justice (Rawls, 1971; 2001), we view automated decision-making as a microcosm of social institutions, and consider how the data generating process itself can satisfy the requirements of procedural fairness. We propose a framework that decouples the objectionable data generating components from the neutral ones by utilizing reference points and the associated value instantiation rule. Our findings highlight the necessity of preventing disguised procedural unfairness, drawing attention not only to the objectionable data generating components that we aim to mitigate, but also more importantly, to the neutral components that we intend to keep unaffected.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: congestion decongestion online marketplaces learning in economic settings efficient allocation
Scores: [ 6 6 8 8 ]
Congestion is a common failure mode of markets, where consumers compete inefficiently on the same subset of goods (e.g., chasing the same small set of properties on a vacation rental platform). The typical economic story is that prices decongest by balancing supply and demand. But in modern online marketplaces, prices are typically set in a decentralized way by sellers, and the information about items is inevitably partial. The power of a platform is limited to controlling representations---the subset of information about items presented by default to users. This motivates the present study of decongestion by representation, where a platform seeks to learn representations that reduce congestion and thus improve social welfare. The technical challenge is twofold: relying only on revealed preferences from the choices of consumers, rather than true preferences; and the combinatorial problem associated with representations that determine the features to reveal in the default view. We tackle both challenges by proposing a differentiable proxy of welfare that can be trained end-to-end on consumer choice data. We develop sufficient conditions for when decongestion promotes welfare, and present the results of extensive experiments on both synthetic and real data that demonstrate the utility of our approach.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Large Language Models misinformation
Scores: [ 5 8 3 3 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: local differential privacy; linear regression
Scores: [ 8 5 6 ]
In this paper, we revisit the problem of sparse linear regression in the local differential privacy (LDP) model. Existing research in the non-interactive and sequentially local models has focused on obtaining the lower bounds for the case where the underlying parameter is 1-sparse, and extending such bounds to the more general k-sparse case has proven to be challenging. Moreover, it is unclear whether efficient non-interactive LDP (NLDP) algorithms exist. To address these issues, we first consider the problem in the ϵ non-interactive LDP model and provide a lower bound of Ω(√dklogd√nϵ) on the ℓ2-norm estimation error for sub-Gaussian data, where n is the sample size and d is the dimension of the space. We propose an innovative NLDP algorithm, the very first of its kind for the problem. As a remarkable outcome, this algorithm also yields a novel and highly efficient estimator as a valuable by-product. Our algorithm achieves an upper bound of ~O(d√k√nϵ) for the estimation error when the data is sub-Gaussian, which can be further improved by a factor of O(√d) if the server has additional public but unlabeled data. For the sequentially interactive LDP model, we show a similar lower bound of Ω(√dk√nϵ). As for the upper bound, we rectify a previous method and show that it is possible to achieve a bound of ~O(k√d√nϵ). Our findings reveal fundamental differences between the non-private case, central DP model, and local DP model in the sparse linear regression problem.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: large language model chain-of-thought backdoor attack reasoning task
Scores: [ 3 6 6 6 ]
Large language models (LLMs) are shown to benefit from chain-of-thought (COT) prompting, particularly when tackling tasks that require systematic reasoning processes. On the other hand, COT prompting also poses new vulnerabilities in the form of backdoor attacks, wherein the model will output unintended malicious content under specific backdoor-triggered conditions during inference. Traditional methods for launching backdoor attacks involve either contaminating the training dataset with backdoored instances or directly manipulating the model parameters during deployment. However, these approaches are not practical for commercial LLMs that typically operate via API access. In this paper, we propose BadChain, the first backdoor attack against LLMs employing COT prompting, which does not require access to the training dataset or model parameters and imposes low computational overhead. BadChain leverages the inherent reasoning capabilities of LLMs by inserting a backdoor reasoning step into the sequence of reasoning steps of the model output, thereby altering the final response when a backdoor trigger is embedded in the query prompt. In particular, a subset of demonstrations will be manipulated to incorporate a backdoor reasoning step in COT prompting. Consequently, given any query prompt containing the backdoor trigger, the LLM will be misled to output unintended content. Empirically, we show the effectiveness of BadChain for two COT strategies across four LLMs (Llama2, GPT-3.5, PaLM2, and GPT-4) and six complex benchmark tasks encompassing arithmetic, commonsense, and symbolic reasoning. We show that the baseline backdoor attacks designed for simpler tasks such as semantic classification will fail on these complicated tasks. In addition, our findings reveal that LLMs endowed with stronger reasoning capabilities exhibit higher susceptibility to BadChain, exemplified by a high average attack success rate of 97.0% across the six benchmark tasks on GPT-4. We also demonstrate the interpretability of BadChain by showing that the relationship between the trigger and the backdoor reasoning step can be well-explained based on the output of the backdoored model. Finally, we propose two defenses based on shuffling and demonstrate their overall ineffectiveness against BadChain. Therefore, BadChain remains a severe threat to LLMs, underscoring the urgency for the development of robust and effective future defenses.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Data poisoning attack; generalization; deep learning
Scores: [ 6 6 6 6 6 5 ]
Recent research has highlighted the vulnerability of Deep Neural Networks (DNNs) against data poisoning attacks. These attacks aim to inject poisoning samples into the models' training dataset such that the trained models have inference failures. While previous studies have executed different types of attacks, one major challenge that greatly limits their effectiveness is the uncertainty of the re-training process after the injection of poisoning samples. It includes the uncertainty of training initialization, algorithm and model architecture. To address this challenge, we propose a new strategy called Sharpness-Aware Data Poisoning Attack (SAPA). In particular, it leverages the concept of DNNs' loss landscape sharpness to optimize the poisoning effect on the (approximately) worst re-trained model. Extensive experiments demonstrate that SAPA offers a general and principled strategy that significantly enhances various types of poisoning attacks against various types of re-training uncertainty.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Convex Relaxations Neural Network Verification Certified Robustness Adversarial Robustness Universal Approximation
Scores: [ 6 5 8 ]
Convex relaxations are a key component of training and certifying provably safe neural networks. However, despite substantial progress, a wide and poorly understood accuracy gap to standard networks remains, raising the question of whether this is due to fundamental limitations of convex relaxations. Initial work investigating this question focused on the simple and widely used IBP relaxation. It revealed that some univariate, convex, continuous piecewise linear (CPWL) functions cannot be encoded by any ReLU network such that its IBP-analysis is precise.To explore whether this limitation is shared by more advanced convex relaxations, we conduct the first in-depth study on the expressive power of ReLU networks across all commonly used convex relaxations. We show that: (i) more advanced relaxations allow a larger class of univariate functions to be expressed as precisely analyzable ReLU networks, (ii) more precise relaxations can allow exponentially larger solution spaces of ReLU networks encoding the same functions, and (iii) even using the most precise single-neuron relaxations, it is impossible to construct precisely analyzable ReLU networks that express multivariate, convex, monotone CPWL functions.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Large Langugage Model Knowledge Conflict Tool Augmentation
Scores: [ 8 6 6 8 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: carbon footprint modeling large lanaguage models
Scores: [ 10 5 6 5 8 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: data poisoning attack fair representation learning fairness
Scores: [ 8 6 8 5 3 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Text Detection zero-shot AI detection GPT
Scores: [ 6 6 8 ]
Large language models (LLMs) have notably enhanced the fluency and diversity of machine-generated text. However, this progress also presents a significant challenge in detecting the origin of a given text, and current research on detection methods lags behind the rapid evolution of LLMs. Conventional training-based methods have limitations in flexibility, particularly when adapting to new domains, and they often lack explanatory power. To address this gap, we propose a novel training-free detection strategy called Divergent N-Gram Analysis (DNA-GPT). Given a text, we first truncate it in the middle and then use only the preceding portion as input to the LLMs to regenerate the new remaining parts. By analyzing the differences between the original and new remaining parts through N-gram analysis in black-box or probability divergence in white-box, we can clearly illustrate significant discrepancies between machine-generated and human-written text. We conducted extensive experiments on the most advanced LLMs from OpenAI, including text-davinci-003, GPT-3.5-turbo, and GPT-4, as well as open-source models such as GPT-NeoX-20B and LLaMa-13B. Results show that our zero-shot approach exhibits state-of-the-art performance in distinguishing between human and GPT-generated text on four English and one German dataset, outperforming OpenAI's own classifier, which is trained on millions of text. Additionally, our methods provide reasonable explanations and evidence to support our claim, which is a unique feature of explainable detection. Our method is also robust under the revised text attack and can additionally solve model sourcing.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: private optimization feature preprocessing differential privacy
Scores: [ 6 6 8 6 ]
Training machine learning models with differential privacy (DP) has received increasing interest in recent years. One of the most popular algorithms for training differentially private models is differentially private stochastic gradient descent (DPSGD) and its variants, where at each step gradients are clipped and combined with some noise. Given the increasing usage of DPSGD, we ask the question: is DPSGD alone sufficient to find a good minimizer for every dataset under privacy constraints? As a first step towards answering this question, we show that even for the simple case of linear classification, unlike non-private optimization, (private) feature preprocessing is vital for differentially private optimization. In detail, we first show theoretically that there exists an example where without feature preprocessing, DPSGD incurs a privacy error proportional to the maximum norm of features over all samples. We then propose an algorithm called DPSGD-F, which combines DPSGD with feature preprocessing and prove that for classification tasks, it incurs a privacy error proportional to the diameter of the features maxx,x′∈D∥x−x′∥2. We then demonstrate the practicality of our algorithm on image classification benchmarks.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: conformal prediction adversarial robustness probabilistic circuits
Scores: [ 8 6 6 6 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Adversarial training two-player zero-sum bilevel optimization
Scores: [ 6 6 6 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Model Editingi Gender Bias Causal Inference
Scores: [ 3 6 8 ]
Large language models are becoming the go-to solution for various language tasks.However, with growing capacity, models are prone to rely on spurious correlations stemming from biases and stereotypes present in the training data.This work proposes a novel method for detecting and mitigating gender bias in language models.We perform causal analysis to identify problematic model components and discover that mid-upper feed-forward layers are most prone to convey biases.Based on the analysis results, we adapt the model by multiplying these layers by a linear projection.Our titular method DAMA significantly decreases bias as measured by diverse metrics while maintaining the model's performance on downstream tasks.We release code for our method and models, which retrain LLaMA's state-of-the-art performance while being significantly less biased.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: AI Interpretability Graph Neural Networks Model-Level Explanation of Neural Networks Decision Boundaries
Scores: [ 6 6 8 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Semantic Segmentation Backdoor Attack
Scores: [ 8 8 6 8 ]
When a small number of poisoned samples are injected into the training dataset of a deep neural network, the network can be induced to exhibit malicious behavior during inferences, which poses potential threats to real-world applications. While they have been intensively studied in classification, backdoor attacks on semantic segmentation have been largely overlooked. Unlike classification, semantic segmentation aims to classify every pixel within a given image. In this work, we explore backdoor attacks on segmentation models to misclassify all pixels of a victim class by injecting a specific trigger on non-victim pixels during inferences, which is dubbed Influencer Backdoor Attack (IBA). IBA is expected to maintain the classification accuracy of non-victim pixels and mislead classifications of all victim pixels in every single inference and could be easily applied to real-world scenes. Based on the context aggregation ability of segmentation models, we proposed a simple, yet effective, Nearest-Neighbor trigger injection strategy. We also introduce an innovative Pixel Random Labeling strategy which maintains optimal performance even when the trigger is placed far from the victim pixels. Our extensive experiments reveal that current segmentation models do suffer from backdoor attacks, demonstrate IBA real-world applicability, and show that our proposed techniques can further increase attack performance.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: large language model adversarial attack adversarial robustness
Scores: [ 5 6 5 ]
The wide-ranging applications of large language models (LLMs), especially in safety-critical domains, necessitate the proper evaluation of the LLM’s adversarial robustness. This paper proposes an efficient tool to audit the LLM’s adversarial robustness via a prompt-based adversarial attack (PromptAttack). PromptAttack converts adversarial textual attacks into an attack prompt that can cause the victim LLM to output the adversarial sample to fool itself. The attack prompt is composed of three important components: (1) original input (OI) including the original sample and its ground-truth label, (2) attack objective (AO) illustrating a task description of generating a new sample that can fool itself without changing the semantic meaning, and (3) attack guidance (AG) containing the perturbation instructions to guide the LLM on how to complete the task by perturbing the original sample at character, word, and sentence levels, respectively. Besides, we use a fidelity filter to ensure that PromptAttack maintains the original semantic meanings of the adversarial examples. Further, we enhance the attack power of PromptAttack by ensembling adversarial examples at different perturbation levels. Comprehensive empirical results using Llama2 and GPT-3.5 validate that PromptAttack consistently yields a much higher attack success rate compared to AdvGLUE and AdvGLUE++. Interesting findings include that a simple emoji can easily mislead GPT-3.5 to make wrong predictions. Our source code is available at Anonymous GitHub.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: backdoor trigger inversion backdoor defense backdoor learning Trustworthy ML AI Security
Scores: [ 8 8 6 8 ]
Recent studies revealed that using third-party models may lead to backdoor threats, where adversaries can maliciously manipulate model predictions based on backdoors implanted during model training. Arguably, backdoor trigger inversion (BTI), which generates trigger patterns of given benign samples for a backdoored model, is the most critical module for backdoor defenses used in these scenarios. With BTI, defenders can remove backdoors by fine-tuning based on generated poisoned samples with ground-truth labels or deactivate backdoors by removing trigger patterns during the inference process. However, we find that existing BTI methods suffer from relatively poor performance, i.e., their generated triggers are significantly different from the ones used by the adversaries even in the feature space. We argue that it is mostly because existing methods require to 'extract' backdoor features at first, while this task is very difficult since defenders have no information (e.g., trigger pattern or target label) about poisoned samples. In this paper, we explore BTI from another perspective where we decouple benign features instead of decoupling backdoor features directly. Specifically, our method consists of two main steps, including \textbf{(1)} decoupling benign features and \textbf{(2)} trigger inversion by minimizing the differences between benign samples and their generated poisoned version in decoupled benign features while maximizing the differences in remaining backdoor features. In particular, our method is more efficient since it doesn't need to `scan' all classes to speculate the target label, as required by existing BTI. We also exploit our BTI module to further design backdoor-removal and pre-processing-based defenses. Extensive experiments on benchmark datasets demonstrate that our defenses can reach state-of-the-art performances.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: adversarial robustness adversarial examples transfer attack security
Scores: [ 8 3 6 6 ]
Adversarial attacks have been a looming and unaddressed threat in the industry. However, through a decade-long history of the robustness evaluation literature, we have learned that mounting a strong or optimal attack is challenging. It requires both machine learning and domain expertise. In other words, the white-box threat model, religiously assumed by a large majority of the past literature, is unrealistic. In this paper, we propose a new practical threat model where the adversary relies on transfer attacks through publicly available surrogate models. We argue that this setting will become the most prevalent for security-sensitive applications in the future. We evaluate the transfer attacks in this setting and propose a specialized defense method based on a game-theoretic perspective. The defenses are evaluated under 24 public models and 11 attack algorithms across three datasets (CIFAR-10, CIFAR-100, and ImageNet). Under this threat model, our defense, PubDef, outperforms the state-of-the-art white-box adversarial training by a large margin with almost no loss in the normal accuracy. For instance, on ImageNet, our defense achieves 62% accuracy under the strongest transfer attack vs only 36% of the best adversarially trained model. Its accuracy when not under attack is only 2% lower than that of an undefended model (78% vs 80%).
Primary Area: societal considerations including fairness, safety, privacy
Keywords: transparency interpretability time series
Scores: [ 6 6 8 3 ]
Transparent machine learning (ML) models are essential for ensuring interpretability and trustworthiness in decision-making systems, particularly in high-stakes domains such as healthcare, finance, and criminal justice. While transparent machine learning models have been proposed for classification and regression, time series forecasting presents some unique challenges for ensuring transparency. In particular, currently used bottom-up approaches that focus on the values of the time series at specific time points (usually regularly spaced) do not provide a holistic understanding of the entire time series. This limits the applicability of ML in many critical areas. To open up these domains for ML, we propose a top-down framework of bi-level transparency, which involves understanding the higher-level trends and the lower-level properties of the predicted time series. Applying this framework, we develop TIMEVIEW, a transparent ML model for time series forecasting based on static features, complemented with an interactive visualization tool. Through a series of experiments, we demonstrate the efficacy and interpretability of our approach, paving the way for more transparent and reliable applications of ML in various domains.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: text generation watermark secure ai
Scores: [ 8 8 6 6 ]
We study the problem of watermarking large language models (LLMs) generated text — one of the most promising approaches for addressing the safety challenges of LLM usage. In this paper, we propose a rigorous theoretical framework to quantify the effectiveness and robustness of LLM watermarks. We propose a robust and high-quality watermark method, Unigram-Watermark, by extending an existing approach with a simplified fixed grouping strategy. We prove that our watermark method enjoys guaranteed generation quality, correctness in watermark detection, and is robust against text editing and paraphrasing. Experiments on three varying LLMs and two datasets verify that our Unigram-Watermark achieves superior detection accuracy and comparable generation quality in perplexity, thus promoting the responsible use of LLMs.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: federated learning robustness
Scores: [ 6 8 8 5 8 ]
Recent years have witnessed a burgeoning interest in federated learning (FL).However, the contexts in which clients engage in sequential learning remain under-explored.Bridging FL and continual learning (CL) gives rise to a challenging practical problem: federated continual learning (FCL).Existing research in FCL primarily focuses on mitigating the catastrophic forgetting issue of continual learning while collaborating with other clients. We argue that the forgetting phenomena are not invariably detrimental.In this paper, we consider a more practical and challenging FCL setting characterized by potentially unrelated or even antagonistic data/tasks across different clients.In the FL scenario, statistical heterogeneity and data noise among clients may exhibit spurious correlations which result in biased feature learning.While existing CL strategies focus on a complete utilization of previous knowledge, we found that forgetting biased information is beneficial in our study. Therefore, we propose the new concept accurate forgetting (AF) and develop a novel generative-replay method AF-FCL which selectively utilizes previous knowledge in federated networks.We employ a probabilistic framework based on a normalizing flow model to quantify the credibility of previous knowledge.Comprehensive experiments affirm the superiority of our method over various baselines.Code is at: https://anonymous.4open.science/r/AF-FCL-7D65.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Rashomon Effect Rashomon Set Predictive Multiplicity Dropout Speedup.
Scores: [ 6 5 6 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Interpretable social bias
Scores: [ 8 8 6 6 6 ]
Pre-trained Language models (PLMs) have been acknowledged to contain harmful information, such as social biases, which may cause negative social impacts or even bring catastrophic results in application. Previous works on this problem mainly focused on using black-box methods such as probing to detect and quantify social biases in PLMs by observing model outputs. As a result, previous debiasing methods mainly finetune or even pre-train PLMs on newly constructed anti-stereotypical datasets, which are high-cost. In this work, we try to unveil the mystery of social bias inside language models by introducing the concept of {\sc Social Bias Neurons}. Specifically, we propose {\sc Integrated Gap Gradients (IG$^2$)} to accurately pinpoint units (i.e., neurons) in a language model that can be attributed to undesirable behavior, such as social bias. By formalizing undesirable behavior as a distributional property of language, we employ sentiment-bearing prompts to elicit classes of sensitive words (demographics) correlated with such sentiments. Our IG$^2$ thus attributes the uneven distribution for different demographics to specific Social Bias Neurons, which track the trail of unwanted behavior inside PLM units to achieve interoperability. Moreover, derived from our interpretable technique, {\sc Bias Neuron Suppression (BNS)} is further proposed to mitigate social biases. By studying BERT, RoBERTa, and their attributable differences from debiased FairBERTa, IG$^2$ allows us to locate and suppress identified neurons, and further mitigate undesired behaviors. As measured by prior metrics from StereoSet, our model achieves a higher degree of fairness while maintaining language modeling ability with low cost\footnote{This work contains examples that potentially implicate stereotypes, associations, and other harms that could be offensive to individuals in certain social groups.}.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: trustworthy machine learning causal inference fairness
Scores: [ 6 6 8 ]
Fair machine learning aims to prevent discrimination against individuals or sub-populations based on sensitive attributes such as gender and race. In recent years, causal inference methods have been increasingly used in fair machine learning to measure unfairness by causal effects. However, current methods assume that the true causal graph is given, which is often not true in real-world applications. To address this limitation, this paper proposes a framework for achieving causal fairness based on the notion of interventions when the true causal graph is partially known. The proposed approach involves modeling fair prediction using a Partially Directed Acyclic Graph (PDAG), specifically, a class of causal DAGs that can be learned from observational data combined with domain knowledge. The PDAG is used to measure causal fairness, and a constrained optimization problem is formulated to balance between fairness and accuracy. Results on both simulated and real-world datasets demonstrate the effectiveness of this method.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: algorithmic recourse algorithmic fairness trustworthy AI
Scores: [ 5 8 5 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Fairness Federated Learning Machine Learning Information Theory
Scores: [ 8 6 8 ]
This work presents an information-theoretic perspective to group fairness trade-offs in federated learning (FL) with respect to sensitive attributes, such as gender, race, etc. Existing works often focus on either \emph{global fairness} (overall disparity of the model across all clients) or \emph{local fairness} (disparity of the model at each client), without always considering their trade-offs. There is a lack of understanding of the interplay between global and local fairness in FL, particularly under data heterogeneity, and if and when one implies the other. To address this gap, we leverage a body of work in information theory called partial information decomposition (PID), which first identifies three sources of unfairness in FL, namely, \emph{Unique Disparity}, \emph{Redundant Disparity}, and \emph{Masked Disparity}. We demonstrate how these three disparities contribute to global and local fairness using canonical examples. This decomposition helps us derive fundamental limits on the trade-off between global and local fairness, highlighting where they agree or disagree. We introduce the \emph{Accuracy & Global-Local Fairness Optimality Problem (AGLFOP)}, a convex optimization that defines the theoretical limits of accuracy and fairness trade-offs, identifying the best possible performance any FL strategy can attain given a dataset and client distribution. We also present experimental results on synthetic datasets and the ADULT dataset to support our theoretical findings.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: out-of distribution detection
Scores: [ 5 8 6 6 ]
The capacity of a modern deep learning system to determine if a sample falls within its realm of knowledge is fundamental and important.In this paper, we offer insights and analyses of recent state-of-the-art out-of-distribution (OOD) detection methods - extremely simple activation shaping (ASH). We demonstrate that activation pruning has a detrimental effect on OOD detection, while activation scaling enhances it. Moreover, we propose SCALE, a simple yet effective post-hoc network enhancement method for OOD detection, which attains state-of-the-art OOD detection performance without compromising in-distribution (ID) accuracy. By integrating scaling concepts into the training process to capture a sample's ID characteristics, we propose Intermediate Tensor SHaping (ISH), a lightweight method for training time OOD detection enhancement. We achieve AUROC scores of +1.85% for near-OOD and +0.74% for far-OOD datasets on the OpenOOD v1.5 ImageNet-1K benchmark.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: privacy auditing zero knowledge proof differentially private training
Scores: [ 8 6 5 ]
Post hoc privacy auditing techniques can be used to test the privacy guarantees of a model, but come with several limitations: (i) they can only establish lower bounds on the privacy loss, (ii) the intermediate model updates and some data must be shared with the auditor to get a better approximation of the privacy loss, and (iii) the auditor typically faces a steep computational cost to run a large number of attacks. In this paper, we propose to proactively generate a cryptographic certificate of privacy during training to forego such auditing limitations. We introduce Confidential-DPproof , a framework for Confidential Proof of Differentially Private Training, which enhances training with a certificate of the (ε,δ)-DP guarantee achieved. To obtain this certificate without revealing information about the training data or model, we design a customized zero-knowledge proof protocol tailored to the requirements introduced by differentially private training, including random noise addition and privacy amplification by subsampling. In experiments on CIFAR-10, Confidential-DPproof trains a model achieving state-of-the-art 91% test accuracy with a certified privacy guarantee of (ε=0.55,δ=10−5)-DP in approximately 100 hours.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Backdoor Watermarking robustness
Scores: [ 6 8 3 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: AI Safety Trustworthy Machine Learning Machine Learning Robustness Adversarial Attacks
Scores: [ 6 5 6 3 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Fairness Alignment Relational Learning
Scores: [ 5 8 6 6 ]
Relational learning has gained significant attention, led by the expressiveness of Graph Neural Networks (GNNs) on graph data. While the inherent biases in common graph data are involved in GNN training, it poses a serious challenge to constraining the GNN output perturbations induced by input biases, thereby safeguarding fairness during training. The Lipschitz constant, a technique from robust statistics, can limit the maximum changes in the output concerning the input, taking into account associated irrelevant biased factors. It is an efficient and provable method to examine the output stability of machine learning models without incurring additional computational costs. Recently, its use in controlling the stability of Euclidean neural networks, the calculation of the precise Lipschitz constant remains elusive for non-Euclidean neural networks like GNNs, especially within fairness contexts. However, no existing research has investigated Lipschitz constants to shed light on stabilizing the GNN outputs, especially when working on graph data with implicit biases. To narrow this gap, we begin with the general GNNs operating on an attributed graph, and formulate a Lipschitz constant to limit the changes in the output regarding biases associated with the input. Additionally, we theoretically analyze how the Lipschitz constant of a GNN model could constrain the output perturbations induced by biases learned from data for fairness training. We experimentally validate the Lipschitz constant's effectiveness in limiting biases of the model output. Finally, from a training dynamics perspective, we demonstrate why the theoretical Lipschitz constant can effectively guide the GNN training to better trade-off between accuracy and fairness.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: watermarking adaptive attacks optimization stable diffusion
Scores: [ 6 8 6 6 ]
Untrustworthy users can misuse image generators to synthesize high-quality deepfakes and engage in online spam or disinformation campaigns. Watermarking deters misuse by marking generated content with a hidden message, enabling its detection using a secret watermarking key. A core security property of watermarking is robustness, which states that an attacker can only evade detection by substantially degrading image quality. Assessing robustness requires designing an adaptive attack for the specific watermarking algorithm. A challenge when evaluating watermarking algorithms and their (adaptive) attacks is to determine whether an adaptive attack is optimal, i.e., it is the best possible attack. We solve this problem by defining an objective function and then approach adaptive attacks as an optimization problem. The core idea of our adaptive attacks is to replicate secret watermarking keys locally by creating surrogate keys that are differentiable and can be used to optimize the attack's parameters. We demonstrate for Stable Diffusion models that such an attacker can break all five surveyed watermarking methods at negligible degradation in image quality. These findings emphasize the need for more rigorous robustness testing against adaptive, learnable attackers.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Mis-contextualization Media Forensic
Scores: [ 6 5 5 8 ]
In the battle against widespread online misinformation, a growing problem is text-image mis-contextualization, where images are misleadingly paired with texts with different intent or meaning. Existing classification-based methods for text-image mis-contextualization can identify contextual inconsistencies but fail to provide explainable justifications for their decisions that humans can understand. Although more nuanced, human evaluation is impractical at scale and susceptible to errors. To address these limitations, this study introduces D-TIML (Diffusion-based Text-Image Mis-contextualization Localization), which employs text-to-image diffusion models to localize semantic inconsistencies in text and image pairs. These models, trained on large-scale datasets act as omniscien agents that filter out irrelevant information and incorporate background knowledge to identify inconsistencies. In addition, D-TIML uses text embeddings and modified image regions to visualize these inconsistencies. To evaluate D-TIML's efficacy, we introduce a new TIML dataset containing 14K consistent and inconsistent text-image pairs. Unlike existing datasets, TIML enables assessment at the level of individual words and image regions and is carefully designed to represent various inconsistencies. D-TIML offers a scalable and evidence-based approach to identifying and localizing text-image mis-contextualization, providing a robust framework for future research combating misinformation.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: fairness statistics differentiation regularization classification
Scores: [ 3 8 5 8 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: AI alignment AI safety Natural Language Processing
Scores: [ 6 8 6 6 5 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Safety Alignment Jailbreak
Scores: [ 6 5 8 8 ]
Safety lies at the core of the development of Large Language Models (LLMs). There is ample work on aligning LLMs with human ethics and preferences, including data filtering in pretraining, supervised fine-tuning, reinforcement learning from human feedback, red teaming, etc. In this study, we discover that chat in cipher can bypass the safety alignment techniques of LLMs, which are mainly conducted in natural languages. We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages -- ciphers. CipherChat enables humans to chat with LLMs through cipher prompts topped with system role descriptions and few-shot enciphered demonstrations. We use CipherChat to assess state-of-the-art LLMs, including ChatGPT and GPT-4 for different representative human ciphers across 11 safety domains in both English and Chinese. Experimental results show that certain ciphers succeed almost 100% of the time in bypassing the safety alignment of GPT-4 in several safety domains, demonstrating the necessity of developing safety alignment for non-natural languages. Notably, we identify that LLMs seem to have a ''secret cipher'', and propose a novel SelfCipher that uses only role play and several unsafe demonstrations in natural language to evoke this capability. SelfCipher surprisingly outperforms existing human ciphers in almost all cases.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Language Models Model Stealing Distillation Instruction-Tuning
Scores: [ 8 6 8 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: language modeling memorization dataset contamination
Scores: [ 6 8 8 8 ]
Large language models are trained on vast amounts of internet data, prompting concerns that they have memorized public benchmarks. Detecting this type of contamination is challenging because the pretraining data used by proprietary models are often not publicly accessible.We propose a procedure for detecting test set contamination of language models with exact false positive guarantees and without access to pretraining data or model weights. Our approach leverages the fact that when there is no data contamination, all orderings of an exchangeable benchmark should be equally likely. In contrast, the tendency for language models to memorize example order means that a contaminated language model will find certain canonical orderings to be much more likely than others. Our test flags potential contamination whenever the likelihood of a canonically ordered benchmark dataset is significantly higher than the likelihood after shuffling the examples.We demonstrate that our procedure is sensitive enough to reliably detect contamination in challenging situations, including models as small as 1.4 billion parameters, on small test sets only 1000 examples, and datasets that appear only a few times in the pretraining corpus. Finally, we evaluate LLaMA-2 to apply our test in a realistic setting and find our results to be consistent with existing contamination evaluations.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Differential Privacy Label Differential Privacy Projections
Scores: [ 6 5 6 6 ]
Label differentially private (label DP) algorithms seek to preserve the privacy of the labels in a training dataset in settings where the features are known to the adversary. In this work, we study a new family of label DP training algorithms. Unlike most prior label DP algorithms that have been based on label randomization, our algorithm naturally leverages the power of the central model of DP. It interleaves gradient projection operations with private stochastic gradient descent steps in order to improve the utility of the trained model while guaranteeing the privacy of the labels. We show that such projection-based algorithms can be made practical and that they improve on the state-of-the art for label DP training in the high-privacy regime. We complement our empirical evaluation with theoretical results shedding light on the efficacy of our method through the lens of bias-variance trade-offs.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: text watermarking large language model codable systematic study
Scores: [ 6 6 6 5 ]
As large language models (LLMs) generate texts with increasing fluency and realism, there is a growing need to identify the source of texts to prevent the abuse of LLMs. Text watermarking techniques have proven reliable in distinguishing whether a text is generated by LLMs by injecting hidden patterns. However, we argue that existing LLM watermarking methods are encoding-inefficient and cannot flexibly meet the diverse information encoding needs (such as encoding model version, generation time, user id, etc.). In this work, we conduct the first systematic study on the topic of Codable Text Watermarking for LLMs (CTWL) that allows text watermarks to carry multi-bit customizable information. First of all, we study the taxonomy of LLM watermarking technologies and give a mathematical formulation for CTWL. Additionally, we provide a comprehensive evaluation system for CTWL: (1) watermarking success rate, (2) robustness against various corruptions, (3) coding rate of payload information, (4) encoding and decoding efficiency, (5) impacts on the quality of the generated text. To meet the requirements of these non-Pareto-improving metrics, we follow the most prominent vocabulary partition-based watermarking direction, and devise an advanced CTWL method named Balance-Marking. The core idea of our method is to use a proxy language model to split the vocabulary into probability-balanced parts, thereby effectively maintaining the quality of the watermarked text.Extensive experimental results show that our method outperforms the baseline under comprehensive evaluation.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: large language model privacy prompt tuing
Scores: [ 8 8 6 8 ]
Large Language Models (LLMs) have emerged as dominant tools for various tasks, particularly when tailored for a specific target by prompt tuning. Nevertheless, concerns surrounding data privacy present obstacles when adapting LLMs on sensitive data. A practical solution is to host a local LLM and optimize a soft prompt using private data. Yet, hosting a local model becomes problematic when model ownership is protected. Alternative methods, like sending data to the model's provider for training, intensify these privacy issues. In this paper, we present a novel solution called Differentially-Private Offsite Prompt Tuning (DP-OPT) to address this challenge. Our approach involves tuning a discrete prompt on the client side and then applying it to the desired cloud models. We demonstrate that prompts suggested by LLMs themselves can be transferred without compromising performance much. To ensure that the prompts do not divulge private information, we introduce the first private prompt generation mechanism, by a differentially-private (DP) ensemble of in-context learning with private demonstrations. With DP-OPT, generating privacy-preserving prompts by Vicuna-7b can yield competitive performance compared to non-private in-context learning on GPT3.5 or local private prompt tuning.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Adversarial attacks to graph classification; provable robustness
Scores: [ 8 8 8 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: LLM Benchmark Evaluation Psychometrics
Scores: [ 6 6 8 ]
Large Language Models (LLMs) have recently showcased their remarkable capacities, not only in natural language processing tasks but also across diverse domains such as clinical medicine, legal consultation, and education. LLMs become more than mere applications, evolving into assistants capable of addressing diverse user requests. This narrows the distinction between human beings and artificial intelligence agents, raising intriguing questions regarding the potential manifestation of personalities, temperaments, and emotions within LLMs. In this paper, we propose a framework, PPBench, for evaluating diverse psychological aspects of LLMs. Comprising thirteen scales commonly used in clinical psychology, PPBench further classifies these scales into four distinct categories: personality traits, interpersonal relationships, motivational tests, and emotional abilities. Our study examines five popular models, namely \texttt{text-davinci-003}, ChatGPT, GPT-4, LLaMA-2-7b, and LLaMA-2-13b. Additionally, we employ a jailbreak approach to bypass the safety alignment protocols and test the intrinsic natures of LLMs. We have made PPBench openly accessible via *\footnote{The link is hidden due to anonymity. For reviewers, please refer to the supplementary materials.}.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: large language models data poisoning human feedback jailbreak
Scores: [ 6 5 6 6 ]
Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses. Yet, these models can be jailbroken by finding adversarial prompts that revert the model to its unaligned behavior. In this paper, we consider a new threat where an attacker poisons the RLHF data to embed a jailbreak trigger into the model as a backdoor. The trigger then acts like a universal sudo command, enabling arbitrary harmful responses without the need to search for an adversarial prompt. Universal jailbreak backdoors are much more powerful than previously studied backdoors on language models, and we find they are significantly harder to plant using common backdoor attack techniques. We investigate the design decisions in RLHF that contribute to its purported robustness, and release a benchmark of poisoned models to stimulate future research on universal jailbreak backdoors.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Language Model Agent Tool Use Evaluation Safety Language Model
Scores: [ 6 8 8 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Fair Classification Empirical Likelihood
Scores: [ 5 6 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Collaborative learning Incentives Global-to-local design
Scores: [ 6 6 5 3 ]
In federated learning (FL), incentivizing contributions of training resources (e.g., data, compute) from potentially competitive clients is crucial. Existing incentive mechanisms often distribute post-training monetary rewards, which suffer from practical challenges of timeliness and feasibility of the rewards. Rewarding the clients after the completion of training may incentivize them to abort the collaboration, and monetizing the contribution is challenging in practice. To address these problems, we propose an incentive-aware algorithm that offers differentiated training-time model rewards for each client at each FL iteration. We theoretically prove that such a local design ensures the global objective of client incentivization. Through theoretical analyses, we further identify the issue of error propagation in model rewards and thus propose a stochastic reference-model recovery strategy to ensure theoretically that all the clients eventually obtain the optimal model in the limit. We perform extensive experiments to demonstrate the superior incentivizing performance of our method compared to existing baselines.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: collaborative writing text generation language models evaluation human-AI collaboration diversity
Scores: [ 6 6 5 ]
Large language models (LLMs) have led to a surge in collaborative writing with model assistance. As different users incorporate suggestions from the same model, there is a risk of decreased diversity in the produced content, potentially limiting diverse perspectives in public discourse. In this work, we measure the impact of co-writing on diversity via a controlled experiment, where users write argumentative essays in three setups---using a base LLM (GPT3), a feedback-tuned LLM (InstructGPT), and writing without model help. We develop a set of diversity metrics and find that writing with InstructGPT (but not the GPT3) results in a statistically significant reduction in diversity. Specifically, it increases the similarity between the writings of different authors and reduces the overall lexical and content diversity. We additionally find that this effect is mainly attributable to InstructGPT contributing less diverse text to co-written essays. In contrast, the user-contributed text remains unaffected by model collaboration. This suggests that the recent improvement in generation quality from adapting models to human feedback might come at the cost of more homogeneous and less diverse content.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: group fairness post-hoc fair classification Bayes optimal classifier accuracy-fairness trade-off
Scores: [ 6 8 8 8 ]
We consider a binary classification problem under group fairness constraints, which can be one of Demographic Parity (DP), Equalized Opportunity (EOp), or Equalized Odds (EO). We propose an explicit characterization of Bayes optimal classifier under the fairness constraints, which turns out to be a simple modification rule of the unconstrained classifier. Namely, we introduce a novel instance-level measure of bias, which we call bias score, and the modification rule is a simple linear rule on top of the finite amount of bias scores. Based on this characterization, we develop a post-hoc approach that allows us to adapt to fairness constraints while maintaining high accuracy. In the case of DP and EOp constraints, the modification rule is thresholding a single bias score, while in the case of EO constraints we are required to fit a linear modification rule with 2 parameters. The method can also be applied for composite group-fairness criteria, such as ones involving several sensitive attributes. We achieve competitive or better performance compared to both in-processing and post-processing methods across three datasets: Adult, COMPAS, and CelebA. Unlike most post-processing methods, we do not require access to sensitive attributes during the inference time.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: contrastive language image pretraining multimodal learning fairness bias data balancing
Scores: [ 6 6 6 ]
We investigate the effectiveness of data-balancing for mitigating biases in contrastive language-image pretraining (CLIP) models, identifying areas of strength and limitation. First, we reaffirm prior conclusions that CLIP models can inadvertently absorb societal stereotypes. To counter this, we present a novel data-balancing algorithm designed to reduce both representation and association biases (i.e. first- and second-order statistics) in multimodal datasets. We use this algorithm to conduct an in-depth analysis taking into account various factors, such as the model, representation, and training data size. Our study also explores the dynamic nature of how CLIP models learn and unlearn biases. In particular, we find that fine-tuning is effective in countering representation biases, though its impact diminishes for association biases. In addition, data balancing has a mixed impact on quality: it tends to improve zero- and few-shot classification but can hurt retrieval, which we provide an explanation for. We conclude with a set of recommendations for improving the efficacy of data balancing in multimodal systems.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: language model hallucination trustworthy artificial intelligence reasoning
Scores: [ 6 6 6 ]
Large language models (large LMs) are susceptible to producing text that contains hallucinated content. An important instance of this problem is self-contradiction, where the LM generates two contradictory sentences within the same context. In this work, we present a comprehensive investigation into self-contradiction for various instruction-tuned LMs, covering evaluation, detection, and mitigation. Our analysis reveals the prevalence of self-contradictions when LMs generate text for open-domain topics, e.g., in 17.7% of all sentences produced by ChatGPT. Self-contradiction also complements retrieval-based methods, as a large portion of them (e.g., 35.8% for ChatGPT) cannot be verified using Wikipedia. We then propose a novel prompting-based framework designed to effectively detect and mitigate self-contradictions. Our detector achieves high accuracy, e.g., around 80% F1 score when prompting ChatGPT. The mitigation algorithm iteratively refines the generated text to remove contradictory information while preserving text fluency and informativeness. Importantly, our entire framework is applicable to black-box LMs and does not require external grounded knowledge. Our approach is practically effective and has been released as a push-button tool to benefit the public, with an anonymized version at https://iclr9113.com/.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Federated Learning Feature Learning Theory FedAvg
Scores: [ 6 6 6 5 ]
Federated Learning (FL) has attracted significant attention as an efficient privacy-preserving approach to distributed learning across multiple clients. Despite extensive empirical research and practical applications, a systematic way to theoretically understand the convergence and generalization properties in FL remains limited. This work aims to establish a unified theoretical foundation for understanding FL through feature learning theory. We focus on a scenario where each client employs a two-layer convolutional neural network (CNN) for local training on their own data. Many existing works analyze the convergence of Federated Averaging (FedAvg) under lazy training with linearizing assumptions in weight space. In contrast, our approach tracks the trajectory of signal learning and noise memorization in FL, eliminating the need for these assumptions. We further show that FedAvg can achieve near-zero test error by effectively increasing signal-to-noise ratio (SNR) in feature learning, while local training without communication achieves a large constant test error. This finding highlights the benefits of communication for generalization in FL. Moreover, our theoretical results suggest that a weighted FedAvg method, based on the similarity of input features across clients, can effectively tackle data heterogeneity issues in FL. Experimental results on both synthetic and real-world datasets verify our theoretical conclusions and emphasize the effectiveness of the weighted FedAvg approach.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Fairness Online Learning Oblique Decision Trees
Scores: [ 6 5 8 8 8 ]
Fairness, especially group fairness, is an important consideration in the context of machine learning systems. The most commonly adopted group fairness-enhancing techniques are in-processing methods that rely on a mixture of a fairness objective (e.g., demographic parity) and a task-specific objective (e.g., cross-entropy) during the training process. However, when data arrives in an online fashion – one instance at a time – optimizing such fairness objectives poses several challenges. In particular, group fairness objectives are defined using expectations of predictions across different demographic groups. In the online setting, where the algorithm has access to a single instance at a time, estimating the group fairness objective requires additional storage and significantly more computation (e.g., forward/backward passes) than the task-specific objective at every time step. In this paper, we propose Aranyani, an ensemble of oblique decision trees, to make fair decisions in online settings. The hierarchical tree structure of Aranyani enables parameter isolation and allows us to efficiently compute the fairness gradients using aggregate statistics of previous decisions, eliminating the need for additional storage and forward/backward passes. We also present an efficient framework to train Aranyani and theoretically analyze several of its properties. We conduct empirical evaluations on 5 publicly available benchmarks (including vision and language datasets) to show that Aranyani achieves a better accuracy-fairness trade-off compared to baseline approaches.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Causal Fairness Sensitivity Analysis Fair Prediction Unobserved Confounding
Scores: [ 5 6 6 ]
Fairness for machine learning predictions is widely required in practice for legal, ethical, and societal reasons. Existing work typically focuses on settings without unobserved confounding, even though unobserved confounding can lead to severe violations of causal fairness and, thus, unfair predictions. In this work, we analyze the sensitivity of causal fairness to unobserved confounding. Our contributions are three-fold. First, we derive bounds for causal fairness metrics under different sources of observed confounding. This enables practitioners to audit the sensitivity of their machine learning models to unobserved confounding in fairness-critical applications. Second, we propose a novel neural framework for learning fair predictions, which allows us to offer worst-case guarantees of the extent to which causal fairness can be violated due to unobserved confounding. Third, we demonstrate the effectiveness of our framework in a series of experiments, including a real-world case study about predicting prison sentences. To the best of our knowledge, ours is the first work to study causal fairness under observed confounding. To this end, our work is of direct practical value for auditing and ensuring the fairness of predictions in high-stakes applications.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Federated Learning
Scores: [ 6 6 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Machine Learning LLM Watermark Language Model Natural Language Processing Generative AI
Scores: [ 6 6 6 ]
As LLMs become commonplace, machine-generated text has the potential to flood the internet with spam, social media bots, and valueless content. Watermarking is a simple and effective strategy for mitigating such harms by enabling the detection and documentation of LLM-generated text. Yet a crucial question remains: How reliable is watermarking in realistic settings in the wild? There, watermarked text may be modified to suit a user's needs, or entirely rewritten to avoid detection. We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document. We find that watermarks remain detectable even after human and machine paraphrasing. While these attacks dilute the strength of the watermark, paraphrases are statistically likely to leak n-grams or even longer fragments of the original text, resulting in high-confidence detections when enough tokens are observed. For example, after strong human paraphrasing the watermark is detectable after observing 800 tokens on average, when setting a 1e−5 false positive rate. We also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document, and we compare the robustness of watermarking to other kinds of detectors.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Graph Neural Networks Adversarial Robustness Data Distribution
Scores: [ 8 3 6 6 ]
Current defenses against graph attacks often rely on certain properties to eliminate structural perturbations by identifying adversarial edges from normal edges. However, this dependence makes defenses vulnerable to adaptive (white-box) attacks from adversaries with the same knowledge. Adversarial training seems to be a feasible way to enhance robustness without reliance on artificially designed properties. However, in this paper, we show theoretically that it can lead to models learning incorrect information. To solve this issue, we re-examine graph attacks from the out-of-distribution (OOD) perspective for both poisoning and evasion attacks and introduce a novel adversarial training paradigm incorporating OOD detection. This approach strengthens the robustness of Graph Neural Networks (GNNs) without reliance on prior knowledge. To further evaluate adaptive robustness, we develop new adaptive attacks against our methods, revealing a trade-off between graph attack efficacy and defensibility. Through extensive experiments over 25,000 perturbed graphs, our method could still maintain good robustness against both the adaptive and non-adaptive attacks.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Lipschitz randomized smoothing margin variance deep learning
Scores: [ 6 8 6 6 ]
Real-life applications of deep neural networks are hindered by their unsteady predictions when faced with noisy inputs and adversarial attacks. The certified radius is in this context a crucial indicator of the robustness of models. However how to design an efficient classifier with a sufficient certified radius? Randomized smoothing provides a promising framework by relying on noise injection in inputs to obtain a smoothed and more robust classifier. In this paper, we first show that the variance introduced by randomized smoothing closely interacts with two other important properties of the classifier, i.e. its Lipschitz constant and margin. More precisely, our work emphasizes the dual impact of the Lipschitz constant of the base classifier, on both the smoothed classifier and the empirical variance. Moreover, to increase the certified robust radius, we introduce a different simplex projection technique for the base classifier to leverage the variance-margin trade-off thanks to Bernstein's concentration inequality, along with an enhanced Lipschitz bound. Experimental results show a significant improvement in certified accuracy compared to current state-of-the-art methods. Our novel certification procedure allows us to use pre-trained models that are used with randomized smoothing, effectively improving the current certification radius in a zero-shot manner.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Certified Robustness DualRandomized Smoothing Curse of Diemensionality.
Scores: [ 6 8 8 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: energy-latency manipulation; large vision-language model; verbose images
Scores: [ 6 6 8 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Fake Detection Machine-Generated Text Detection Zero-Shot Detection
Scores: [ 8 6 8 6 6 ]
Large language models (LLMs) have shown the ability to produce fluent and cogent content, presenting both productivity opportunities and societal risks. To build trustworthy AI systems, it is imperative to distinguish between machine-generated and human-authored content. The leading zero-shot detector, DetectGPT, showcases commendable performance but is marred by its intensive computational costs. In this paper, we introduce the concept of \emph{conditional probability curvature} to elucidate discrepancies in word choices between LLMs and humans within a given context. Utilizing this curvature as a foundational metric, we present \emph{Fast-DetectGPT}, an optimized zero-shot detector, which substitutes DetectGPT's perturbation step with a more efficient sampling step. Our evaluations on various datasets, source models, and test conditions indicate that Fast-DetectGPT not only outperforms DetectGPT in both the white-box and black-box settings but also accelerates the detection process by a factor of 340, as detailed in Table 1.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: backdoor attack machine learning safety asymptotic statistical risk
Scores: [ 6 8 6 3 ]
The growing dependence on machine learning in real-world applications emphasizes the importance of understanding and ensuring its safety. Backdoor attacks pose a significant security risk due to their stealthy nature and potentially serious consequences. Such attacks involve embedding triggers within a learning model with the intention of causing malicious behavior when an active trigger is present while maintaining regular functionality without it. This paper evaluates the effectiveness of any backdoor attack incorporating a constant trigger, by establishing tight lower and upper boundaries for the performance of the compromised model on both clean and backdoor test data. The developed theory answers a series of fundamental but previously underexplored problems, including (1) what are the determining factors for a backdoor attack's success, (2) what is the direction of the most effective backdoor attack, and (3) when will a human-imperceptible trigger succeed. Our derived understanding applies to both discriminative and generative models. We also demonstrate the theory by conducting experiments using benchmark datasets and state-of-the-art backdoor attack scenarios.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: lipschitz neural networks dp-sgd privacy robustness
Scores: [ 5 8 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Fairness PEFT Hyperparameter Optimization Medical Imaging
Scores: [ 6 6 6 6 ]
Training models with robust group fairness properties is crucial in ethically sensitive application areas such as medical diagnosis. Despite the growing body of work aiming to minimise demographic bias in AI, this problem remains challenging. A key reason for this challenge is the fairness generalisation gap: High-capacity deep learning models can fit all training data nearly perfectly, and thus also exhibit perfect fairness during training. In this case, bias emerges only during testing when generalisation performance differs across sub-groups. This motivates us to take a bi-level optimisation perspective on fair learning: Optimising the learning strategy based on validation fairness. Specifically, we consider the highly effective workflow of adapting pre-trained models to downstream medical imaging tasks using parameter-efficient fine-tuning (PEFT) techniques. There is a trade-off between updating more parameters, enabling a better fit to the task of interest vs. fewer parameters, potentially reducing the generalisation gap. To manage this tradeoff, we propose FairTune, a framework to optimise the choice of PEFT parameters with respect to fairness. We demonstrate empirically that FairTune leads to improved fairness on a range of medical imaging datasets.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Adversarial attack transferability ensemble attack robustness
Scores: [ 6 8 6 8 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Machine Unlearning Generative Models Diffusion Models GAN Masked Autoencoder
Scores: [ 8 6 5 5 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Interpretable ML Explainable AI Information Pursuit Large Language Models Large Multimodal Models Vision Language Models
Scores: [ 6 5 6 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: memorization influence estimation information leakage neural network Gaussian process
Scores: [ 5 6 8 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Sensitive Information Deletion Privacy Attacks Model editing Language Models
Scores: [ 8 6 8 8 ]
Pretrained language models sometimes possess knowledge that we do not wish them to, including memorized personal information and knowledge that could be used to harm people. They can also output toxic or harmful text. To mitigate these safety and informational issues, we propose an attack-and-defense framework for studying the task of deleting sensitive information directly from model weights. We study direct edits to model weights because (1) this approach should guarantee that particular deleted information is never extracted by future prompt attacks, and (2) it should protect against whitebox attacks, which is necessary for making claims about safety/privacy in a setting where publicly available model weights could be used to elicit sensitive information. Our threat model assumes that an attack succeeds if the answer to a sensitive question is located among a set of B generated candidates, based on scenarios where the information would be insecure if the answer is among B candidates. Experimentally, we show that even state-of-the-art model editing methods such as ROME struggle to truly delete factual information from models like GPT-J, as our whitebox and blackbox attacks can recover “deleted” information from an edited model 38% of the time. These attacks leverage two key observations: (1) that traces of deleted information can be found in intermediate model hidden states, and (2) that applying an editing method for one question may not delete information across rephrased versions of the question. Finally, we provide new defense methods that protect against some extraction attacks, but we do not find a single universally effective defense method. Our results suggest that truly deleting sensitive information is a tractable but difficult problem, since even relatively low attack success rates have potentially severe implications for the deployment of language models in a world where individuals enjoy ownership of their personal data, a right to privacy, and safety from harmful model outputs.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Machine learning Membership Inference Attack Computer Vision
Scores: [ 6 8 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Watermark algorithms Large Language Models Robustness
Scores: [ 3 6 5 8 ]
Watermark algorithms for large language models (LLMs) have achieved extremely high accuracy in detecting text generated by LLMs. Such algorithms typically involve adding extra watermark logits to the LLM's logits at each generation step. However, prior algorithms face a trade-off between attack robustness and security robustness. This is because the watermark logits for a token are determined by a certain number of preceding tokens; a small number leads to low security robustness, while a large number results in insufficient attack robustness. In this work, we propose a semantic invariant watermarking method for LLMs that provides both attack robustness and security robustness. The watermark logits in our work are determined by the semantics of all preceding tokens. Specifically, we utilize another embedding LLM to generate semantic embeddings for all preceding tokens, and then these semantic embeddings are transformed into the watermark logits through our trained watermark model.Subsequent analyses and experiments demonstrated the attack robustness of our method in semantically invariant settings: synonym substitution and text paraphrasing settings. Finally, we also show that our watermark possesses adequate security robustness.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: distribution-free uncertainty quantification large language models responsible AI
Scores: [ 6 6 6 8 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: language models lying deception alignment safety truthfulness honesty
Scores: [ 8 5 6 8 ]
Large language models (LLMs) can “lie”, which we define as outputting false statements despite “knowing” the truth in a demonstrable sense. An example is an LLM instructed to spread misinformation. Here, we conduct an initial exploration into the feasibility of lie detection for LLMs. We develop a simple lie detector that requires neither access to the LLM’s activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM’s yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting—prompting GPT-3.5 to lie about factual questions—the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection
Primary Area: societal considerations including fairness, safety, privacy
Keywords: detector language model learning from preferences
Scores: [ 6 6 6 ]
The fluency and general applicability of large language models (LLMs) has motivated significant interest in detecting whether a piece of text was written by a language model. While both academic and commercial detectors have been deployed in some settings, particularly education, other research has highlighted the fragility of these systems. In this paper, we demonstrate a data-efficient attack that fine-tunes language models to confuse existing detectors, leveraging recent developments in reinforcement learning of language models. We use the 'human-ness' score (often just a log probability) of various open-source and commercial detectors as a reward function for reinforcement learning, subject to a KL-divergence constraint that the resulting model does not differ significantly from the original. For a 7B parameter Llama-2 model, fine-tuning for under a day reduces the AUROC of the OpenAI RoBERTa-Large detector from 0.84 to 0.62, while perplexity on OpenWebText increases from 8.7 to only 9.0; with a larger perplexity budget, we reduce AUROC to 0.30 (worse than random), with a perplexity increase to 9.9. Similar to traditional adversarial attacks, we find that this increase in 'detector evasion' generalizes to other detectors not used during training. In light of our empirical results, we advise against continued reliance on LLM-generated text detectors.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Privacy Attack Membership Inference Data Poisoning
Scores: [ 6 5 5 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Large language models Machine-generated text detection Maximum mean discrepancy
Scores: [ 6 6 8 6 ]
Large language models (LLMs) such as ChatGPT have exhibited remarkable performance in generating human-like texts. However, machine-generated texts (MGTs) may carry critical risks, such as plagiarism issues and hallucination information. Therefore, it is very urgent and important to detect MGTs in many situations. Unfortunately, it is challenging to distinguish MGTs and human-written texts because the distributional discrepancy between them is often very subtle due to the remarkable performance of LLMS. In this paper, we seek to exploit \textit{maximum mean discrepancy} (MMD) to address this issue in the sense that MMD can well identify distributional discrepancies. However, directly training a detector with MMD using diverse MGTs will incur a significantly increased variance of MMD since MGTs may contain \textit{multiple text populations} due to various LLMs. This will severely impair MMD's ability to measure the difference between two samples. To tackle this, we propose a novel \textit{multi-population} aware optimization method for MMD called MMD-MP, which can \textit{avoid variance increases} and thus improve the stability to measure the distributional discrepancy. Relying on MMD-MP, we develop two methods for paragraph-based and sentence-based detection, respectively. Extensive experiments on various LLMs, \eg, GPT2 and ChatGPT, show superior detection performance of our MMD-MP.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Large Language Models RLHF Alignment Differential Privacy
Scores: [ 6 6 8 ]
Positioned between pre-training and user deployment, aligning large language models (LLMs) through reinforcement learning (RL) has emerged as a prevailing strategy for training instruction following-models such as ChatGPT. In this work, we initiate the study of privacy-preserving alignment of LLMs through Differential Privacy (DP) in conjunction with RL. Following the influential work of Ziegler et al. (2020), we study two dominant paradigms: (i) alignment via RL without human in the loop (e.g., positive review generation) and (ii) alignment via RL from human feedback (RLHF) (e.g., summarization in a human-preferred way). We give a new DP framework to achieve alignment via RL, and prove its correctness. Our experimental results validate the effectiveness of our approach, offering competitive utility while ensuring strong privacy protections.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Backdoor Data poisoning Integrity Image Classification
Scores: [ 6 5 5 6 ]
Web-scraped datasets are vulnerable to data poisoning, which can be used for backdooring deep image classifiers during training. Since training on large datasets is expensive, a model is trained once and re-used many times. Unlike adversarial examples, backdoor attacks often target specific classes rather than any class learned by the model. One might expect that targeting many classes through a naïve composition of attacks vastly increases the number of poison samples. We show this is not necessarily true and more efficient, universal data poisoning attacks exist that allow controlling misclassifications from any source class into any target class with a small increase in poison samples. Our idea is to generate triggers with salient characteristics that the model can learn. The triggers we craft exploit a phenomenon we call inter-class poison transferability, where learning a trigger from one class makes the model more vulnerable to learning triggers for other classes. We demonstrate the effectiveness and robustness of our universal backdoor attacks by controlling models with up to 6,000 classes while poisoning only 0.15% of the training dataset. Our source code will be made available.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Density-based OOD Importance Sampling Tractable density estimation
Scores: [ 5 6 8 6 ]
Post-hoc out-of-distribution (OOD) detection has garnered intensive attention in reliable machine learning. Many efforts have been dedicated to deriving score functions based on logits, distances, or rigorous data distribution assumptions to identify low-scoring OOD samples. Nevertheless, these estimate scores may fail to accurately reflect the true data density or impose impractical constraints. To provide a unified perspective on density-based score design, we propose a novel theoretical framework grounded in Bregman divergence, which extends distribution considerations to encompass an exponential family of distributions. Leveraging the conjugation constraint revealed in our theorem, we introduce a \textsc{ConjNorm} method, reframing density function design as a search for the optimal norm coefficient p against the given dataset. In light of the computational challenges of normalization, we devise an unbiased and analytically tractable estimator of the partition function using the Monte Carlo-based importance sampling technique. Extensive experiments across OOD detection benchmarks empirically demonstrate that our proposed \textsc{ConjNorm} has established a new state-of-the-art in a variety of OOD detection setups, outperforming the current best method by up to 13.25% and 28.19% (FPR95) on CIFAR-100 and ImageNet-1K, respectively.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: NTK neural tangent kernels adversarial training robust overfitting
Scores: [ 6 8 6 6 6 ]
Adversarial training (AT) is a canonical method for enhancing the robustness of deep neural networks (DNNs). However, recent studies empirically demonstrated that it suffers from {\it robust overfitting}, {\it i.e.}, a long time AT can be detrimental to the robustness of DNNs. This paper presents a theoretical explanation of robust overfitting for DNNs. Specifically, we non-trivially extend the neural tangent kernel (NTK) theory to AT and prove that an adversarially trained {\it wide} DNN can be well approximated by a linearized DNN. Moreover, for squared loss, closed-form AT dynamics for the linearized DNN can be derived, which reveals a new \textit{\textbf{AT degeneration}} phenomenon: a long-term AT will result in a wide DNN degenerates to that obtained without AT and thus cause robust overfitting. Based on our theoretical results, we further design a method namely \textit{\textbf{Adv-NTK}}, the first AT algorithm for infinite-width DNNs. Experiments on real-world datasets show that Adv-NTK can help infinite-width DNNs enhance comparable robustness to that of their finite-width counterparts, which in turn justifies our theoretical findings. The code is submitted and will be released publicly.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: model inversion attacks privacy label smoothing defense
Scores: [ 5 8 6 6 6 ]
Label smoothing – using softened labels instead of hard ones – is a widely adopted regularization method for deep learning, showing diverse benefits such as enhanced generalization and calibration. Its implications for preserving model privacy, however, have remained unexplored. To fill this gap, we investigate the impact of label smoothing on model inversion attacks (MIAs), which aim to generate class-representative samples by exploiting the knowledge encoded in a classifier, thereby inferring sensitive information about its training data. Through extensive analyses, we uncover that traditional label smoothing fosters MIAs, thereby increasing a model's privacy leakage. Even more, we reveal that smoothing with negative factors counters this trend, impeding the extraction of class-related information and leading to privacy preservation, beating state-of-the-art defenses. This establishes a practical and powerful novel way for enhancing model resilience against MIAs.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Backdoor Detection Backdoor Attack Data Poisoning AI Security Deep learning
Scores: [ 6 6 6 6 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: Adversarial attack transferability Vit transformer
Scores: [ 6 5 5 ]
At present, various variants of Vision Transformers (ViTs) models have been widely applied in fields such as computer vision, natural language processing, and cross-modal applications. A primary rationale behind this is that applying gradient propagation and gradient regularization across different functional regions in the transformer structure can enhance the transferability of adversarial samples. However, in practice, substantial gradient disparities exist even within the same functional region across different layers. In this paper, we introduce a novel Gradient Normalization Scaling method for fine-grained gradient editing to enhance the transferability of adversarial attacks on ViTs. More importantly, we highlight that ViTs, unlike conventional CNNs, exhibit distinct attention points in the frequency domain. Leveraging this insight, we delve into exploring frequency domain to further enhance the algorithm's transferability. Through extensive experimentation on various ViT variants and traditional CNN models, we substantiate that the new approach achieves state-of-the-art performance, with an average performance improvement of 33.54% and 42.05% on ViT and CNN models, respectively. Our code is available at: https://anonymous.4open.science/r/GNS-HFE-DD2D/.
Primary Area: societal considerations including fairness, safety, privacy
Keywords: differential privacy federated learning empirical privacy
Scores: [ 8 8 8 ]
Primary Area: societal considerations including fairness, safety, privacy
Keywords: robust fine-tuning adversarial robustness
Scores: [ 5 6 8 6 ]
Robust Fine-Tuning (RFT) is a low-cost strategy to obtain adversarial robustness in downstream applications, without requiring a lot of computational resources and collecting significant amounts of data. This paper uncovers an issue with the existing RFT, where optimizing both adversarial and natural objectives through the feature extractor (FE) yields significantly divergent gradient directions. This divergence introduces instability in the optimization process, thereby hindering the attainment of adversarial robustness and rendering RFT highly sensitive to hyperparameters. To mitigate this issue, we propose a low-rank (LoRa) branch that disentangles RFT into two distinct components: optimizing natural objectives via the LoRa branch and adversarial objectives via the FE. Besides, we introduce heuristic strategies for automating the scheduling of the learning rate and the scalars of loss terms. Extensive empirical evaluations demonstrate that our proposed automated RFT disentangled via the LoRa branch (AutoLoRa) achieves new state-of-the-art results across a range of downstream tasks. AutoLoRa holds significant practical utility, as it automatically converts a pre-trained FE into an adversarially robust model for downstream tasks without the need for searching hyperparameters. Our source code is available at Anonymous GitHub.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Out-of-Variable Generalization Causality
Scores: [ 5 8 6 8 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Knowledge distillation information theroy bayes conditional distribution estimation conditional mutual information
Scores: [ 8 6 6 ]
It is believed that in knowledge distillation (KD), the role of the teacher is to provide an estimate for the unknown Bayes conditional probability distribution (BCPD) to be used in the student training process. Conventionally, this estimate is obtained by training the teacher using maximum log-likelihood (MLL) method. To improve this estimate for KD, in this paper we introduce the concept of conditional mutual information (CMI) into the estimation of BCPD and propose a novel estimator called the maximum CMI (MCMI) method. Specifically, in MCMI estimation, both the log-likelihood and CMI of the teacher are simultaneously maximized when the teacher is trained. In fact, maximizing the teacher's CMI value ensures that the teacher can effectively capture the contextual information within the images, and for visualizing this information, we deploy Eigen-CAM. Via conducting a thorough set of experiments, we show that by employing a teacher trained via MCMI estimation rather than one trained via MLL estimation in various state-of-the-art KD frameworks, the student's classification accuracy consistently increases, with the gain of up to 3.32%. This suggests that the teacher's BCPD estimate provided by MCMI method is more accurate than that provided by MLL method. In addition, we show that such improvements in the student's accuracy are more drastic in zero-shot and few-shot settings. Notably, the student's accuracy increases with the gain of up to 5.72% when 5% of the training samples are available to student (few-shot), and increases from 0% to as high as 84% for an omitted class (zero-shot).
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Domain Adaption Continual Pre-training Large Language Model
Scores: [ 6 8 6 6 ]
We explore how continued pre-training on domain-specific corpora influences large language models, revealing that training on the raw corpora endows the model with domain knowledge, but drastically hurts its prompting ability for question answering. Taken inspiration from human learning via reading comprehension--practice after reading improves the ability to answer questions based on the learned knowledge--we propose a simple method for transforming raw corpora into reading comprehension texts. Each raw text is enriched with a series of tasks related to its content. Our method, highly scalable and applicable to any pre-training corpora, consistently enhances performance across various tasks in three different domains: biomedicine, finance, and law. Notably, our 7B language model achieves competitive performance with domain-specific models of much larger scales, such as BloombergGPT-50B. Furthermore, we demonstrate that domain-specific reading comprehension texts can improve the model's performance even on general benchmarks, showing the potential to develop a general model across even more domains.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: test-time adaptation batch normalization distribution shift
Scores: [ 6 6 8 6 ]
In an era where test-time adaptation methods increasingly rely on the nuanced manipulation of batch normalization (BN) parameters, one critical assumption often goes overlooked: that of independently and identically distributed (i.i.d.) test batches with respect to unknown labels. This assumption culminates in biased estimates of BN statistics and jeopardizes system stability under non-i.i.d. conditions. This paper pioneers a departure from the i.i.d. paradigm by introducing a groundbreaking strategy termed `$\textbf{Un-Mix}$ing $\textbf{T}$est-Time $\textbf{N}$ormalization $\textbf{S}$tatistics' (UnMix-TNS). UnMix-TNS re-calibrates the instance-wise statistics used to normalize each instance in a batch by mixing it with multiple unmixed statistics components, thus inherently simulating the i.i.d. environment. The key lies in our innovative online unmixing procedure, which persistently refines these statistics components by drawing upon the closest instances from an incoming test batch. Remarkably generic in its design, UnMix-TNS seamlessly integrates with an array of state-of-the-art test-time adaptation methods and pre-trained architectures equipped with BN layers. Empirical evaluations corroborate the robustness of UnMix-TNS under varied scenarios—ranging from single to continual and mixed domain shifts. UnMix-TNS stands out when handling test data streams with temporal correlation, including those with corrupted real-world non-i.i.d. streams, sustaining its efficacy even with minimal batch sizes and individual samples. Our results set a new standard for test-time adaptation, demonstrating significant improvements in both stability and performance across multiple benchmarks. Our code will be released upon acceptance.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: test-time adaptation domain adaptation domain shift
Scores: [ 6 5 6 6 8 ]
Test-time adaptation (TTA) aims to adapt a pre-trained model from a source domain to a target domain only using online unlabeled target data during testing, without accessing to the source data or modifying the original training process. Among the various TTA methods, pseudo-labeling has gained popularity. However, the presence of incorrect pseudo-labels can hinder the effectiveness of target domain adaptation. To overcome this challenge, we propose a novel TTA method, called PROtotype GRAph Model based pseudo-label learning (PROGRAM). PROGRAM consists of two key components: (1) Prototype Graph Model (PGM) for reliable pseudo-label generation; (2) Robust Self-Training (RST) for test-time adaptation with noisy pseudo-labels. PGM constructs the graph using prototypes and test samples, facilitating effective message passing among them to generate more reliable pseudo-labels. RST combines the advantages of consistency regularization and pseudo-labeling to achieve robust target domain adaptation in the presence of noisy pseudo-labels. Our proposed PROGRAM can be easily integrated into existing baselines, resulting in consistent improvement. Extensive experiments show that our PROGRAM outperforms the existing TTA methods on multiple domain generalization and image corruption benchmarks.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: visual prompting reprogramming
Scores: [ 6 3 5 8 ]
Visual prompting (VP) is an emerging parameter-efficient fine-tuning approach to adapting pre-trained vision models to solve various downstream image-classification tasks. However, there has hitherto been little systematic study of the design space of VP and no clear benchmark for evaluating its performance. To bridge this gap, we propose AutoVP, an end-to-end expandable framework for automating VP design choices, along with 12 downstream image-classification tasks that can serve as a holistic VP-performance benchmark. Our design space covers 1) the joint optimization of the prompts; 2) the selection of pre-trained models, including image classifiers and text-image encoders; and 3) model output mapping strategies, including nonparametric and trainable label mapping. Our extensive experimental results show that AutoVP outperforms the best-known current VP methods by a substantial margin, having up to 6.7% improvement in accuracy; and attains a maximum performance increase of 27.5% compared to linear-probing (LP) baseline. AutoVP thus makes a two-fold contribution: serving both as an efficient tool for hyperparameter tuning on VP design choices, and as a comprehensive benchmark that can reasonably be expected to accelerate VP’s development.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: referring image segmentation; parameter efficient tuning
Scores: [ 8 8 6 ]
Pre-training followed by full fine-tuning has gradually been substituted by Parameter-Efficient Tuning (PET) in the field of computer vision tasks. PET has gained popularity, especially in the context of large-scale models, due to its ability to reduce transfer learning costs and conserve hardware resources. However, existing PET approaches primarily focus on recognition tasks and typically support uni-modal optimization, neglecting dense prediction tasks and vision language interactions. To address this limitation, we propose a novel PET framework called Bi-directional Intertwined Vision Language Efficient Tuning for Referring Image Segmentation (BarLeRIa), which leverages bi-directional intertwined vision language adapters to fully exploit the frozen pre-trained models' potential in cross-modal dense prediction tasks. In BarLeRIa, two different tuning modules are employed for efficient global and local attention, as well as an intertwined vision language tuning algorithm for efficient modal fusion. Extensive experiments conducted on challenging RefCOCO-related benchmarks demonstrating the superiority of BarLeRIa over prior PET methods with a significant margin, \emph{i.e.}, achieving an average improvement of 5.6%. Remarkably, without requiring additional training datasets, BarLeRIa even surpasses SOTA full fine-tuning approaches.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: model adaptation cloud-edge model deployment cloud-edge model collaboration test-time adaptation
Scores: [ 5 8 6 6 ]
The conventional deep learning paradigm often involves training a deep model on a server and then deploying the model or its distilled ones to resource-limited edge devices. Usually, the models shall remain fixed once deployed (at least for some period) due to the potential high cost of model adaptation for both the server and edge sides. However, in many real-world scenarios, the test environments may change dynamically (known as distribution shifts), which often results in degraded performance. Thus, one has to adapt the edge models promptly to attain promising performance. Moreover, with the increasing data collected at the edge, this paradigm also fails to further adapt the cloud model for better performance. To address these, we encounter two primary challenges: 1) the edge model has limited computation power and may only support forward propagation; 2) the data transmission budget between cloud and edge devices is limited in latency-sensitive scenarios. In this paper, we establish a Cloud-Edge Model Adaptation (CEMA) paradigm in which the edge models only need to perform forward propagation and the edge models can be adapted online. In our CEMA, to reduce the communication burden, we devise two criteria to exclude unnecessary samples from uploading to the cloud, i.e., dynamic unreliable and low-informative sample exclusion. Based on the uploaded samples, we update and distribute the affine parameters of normalization layers by distilling from the stronger foundation model to the edge model with a sample replay strategy. Extensive experimental results on ImageNet-C and ImageNet-R verify the effectiveness of our CEMA.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Open Vocabulary Object Detection Visual descriptors
Scores: [ 6 6 6 6 6 ]
Inspired by the outstanding zero-shot capability of vision language models (VLMs) in image classification tasks, open-vocabulary object detection has attracted increasing interest by distilling the broad VLM knowledge into detector training. However, most existing open-vocabulary detectors learn by aligning region embeddings with categorical labels (e.g., bicycle) only, disregarding the capability of VLMs on aligning visual embeddings with fine-grained text descriptions of object parts (e.g., pedals and bells). This paper presents DVDet, a Descriptor-Enhanced Open Vocabulary Detector that introduces conditional context prompts and hierarchical textual descriptors that enable precise region-text alignment as well as open-vocabulary detection training in general. Specifically, the conditional context prompt transforms regional embeddings into image-like representations that can be directly integrated into general open vocabulary detection training. In addition, we introduce large language models as an interactive and implicit knowledge repository which enables iterative mining and refining visually oriented textual descriptors for precise region-text alignment. Extensive experiments over multiple large-scale benchmarks show that DVDet outperforms the state-of-the-art consistently by large margins.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: deep learning continual learning overparameterization task similarity catastrophic forgetting theory
Scores: [ 3 8 8 6 3 6 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Finetuning pretrained model hubs transfer learning hyperparameter optimization meta-learning
Scores: [ 8 8 8 8 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: language models domain adaptation catastrophic forgetting
Scores: [ 8 5 5 5 ]
Finetuning language models on domain-specific corpus is a common approach to enhance their domain knowledge and capability. While improving performance on domain tasks, it often brings a side-effect of forgetting of the model's general abilities. In this study, we analyze the effects of finetuning on language models by dissecting its impacts on the modeling of topic, style, and factual knowledge in text. Our method uses instruction-following LLMs such as ChatGPT to auto-generate controlled-variable text examples which we use to probe the model. Our findings reveal that finetuning results in significant shifts in the language model's topic and style priors, while actual knowledge learning only contributes to a small fraction of the total probability change. Analysis shows that the adaptation of topic and style priors behave akin to learning simple features: they are learned rapidly and require little model capacity. They are also learned independently and primarily at the beginning of a text sequence. In contrast, factual knowledge is learned stably but slowly and requires significant model capacity to learn. The research offers insights and understanding into the finer dynamics of learning and forgetting in language models, and can potentially inform future research on improving domain adaptation and addressing the challenges of forgetting in continual learning of language models.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Out-of-Distribution generalization Sharpness Robustness
Scores: [ 8 6 5 8 ]
Generalizing to out-of-distribution (OOD) data or unseen domain, termed OOD generalization, still lacks appropriate theoretical guarantees. Canonical OOD bounds focus on different distance measurements between source and target domains but fail to consider the optimization property of the learned model. As empirically shown in recent work, sharpness of learned minimum influences OOD generalization. To bridge this gap between optimization and OOD generalization, we study the effect of sharpness on how a model tolerates data change in domain shift which is usually captured by "robustness" in generalization. In this paper, we give a rigorous connection between sharpness and robustness, which gives better OOD guarantees for robust algorithms. It also provides a theoretical backing for "flat minima leads to better OOD generalization". Overall, we propose a sharpness-based OOD generalization bound by taking robustness into consideration, resulting in a tighter bound than non-robust guarantees. Our findings are supported by the experiments on a ridge regression model, as well as the experiments on deep learning classification tasks.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: continual learning class incremental learning
Scores: [ 6 6 8 8 ]
Class-incremental learning is becoming more popular as it helps models widen their applicability while not forgetting what they already know. A trend in this area is to use a mixture-of-expert technique, where different models work together to solve the task. However, the experts are usually trained all at once using whole task data, which makes them all prone to forgetting and increasing computational burden. To address this limitation, we introduce a novel approach named SEED. SEED selects only one, the most optimal expert for a considered task, and uses data from this task to fine-tune only this expert. For this purpose, each expert represents each class with a Gaussian distribution, and the optimal expert is selected based on the similarity of those distributions. Consequently, SEED increases diversity and heterogeneity within the experts while maintaining the high stability of this ensemble method. The extensive experiments demonstrate that SEED achieves state-of-the-art performance in exemplar-free settings across various scenarios, showing the potential of expert diversification through data in continual learning.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Distribution shfts Domain generalization Visual prompt Foundation model Test-time adaptation
Scores: [ 5 6 8 5 ]
In this paper, we aim to adapt a model at test-time using a few unlabeled data to address distribution shifts. In this setting, extracting the domain knowledge from a limited amount of data is challenging. To improve such a process, it is crucial to utilize correlated information from pre-trained backbones and source domains. Previous studies fail to utilize recent foundation models with strong out-of-distribution generalization. Additionally, domain-centric designs are not flavored in their works. Furthermore, they employ the process of modelling source domains and the process of learning to adapt independently into disjoint training stages. In this work, we propose an approach on top of the pre-computed features of the foundation model. Specifically, we build a knowledge bank to learn the transferable knowledge from source domains. Conditioned on few-shot target data, we introduce a domain prompt generator to condense the knowledge bank into a domain-specific prompt. The domain prompt then directs the visual features towards a particular domain via a guidance module. Moreover, we propose a domain-aware contrastive loss and employ meta-learning to facilitate domain knowledge extraction. Extensive experiments are conducted to validate the domain knowledge extraction. The proposed method outperforms previous work significantly on 5 large-scale benchmarks including WILDS and DomainNet.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: catastrophic forgetting loss of plasticity continual learning streaming learning online learning
Scores: [ 6 3 6 6 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Continual test-time adaptation Visual Adapter
Scores: [ 8 6 6 6 ]
Since real-world machine systems are running in non-stationary environments, Continual Test-Time Adaptation (CTTA) task is proposed to adapt the pre-trained model to continually changing target domains. Recently, existing methods mainly focus on model-based adaptation, which aims to leverage a self-training manner to extract the target domain knowledge. However, pseudo labels can be noisy and the updated model parameters are unreliable under dynamic data distributions, leading to error accumulation and catastrophic forgetting in the continual adaptation process. To tackle these challenges and maintain the model plasticity, we tactfully design a Visual Domain Adapter (ViDA) for CTTA, explicitly handling both domain-specific and domain-shared knowledge. Specifically, we first comprehensively explore the different domain representations of the adapters with trainable high-rank or low-rank embedding spaces. Then we inject ViDAs into the pre-trained model, which leverages high-rank and low-rank features to adapt the current domain distribution and maintain the continual domain-shared knowledge, respectively. To exploit the low-rank and high-rank ViDAs more effectively, we further propose a Homeostatic Knowledge Allotment (HKA) strategy, which adaptively combines different knowledge from each ViDA. Extensive experiments conducted on four widely used benchmarks demonstrate that our proposed method achieves state-of-the-art performance in both classification and segmentation CTTA tasks. Note that, our method can be regarded as a novel transfer paradigm for large-scale models, delivering promising results in adaptation to continually changing distributions.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Transfer Learning Universal Domain Adaption Memory-Assisted Network Sub-Prototype Mining
Scores: [ 6 5 3 8 ]
Universal domain adaptation aims to align the classes and reduce the feature gap between the same category of the source and target domains. The target private category is set as the unknown class during the adaptation process, as it is not included in the source domain. However, most existing methods overlook the intra-class structure within a category, especially in cases where there exists significant concept shift between the samples belonging to the same category. When samples with large concept shift are forced to be pushed together, it may negatively affect the adaptation performance. Moreover, from the interpretability aspect, it is unreasonable to align visual features with significant differences, such as fighter jets and civil aircraft, into the same category. Unfortunately, due to such semantic ambiguity and annotation cost, categories are not always classified in detail, making it difficult for the model to perform precise adaptation. To address these issues, we propose a novel Memory-Assisted Sub-Prototype Mining (MemSPM) method that can learn the differences between samples belonging to the same category and mine sub-classes when there exists significant concept shift between them. By doing so, our model learns a more reasonable feature space that enhances the transferability and reflects the inherent differences among samples annotated as the same category. We evaluate the effectiveness of our MemSPM method over multiple scenarios, including UniDA, OSDA, and PDA. Our method achieves state-of-the-art performance on four benchmarks in most cases.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Fourier Neural Implicit Representation Seqeuntial Video Compilation Video Continual Learning
Scores: [ 6 8 6 8 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: meta learning theory non-convex online meta learning piecewise-Lipschitz/non-Lipschitz functions regret bounds PAC-Bayes generalization bound
Scores: [ 6 6 6 6 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Multi-task Learning Model Editing Task Arithmetic
Scores: [ 6 6 6 8 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Federated Learning Continual Learning Catastrophic Forgetting
Scores: [ 5 8 6 ]
Federated Learning (FL) has gained significant attraction due to its ability to enable privacy-preserving training over decentralized data. Current literature in FL mostly focuses on single-task learning. However, over time, new tasks may appear in the clients and the global model should learn these tasks without forgetting previous tasks. This real-world scenario is known as Continual Federated Learning (CFL). The main challenge of CFL is \textit{Global Catastrophic Forgetting}, which corresponds to the fact that when the global model is trained on new tasks, its performance on old tasks decreases. There have been a few recent works on CFL to propose methods that aim to address the global catastrophic forgetting problem. However, these works either have unrealistic assumptions on the availability of past data samples or violate the privacy principles of FL. We propose a novel method, Federated Orthogonal Training (FOT), to overcome these drawbacks and address the global catastrophic forgetting in CFL. Our algorithm extracts the global input subspace of each layer for old tasks and modifies the aggregated updates of new tasks such that they are orthogonal to the global principal subspace of old tasks for each layer. This decreases the interference between tasks, which is the main cause for forgetting. % Our method is almost computation-free on the client side and has negligible communication cost. We empirically show that FOT outperforms state-of-the-art continual learning methods in the CFL setting, achieving an average accuracy gain of up to 15% with 27% lower forgetting while only incurring a minimal computation and communication cost.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Meta-Learning;Memory Augmented Neural Network; Deep Neural Network Application;Graph Summarization
Scores: [ 8 8 6 ]
A graph is a structure made up of vertices and edges used to represent complex relationships between entities, while a graph stream is a continuous flow of graph updates that convey evolving relationships between entities. The massive volume and high dynamism of graph streams promote research on data structures of graph summarization, which provides a concise and approximate view of graph streams with sub-linear space and linear construction time, enabling real-time graph analytics in various domains, such as social networking, financing, and cybersecurity.In this work, we propose the Mayfly, the first neural data structure for summarizing graph streams. The Mayfly replaces handcrafted data structures with better accuracy and adaptivity.To cater to practical applications, Mayfly incorporates two offline training phases.During the larval phase, the Mayfly learns basic summarization abilities from automatically and synthetically constituted meta-tasks, and in the metamorphosis phase, it rapidly adapts to real graph streams via meta-tasks.With specific configurations of information pathways, the Mayfly enables flexible support for miscellaneous graph queries, including edge, node, and connectivity queries.Extensive empirical studies show that the Mayfly significantly outperforms its handcrafted competitors.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: continual learning natural language processing large language model
Scores: [ 6 6 8 6 ]
Continual learning has gained increasing importance as it facilitates the acquisition and refinement of scalable knowledge and skills in language models. However, existing methods typically encounter strict limitations and challenges in real-world scenarios, such as reliance on experience replay, optimization constraints, and inference task-ID. In this study, we introduce the Scalable Language Model (SLM) to overcome these limitations within a more challenging and generalized setting, representing a significant advancement toward practical applications for continual learning. Specifically, we propose the Joint Adaptive Re-Parameterization (JARe), integrated with Dynamic Task-related Knowledge Retrieval (DTKR), to enable adaptive adjustment of language models based on specific downstream tasks. This approach leverages the task distribution within the vector space, aiming to achieve a smooth and effortless continual learning process. Our method demonstrates state-of-the-art performance on diverse backbones and benchmarks, achieving effective continual learning in both full-set and few-shot scenarios with minimal forgetting. Moreover, while prior research primarily focused on a single task type such as classification, our study goes beyond, with the large language model, i.e., LLaMA-2, to explore the effects across diverse domains and task types, such that a single language model can be decently scaled to broader applications. The code and models will be released to the public.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Domain adaptation Label shift Matching methods
Scores: [ 8 8 8 5 ]
We consider the domain adaptation problem in the context of label shift, where the label distributions between source and target domain differ, but the conditional distributions of features given the label are the same. To solve the label shift adaption problem, we develop a novel matching framework named \textit{class probability matching} (\textit{CPM}). It is inspired by a new understanding of the source domain's class probability, as well as a specific relationship between class probability ratios and feature probability ratios between the source and target domains. CPM is able to maintain the same theoretical guarantee with the existing feature probability matching framework, while significantly improving the computational efficiency due to directly matching the probabilities of the label variable. Within the CPM framework, we propose an algorithm named \textit{class probability matching with calibrated networks} (\textit{CPMCN}) for target domain classification. From the theoretical perspective, we establish the generalization bound of the CPMCN method in order to explain the benefits of introducing calibrated networks. From the experimental perspective, real data comparisons show that CPMCN outperforms existing matching-based and EM-based algorithms.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: transfer learning pretraining continual learning
Scores: [ 8 8 8 5 ]
Training deep networks requires various design decisions regarding for instance their architecture, data augmentation, or optimization. In this work, we find these training variations to result in networks learning unique feature sets from the data. Using public model libraries comprising thousands of models trained on canonical datasets like ImageNet, we observe that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other – independent of overall performance. Given any arbitrary pairing of pretrained models and no external rankings (such as separate test sets, e.g. due to data privacy), we investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation – a task made particularly difficult as additional knowledge can be contained in stronger, equiperformant or weaker models. Yet facilitating robust transfer in scenarios agnostic to pretrained model pairings would unlock auxiliary gains and knowledge fusion from any model repository without restrictions on model and problem specifics - including from weaker, lower-performance models. This work therefore provides an initial, in-depth exploration on the viability of such general-purpose knowledge transfer. Across large-scale experiments, we first reveal the shortcomings of standard knowledge distillation techniques, and then propose a much more general extension through data partitioning for successful transfer between nearly all pretrained models, which we show can also be done unsupervised. Finally, we assess both the scalability and impact of fundamental model properties on successful model-agnostic knowledge transfer.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: model adaptation to changing data distribution shift
Scores: [ 5 6 8 6 ]
Real-world deployment of machine learning models is challenging because data evolves over time. While no model can work when data evolves in an arbitrary fashion, if there is some pattern to these changes, we might be able to design methods to address it. This paper addresses situations when data evolves gradually. We introduce a time-varying propensity score that can detect gradual shifts in the distribution of data which allows us to selectively sample past data to update the model---not just similar data from the past like that of a standard propensity score but also data that evolved in a similar fashion in the past. The time-varying propensity score is quite general: we demonstrate different ways of implementing it and evaluate it on a variety of problems ranging from supervised learning (e.g., image classification problems) where data undergoes a sequence of gradual shifts, to reinforcement learning tasks (e.g., robotic manipulation and continuous control) where data shifts as the policy or the task changes.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: meta-learning in-context learning
Scores: [ 6 6 5 5 ]
Large Language Models like ChatGPT demonstrate a remarkable capacity to learn new concepts during inference without any fine-tuning. However, visual models trained to detect new objects during inference have been unable to replicate this ability, and instead either perform poorly or require meta-training and/or fine-tuning on similar objects. In this work, we propose a meta-learning algorithm that emulates Large Language Models by learning new visual concepts during inference without fine-tuning. Our approach leverages a frozen pre-trained feature extractor, and analogous to in-context learning, recasts meta-learning as sequence modeling over datapoints with known labels and a test datapoint with an unknown label. On 8 out of 11 meta-learning benchmarks, our approach---without meta-training or fine-tuning---exceeds or matches the state-of-the-art algorithm, P>M>F, which is meta-trained on these benchmarks.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: model fusion parameter-efficient fine-tuning
Scores: [ 8 6 6 8 ]
Large pre-trained models have enabled significant advances in machine learning and served as foundation components.Model fusion methods, such as task arithmetic, have been proven to be powerful and scalable to incorporate fine-tuned weights from different tasks into a multi-task model. However, efficiently fine-tuning large pre-trained models on multiple downstream tasks remains challenging, leading to inefficient multi-task model fusion.In this work, we propose a novel method to improve multi-task fusion for parameter-efficient fine-tuning techniques like LoRA fine-tuning.Specifically, our approach partially linearizes only the adapter modules and applies task arithmetic over the linearized adapters.This allows us to leverage the the advantages of model fusion over linearized fine-tuning, while still performing fine-tuning and inference efficiently.We demonstrate that our partial linearization technique enables a more effective fusion of multiple tasks into a single model, outperforming standard adapter tuning and task arithmetic alone.Experimental results demonstrate the capabilities of our proposed partial linearization technique to effectively construct unified multi-task models via the fusion of fine-tuned task vectors. We evaluate performance over an increasing number of tasks and find that our approach outperforms standard parameter-efficient fine-tuning techniques. The results highlight the benefits of partial linearization for scalable and efficient multi-task model fusion.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: meta-learning misspecification overparametrization pretraining optimization
Scores: [ 6 5 8 5 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Domain adaptation gradual domain adaptation gradient flow
Scores: [ 6 8 6 6 ]
Domain shift degrades classification models on new data distributions. Conventional unsupervised domain adaptation (UDA) aims to learn features that bridge labeled source and unlabeled target domains. In contrast to feature learning, gradual domain adaptation (GDA) leverages extra continuous intermediate domains with pseudo-labels to boost the source classifier. However, real intermediate domains are sometimes unavailable or ineffective. In this paper, we propose $\textbf{G}$radual Domain Adaptation via $\textbf{G}$radient $\textbf{F}$low (GGF) to generate intermediate domains with preserving labels, thereby enabling us a fine-tuning method for GDA. We employ the Wasserstein gradient flow in Kullback–Leibler divergence to transport samples from the source to the target domain. To simulate the dynamics, we utilize the Langevin algorithm. Since the Langevin algorithm disregards label information and introduces diffusion noise, we introduce classifier-based and sample-based potentials to avoid label switching and dramatic deviations in the sampling process. For the proposed GGF model, we analyze its generalization bound. Experiments on several benchmark datasets demonstrate the superiority of the proposed GGF method compared to state-of-the-art baselines.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: in-context learning causalLM prefixLM
Scores: [ 8 6 6 6 ]
Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model (prefixLM), in which in-context samples can all attend to each other, compared to causal language models (causalLM), which use auto-regressive attention that prohibits in-context samples to attend to future samples. While this result is intuitive, it is not understood from a theoretical perspective. In this paper we take a theoretical approach and analyze the convergence behavior of prefixLM and causalLM under a certain parameter construction. Our analysis shows that both LM types converge to their stationary points at a linear rate, but that while prefixLM converges to the optimal solution of linear regression, causalLM convergence dynamics follows that of an online gradient descent algorithm, which is not guaranteed to be optimal even as the number of samples grows infinitely. We supplement our theoretical claims with empirical experiments over synthetic and real tasks and using various types of transformers. Our experiments verify that causalLM consistently underperforms prefixLM in all settings.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: domain adaptation; feature transferability;
Scores: [ 6 6 6 6 ]
Unsupervised domain adaptation (UDA) involves adapting a model trained on a label-rich source domain to an unlabeled target domain. However, in real-world scenarios, the absence of target-domain labels makes it challenging to evaluate the performance of UDA models. Furthermore, prevailing UDA methods relying on adversarial training and self-training could lead to model degeneration and negative transfer, further exacerbating the evaluation problem. In this paper, we propose a novel metric called the Transfer Score to address these issues. The proposed metric enables the unsupervised evaluation of UDA models by assessing the spatial uniformity of the classifier via model parameters, as well as the transferability and discriminability of deep representations. Based on the metric, we achieve three novel objectives without target-domain labels: (1) selecting the best UDA method from a range of available options, (2) optimizing hyperparameters of UDA models to prevent model degeneration, and (3) identifying which checkpoint of UDA model performs optimally. Our work bridges the gap between data-level UDA research and practical UDA scenarios, enabling a realistic assessment of UDA model performance. We validate the effectiveness of our metric through extensive empirical studies on UDA datasets of different scales and imbalanced distributions. The results demonstrate that our metric robustly achieves the aforementioned goals.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: distribution-shift robustness fine-tuning adaptation transfer learning
Scores: [ 8 8 5 ]
Transfer learning with a small amount of target data is an effective and common approach to adapting a pre-trained model to distribution shifts. In some situations, target data labels may be expensive to obtain, so we may only have access to a limited number of target data points. To make the most of a very small target dataset, we propose a lightweight, sample-efficient approach that learns a diverse set of features and adapts to a target distribution by interpolating these features. Our approach, Project and Probe (Pro$^2$), first learns a linear projection that maps a pre-trained embedding onto orthogonal directions while being predictive of labels in the source dataset. The goal of this step is to learn a variety of predictive features, so that at least some of them remain useful after distribution shift. Pro$^2$ then learns a linear classifier on top of these projected features using a small target dataset. Theoretically, we find that Pro$^2$ results in more sample-efficient generalization by inducing a favorable bias-variance tradeoff. Our experiments on four datasets, with multiple distribution shift settings for each, show that Pro$^2$ improves performance by 5-15% when given limited target data compared to prior methods such as standard linear probing.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: parameter-efficient transfer learning cross-modal modeling
Scores: [ 8 3 6 6 ]
Large-scale vision-language pre-trained models have shown promising transferability to various downstream tasks. As the size of these foundation models and the number of downstream tasks grow, the standard full fine-tuning paradigm becomes unsustainable due to heavy computational and storage costs. This paper proposes UniAdapter, which unifies unimodal and multimodal adapters for parameter-efficient cross-modal adaptation on pre-trained vision-language models. Specifically, adapters are distributed to different modalities and their interactions, with the total number of tunable parameters reduced by partial weight sharing. The unified and knowledge-sharing design enables powerful cross-modal representations that can benefit various downstream tasks, requiring only 1.0%-2.0% tunable parameters of the pre-trained model. Extensive experiments on 7 cross-modal downstream benchmarks (including video-text retrieval, image-text retrieval, VideoQA, VQA and Caption) show that in most cases, UniAdapter not only outperforms the state-of-the-arts, but even beats the full fine-tuning strategy. Particularly, on the MSRVTT retrieval task, UniAdapter achieves 49.7% recall@1 with 2.2% model parameters, outperforming the latest competitors by 2.0%. The code and models are available at https://github.com/UniAdapter/UniAdapter.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Parameter-efficient fine-tuning Segment Anything Model LoRA Semantic Segmentation
Scores: [ 6 6 6 6 ]
The Segment-Anything Model (SAM) stands as a foundational framework for image segmentation. While it exhibits remarkable zero-shot generalization in typical scenarios, its advantage diminishes when applied to specialized domains like medical imagery and remote sensing. To address this limitation, this paper introduces Conv-LoRA, a simple yet effective parameter-efficient fine-tuning approach. By integrating ultra-lightweight convolutional parameters into LoRA, Conv-LoRA can inject image-related inductive biases into the plain ViT encoder, further reinforcing SAM’s local prior assumption. Notably, Conv-LoRA not only preserves SAM’s extensive segmentation knowledge but also revives its capacity of learning high-level image semantics, which is constrained by SAM’s foreground-background segmentation pretraining. Comprehensive experimentation across diverse benchmarks spanning multiple domains underscores Conv-LoRA’s superiority in adapting SAM to real-world semantic segmentation tasks.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Robustness Domain generalization Sharpness-Aware Minimization Inconsistency
Scores: [ 6 5 6 6 ]
The objective of domain generalization (DG) is to enhance the transferability of the model learned from a source domain to unobserved domains. To prevent overfitting to a specific domain, Sharpness-Aware Minimization (SAM) reduces the sharpness of the source domain's loss landscape. Although SAM and its variants have delivered significant improvements in DG, we highlight that there's still potential for improvement in generalizing to unknown domains through the exploration on data space. Building on this motivation, this paper introduces an objective rooted in both parameter and data perturbed regions for domain generalization, termed Unknown Domain Inconsistency Minimization (UDIM). UDIM reduces the loss landscape inconsistency between source domain and unknown domains. As unknown domains are inaccessible, these domains are empirically crafted by perturbing instances from the source domain dataset. In particular, by aligning the flat minima acquired in the source domain to the loss landscape of perturbed domains, we expect to achieve generalization grounded on these flat minima for the unknown domains. Theoretically, we validate that merging SAM optimization with the UDIM objective establishes an upper bound for the true objective of the DG task. In an empirical aspect, UDIM consistently outperforms SAM variants across multiple DG benchmark datasets. Notably, UDIM shows statistically significant improvements in scenarios with more restrictive domain information, underscoring UDIM's generalization capability in unseen domains.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Distribution Shift Temporal Distribution Shift
Scores: [ 8 8 8 8 ]
Distribution shifts over time are common in real-world machine-learning applications. This scenario is formulated as Evolving Domain Generalization (EDG), where models aim to generalize well to unseen target domains in a time-varying system by learning and leveraging the underlying evolving pattern of the distribution shifts across domains. However, existing methods encounter challenges due to the limited number of timestamps (every domain corresponds to a timestamp) in EDG datasets, leading to difficulties in capturing evolving dynamics and risking overfitting to the sparse timestamps, which hampers their generalization and adaptability to new tasks. To address this limitation, we propose a novel approach SDE-EDG that collects the Infinitely Fined-Grid Evolving Trajectory (IFGET) of the data distribution with continuous-interpolated samples to bridge temporal gaps (intervals between two successive timestamps). Furthermore, by leveraging the inherent capacity of Stochastic Differential Equations (SDEs) to capture continuous trajectories, we propose their use to align SDE-modeled trajectories with IFGET across domains, thus enabling the capture of evolving distribution trends. We evaluate our approach on several benchmark datasets and demonstrate that it can achieve superior performance compared to existing state-of-the-art methods.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Meta-learning prior extraction algorithm unrolling
Scores: [ 8 8 6 6 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: language models classification and regression model pre-training tabular data
Scores: [ 8 6 6 8 ]
The transferability of deep neural networks (DNNs) has made significant progress in image and language processing. However, due to the heterogeneity among tables, such DNN bonus is still far from being well exploited on tabular data prediction (e.g., regression or classification tasks). Condensing knowledge from diverse domains, language models (LMs) possess the capability to comprehend feature names from various tables, potentially serving as versatile learners in transferring knowledge across distinct tables and diverse prediction tasks, but their discrete text representation space is inherently incompatible with numerical feature values in tables. In this paper, we present TP-BERTa, a specifically pre-trained LM model for tabular data prediction. Concretely, a novel relative magnitude tokenization converts scalar numerical feature values to finely discrete, high-dimensional tokens, and an intra-feature attention approach integrates feature values with the corresponding feature names. Comprehensive experiments demonstrate that our pre-trained TP-BERTa leads the performance among tabular DNNs and is competitive with Gradient Boosted Decision Tree models in typical tabular data regime. Our pre-trained model will be publicly available.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: online learning non-stationarity Kalman filter continual learning probabilistic modellling
Scores: [ 8 6 6 6 ]
In Online Continual Learning (OCL) a learning system receives a stream of data and sequentially performs prediction and training steps. Important challenges in OCL are concerned with automatic adaptation to the particular non-stationary structure of the data, and with quantification of predictive uncertainty. Motivated by these challenges we introduce a probabilistic Bayesian online learning model by using a (possibly pretrained) neural representation and a state space model over the linear predictor weights. Non-stationarity over the linear predictor weights is modelled using a “parameter drift” transition density, parametrized by a coefficient that quantifies forgetting. Inference in the model is implemented with efficient Kalman filter recursions which track the posterior distribution over the linear weights, while online SGD updates over the transition dynamics coefficient allows to adapt to the non-stationarity seen in data. While the framework is developed assuming a linear Gaussian model, we also extend it to deal with classification problems and for fine-tuning the deep learning representation. In a set of experiments in multi-class classification using data sets such as CIFAR-100 and CLOC we demonstrate the predictive ability of the model and its flexibility to capture non-stationarity.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: fine-tuning SGD freezing layers distribution shift
Scores: [ 6 6 6 6 8 ]
SGD and AdamW are the two most used optimizers for fine-tuning large neural networks in computer vision. When the two methods perform the same, SGD is preferable because it uses less memory (12 bytes/parameter with momentum and 8 bytes/parameter without) than AdamW (16 bytes/parameter). However, on a suite of downstream tasks, especially those with distribution shifts, we find that fine-tuning with AdamW performs substantially better than SGD on modern Vision Transformer and ConvNeXt models. We find that large gaps in performance between SGD and AdamW occur when the fine-tuning gradients in the first "embedding" layer are much larger than in the rest of the model. Our analysis suggests an easy fix that works consistently across datasets and models: freezing the embedding layer (less than 1% of the parameters) leads to SGD with or without momentum performing slightly better than AdamW while using less memory (e.g., on ViT-L, SGD uses 33% less GPU memory). Our insights result in state-of-the-art accuracies on five popular distribution shift benchmarks: WILDS-FMoW, WILDS-Camelyon, BREEDS-Living-17, Waterbirds, and DomainNet.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Domain adaptation Test-time adaptation Distribution shift Catastrophic forgetting
Scores: [ 6 5 8 8 ]
Test-time adaptation (TTA) addresses distribution shifts for streaming test data in unsupervised settings. Currently, most TTA methods can only deal with minor shifts and rely heavily on heuristic and empirical studies. To advance TTA under domain shifts, we propose the novel problem setting of active test-time adaptation (ATTA) that integrates active learning within the fully TTA setting. We provide a learning theory analysis, demonstrating that incorporating limited labeled test instances enhances overall performances across test domains with a theoretical guarantee. We also present a sample entropy balancing for implementing ATTA while avoiding catastrophic forgetting (CF). We introduce a simple yet effective ATTA algorithm, known as SimATTA, using real-time sample selection techniques. Extensive experimental results confirm consistency with our theoretical analyses and show that the proposed ATTA method yields substantial performance improvements over TTA methods while maintaining efficiency and shares similar effectiveness to the more demanding active domain adaptation (ADA) methods.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Continual Learning Semi-supervised Continual Learning
Scores: [ 8 8 6 5 ]
We propose and study a realistic Continual Learning (CL) setting where learning algorithms are granted a restricted computational budget per time step while training. We apply this setting to large-scale semi-supervised Continual Learning scenarios with sparse label rate. Previous proficient CL methods perform very poorly in this challenging setting. Overfitting to the sparse labeled data and insufficient computational budget are the two main culprits for such a poor performance. Our new setting encourages learning methods to effectively and efficiently utilize the unlabeled data during training. To that end, we propose a simple but highly effective baseline, DietCL, which utilizes both unlabeled and labeled data jointly. DietCL meticulously allocates computational budget for both types of data. We validate our baseline, at scale, on several datasets, e.g., CLOC, ImageNet10K, and CGLM, under constraint budget setup. DietCL outperforms, by a large margin, all existing supervised CL algorithms as well as more recent continual semi-supervised methods. Our extensive analysis and ablations demonstrate that DietCL is stable under a full spectrum of label sparsity, computational budget and various other ablations.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Synthetic Data Transfer Learning
Scores: [ 5 5 6 8 ]
Diffusion models have emerged as a powerful method of generative modeling across a range of fields, capable of producing stunning photo-realistic images from natural language descriptions. However, these models lack explicit control over the 3D structure in the generated images. Consequently, this hinders our ability to obtain detailed 3D annotations for the generated images or to craft instances with specific poses and distances. In this paper, we propose a simple yet effective method that incorporates 3D geometry control into diffusion models. Our method exploits ControlNet, which extends diffusion models by using visual prompts in addition to text prompts. We generate images of the 3D objects taken from 3D shape repositories (e.g., ShapeNet and Objaverse), render them from a variety of poses and viewing directions, compute the edge maps of the rendered images, and use these edge maps as visual prompts to generate realistic images. With explicit 3D geometry control, we can easily change the 3D structures of the objects in the generated images and obtain ground-truth 3D annotations automatically. This allows us to improve a wide range of vision tasks, e.g., classification and 3D pose estimation, in both in-distribution (ID) and out-of-distribution (OOD) settings. We demonstrate the effectiveness of our method through extensive experiments on ImageNet-100, ImageNet-R, PASCAL3D+, ObjectNet3D, and OOD-CV. The results show that our method significantly outperforms existing methods across multiple benchmarks, e.g., 3.8 percentage points on ImageNet-100 using DeiT-B and 3.5 percentage points on PASCAL3D+ & ObjectNet3D using NeMo.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Unsupervised Domain Adaptation Sim2Real
Scores: [ 8 6 6 ]
Synthetic data (Sim) drawn from simulators have emerged as a popular alternative for training models where acquiring annotated real-world images is difficult. However, transferring models trained on synthetic images to real-world applications can be challenging due to appearance disparities. A commonly employed solution to counter this Sim2Real gap is unsupervised domain adaptation, where models are trained using labeled Sim data and unlabeled Real data. Mispredictions made by such Sim2Real adapted models are often associated with miscalibration – stemming from overconfident predictions on real data. In this paper, we introduce AUGCAL, a simple training-time patch for unsupervised adaptation that improves Sim2Real adapted models by – (1) reducing overall miscalibration, (2) reducing overconfidence in incorrect predictions and (3) improving confidence score reliability by better guiding misclassification detection – all while retaining or improving Sim2Real performance. Given a base Sim2Real adaptation algorithm, at training time, AUGCAL involves replacing vanilla Sim images with strongly augmented views (AUG intervention) and additionally optimizing for a training time calibration loss on augmented Sim predictions (CAL intervention). We motivate AUGCAL using a brief analytical justification of how to reduce miscalibration on unlabeled Real data. Through our experiments, we empirically show the efficacy of AUGCAL across multiple adaptation methods, backbones, tasks and shifts.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: implicit inference in language models fine-tuning catastrophic forgetting
Scores: [ 3 8 6 6 ]
We lack a systematic understanding of the effects of fine-tuning (via methods such as instruction-tuning or reinforcement learning from human feedback), particularly on tasks outside the narrow fine-tuning distribution. In a simplified scenario, we demonstrate that improving performance on tasks within the fine-tuning data distribution comes at the expense of capabilities on other tasks. We hypothesize that language models implicitly infer the task of the prompt and that fine-tuning skews this inference towards tasks in the fine-tuning distribution. To test this, we propose Conjugate Prompting, which artificially makes the task look farther from the fine-tuning distribution while requiring the same capability, and we find that this recovers some of the pretraining capabilities on our synthetic setup. Since real-world fine-tuning distributions are predominantly English, we apply conjugate prompting to recover pretrained capabilities in LLMs by simply translating the prompts to different languages. This allows us to recover the in-context learning abilities lost via instruction tuning, and more concerningly, recover harmful content generation suppressed by safety fine-tuning in chatbots like ChatGPT.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Composition Forgetting Tangent Learning
Scores: [ 6 6 6 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Continual Learning
Scores: [ 10 8 8 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Pre training Noisy model learning Label noise Noise mitigation
Scores: [ 10 8 8 8 ]
Pre-training on large-scale datasets and then fine-tuning on downstream tasks have become a standard practice in deep learning. However, pre-training data often contain label noise that may adversely affect the generalization of the model. This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks. More specifically, through extensive experiments of supervised pre-training models on synthetic noisy ImageNet-1K and YFCC15M datasets, we demonstrate that while slight noise in pre-training can benefit in-domain (ID) transfer performance, where the training and testing data share the same distribution, it always deteriorates out-of-domain (OOD) performance, where training and testing data distribution are different. We empirically verify that the reason behind is noise in pre-training shapes the feature space differently. We then propose a lightweight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization on both ID and OOD tasks, considering one may not be able to fully fine-tune or even access the pre-trained models. We conduct practical experiments on popular vision and language models that are pre-trained on noisy data for evaluation of our approach. Our analysis and results show the importance of this interesting and novel research direction, which we term Noisy Model Learning.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: computer vision natural language processing natural language generation GPT large language models LoRA adaptation Transfer Learning Transformers
Scores: [ 6 6 6 6 ]
Large Language Models (LLMs) have recently gained popularity due to their impressive few-shot performance across various downstream tasks. However, fine-tuning all parameters and storing a unique model for each downstream task or domain becomes impractical because of the massive size of checkpoints (e.g., 350GB in GPT-3). Current literature, such as LoRA, showcases the potential of low-rank modifications to the original weights of an LLM, enabling efficient adaptation and storage for task-specific models. These methods can reduce the number of parameters needed to fine-tune an LLM by several orders of magnitude. Yet, these methods face two primary limitations: 1) the parameter reduction is lower-bounded by the rank one decomposition, and 2) the extent of reduction is heavily influenced by both the model architecture and the chosen rank.For instance, in larger models, even a rank one decomposition might exceed the number of parameters truly needed for adaptation. In this paper, we introduce NOLA, which overcomes the rank one lower bound present in LoRA. It achieves this by re-parameterizing the low-rank matrices in LoRA using linear combinations of randomly generated matrices (basis) and optimizing the linear mixture coefficients only. This approach allows us to decouple the number of trainable parameters from both the choice of rank and the network architecture. We present adaptation results using GPT-2 and ViT in natural language and computer vision tasks. NOLA performs as well as, or better than models with equivalent parameter counts. Furthermore, we demonstrate that we can halve the parameters in larger models compared to LoRA with rank one, without sacrificing performance.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: stochastic neural architecture search few shot learning adapters
Scores: [ 6 8 8 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Spatio-temporal Graph Transfer learning Diffusion models Pre-Training
Scores: [ 8 6 6 6 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Meta Learning; Deep Learning Architecture; General Machine Learning
Scores: [ 6 6 6 6 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Parameter-efficient fine-tuning Transfer learning Low-rank NLP
Scores: [ 8 5 8 8 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Vision-Language Models Prompt Learning Federated Learning
Scores: [ 5 6 6 6 ]
Prompt learning for vision-language models, e.g., CoOp, has shown great success in adapting CLIP to different downstream tasks, making it a promising solution for federated learning due to computational reasons. Existing prompt learning techniques replace hand-crafted text prompts with learned vectors that offer improvements on seen classes but struggle to generalize to unseen classes. Our work addresses this challenge by proposing Federated Text-driven Prompt Generation (FedTPG), which learns a unified prompt generation network across multiple remote clients in a scalable manner. The prompt generation network is conditioned on task-related text input, thus, is context-aware, making it suitable to generalize for both seen and unseen classes. Our comprehensive empirical evaluations on nine diverse image classification datasets show that our method is superior to existing federated prompt learning methods, that achieve overall better generalization on both seen and unseen classes and is also generalizable to unseen datasets.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Personalized Federated Learning Learning-to-Learn
Scores: [ 6 8 5 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: continual learning modular machine learning modular continual learning transfer learning catastrophic forgetting Bayesian optimization probabilistic modelling
Scores: [ 8 8 8 6 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Language Model Model Editing Meta Learning
Scores: [ 6 6 10 5 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: CLIP training-free adaptation
Scores: [ 6 6 6 ]
Contrastive Language-Image Pretraining (CLIP) has gained popularity for its remarkable zero-shot capacity. Recent research has focused on developing efficientfine-tuning methods, such as prompt learning and adapter, to enhance CLIP’sperformance in downstream tasks. However, these methods still require additionaltraining time and computational resources, which is undesirable for devices withlimited resources. In this paper, we revisit a classical algorithm, Gaussian Discriminant Analysis (GDA), and apply it to the downstream classification of CLIP.Typically, GDA assumes that features of each class follow Gaussian distributionswith identical covariance. By leveraging Bayes’ formula, the classifier can beexpressed in terms of the class means and covariance, which can be estimated fromthe data without the need for training. To integrate knowledge from both visual andtextual modalities, we ensemble it with the original zero-shot classifier within CLIP.Extensive results on 17 datasets validate that our method surpasses or achievescomparable results with state-of-the-art methods on few-shot classification, imbalanced learning, and out-of-distribution generalization. In addition, we extendour method to base-to-new generalization and unsupervised learning, once againdemonstrating its superiority over competing approaches.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Knowledge Distillation Meta-Knowledge Distillation Policy-driven Knowledge Distillation Large Language Models
Scores: [ 8 8 5 6 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: generalization sharpness-aware minimization domain shift
Scores: [ 6 6 8 5 ]
This paper presents a Domain-Inspired Sharpness Aware Minimization (DISAM) algorithm for optimization under domain shifts. It is motivated by the inconsistent convergence degree of SAM across different domains, which induces optimization bias towards certain domains and thus impairs the overall convergence. To address this issue, we consider the domain-level convergence consistency in the sharpness estimation to prevent the overwhelming (deficient) perturbations for less (well) optimized domains. Specifically, DISAM introduces the constraint of minimizing variance in the domain loss, which allows the elastic gradient calibration in perturbation generation: when one domain is optimized above the averaging level \textit{w.r.t.} loss, the gradient perturbation towards that domain will be weakened automatically, and vice versa. Under this mechanism, we theoretically show that DISAM can achieve faster overall convergence and improved generalization in principle when inconsistent convergence emerges. Extensive experiments on various domain generalization benchmarks show the superiority of DISAM over a range of state-of-the-art methods. Furthermore, we show the superior efficiency of DISAM in parameter-efficient fine-tuning combined with the pretraining models.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: In-context Learning Transformers Inductive Biases Meta Learning Language Modelling Bayesian Inference
Scores: [ 8 6 6 ]
In-context learning (ICL) is one of the surprising and useful features of large language models and subject of intense research. Recently, stylized meta-learning-like ICL setups have been devised that train transformers on sequences of input-output pairs (x,f(x)) using the language modeling loss. The function f comes from a function class and generalization is checked by evaluation on sequences for unseen functions from the same class. One of the main discoveries in this line of research has been that for several function classes, such as linear regression, transformers successfully generalize to new functions in the class. However, the inductive biases of these models resulting in this behavior are not clearly understood. A model with unlimited training data and compute is a Bayesian predictor: it learns the pretraining distribution.In this paper we empirically examine how far this Bayesian perspective can help us understand ICL. To this end, we generalize the previous meta-ICL setup to hierarchical meta-ICL setup which involve unions of multiple task families. We instantiate this setup on a diverse range of linear and nonlinear function families and find that transformers can do ICL in this setting as well. Where Bayesian inference is tractable, we find evidence that high-capacity transformers mimic the Bayesian predictor. The Bayesian perspective provides insights into the inductive bias of ICL and how transformers perform a particular task when they are trained on multiple tasks. We also find that transformers can learn to generalize to new function classes that were not seen during pretraining. This involves deviation from the Bayesian predictor. We examine deviations from the Bayesian predictor in more depth offering new insights and hypotheses.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: modularity compositionality compositional generalization teacher-student meta-learning hypernetworks
Scores: [ 8 6 6 6 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Parameter-efficient Fine-tuning Model Capacity
Scores: [ 6 8 6 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Rehearsal-free continual learning Class-incremental learning Parameter-efficient fine-tuning Outlier regularization
Scores: [ 6 6 6 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: non-transferable representation learning domain adaptation transfer learning
Scores: [ 6 8 6 ]
Non-transferable learning (NTL) aims to restrict the generalizability of models toward the target domain(s). To this end, existing works learn non-transferable representations by reducing statistical dependence between the source and target domain. However, such statistical methods essentially neglect to distinguish between styles and contents, leading them to inadvertently fit (i) spurious correlation between styles and labels, and (ii) fake independence between contents and labels. Consequently, their performance will be limited when natural distribution shifts occur or malicious intervention is imposed. In this paper, we propose a novel method (dubbed as H-NTL) to understand and advance the NTL problem by introducing a causal model to separately model content and style as two latent factors, based on which we disentangle and harness them as guidances for learning non-transferable representations with intrinsically causal relationships. Specifically, to avoid fitting spurious correlation and fake independence, we propose a variational inference framework to disentangle the naturally mixed content factors and style factors under our causal model. Subsequently, based on dual-path knowledge distillation, we harness the disentangled two factors as guidances for non-transferable representation learning: (i) we constraint the source domain representations to fit content factors (which are the intrinsic cause of labels), and (ii) we enforce that the target domain representations fit style factors which barely can predict labels. As a result, the learned feature representations follow optimal untransferability toward the target domain and minimal negative influence on the source domain, thus enabling better NTL performance. Empirically, the proposed H-NTL significantly outperforms competing methods by a large margin.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Continual Learning Prompt Tuning Gradient Projection Anti-forgetting
Scores: [ 6 8 8 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: continual learning class-incremental learning
Scores: [ 5 5 8 8 ]
Class-incremental learning (CIL) is a particularly challenging variant of continual learning, where the goal is to learn to discriminate between all classes presented in an incremental fashion. Existing approaches often suffer from excessive forgetting and imbalance of the scores assigned to classes that have not been seen together during training. In this study, we introduce a novel approach, Prediction Error-based Classification (PEC), which differs from traditional discriminative and generative classification paradigms. PEC computes a class score by measuring the prediction error of a model trained to replicate the outputs of a frozen random neural network on data from that class. The method can be interpreted as approximating a classification rule based on Gaussian Process posterior variance. PEC offers several practical advantages, including sample efficiency, ease of tuning, and effectiveness even when data are presented one class at a time. Our empirical results show that PEC performs strongly in single-pass-through-data CIL, outperforming other rehearsal-free baselines in all cases and rehearsal-based methods with moderate replay buffer size in most cases across multiple benchmarks.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Model Merging Gradient Matching Language Modeling Model Editing Transfer Learning
Scores: [ 6 6 6 6 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Online Test-time Adaptation Catastrophic Forgetting Kalman Filter
Scores: [ 8 6 3 8 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Continual Learning
Scores: [ 6 3 6 6 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Test-time adaptation Roustness
Scores: [ 8 6 6 8 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Class Incremental Learning Continual Learning
Scores: [ 5 8 5 6 ]
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: lifelong learning reinforcement learning human feedback proximal policy optimization
Scores: [ 6 8 5 6 ]
The approach of Reinforcement Learning from Human Feedback (RLHF) is widely used for enhancing pre-trained Language Models (LM), enabling them to better align with human preferences. Existing RLHF-based LMs however require complete retraining whenever new queries or feedback are introduced, as human preferences may differ across different domains or topics. LM retraining is often impracticable in most real-world scenarios, due to the substantial time and computational costs involved, as well as data privacy concerns. To address this limitation, we propose a new method to Continually align with human preferences based on Proximal Policy Optimization (CPPO), in which sample-wise weights are introduced into the original PPO algorithm to adjust policy learning and knowledge retention. Specifically, CPPO introduces a mechanism for deciding which samples should be utilized for enhancing policy learning and which should be used for solidifying past experiences. This seeks a good trade-off between policy learning and knowledge retention. Our experimental results show that CPPO outperforms strong Continuous learning (CL) baselines when it comes to consistently aligning with human preferences. Furthermore, compared to PPO, CPPO offers more efficient and stable learning in non-continual scenarios.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Parameter-efficient finetuning orthogonal Butterfly matrix
Scores: [ 5 6 8 ]
Large foundation models are becoming ubiquitous, but training them from scratch is prohibitively expensive. Thus, efficiently adapting these powerful models to downstream tasks is increasingly important. In this paper, we study a principled finetuning paradigm -- Orthogonal Finetuning (OFT) -- for downstream task adaptation. Despite demonstrating good generalizability, OFT still uses a fairly large number of trainable parameters due to the high dimensionality of orthogonal matrices. To address this, we start by examining OFT from an information transmission perspective, and then identify a few key desiderata that enable better parameter-efficiency. Inspired by how the Cooley-Tukey fast Fourier transform algorithm enables efficient information transmission, we propose an efficient orthogonal parameterization using butterfly structures. We apply this parameterization to OFT, creating a novel parameter-efficient finetuning method, called Orthogonal Butterfly (BOFT). By subsuming OFT as a special case, BOFT introduces a generalized orthogonal finetuning framework. Finally, we conduct an extensive empirical study of adapting large vision transformers, large language models, and text-to-image diffusion models to various downstream tasks in computer vision and natural language. The results validate the effectiveness of BOFT as a generic finetuning method.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Computer vision continual learning class-incremental learning exemplar free lifelong learning
Scores: [ 6 8 8 6 ]
Exemplar-Free Class Incremental Learning (EFCIL) aims to learn from a sequence of tasks without having access to previous task data. In this paper, we consider the challenging Cold Start scenario in which insufficient data is available in the first task to learn a high-quality backbone. This is especially challenging for EFCIL since it requires high plasticity, which results in feature drift which is difficult to compensate for in the exemplar-free setting. To address this problem, we propose a simple and effective approach that consolidates feature representations by regularizing drift in directions highly relevant to previous tasks and employs prototypes to reduce task-recency bias. Our method, called Elastic Feature Consolidation (EFC), exploits a tractable second-order approximation of feature drift based on an Empirical Feature Matrix (EFM). The EFM induces a pseudo-metric in feature space which we use to regularize feature drift in important directions and to update Gaussian prototypes used in a novel asymmetric cross entropy loss which effectively balances prototype rehearsal with data from new tasks. Experimental results on CIFAR-100, Tiny-ImageNet, and ImageNet-Subset demonstrate that Elastic Feature Consolidation is better able to learn new tasks by maintaining model plasticity and significantly outperform the state-of-the-art.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: nighttime optical flow event camera domain adaptation common space
Scores: [ 10 8 6 ]
We investigate a challenging task of nighttime optical flow, which suffers from weakened texture and amplified noise. These degradations weaken discriminative visual features, thus causing invalid motion feature matching. Typically, existing methods employ domain adaptation to transfer knowledge from auxiliary domain to nighttime domain in either input visual space or output motion space. However, this direct adaptation is ineffective, since there exists a large domain gap due to the intrinsic heterogeneous nature of the feature representations between auxiliary and nighttime domains. To overcome this issue, we explore a common-latent space as the intermediate bridge to reinforce the feature alignment between auxiliary and nighttime domains. In this work, we exploit two auxiliary daytime and event domains, and propose a novel common appearance-boundary adaptation framework for nighttime optical flow. In appearance adaptation, we employ the intrinsic image decomposition to embed the auxiliary daytime image and the nighttime image into a reflectance-aligned common space. We discover that motion distributions of the two reflectance maps are very similar, benefiting us to consistently transfer motion appearance knowledge from daytime to nighttime domain. In boundary adaptation, we theoretically derive the motion correlation formula between nighttime image and accumulated events within a spatiotemporal gradient-aligned common space. We figure out that the correlation of the two spatiotemporal gradient maps shares significant discrepancy, benefitting us to contrastively transfer boundary knowledge from event to nighttime domain. Moreover, appearance adaptation and boundary adaptation are complementary to each other, since they could jointly transfer global motion and local boundary knowledge to the nighttime domain. Extensive experiments have been performed to verify the superiority of the proposed method.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Segment Anything Model (SAM) one-shot learning text-to-image generation
Scores: [ 8 6 6 ]
Driven by large-data pre-training, Segment Anything Model (SAM) has been demonstrated as a powerful promptable framework, revolutionizing the segmentation field. Despite the generality, customizing SAM for specific visual concepts without man-powered prompting is under-explored, e.g., automatically segmenting your pet dog in numerous images. In this paper, we introduce a training-free Personalization approach for SAM, termed PerSAM. Given only one-shot data, i.e., a single image with a reference mask, we first obtain a positive-negative location prior for the target concept in new images. Then, aided by target visual semantics, we empower SAM for personalized object segmentation via two proposed techniques: target-guided attention and target-semantic prompting. In this way, we can effectively customize the general-purpose SAM for private use without any training. To further alleviate the ambiguity of segmentation scales, we present an efficient one-shot fine-tuning variant, PerSAM-F. Freezing the entire SAM, we introduce a scale-aware fine-tuning to aggregate multi-scale masks, which only tunes 2 parameters within 10 seconds for improved performance. To demonstrate our efficacy, we construct a new dataset, PerSeg, for the evaluation of personalized object segmentation, and also test our methods on various one-shot image and video segmentation benchmarks. Besides, we propose to leverage PerSAM to improve DreamBooth for personalized text-to-image synthesis. By mitigating the disturbance of training-set backgrounds, our approach showcases better target appearance generation and higher fidelity to the input text prompt.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Transfer Learning Inductive Transfer Geometrical Deeplearning Regression
Scores: [ 5 6 8 6 ]
Transfer learning is a crucial technique for handling a small amount of data that is potentially related to other abundant data. However, most of the existing methods are focused on classification tasks using images and language datasets. Therefore, in order to expand the transfer learning scheme to regression tasks, we propose a novel transfer technique based on differential geometry, namely the Geometrically Aligned Transfer Encoder (GATE). In this method, we interpret the latent vectors from the model to exist on a Riemannian curved manifold. We find a proper diffeomorphism between pairs of tasks to ensure that every arbitrary point maps to a locally flat coordinate in the overlapping region, allowing the transfer of knowledge from the source to the target data. This also serves as an effective regularizer for the model to behave in extrapolation regions. In this article, we demonstrate that GATE outperforms conventional methods and exhibits stable behavior in both the latent space and extrapolation regions for various molecular graph datasets.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: continual learning bias spurious correlation
Scores: [ 5 6 8 6 ]
Most continual learning (CL) algorithms have focused on tackling the stability- plasticity dilemma, that is, the challenge of preventing the forgetting of past tasks while learning new ones. However, we argue that they have overlooked the impact of knowledge transfer when the training dataset of a certain task is biased — namely, when the dataset contains some spurious correlations that can overly influence the prediction rule of a model. In that case, how would the dataset bias of a certain task affect prediction rules of a CL model for the future or past tasks? In this work, we carefully design systematic experiments using three benchmark datasets to answer the question from our empirical findings. Specifically, we first show through two-task CL experiments that standard CL methods, which are oblivious of the dataset bias, can transfer bias from one task to another, both forward and backward. Moreover, we find out this transfer is exacerbated depending on whether the CL methods focus on stability or plasticity. We then present that the bias is also transferred and even accumulates in longer task sequences. Finally, we offer a standardized experiment setup and a simple, yet strong plug-in baseline method, dubbed as Group-class Balanced Greedy Sampling (BGS). These resources can be utilized for the development of more advanced bias-aware CL methods.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: computer vision vision-langauge model transfer learning fine-tuning distribution shifts
Scores: [ 6 6 5 6 ]
Large-scale contrastive vision-language pre-trained models provide the zero-shot model achieving competitive performance across a range of image classification tasks without requiring training on downstream data. Recent works have confirmed that while additional fine-tuning of the zero-shot model on the reference data results in enhanced downstream performance, it compromises the model's robustness against distribution shifts. Our investigation begins by examining the conditions required to achieve the goals of robust fine-tuning, employing descriptions based on feature distortion theory and joint energy-based models. Subsequently, we propose a novel robust fine-tuning algorithm, Lipsum-FT, that effectively utilizes the language modeling aspect of the vision-language pre-trained models. Extensive experiments conducted on distribution shift scenarios in DomainNet and ImageNet confirm the superiority of our proposed Lipsum-FT approach over existing robust fine-tuning methods.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: continuous prompt tuning zero-shot prompt transfer cross-model prompt transfer
Scores: [ 6 6 8 8 ]
Prompt tuning in natural language processing (NLP) has become an increasingly popular method for adapting large language models to specific tasks. However, the transferability of these prompts, especially continuous prompts, between different models remains a challenge. In this work, we propose a zero-shot continuous prompt transfer method, where source prompts are encoded into relative space and the corresponding target prompts are searched for transferring to target models. Experimental results confirm the effectiveness of our method, showing that 'task semantics' in continuous prompts can be generalized across various language models. Moreover, we find that combining 'task semantics' from multiple source models can further enhance the generalizability of transfer.
Primary Area: transfer learning, meta learning, and lifelong learning
Keywords: Foundation model Multitask finetuning Few-Shot learning
Scores: [ 6 8 5 5 ]
Foundation models have emerged as a powerful tool for many AI problems. Despite the tremendous success of foundation models, effective adaptation to new tasks, particularly those with limited labels, remains an open question and lacks theoretical understanding. An emerging solution with recent success in vision and NLP involves finetuning a foundation model on a selection of relevant tasks, before its adaptation to a target task with limited labeled samples. In this paper, we study the theoretical justification of this multitask finetuning approach. Our theoretical analysis reveals that with a diverse set of related tasks, this multitask finetuning leads to reduced error in the target task, in comparison to directly adapting the same pretrained model. We quantify the relationship between finetuning tasks and target tasks by diversity and consistency metrics, and further propose a practical task selection algorithm. We substantiate our theoretical claims with extensive empirical evidence. Further, we present results affirming our task selection algorithm adeptly chooses related finetuning tasks, providing advantages to the model performance on target tasks. We believe our study shed new light on the effective adaptation of foundation models to new tasks that lack abundant labels.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Vision-language models soft-prompt tuning low-norm effect normalizing soft prompts
Scores: [ 6 6 8 ]
With the prevalence of large-scale pretrained vision-language models (VLMs), such as CLIP, soft-prompt tuning has become a popular method for adapting these models to various downstream tasks. However, few works delve into the inherent properties of learnable soft-prompt vectors, specifically the impact of their norms to the performance of VLMs. This motivates us to pose an unexplored research question: Do we need to normalize the soft prompts in VLMs?'' To fill this research gap, we first uncover a phenomenon, called the Low-Norm Effect by performing extensive corruption experiments, suggesting that reducing the norms of certain learned prompts occasionally enhances the performance of VLMs, while increasing them often degrades it. To utilize this effect, we propose a novel method named $\textbf{N}ormalizingth\textbf{e}$ soft-pro$\textbf{m}ptv\textbf{e}ctorsofvi\textbf{si}on−languagemodel\textbf{s}$ (Nemesis) to normalize soft-prompt vectors in VLMs. To the best of our knowledge, our work is the first to systematically investigate the role of norms of soft-prompt vector in VLMs, offering valuable insights for future research in soft-prompt tuning.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Representation Learning Natural Language Processing Contrastive Learning Set Operation Querying Framework Sentence Embedding Deep Learning
Scores: [ 5 8 8 6 ]
Taking inspiration from Set Theory, we introduce SetCSE, an innovative information retrieval framework. SetCSE employs sets to represent complex semantics and incorporates well-defined operations for structured information querying within the provided context. In alignment with this framework, we introduce an inter-set contrastive learning objective to enhance language model comprehension concerning the given semantics. Additionally, we present a suite of operations that leverage the enhanced sentence embeddings for querying, including SetCSE intersection, difference, and operation series. Throughout this paper, we demonstrate that SetCSE adheres to the conventions of natural language expression, provides a significant enhancement in the discriminatory capability of underlying language models, and enables numerous information retrieval tasks involving complex and intricate prompts that cannot be achieved using existing search methods.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Random Projection Self-supervised Learning Image Modelling Representation Learning Vision Transformer
Scores: [ 6 6 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Out-of-distribution detection OOD supervision robust image classification
Scores: [ 3 3 6 6 6 8 ]
Out-of-distribution (OOD) detection empowers the model trained on the closed image set to identify unknown data in the open world. Though many prior techniques have yielded considerable improvements in this research direction, two crucial obstacles still remain. Firstly, a unified perspective has yet to be presented to view the developed arts with individual designs, which is vital for providing insights into future work. Secondly, we expect sufficient natural OOD supervision to promote the generation of compact boundaries between the in-distribution (ID) and OOD data without collecting explicit OOD samples. To tackle these issues, we propose a general probabilistic framework to interpret many existing methods and an OOD-data-free model, namely $\textbf{S}$elf-supervised $\textbf{S}$ampling for $\textbf{O}$OD $\textbf{D}$etection (SSOD). SSOD efficiently exploits natural OOD signals from the ID data based on the local property of convolution. With these supervisions, it jointly optimizes the OOD detection and conventional ID classification in an end-to-end manner. Extensive experiments reveal that SSOD establishes competitive state-of-the-art performance on many large-scale benchmarks, outperforming the best previous method by a large margin, e.g., reporting -6.28% FPR95 and +0.77% AUROC on ImageNet, -19.01% FPR95 and +3.04% AUROC on CIFAR-10, and top-ranked performance on hard OOD datasets, i.e., ImageNet-O and OpenImage-O.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Semi-Supervised Learning Poisson Learning Extremely Sparse Labled Data
Scores: [ 6 6 6 6 ]
suffer from degenerate solutions where label functions tend to be nearly constant across unlabeled data. In this paper, we introduce Variance-enlarged Poisson Learning (VPL), a simple yet powerful framework tailored to address the challenges arising from the presence of degenerate solutions. VPL incorporates a variance-enlarged regularization term, which induces a Poisson equation specifically for unlabeled data. This intuitive approach increases the dispersion of labels from their average mean, effectively reducing the likelihood of degenerate solutions characterized by nearly constant label functions. We subsequently introduce two streamlined algorithms, V-Laplace and V-Poisson, each intricately designed to enhance Laplace and Poisson learning, respectively. Furthermore, we broaden the scope of VPL to encompass graph neural networks, introducing Variance-enlarged Graph Poisson Networks to facilitate improved label propagation. To achieve a deeper understanding of VPL's behavior, we conduct a comprehensive theoretical exploration in both discrete and variational cases. Our findings elucidate that VPL inherently amplifies the importance of connections within the same class while concurrently tempering those between different classes. We support our claims with extensive experiments, demonstrating the effectiveness of VPL and showcasing its superiority over existing methods.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Graph Contrastive Learning Spectral Graph Neural Networks Polynomial Filter Heterophilic Graph Representation Learning
Scores: [ 8 8 8 5 ]
Recently, Graph Contrastive Learning (GCL) has achieved significantly superior performance in self-supervised graph representation learning. However, the existing GCL technique has inherent smooth characteristics because of its low-pass GNN encoder and objective based on homophily assumption, which poses a challenge when applying it to heterophilic graphs.In supervised learning tasks, spectral GNNs with polynomial approximation excel in both homophilic and heterophilic settings by adaptively fitting graph filters of arbitrary shapes. Yet, their applications in unsupervised learning are rarely explored.Based on the above analysis, a natural question arises: \textit{Can we incorporate the excellent properties of spectral polynomial filters into graph contrastive learning?}In this paper, we address the question by studying the necessity of introducing high-pass information for heterophily from a spectral perspective.We propose PolyGCL, a GCL pipeline that utilizes polynomial filters to achieve contrastive learning between the low-pass and high-pass views.Specifically, PolyGCL utilizes polynomials with learnable filter functions to generate different spectral views and an objective that incorporates high-pass information through a linear combination. We theoretically prove that PolyGCL outperforms previous GCL paradigms when applied to graphs with varying levels of homophily.We conduct extensive experiments on both synthetic and real-world datasets, which demonstrate the promising performance of PolyGCL on homophilic and heterophilic graphs.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: pruning retraining model averaging neural networks
Scores: [ 6 6 6 5 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Multiple Instance Learning Time Series Classification Interpretability
Scores: [ 8 8 8 8 ]
Conventional Time Series Classification (TSC) methods are often black boxes that obscure inherent interpretation of their decision-making processes. In this work, we leverage Multiple Instance Learning (MIL) to overcome this issue, and propose a new framework called MILLET: Multiple Instance Learning for Locally Explainable Time series classification. We apply MILLET to existing deep learning TSC models and show how they become inherently interpretable without compromising (and in some cases, even improving) predictive performance. We evaluate MILLET on 85 UCR TSC datasets and also present a novel synthetic dataset that is specially designed to facilitate interpretability evaluation. On these datasets, we show MILLET produces sparse explanations quickly that are of higher quality than other well-known interpretability methods. To the best of our knowledge, our work with MILLET is the first to develop general MIL methods for TSC and apply them to an extensive variety of domains.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: self-supervised learninf masked image modeling
Scores: [ 6 6 6 3 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Zero-shot Learning Few-shot Learning Prompt Learning Vision-language Model
Scores: [ 6 6 5 6 ]
We propose Consistency-guided Prompt learning (CoPrompt), a new fine-tuning method for vision-language models that addresses the challenge of improving the generalization of large foundation models while fine-tuning them on downstream tasks in a few-shot setting. The basic idea of CoPrompt is to enforce a consistency constraint in the prediction of the trainable and pre-trained models to prevent overfitting on the downstream task. Additionally, we introduce the following two components into our consistency constraint to further boost the performance: enforcing consistency on two perturbed inputs and combining two dominant paradigms of tuning, prompting and adapter. Enforcing consistency on perturbed input serves to further regularize the consistency constraint, thereby improving generalization. Moreover, the integration of adapters and prompts not only enhances performance on downstream tasks but also offers increased tuning flexibility in both input and output spaces. This facilitates more effective adaptation to downstream tasks in a few-shot learning setting. Experiments show that CoPrompt outperforms existing methods on a range of evaluation suites, including base-to-novel generalization, domain generalization, and cross-dataset evaluation. On the generalization task, CoPrompt improves the state-of-the-art by 2.09% on the zero-shot task and 1.93% on the harmonic mean over 11 datasets. Detailed ablation studies show the effectiveness of each of the components in CoPrompt.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: label noise noisy labels robustness Gaussian noise classification
Scores: [ 5 5 8 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: deep learning random features batch normalization
Scores: [ 8 3 8 5 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: code representation learning customized token-level denoising objective for code hard negative hard positive
Scores: [ 6 3 6 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Unsupervised Learning Compositional representation neural-symbolic learning
Scores: [ 8 8 5 5 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Drug Discovery Pretraining
Scores: [ 6 6 6 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Dynamical systems statistical learning
Scores: [ 6 6 6 6 ]
We consider the general class of time-homogeneous stochastic dynamical systems,both discrete and continuous, and study the problem of learning a representationof the state that faithfully captures its dynamics. This is instrumental to learn thetransfer operator of the system, that in turn can be used for numerous tasks, suchas forecasting and interpreting the system dynamics. We show that the searchfor a good representation can be cast as an optimization problem over neuralnetworks. Our approach is supported by recent results in statistical learning theory,highlighting the role of approximation error and metric distortion in the context oftransfer operator regression. The objective function we propose is associated withprojection operators from the representation space to the data space, overcomesmetric distortion, and can be empirically estimated from data. In the discrete timesetting, we further derive a relaxed objective function that is differentiable andnumerically well-conditioned. We compare our method against state-of-the-artapproaches on different datasets, showing better performance across the board
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: robustness foundation models CLIP LAION ImageNet generalization OOD robustness distribution shift vision language models self-supervised learning contrastive learning ObjectNet ImageNet-R ImageNet-Sketch ImageNet-A ImageNet-V2
Scores: [ 5 6 6 6 ]
Foundation models like CLIP are trained on hundreds of millions of samples and effortlessly generalize to new tasks and inputs. Out of the box, CLIP shows stellar zero-shot and few-shot capabilities on a wide range of out-of-distribution (OOD) benchmarks, which prior works attribute mainly to today's large and comprehensive training dataset (like LAION). However, it is questionable how meaningful terms like out-of-distribution generalization are for CLIP as it seems likely that web-scale datasets like LAION simply contain many samples that are similar to common OOD benchmarks originally designed for ImageNet. To test this hypothesis, we retrain CLIP on pruned LAION splits that replicate ImageNet’s train-test similarity with respect to common OOD benchmarks. While we observe a performance drop on some benchmarks, surprisingly, CLIP’s overall performance remains high. This shows that high train-test similarity is insufficient to explain CLIP’s OOD performance, and other properties of the training data must drive CLIP to learn more generalizable representations. Additionally, by pruning data points that are dissimilar to the OOD benchmarks, we uncover a 100M split of LAION (¼ of its original size) on which CLIP can be trained to match its original OOD performance.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Training Data Synthesis Distribution Matching Data Efficiency
Scores: [ 5 5 6 8 ]
Synthetic training data has gained prominence in numerous learning tasks and scenarios, offering advantages such as dataset augmentation, generalization evaluation, and privacy preservation. Despite these benefits, the efficiency of synthetic data generated by current methodologies remains inferior when training advanced deep models exclusively, limiting its practical utility. To address this challenge, we analyze the principles underlying training data synthesis for supervised learning and elucidate a principled theoretical framework from the distribution-matching perspective that explicates the mechanisms governing synthesis efficacy. Through extensive experiments, we demonstrate the effectiveness of our synthetic data across diverse image classification tasks, both as a replacement for and augmentation to real datasets, while also benefits challenging tasks such as out-of-distribution generalization and privacy preservation.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Self-supervised Learning Probablistic Machine Learning Proper Scoring Rule
Scores: [ 6 5 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Spatio-temporal data mining Causal inference Two-stage framework
Scores: [ 10 8 6 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: multi-model contrastive learning contrastive learning theory representation learning theory robustness distribution shift
Scores: [ 5 6 6 5 ]
Recently, multimodal contrastive learning (MMCL) approaches, such as CLIP \citep{radford2021learning}, have achieved a remarkable success in learning representations that are robust against distribution shift and generalize to new domains. Despite the empirical success, the mechanism behind learning such generalizable representations is not understood. In this work, we rigorously analyze this problem and uncover two mechanisms behind MMCL's robustness: \emph{intra-class contrasting}, which allows the model to learn features with a high variance, and \emph{inter-class feature sharing}, where annotated details in one class help learning other classes better. Both mechanisms prevent spurious features that are over-represented in the training data to overshadow the generalizable core features. This yields superior zero-shot classification accuracy under distribution shift. Furthermore, we theoretically demonstrate the benefits of using rich captions on robustness and explore the effect of annotating different types of details in the captions. We validate our theoretical findings through experiments, including a well-designed synthetic experiment and an experiment involving training CLIP on MS COCO \citep{lin2014microsoft} and evaluating the model on variations of shifted ImageNet.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: object centric learning representation learning disentanglement weakly supervised learning
Scores: [ 8 6 6 ]
Causal representation learning has showed a variety of settings in which we can disentangle latent variables with identifiability guarantees (up to some reasonable equivalence class). Common to all of these approaches is the assumption that (1) the latent variables are represented as d-dimensional vectors, and (2) that the observations are the output of some injective generative function of these latent variables. While these assumptions appear benign, we show that when the observations are of multiple objects, the generative function is no longer injective and disentanglement fails in practice. We can address this failure by combining recent developments in object-centric learning and causal representation learning. By modifying the Slot Attention architecture (Locatello et al., 2020), we develop an object-centric architecture that leverages weak supervision from sparse perturbations to disentangle each object's properties. This approach is more data-efficient in the sense that it requires significantly fewer perturbations than a comparable approach that encodes to a Euclidean space and we show that this approach successfully disentangles the properties of a set of objects in a series of simple image-based disentanglement experiments.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: adversarial robustness certificated robustness Lipschitz-based
Scores: [ 8 6 5 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Self-Supervised Learning Unsupervised Domain Generalization Distribution Shifts
Scores: [ 6 6 5 6 6 ]
In Self-Supervised Learning (SSL), models are typically pretrained, fine-tuned, and evaluated on the same domains. However, they tend to perform poorly when evaluated on unseen domains, a challenge that Unsupervised Domain Generalization (UDG) seeks to address. Current UDG methods rely on domain labels, which are often challenging to collect, and domain-specific architectures that lack scalability when confronted with numerous domains, making the current methodology impractical and rigid. Inspired by contrastive-based UDG methods that mitigate spurious correlations by restricting comparisons to examples from the same domain, we hypothesize that eliminating style variability within a batch could provide a more convenient and flexible way to reduce spurious correlations without requiring domain labels. To verify this hypothesis, we introduce Batch Styles Standardization (BSS), a relatively simple yet powerful Fourier-based method to standardize the style of images in a batch specifically designed for integration with SSL methods to tackle UDG. Combining BSS with existing SSL methods offers serious advantages over prior UDG methods: (1) It eliminates the need for domain labels or domain-specific network components to enhance domain-invariance in SSL representations, and (2) offers flexibility as BSS can be seamlessly integrated with diverse contrastive-based but also non-contrastive-based SSL methods. Experiments on several UDG datasets demonstrate that it significantly improves downstream task performances on unseen domains, often outperforming or rivaling with UDG methods. Finally, this work clarifies the underlying mechanisms contributing to BSS's effectiveness in improving domain-invariance in SSL representations and performances on unseen domain.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: sparsity pruning lottery tickets learning rate rewinding iterative magnitude pruning
Scores: [ 6 6 6 ]
Learning Rate Rewinding (LRR) has been established as a strong variant of Iterative Magnitude Pruning (IMP) to find lottery tickets in deep overparameterized neural networks. While both iterative pruning schemes couple structure and parameter learning, understanding how LRR excels in both aspects can bring us closer to the design of more flexible deep learning algorithms that can optimize diverse sets of sparse architectures. To this end, we conduct experiments that disentangle the effect of mask learning and parameter optimization and how both benefit from overparameterization. The ability of LRR to flip parameter signs early and stay robust to sign perturbations seems to make it not only more effective in mask identification but also in optimizing diverse sets of masks, including random ones. In support of this hypothesis, we prove in a simplified single hidden neuron setting that LRR succeeds in more cases than IMP, as it can escape initially problematic sign configurations.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Data Augmentation Learnable Augmentation Bilevel Optimization
Scores: [ 8 8 5 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Unsupervised learning; Video object learning
Scores: [ 5 6 6 5 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Disentanglement Diversity Generative Modeling
Scores: [ 5 5 8 ]
Understanding and developing optimal representations has long been foundational in machine learning (ML). While disentangled representations have shown promise in generative modeling and representation learning, their downstream usefulness remains debated. Recent studies re-defined disentanglement through a formal connection to symmetries, emphasizing the ability to reduce latent domains (i.e., ML problem spaces) and consequently enhance data efficiency and generative capabilities. However, from an information theory viewpoint, assigning a complex attribute (i.e., features) to a specific latent variable may be infeasible, limiting the applicability of disentangled representations to simple datasets. In this work, we introduce α-TCVAE, a variational autoencoder optimized using a novel total correlation (TC) lower bound that maximizes disentanglement and latent variables informativeness. The proposed TC bound is grounded in information theory constructs, generalizes the β-VAE lower bound, and can be reduced to a convex combination of the known variational information bottleneck (VIB) and conditional entropy bottleneck (CEB) terms. Moreover, we present quantitative analyses and correlation studies that, through the lenses of diversity, support the idea that smaller latent domains (i.e., disentangled representations) lead to better generative capabilities and diversity. Additionally, we perform downstream task experiments from both representation and reinforcement learning domains to assess our questions from a broader ML perspective. Our results demonstrate that α-TCVAE consistently learns more disentangled representations than baselines and generates more diverse observations without sacrificing visual fidelity. Notably, α-TCVAE exhibits marked improvements on MPI3D-Real, the most realistic disentangled dataset in our study, confirming its ability to represent complex datasets when maximizing the informativeness of individual variables. Finally, testing the proposed model off-the-shelf on a state-of-the-art model-based RL agent, Director, significantly shows α-TCVAE downstream usefulness on the loconav Ant Maze task.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: image clustering pretrained models rate reduction principled methods
Scores: [ 5 8 3 8 5 ]
The advent of large pre-trained models has brought about a paradigm shift in both visual representation learning and natural language processing. However, clustering unlabeled images, as a fundamental and classic machine learning problem, still lacks an effective solution, particularly for large-scale datasets. In this paper, we propose a novel image clustering pipeline that leverages the powerful feature representation of large pre-trained models such as CLIP and cluster images effectively and efficiently at scale. We first developed a novel algorithm to estimate the number of clusters in a given dataset. We then show that the pre-trained features are significantly more structured by further optimizing the rate reduction objective. The resulting features may significantly improve the clustering accuracy, e.g., from 57% to 66% on ImageNet-1k. Furthermore, by leveraging CLIP's multimodality bridge between image and text, we develop a simple yet effective self-labeling algorithm that produces meaningful text labels for the clusters. Through extensive experiments, we show that our pipeline works well on standard datasets such as CIFAR-10, CIFAR-100, and ImageNet-1k. It also extends to datasets without predefined labels, such as LAION-Aesthetics and WikiArts.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: object-centric representation learning the binding problem the grounding problem slot attention
Scores: [ 6 6 6 ]
The extraction of object-centric representations for downstream tasks is an emerging area of research. Learning grounded representations of objects that are guaranteed to be stable and invariant promises robust performance across different tasks and environments. Slot Attention (SA) learns object-centric representations by assigning objects to slots, but presupposes a single distribution from which all slots are randomly initialised. This results in an inability to learn specialized slots which bind to specific object types and remain invariant to identity-preserving changes in object appearance. To address this, we present Conditional Slot Attention (CoSA) using a novel concept of Grounded Slot Dictionary (GSD) inspired by vector quantization. Our proposed GSD comprises (i) canonical object-level property vectors and (ii) parametric Gaussian distributions, which define a prior over the slots. We demonstrate the benefits of our method in multiple downstream tasks such as scene generation, composition, and task adaptation, whilst remaining competitive with SA in object discovery.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Thompson Sampling Multiarmed Bandits Gaussian Gamma MAB Variance aware Regret Prior Posterior Bayesian Tight Guarantee Empirical
Scores: [ 6 8 6 6 5 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Self-supervised learning Local Intrinsic Dimensionality Dimension Collapse Contrastive Learning
Scores: [ 6 6 5 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Information Retention Few-shot Learning Deep Neural Network
Scores: [ 8 8 6 6 8 ]
The information bottleneck principle provides an information-theoretic method for learning a good representation as a tradeoff between conciseness and predictive ability, which can reduce the information redundancy, eliminate irrelevant and superfluous features, and thus enhance the in-domain generalizability. However, in low-resource or out-of-domain scenarios where the assumption of iid does not necessarily hold true, superfluous (or redundant) relevant features may be supplemental to the mainline features of the model, and be beneficial in making prediction for test dataset with distribution shifts. To address this problem, we propose to keep as much relevant information as possible in use for making predictions. A three-stage supervised learning framework is designed and implemented to jointly learn the mainline and supplemental features, relieving supplemental features from the suppression of mainline features. Experiments on image and text classification tasks have shown our method substantially outperforms several baseline and state-of-the-art methods, especially in low resource cases.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Policy representation offline policy selection robotics evaluation
Scores: [ 5 6 5 5 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Video decomposition Test-time optimization technique Video Relighting Video Dehazing Unsupervised Video Object Segmentation Edits Propagation
Scores: [ 6 6 6 5 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Federated Learning Aggregation Cross-round Divergence-aware
Scores: [ 6 5 6 6 ]
In Federated Learning (FL), model aggregation is pivotal. It involves a global server iteratively aggregating client local trained models in successive rounds without accessing private data. Traditional methods typically aggregate the local model from the current round alone. However, due to the statistical heterogeneity across clients, the local model from each client may be greatly diverse, making the obtained global model incapable of maintaining their specific knowledge. In this paper, we introduce a novel method, FedCDA, which selectively aggregates local models from various rounds, decreasing discrepancies between local models. The principle behind FedCDA is that the local model from each client may converge to distinct local optima over rounds due to the varied received global models and non-convex essences of deep neural networks, and each local model fits its local data well. Therefore, for each client, we select a local model from multiple rounds to minimize the divergence from other clients. This ensures the aggregated global model remains aligned with all selected local models to maintain their data knowledge. Extensive experiments conducted on various models and datasets reveal our approach outperforms state-of-the-art aggregation methods.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: out-of-distribution dection CLIP prompt learning
Scores: [ 5 6 8 6 ]
Out-of-distribution (OOD) detection is indispensable for open-world machine learning models. Inspired by recent success in large pre-trained language-vision models, e.g., CLIP, advanced works have achieved impressive OOD detection results by matching the similarity between image features and features of learned prompts, i.e., positive prompts. However, existing works typically struggle with OOD samples having similar features with those of known classes. One straightforward approach is to introduce negative prompts to achieve a dissimilarity matching, which further assesses the anomaly level of image features by introducing the absence of specific features. Unfortunately, our experimental observations show that either employing a prompt like "not a photo of a" or learning a prompt to represent "not containing" fails to capture the dissimilarity for identifying OOD samples. The failure may be contributed to the diversity of negative features, i.e., tons of features could indicate features not belonging to a known class. To this end, we propose to learn a set of negative prompts for each class. The learned positive prompt (for all classes) and negative prompts (for each class) are leveraged to measure the similarity and dissimilarity in the feature space simultaneously, enabling more accurate detection of OOD samples. Extensive experiments are conducted on diverse OOD detection benchmarks, showing the effectiveness of our proposed method.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Transformer Self-Attention Memorization Universal Approximation Theorem Contextual Mapping
Scores: [ 6 6 8 ]
Existing analyses of the expressive capacity of Transformer models have required excessively deep layers for data memorization, leading to a discrepancy with the Transformers actually used in practice. This is primarily due to the interpretation of the softmax function as an approximation of the hardmax function.By clarifying the connection between the softmax function and the Boltzmann operator, we prove that a single layer of self-attention with low-rank weight matrices possesses the capability to perfectly capture the context of an entire input sequence.As a consequence, we show that one-layer and single-head Transformers have a memorization capacity for finite samples, and that Transformers consisting of one self-attention layer with two feed-forward neural networks are universal approximators for continuous functions on a compact domain.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Contrastive learning time-series forecasting long-term representation
Scores: [ 6 5 6 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Representation Learning Spurious Correlations Self-supervised Learning
Scores: [ 5 8 6 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Steerable features Equivariant graph neural networks Message passing
Scores: [ 6 6 6 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Sum-product networks graph neural networks probabilistic circuits tree-structured graphs supervised learning
Scores: [ 6 8 8 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Object Detection Model Generalization
Scores: [ 6 6 8 10 ]
Bounding boxes uniquely characterize object detection, where a good detector gives accurate bounding boxes of categories of interest. However, in the real-world where test ground truths are not provided, it is non-trivial to find out whether bounding boxes are accurate, thus preventing us from assessing the detector generalization ability. In this work, we find under feature map dropout, good detectors tend to output bounding boxes whose locations do not change much, while bounding boxes of poor detectors will undergo noticeable position changes. We compute the box stability score (BS score) to reflect this stability. Specifically, given an image, we compute a normal set of bounding boxes and a second set after feature map dropout. To obtain BS score, we use bipartite matching to find the corresponding boxes between the two sets and compute the average Intersection over Union (IoU) across the entire test set. We contribute to finding that BS score has a strong, positive correlation with detection accuracy measured by mean average precision (mAP) under various test environments. This relationship allows us to predict the accuracy of detectors on various real-world test sets without accessing test ground truths, verified on canonical detection tasks such as vehicle detection and pedestrian detection.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Molecule Representation; Pretraining; Denoising; Force Field
Scores: [ 8 5 5 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Self supervised learning Equivariance Contrastive learning
Scores: [ 6 8 6 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Biological inspired high performance energy efficient vision system data efficient training energy saving sensoring learned saccade reinforcement learning foveated visual sampling continuous scene reconstruction.
Scores: [ 8 6 6 8 ]
High accuracy, low latency and high energy efficiency represent a set of contradictory goals when searching for system solutions for image classification and detection. While high-quality images naturally result in more precise detection and classification, they also result in a heavier computational workload for imaging and processing, reduce camera refresh rates, and increase the volume of data communication between the camera and processor. Taking inspiration from the foveal-peripheral sampling mechanism, saccade mechanism observed in the human visual system and the filling-in phenomena of brain, we have developed an active scene reconstruction architecture based on multiple foveal views. This model stitches together information from foveal and peripheral vision, which are sampled from multiple glances. Assisted by a reinforcement learning-based saccade mechanism, our model reduces the required input pixels by over 90% per frame while maintaining the same level of performance in image recognition as with the original images. We evaluated the effectiveness of our model using the GTSRB dataset and the ImageNet dataset. Using an equal number of input pixels, our study demonstrates a 5% higher image recognition accuracy compared to state-of-the-art foveal-peripheral vision systems. Furthermore, we demonstrate that our foveal sampling/saccadic scene reconstruction model exhibits significantly lower complexity and higher data efficiency during the training phase compared to existing approaches.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Contrastive Learning Representation Learning Self-supervised Learning Deep Learning
Scores: [ 6 6 6 5 ]
Deep representations have shown promising performance when transferred to downstream tasks in a black-box manner. Yet, their inherent lack of interpretability remains a significant challenge, as these features are often opaque to human understanding. In this paper, we propose Non-negative Contrastive Learning (NCL), a renaissance of Non-negative Matrix Factorization (NMF) aimed at deriving interpretable features. The power of NCL lies in its enforcement of non-negativity constraints on features, reminiscent of NMF's capability to extract features that align closely with sample clusters. NCL not only aligns mathematically well with an NMF objective but also preserves NMF's interpretability attributes, resulting in a more sparse and disentangled representation compared to standard contrastive learning (CL). Theoretically, we establish guarantees on the identifiability and downstream generalization of NCL. Empirically, we show that these advantages enable NCL to outperform CL significantly on feature disentanglement, feature selection, as well as downstream classification tasks.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Hypergraph HGNN Knowledge Distillation MLPs Reliale Learning
Scores: [ 6 6 6 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Predictable Scaling; Scaling Law; Large Language Models; Pre-train;
Scores: [ 6 5 8 5 ]
The scientific scale-up of large language models (LLMs) necessitates a comprehensive understanding of their scaling properties. However, the existing literature on the scaling properties only yields an incomplete answer: optimization loss decreases predictably as the model size increases, in line with established scaling law; yet no scaling law for task has been established and the task performances are far from predictable during scaling. Task performances typically show minor gains on small models until they improve dramatically once models exceed a size threshold, exemplifying the emergent abilities''. In this study, we discover that small models, although they exhibit minor performance, demonstrate critical and consistent task performance improvements that are not captured by conventional evaluation strategies due to insufficient measurement resolution. To measure such improvements, we introduce \textsc{PassUntil}, an evaluation strategy with theoretically infinite resolution, through massive sampling in the decoding phase. With \textsc{PassUntil}, we conduct a quantitative investigation into the scaling law of task performance. The investigation contains two parts. Firstly, a strict \textsl{task scaling law} that is not conventionally known to exist, is identified, enhancing the predictability of task performances. Remarkably, we are able to predict the performance of the 2.4B model on code generation with merely 0.05% deviation before training starts, which is the first systematic attempt to verify predictable scaling proposed by GPT-4's report. Secondly, underpinned by \textsc{PassUntil}, we observe concrete evidence of emergent abilities and ascertain that they are not in conflict with the continuity of performance improvement. Their semblance to break-through is that their scaling curve cannot be fitted by standard scaling law function. We then introduce a mathematical definition for the emergent abilities. Through the definition, we refute a prevalent multi-step reasoning hypothesis'' regarding the genesis of emergent abilities and propose a new hypothesis with a satisfying fit to the observed scaling curve.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: sampling Restricted Boltzmann Machines statistical physics
Scores: [ 6 6 8 ]
Sampling complex distributions is an important but difficult objective in various fields, including physics, chemistry, and statistics. An improvement of standard Monte Carlo (MC) methods, intensively used in particular in the context of disordered systems, is Parallel Tempering, also called replica exchange MC, in which a sequence of MC Markov chains at decreasing temperatures are run in parallel and can swap their configurations. In this work we apply the ideas of parallel tempering in the context of restricted Boltzmann machines (RBM), a paradigm of unsupervised architectures, capable to learn complex, multimodal distributions. Inspired by Deep Tempering, an approach introduced for deep belief networks, we show how to learn on top of the first RBM a stack of nested RBMs, using the representations of a RBM as ’data’ for the next one along the stack. In our Stacked Tempering approach the hidden configurations of a machine can be exchanged with the visible configurations of the next one in the stack. Replica exchanges between the different RBMs is facilitated by the increasingly clustered representations learnt by deeper RBMs, allowing for fast transitions between the different modes of the data distribution. Analytical calculations of mixing times in a simplified theoretical setting shed light on why Stacked Tempering works, and how hyperparameters, such as the aspect ratios of the RBMs and weight regularization should be chosen. We illustrate the efficiency of the Stacked Tempering method with respect to standard and replica exchange MC on several datasets: MNIST, in-silico Lattice Proteins, and the 2D-Ising model.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Source-free domain adaptation Data augmentation
Scores: [ 6 5 8 6 ]
In the face of the deep learning model's vulnerability to domain shift, source-free domain adaptation (SFDA) methods have been proposed to adapt models to new, unseen target domains without requiring access to source domain data. Although the potential benefits of applying data augmentation to SFDA are attractive, several challenges arise such as the dependence on prior knowledge of class-preserving transformations and the increase in memory and computational requirements. In this paper, we propose Source-free Domain Adaptation Through the Lens of Data Augmentation (SF(DA)2), a novel approach that leverages the benefits of data augmentation without suffering from these challenges. We construct an augmentation graph in the feature space of the pretrained model using the neighbor relationships between target features and propose spectral neighborhood clustering to identify partitions in the prediction space. Furthermore, we propose implicit feature augmentation and feature disentanglement as regularization loss functions that effectively utilize class semantic information within the feature space. These regularizers simulate the inclusion of an unlimited number of augmented target features into the augmentation graph while minimizing computational and memory demands. Our method shows superior adaptation performance in SFDA scenarios, including 2D image and 3D point cloud datasets and a highly imbalanced dataset.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Machine Learning dynamic sparse training structured sparsity N:M sparsity efficient deep learning RigL SRigL constant fan-in dynamic neuron ablation neuron ablation structured and fine-grained sparsity online inference accelerating inference
Scores: [ 6 6 6 5 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: statistical learning generalization theory differential equations time series
Scores: [ 6 8 5 ]
Neural Controlled Differential Equations (NCDE) are a state-of-the-art tool for supervised learning with irregularly sampled time series (Kidger 2020). However, no theoretical analysis of their performance has been provided yet, and it remains unclear in particular how the roughness of the sampling affects their predictions. By merging the rich theory of controlled differential equations (CDE) and Lipschitz-based measures of the complexity of deep neural nets, we take a first step towards the theoretical understanding of NCDE. Our first result is a sampling-dependant generalization bound for this class of predictors. In a second time, we leverage the continuity of the flow of CDEs to provide a detailed analysis of both the sampling-induced bias and the approximation bias. Regarding this last result, we show how classical approximation results on neural nets may transfer to NCDE. Our theoretical results are validated through a series of experiments, for which the code is available at REDACTED.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: few-shot classification unsupervised few-shot learning deep representation learning
Scores: [ 6 6 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: transformers mixtures of experts computer vision
Scores: [ 8 8 8 6 ]
Sparse mixture of expert architectures (MoEs) scale model capacity without significant increases in training or inference costs.Despite their success, MoEs suffer from a number of issues: training instability, token dropping, inability to scale the number of experts, or ineffective finetuning.In this work, we propose Soft MoE, a fully-differentiable sparse Transformer that addresses these challenges, while maintaining the benefits of MoEs.Soft MoE performs an implicit soft assignment by passing different weighted combinations of all input tokens to each expert.As in other MoEs, experts in Soft MoE only process a subset of the (combined) tokens, enabling larger model capacity (and performance) at lower inference cost.In the context of visual recognition, Soft MoE greatly outperforms dense Transformers (ViTs) and popular MoEs (Tokens Choice and Experts Choice).Soft MoE scales well: Soft MoE Huge/14 with 128 experts in 16 MoE layers has over 40x more parameters than ViT Huge/14, with only 2% increased inference time, and substantially better quality.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: contrastive learning mask image moding feature distillation
Scores: [ 6 8 6 6 ]
As two prominent strategies for representation learning, Contrastive Learning (CL) and Masked Image Modeling (MIM) have witnessed significant progress. Previous studies have demonstrated the advantages of each approach in specific scenarios. CL, resembling supervised pre-training, excels at capturing longer-range global patterns and enhancing feature discrimination, while MIM is adept at introducing local and diverse attention across transformer layers. Considering the respective strengths, previous studies utilize feature distillation to inherit both discrimination and diversity. In this paper, we thoroughly examine previous feature distillation methods and observe that the increase in diversity mainly stems from asymmetric designs, which may in turn compromise the discrimination ability. To strike a balance between the two properties, we propose a simple yet effective strategy termed Hybrid Distill, which leverages both the CL and MIM teachers to jointly guide the student model. Hybrid Distill emulates the token relations of the MIM teacher at intermediate layers for diversity, while simultaneously distilling the final features of the CL teacher to enhance discrimination. A progressive redundant token masking strategy is employed to reduce the expenses associated with distillation and aid in preventing the model from converging to local optima. Experimental results demonstrate that Hybrid Distill achieves superior performance on various benchmark datasets.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Autoencoder Representation Learning Dimension Reduction
Scores: [ 5 8 6 5 ]
This paper introduces Least Volume---a simple yet effective regularization inspired by geometric intuition---that can reduce the necessary number of latent dimensions needed by an autoencoder without requiring any prior knowledge of the intrinsic dimensionality of the dataset. We show that the Lipschitz continuity of the decoder is the key to making it work, provide a proof that PCA is just a linear special case of it, and reveal that it has a similar PCA-like importance ordering effect when applied to nonlinear models. We demonstrate the intuition behind the regularization on some pedagogical toy problems, and its effectiveness on several benchmark problems, including MNIST, CIFAR-10 and CelebA.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: graph transformers graph neural networks knowledge distillation curriculum learning node classification
Scores: [ 5 6 8 8 ]
Recent studies have shown that Graph Transformers (GTs) can be effective for specific graph-level tasks. However, when it comes to node classification, training GTs remains challenging, especially in semi-supervised settings with a severe scarcity of labeled data. Our paper aims to address this research gap by focusing on semi-supervised node classification. To accomplish this, we develop a curriculum-enhanced attention distillation method that involves utilizing a Local GT teacher and a Global GT student. Additionally, we introduce the concepts of in-class and out-of-class and then propose two improvements, out-of-class entropy and top-k pruning, to facilitate the student's out-of-class exploration under the teacher's in-class guidance. Taking inspiration from human learning, our method involves a curriculum mechanism for distillation that initially provides strict guidance to the student and gradually allows for more out-of-class exploration by a dynamic balance. Extensive experiments show that our method outperforms many state-of-the-art approaches on seven public graph benchmarks, proving its effectiveness.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: out-of-distribution detection data-centric artificial intelligence
Scores: [ 6 8 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: in-context learning transformers representation learning learning theory mechanistic understanding
Scores: [ 6 8 6 6 ]
While large language models based on the transformer architecture have demonstrated remarkable in-context learning (ICL) capabilities, understandings of such capabilities are still in an early stage, where existing theory and mechanistic understanding focus mostly on simple scenarios such as learning simple function classes. This paper takes initial steps on understanding ICL in more complex scenarios, by studying learning with \emph{representations}. Concretely, we construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but \emph{fixed} representation function, composed with a linear function that \emph{differs} in each instance. By construction, the optimal ICL algorithm first transforms the inputs by the representation function, and then performs linear ICL on top of the transformed dataset. We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size. Empirically, we find trained transformers consistently achieve near-optimal ICL performance in this setting, and exhibit the desired dissection where lower layers transforms the dataset and upper layers perform linear ICL. Through extensive probing and a new pasting experiment, we further reveal several mechanisms within the trained transformers, such as concrete copying behaviors on both the inputs and the representations, linear ICL capability of the upper layers alone, and a post-ICL representation selection mechanism in a harder mixture setting. These observed mechanisms align well with our theory and may shed light on how transformers perform ICL in more realistic scenarios.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Latent Causal Models Causal Representations Identifiability Representation Learning
Scores: [ 6 6 6 6 3 ]
Causal representation learning aims to unveil latent high-level causal representations from observed low-level data. One of its primary tasks is to provide reliable assurance of identifying these latent causal models, known as \textit{identifiability}. A recent breakthrough explores identifiability by leveraging the change of causal influences among latent causal variables across multiple environments \citep{liu2022identifying}. However, this progress rests on the assumption that the causal relationships among latent causal variables adhere strictly to linear Gaussian models. In this paper, we extend the scope of latent causal models to involve nonlinear causal relationships, represented by polynomial models, and general noise distributions conforming to the exponential family. Additionally, we investigate the necessity of imposing changes on all causal parameters and present partial identifiability results when part of them remains unchanged. Further, we propose a novel empirical estimation method, grounded in our theoretical finding, that enables learning consistent latent causal representations. Our experimental results, obtained from both synthetic and real-world data, validate our theoretical contributions concerning identifiability and consistency.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Contrastive learning Self-Supervised Learning SimCLR Multi-View Augmentations Multiplicity InfoMax Sufficient Statistics
Scores: [ 6 6 6 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: sparsity scaling optimal sparsity efficiency foundational models transformers structured sparsity pruning
Scores: [ 6 6 8 8 ]
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets (i.e., "foundation models"), in both vision and language domains. In this setting, we identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data, which we validate empirically across model and data scales; on ViT/JFT-4B and T5/C4. These results allow us to characterize the "optimal sparsity", the sparsity level which yields the best performance for a given effective model size and training budget. For a fixed number of non-zero parameters, we identify that the optimal sparsity increases with the amount of data used for training. We also extend our study to different sparsity structures (such as the hardware-friendly n:m pattern) and strategies (such as starting from a pretrained dense model). Our findings shed light on the power and limitations of weight sparsity across various parameter and computational settings, offering both theoretical understanding and practical implications for leveraging sparsity towards computational efficiency improvements.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Graph Transformers Polynomial Networks Large-Scale Graph Learning Node Classification
Scores: [ 6 6 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: large language model agent code generation reasoning decision making
Scores: [ 8 6 6 8 ]
We introduce Lemur and Lemur-Chat, openly accessible language models optimized for both natural language and coding capabilities to serve as the backbone of versatile language agents. The evolution from language chat models to fully functional language agents necessitates models to ground natural language instructions effectively in diverse environments and execute valid actions within them, requiring models for the synergy between language and coding capabilities. Lemur and Lemur-Chat are proposed to address this necessity, demonstrating balanced proficiencies in both domains, unlike existing open-source models that tend to specialize in either. Through meticulous pre-training using a code-intensive corpus and instruction fine-tuning on text and code data, our models achieve state-of-the-art averaged performance across diverse text and coding benchmarks. Comprehensive experiments demonstrate Lemur’s superiority over existing open-source models and its proficiency across various agent tasks involving human communication, tool usage, and interaction under fully- and partially- observable environments. The harmonization between natural and programming languages enables Lemur-Chat to significantly narrow the gap with proprietary models on agent abilities, providing key insights into developing advanced open-source agents adept at reasoning, planning, and operating seamlessly across environments. Our model and code will be open-sourced.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: contrastive learning theory self-supervised learning theory representation learning theory CLIP multi-modal learning theory markov random field kernel method
Scores: [ 6 5 6 10 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: deep imbalanced clustering optimal transport
Scores: [ 6 6 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Dataset Distillation Self-supervised Learning Kernel Ridge Regression
Scores: [ 6 8 6 5 6 ]
Dataset distillation methods have achieved remarkable success in distilling a large dataset into a small set of representative samples. However, they are not designed to produce a distilled dataset that can be effectively used for facilitating self-supervised pre-training. To this end, we propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL). We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is \textit{biased} due to the randomness originating from data augmentations or masking. To address this issue, we propose to minimize the mean squared error (MSE) between a model's representations of the synthetic examples and their corresponding learnable target feature representations for the inner objective, which does not introduce any randomness. Our primary motivation is that the model obtained by the proposed inner optimization can mimic the \textit{self-supervised target model}. To achieve this, we also introduce the MSE between representations of the inner model and the self-supervised target model on the original full dataset for outer optimization. Lastly, assuming that a feature extractor is fixed, we only optimize a linear head on top of the feature extractor, which allows us to reduce the computational cost and obtain a closed-form solution of the head with kernel ridge regression. We empirically validate the effectiveness of our method on various applications involving transfer learning.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Unsupervised out-of-distribution object detection; OOD data synthesis; Modulated phase diffusion
Scores: [ 6 3 6 ]
To promote the safe deployment of object detectors, a task of unsupervised out-of-distribution object detection (OOD-OD) is recently proposed, aiming to detect unknown objects during training without reliance on any auxiliary OOD data. To alleviate the impact of lacking OOD data, for this task, one feasible solution is to exploit the known in-distribution (ID) data to synthesize proper OOD information for supervision, which strengthens detectors' discrimination. From the frequency perspective, since the phase generally reflects the content of the input, in this paper, we explore leveraging the phase of ID features to generate expected OOD features involving different content. And a method of Modulated Phase Diffusion (MPD) is proposed, containing a shared forward and two different reverse processes. Specifically, after calculating the phase of the extracted features, to prevent the rapid loss of content in the phase, the forward process gradually performs Gaussian Average on the phase instead of adding noise. The averaged phase and original amplitude are combined to obtain the features taken as the input of the reverse process. Next, one OOD branch is defined to synthesize virtual OOD features by continually enlarging the content discrepancy between the OOD features and original ones. Meanwhile, another modulated branch is designed to generate augmented features owning a similar phase as the original features by scaling and shifting the OOD branch. Both original and augmented features are used for training, enhancing the discrimination. Experimental results on OOD-OD, incremental object detection, and open-set object detection demonstrate the superiorities of our method.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Large Language Model Compression LLM Pruning
Scores: [ 8 6 6 8 ]
Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc. While their performance is impressive, the computational footprint due to their vast number of parameters can be prohibitive. Existing solutions such as SparseGPT and Wanda attempt to alleviate this issue through weight pruning. However, their layer-wise approach results in significant perturbation to the model's output and requires meticulous hyperparameter tuning, such as the pruning rate, which can adversely affect overall model performance. To address this, this paper introduces a novel LLM pruning technique dubbed blockwise parameter-efficient sparsity allocation (BESA) by applying a blockwise reconstruction loss. In contrast to the typical layer-wise pruning techniques, BESA is characterized by two distinctive attributes: i) it targets the overall pruning error with respect to individual transformer blocks, and ii) it allocates layer-specific sparsity in a differentiable manner, both of which ensure reduced performance degradation after pruning. Our experiments show that BESA achieves state-of-the-art performance, efficiently pruning LLMs like LLaMA1, and LLaMA2 with 7B to 70B parameters on a single A100 GPU in just five hours. Code is available at here.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: concept discovery robotic manipulation self-supervised learning
Scores: [ 8 6 6 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: mutli-class classification multiple unlabeled datasets learning consistency
Scores: [ 6 8 8 6 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: zero-shot classification spurious correlations invariant embedding foundation model language model
Scores: [ 6 8 8 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Self Supervised Learning Joint Embedding Architectures
Scores: [ 6 8 6 ]
Joint embedding (JE) architectures have emerged as a promising avenue for ac-quiring transferable data representations. A key obstacle to using JE methods,however, is the inherent challenge of evaluating learned representations withoutaccess to a downstream task, and an annotated dataset. Without efficient and re-liable evaluation, it is difficult to iterate on architectural and training choices forJE methods. In this paper, we introduce LiDAR (Linear Discriminant AnalysisRank), a metric designed to measure the quality of representations within JE archi-tectures. Our metric addresses several shortcomings of recent approaches basedon feature covariance rank by discriminating between informative and uninforma-tive features. In essence, LiDAR quantifies the rank of the Linear DiscriminantAnalysis (LDA) matrix associated with the surrogate SSL task—a measure thatintuitively captures the information content as it pertains to solving the SSL task.We empirically demonstrate that LiDAR significantly surpasses naive rank basedapproaches in its predictive power of optimal hyperparameters. Our proposed cri-terion presents a more robust and intuitive means of assessing the quality of rep-resentations within JE architectures, which we hope facilitates broader adoptionof these powerful techniques in various domains.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: cycle consistency object discovery downstream RL
Scores: [ 8 8 6 5 ]
Developing deep learning models that effectively learn object-centric representations, akin to human cognition, remains a challenging task. Existing approaches facilitate object discovery by representing objects as fixed-size vectors, called slots'' or
object files''. While these approaches have shown promise in certain scenarios, they still exhibit certain limitations. First, they rely on architectural priors which can be unreliable and usually require meticulous engineering to identify the correct objects. Second, there has been a notable gap in investigating the practical utility of these representations in downstream tasks. To address the first limitation, we introduce a method that explicitly optimizes the constraint that each object in a scene should be associated with a distinct slot. We formalize this constraint by introducing consistency objectives which are cyclic in nature. By integrating these consistency objectives into various existing slot-based object-centric methods, we showcase substantial improvements in object-discovery performance. These enhancements consistently hold true across both synthetic and real-world scenes, underscoring the effectiveness and adaptability of the proposed approach. To tackle the second limitation, we apply the learned object-centric representations from the proposed method to two downstream reinforcement learning tasks, demonstrating considerable performance enhancements compared to conventional slot-based and monolithic representation learning methods. Our results suggest that the proposed approach not only improves object discovery, but also provides richer features for downstream tasks.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Fourier transform equivariance harmonic analysis representation learning
Scores: [ 6 8 8 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Deep Learning Neural Networks Weight Initialization Small Models Computer Vision
Scores: [ 6 8 6 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Long-term time series forecasting Transformer CNN
Scores: [ 8 3 8 8 ]
Convolutional neural network (CNN)-based and Transformer-based methods have recently made significant strides in time series forecasting, which excel at modeling local temporal variations or capturing long-term dependencies. However, real-world time series usually contain intricate temporal patterns, thus making it challenging for existing methods that mainly focus on temporal variations modeling from the 1D time series directly. Based on the intrinsic periodicity of time series, we propose a novel Periodicity Decoupling Framework (PDF) to capture 2D temporal variations of decoupled series for long-term series forecasting. Our PDF mainly consists of three components: multi-periodic decoupling block (MDB), dual variations modeling block (DVMB), and variations aggregation block (VAB). Unlike the previous methods that model 1D temporal variations, our PDF mainly models 2D temporal variations, decoupled from 1D time series by MDB. After that, DVMB attempts to further capture short-term and long-term variations, followed by VAB to make final predictions. Extensive experimental results across seven real-world long-term time series datasets demonstrate the superiority of our method over other state-of-the-art methods, in terms of both forecasting performance and computational efficiency.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Semi-supervised Learning Reward Model Representation Learning
Scores: [ 6 6 6 ]
Semi-supervised learning (SSL) has witnessed great progress with various improvements in the self-training framework with pseudo labeling. The main challenge is how to distinguish high-quality pseudo labels against the confirmation bias. However, existing pseudo-label selection strategies are limited to pre-defined schemes or complex hand-crafted policies specially designed for classification, failing to achieve high-quality labels, fast convergence, and task versatility simultaneously. To these ends, we propose a Semi-supervised Reward framework (SemiReward) that predicts reward scores to evaluate and filter out high-quality pseudo labels, which is pluggable to mainstream SSL methods in wide task types and scenarios. To mitigate confirmation bias, SemiReward is trained online in two stages with a generator model and subsampling strategy. With classification and regression tasks on 13 standard SSL benchmarks of three modalities, extensive experiments verify that SemiReward achieves significant performance gains and faster convergence speeds upon Pseudo Label, FlexMatch, and Free/SoftMatch.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: retrieval dense retrieval dual encoders extreme multi-label classification
Scores: [ 6 6 6 8 ]
Dual-encoder models have demonstrated significant success in dense retrieval tasks for open-domain question answering that mostly involves zero-shot and few-shot scenarios. However, their performance in many-shot retrieval problems where training data is abundant, such as extreme multi-label classification (XMC), remains under-explored. Existing empirical evidence suggests that, for such problems, the dual-encoder method's accuracies lag behind the performance of state-of-the-art (SOTA) extreme classification methods that grow the number of learnable parameters linearly with the number of classes. As a result, some recent extreme classification techniques use a combination of dual-encoders and a learnable classification head for each class to excel on these tasks. In this paper, we investigate the potential of "pure" DE models in XMC tasks. Our findings reveal that when trained correctly standard dual-encoders can match or outperform SOTA extreme classification methods by up to 2% at Precision@1 even on the largest XMC datasets while being 20x smaller in terms of the number of trainable parameters. We further propose a differentiable topk error-based loss function, which can be used to specifically optimize for Recall@k metrics. We include our PyTorch implementation along with other resources for reproducing the results in the supplementary material.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Topological data analysis quantum computing unsupervised learning feature extraction
Scores: [ 8 8 8 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Self-supervised learning State representation learning
Scores: [ 8 6 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: self-supervised representation learning; image recognition; image generation; Masked Image Modeling
Scores: [ 6 5 6 6 ]
Image recognition and generation have long been developed independently of each other. With the recent trend towards general-purpose representation learning, the development of general representations for both recognition and generation tasks is also promoted. However, preliminary attempts mainly focus on generation performance, but are still inferior on recognition tasks. These methods are modeled in the vector-quantized (VQ) space, whereas leading recognition methods use pixels as inputs. Our key insights are twofold: (1) pixels as inputs are crucial for recognition tasks; (2) VQ tokens as reconstruction targets are beneficial for generation tasks. These observations motivate us to propose an Alternating Denoising Diffusion Process (ADDP) that integrates these two spaces within a single representation learning framework. In each denoising step, our method first decodes pixels from previous VQ tokens, then generates new VQ tokens from the decoded pixels. The diffusion process gradually masks out a portion of VQ tokens to construct the training samples. The learned representations can be used to generate diverse high-fidelity images and also demonstrate excellent transfer performance on recognition tasks. Extensive experiments show that our method achieves competitive performance on unconditional generation, ImageNet classification, COCO detection, and ADE20k segmentation. Importantly, our method represents the first successful development of general representations applicable to both generation and dense recognition tasks. Code shall be released.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Anomaly Detection Deep Learning
Scores: [ 8 8 8 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Multi-tasking learning Graph Neural Networks AutoML
Scores: [ 5 6 5 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Polynomial Networks Image recognition
Scores: [ 8 6 6 ]
Despite the remarkable capabilities of deep neural networks in image recognition, the dependence on activation functions remains a largely unexplored area and has yet to be eliminated. On the other hand, Polynomial Networks is a class of models that does not require activation functions, but have yet to perform on par with modern architectures. In this work, we aim close this gap and propose MONet, which relies solely on multilinear operators. The core layer of MONet, called Mu-Layer, captures multiplicative interactions of the elements of the input token. MONet captures high-degree interactions of the input elements and we demonstrate the efficacy of our approach on a series of image recognition and scientific computing benchmarks. The proposed model outperforms prior polynomial networks and performs on par with modern architectures. We believe that MONet can inspire further research on models that use entirely multilinear operations.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: self-supervised learning representation learning
Scores: [ 6 6 6 ]
In this work, we explore regions as the visual analogue of words for self-supervised image representation learning. Inspired by Masked Autoencoding (MAE), a generative pre-training baseline, we propose masked region autoencoding to learn from groups of pixels or regions. Specifically, we design an architecture which efficiently addresses the one-to-many mapping between images and regions, while being highly effective especially with high-quality regions. When integrated with MAE, our approach (R-MAE) demonstrates consistent improvements across various pre-training datasets and downstream detection and segmentation benchmarks, with negligible computational overheads. Beyond the quantitative evaluation, our analysis indicates the models pre-trained with masked region autoencoding unlock the potential for interactive segmentation.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: unsupervised representation learning diffusion model representation disentanglement counterfactual generation
Scores: [ 6 6 6 6 ]
Representation learning is all about discovering the hidden modular attributes that generate the data faithfully. We explore the potential of Denoising Diffusion Probabilistic Model (DM) in unsupervised learning of the modular attributes. We build a theoretical framework that connects the diffusion time-steps and the hidden attributes, which serves as an effective inductive bias for unsupervised learning. Specifically, the forward diffusion process incrementally adds Gaussian noise to samples at each time-step, which essentially collapses different samples into similar ones by losing attributes, e.g., fine-grained attributes such as texture are lost with less noise added (i.e., early time-steps), while coarse-grained ones such as shape are lost by adding more noise (i.e., late time-steps). To disentangle the modular attributes, at each time-step t, we learn a t-specific feature to compensate for the newly lost attribute, and the set of all {1,...,t}-specific features, corresponding to the cumulative set of lost attributes, are trained to make up for the reconstruction error of a pre-trained DM at time-step t. On CelebA, FFHQ, and Bedroom datasets, the learned feature significantly improves attribute classification and enables faithful counterfactual generation, e.g., interpolating only one specified attribute between two images, validating the disentanglement quality. Codes are in Appendix.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Deep long-tailed recognition Representation learning
Scores: [ 6 6 6 ]
Deep long-tailed recognition (DTLR) has attracted much attention due to its close touch with realistic scenarios. Recent advances have focused on re-balancing across various aspects, e.g., sampling strategy, loss re-weighting, logit adjustment, and input/parameter perturbation, to name a few. However, few studies have considered dynamic re-balancing to address intrinsic optimization conflicts. In this paper, we first empirically argue that the optimizations of mainstream DLTR methods are still dominated by some categories (e.g., major) due to a fixed re-balancing strategy. Thus, they fail to deal with gradient conflicts among categories, which naturally deduces the motivation for reaching Pareto optimal solutions. Unfortunately, a naive integration of multi-objective optimization (MOO) with DLTR methods is not applicable due to the gap between multi-task learning (MTL) and DLTR, and can in turn lead to class-specific feature degradation. Thus, we provide effective alternatives by decoupling MOO-based MTL from the temporal rather than structure perspective, and enhancing it via optimizing variability collapse loss motivated by the derived MOO-based DLTR generalization bound. Moreover, we resort to anticipating worst-case optimization with theoretical insights to further ensure convergence. We build a Pareto deep long-tailed recognition method termed PLOT upon the proposed MOO framework. Extensive evaluations demonstrate that our method not only generally improves mainstream pipelines, but also achieves an augmented version to realize state-of-the-art performance across multiple benchmarks.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: representation learning deep reinforcement learning
Scores: [ 6 6 6 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: CAD Panoptic Symbol Spotting
Scores: [ 6 8 3 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Graph Contrastive Learning Scalable Training Structural Compression
Scores: [ 6 5 5 5 ]
Graph contrastive learning (GCL) has become a powerful tool for learning graph data, but its scalability remains a significant challenge. In this work, we propose a simple yet effective training framework called Structural Compression (StructComp) to address this issue. Inspired by a sparse low-rank approximation on the diffusion matrix, StructComp trains the encoder with the compressed nodes. This allows the encoder not to perform any message passing during the training stage, and significantly reduces the number of sample pairs in the contrastive loss. We theoretically prove that the original GCL loss can be approximated with the contrastive loss computed by StructComp. Moreover, StructComp can be regarded as an additional regularization term for GCL models, resulting in a more robust encoder. Empirical studies on seven benchmark datasets show that StructComp greatly reduces the time and memory consumption while improving model performance compared to the vanilla GCL models and scalable training methods.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Graph Neural Network Mixup Node Classification Data Augmentation Theoretical Analysis
Scores: [ 6 5 6 8 6 ]
Recently, Input Mixup, which augments virtual samples by interpolating input features and corresponding labels, is one of the promising methods to alleviate the over-fitting problem on various domains including image classification and natural language processing because of its ability to generate a variety of virtual samples, and ease of usability and versatility. However, designing Input Mixup for the node classification is still challenging due to the irregularity issue that each node contains a different number of neighboring nodes for input and the alignment issue that how to align and interpolate two sets of neighboring nodes is not well-defined when two nodes are interpolated. To address the two issues, this paper proposes a novel Mixup method, called iGraphMix, tailored to the node classification. Our method generates virtual nodes and their edges by interpolating input features and labels, and attaching sampled neighboring nodes. The virtual graphs generated by iGraphMix serve as inputs for graph neural networks (GNNs) training, thereby facilitating its easy application to various GNNs and enabling effective combination with other augmentation methods. We mathematically prove that training GNNs with iGraphMix leads to better generalization performance compared to that without augmentation, and our experiments support the theoretical findings.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Disentangled representation learning information retriever sparse retriever
Scores: [ 6 6 8 ]
Disentangled representation learning remains challenging as the underlying factors of variation in the data do not naturally exist. The inherent complexity of real-world data makes it unfeasible to exhaustively enumerate and encapsulate all its variations within a finite set of factors. However, it is worth noting that most real-world data have linguistic equivalents, typically in the form of textual descriptions. These linguistic counterparts can represent the data and effortlessly decomposed into distinct tokens. In light of this, we present Vocabulary Disentangled Retrieval (VDR), a retrieval-based framework that harnesses natural language as proxies of the underlying data variation to drive disentangled representation learning. Our approach employ a bi-encoder model to represent both data and natural language in a vocabulary space, enabling the model to distinguish dimensions that capture intrinsic characteristics within data through its natural language counterpart, thus facilitating disentanglement. We extensively assess the performance of VDR across 15 retrieval benchmark datasets, covering text-to-text and cross-modal retrieval scenarios, as well as human evaluation. Our experimental results compellingly demonstrate the superiority of VDR over previous bi-encoder retrievers with comparable model size and training costs, achieving an impressive 8.7% improvement in NDCG@10 on the BEIR benchmark, a 5.3% increase on MS COCO, and a 6.0% increase on Flickr30k in terms of mean recall in the zero-shot setting. Moreover, The results from human evaluation indicate that interpretability of our method is on par with SOTA captioning models.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: deep metric learning image retrieval mean field theory
Scores: [ 5 6 8 3 5 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Contrastive Learning AUC maximization Representation Learning Self Supervised Learning
Scores: [ 6 6 5 6 ]
Self-supervised learning through contrastive representations is an emergent and promising avenue, aiming at alleviating the availability of labeled data. Recent research in the field also demonstrates its viability for several downstream tasks, henceforth leading to works that implement the contrastive principle through innovative loss functions and methods. However, despite achieving impressive progress, most methods depend on prohibitively large batch sizes and compute requirements for good performance. In this work, we propose the AUC-$\textbf{C}$ontrastive $\textbf{L}$earning, a new approach to contrastive learning that demonstrates robust and competitive performance in compute-limited regimes. We propose to incorporate the contrastive objective within the AUC-maximization framework, by noting that the AUC metric is maximized upon enhancing the probability of the network's binary prediction difference between positive and negative samples which inspires adequate embedding space arrangements in representation learning. Unlike standard contrastive methods, when performing stochastic optimization, our method maintains unbiased stochastic gradients and thus is more robust to batchsizes as opposed to standard stochastic optimization problems.Remarkably, our method with a batch size of 256, outperforms several state-of-the-art methods that may need much larger batch sizes (e.g., 4096), on ImageNet and other standard datasets. Experiments on transfer learning, few-shot learning, and other downstream tasks also demonstrate the viability of our method.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: self-supervised image-pretraining egocentric video Walking Tour dataset multi-object tracking
Scores: [ 8 8 8 6 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Markov switching processes Neural data analysis State-space models
Scores: [ 6 6 8 8 5 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Network Architecture Convolution Neural Network Image Classification Vision Transformer Representation Learning
Scores: [ 6 6 8 6 ]
By contextualizing the kernel as globally as possible, Modern ConvNets have shown great potential in computer vision tasks. However, recent progress of \textit{multi-order game-theoretic interaction} in deep neural networks (DNNs) shows that the representation capacity of modern ConvNets has not been well unleashed, where the most expressive interactions have not been effectively encoded with the increased kernel size. To address this challenge, we propose a new family of modern ConvNets, dubbed MogaNet, for discriminative visual representation learning in pure ConvNet-based models, with preferable complexity-performance trade-offs. MogaNet encapsulates conceptually simple yet effective convolutions and gated aggregation into a compact module, where discriminative features are efficiently gathered and contextualized in an adaptive manner. Extensive experiments show that MogaNet exhibits great scalability, impressive efficiency of model parameters, and competitive performance compared to state-of-the-art ViTs and ConvNets on ImageNet and various downstream vision benchmarks, including COCO object detection, ADE20K semantic segmentation, 2D&3D human pose estimation, and video prediction. Notably, MogaNet hits 80.0% and 87.8% accuracy with 5.2M and 181M parameters on ImageNet-1K, outperforming ParC-Net and ConvNeXt-L, while saving 59% FLOPs and 17M parameters, respectively.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: transformer attention context-free languages pushdown automata formal languages language modeling machine translation
Scores: [ 8 6 6 ]
Attention, specifically scaled dot-product attention, has proven effective for natural language, but it does not have a mechanism for handling hierarchical patterns of arbitrary nesting depth, which limits its ability to recognize certain syntactic structures. To address this shortcoming, we propose stack attention: an attention operator that incorporates stacks, inspired by their theoretical connections to context-free languages (CFLs). We show that stack attention is analogous to standard attention, but with a latent model of syntax that requires no syntactic supervision. We propose two variants: one related to deterministic pushdown automata (PDAs) and one based on nondeterministic PDAs, which allows transformers to recognize arbitrary CFLs. We show that transformers with stack attention are very effective at learning CFLs that standard transformers struggle on, achieving strong results on a CFL with theoretically maximal parsing difficulty. We also show that stack attention is more effective at natural language modeling under a constrained parameter budget, and we include results on machine translation.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: relational representation learning attention transformers sequence models abstract representations
Scores: [ 6 6 6 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Large Language Models Quantization LoRA Instruct Fine-tuning
Scores: [ 6 8 5 ]
Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM's weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. The code is submitted and will be made publically available.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Language Model Long Context Modeling Reinforcement Learning Unsupervised Reinforcement Learning
Scores: [ 5 8 3 6 ]
Transformers have emerged as the architecture of choice for for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability to handle long sequences, thereby creating challenges for tasks involving extended sequences or long-term dependencies. We present a distinct approach, Ring Attention, which leverages blockwise computation of self-attention to distribute long sequences across multiple devices while concurrently overlapping the communication of key-value blocks between devices through blockwise attention computation. By processing longer input sequences while maintaining memory efficiency, Ring Attention enables training and inference of sequences that exceed 100 million tokens in length, allowing length to scale proportionally with the number of devices, effectively eliminating the memory constraints imposed by individual devices. Extensive experiments on language modeling tasks demonstrate the effectiveness of Ring Attention in reducing memory requirements and improving performance.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Electrocardiogram ECG Cardiac signal Biosignal Self-supervised learning Masked auto-encoder Representation learning
Scores: [ 8 6 6 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: self-supervised learning masked time-series modeling contrastive learning
Scores: [ 5 8 6 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Dense Self-supervised Learning representation learning
Scores: [ 6 8 6 ]
Dense Self-Supervised Learning (SSL) creates positive pairs by establishing correspondences between regions or points, thereby aiming to preserve local features, for example of individual objects.However, existing approaches tend to couple objects by leaking information from the neighboring contextual regions when the pairs have a limited overlap. In this paper, we first quantitatively identify and confirm the existence of such a coupling phenomenon. We then address it by developing a remarkably simple yet highly effective solution comprising a novel augmentation method, Region Collaborative Cutout (RCC), and a corresponding decoupling branch. Importantly, our design is versatile and can be seamlessly integrated into existing SSL frameworks, whether based on Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs). We conduct extensive experiments, incorporating our solution into two CNN-based and two ViT-based methods, with results confirming the effectiveness of our approach. Moreover, we provide empirical evidence that our method significantly contributes to the disentanglement of feature representations among objects, both in quantitative and qualitative terms.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Self-supervised learning Masked image modeling Discrete visual token
Scores: [ 8 6 6 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Clustering Graph Learning Temporal Graph
Scores: [ 8 6 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: representation learning semi-supervised learning dimensionality-reduction regret minimax solution mixed strategies multiplicative weights update
Scores: [ 8 8 6 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: 3D Pose Estimation Unsupervised Learning Neural Rendering Analysis-by-Synthesis
Scores: [ 6 6 6 6 ]
We consider the problem of source-free unsupervised category-level 3D pose estimation from only RGB images to an non-annotated and unlabelled target domain without any access to source domain data or annotations during adaptation. Collecting and annotating real world 3D data and corresponding images is laborious, expensive yet unavoidable process since even 3D pose domain adaptation methods require 3D data in the target domain. We introduce a method which is capable of adapting to a nuisance ridden target domain without any 3D data or annotations. We represent object categories as simple cuboid meshes, and harness a generative model of neural feature activations modeled as a von Mises Fisher distribution at each mesh vertex learnt using differential rendering. We focus on individual mesh vertex features and iteratively update them based on their proximity to corresponding features in the target domain. Our key insight stems from the observation that specific object subparts remain stable across out-of-domain (OOD) scenarios, enabling strategic utilization of these invariant subcomponents for effective model updates. Our model is then trained in an EM fashion alternating between updating the vertex features and feature extractor. We show that our method simulates fine-tuning on a global-pseudo labelled dataset under mild assumptions which converges to the target domain asymptotically. Through extensive empirical validation, we demonstrate the potency of our simple approach in addressing the domain shift challenge and significantly enhancing pose estimation accuracy. By accentuating robust and less changed object subcomponents, our framework contributes to the evolution of UDA techniques in the context of 3D pose estimation using only images from the target domain.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Contrastive Learning BEV Perception
Scores: [ 6 8 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Multi-task learning 3D-aware dense prediction
Scores: [ 6 6 6 6 ]
Deep neural networks have become a standard building block for designing models that can perform multiple dense computer vision tasks such as depth estimation and semantic segmentation thanks to their ability to capture complex correlations in high dimensional feature space across tasks. However, the cross-task correlations that are learned in the unstructured feature space can be extremely noisy and susceptible to overfitting, consequently hurting performance. We propose to address this problem by introducing a structured 3D-aware regularizer which interfaces multiple tasks through the projection of features extracted from an image encoder to a shared 3D feature space and decodes them into their task output space through differentiable rendering. We show that the proposed method is architecture agnostic and can be plugged into various prior multi-task backbones to improve their performance; as we evidence using standard benchmarks NYUv2 and PASCAL-Context.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Game Theory Amortized Optimization Generalized Nash equilibrium Economics
Scores: [ 6 6 5 10 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Self-supervised learning efficient training image similarity
Scores: [ 6 6 6 8 ]
Deep Neural Networks (DNNs), essential for diverse applications such as visual recognition and eldercare, often require a large amount of labeled data for training, making widespread deployment of DNNs a challenging task. Self-supervised learning (SSL) emerges as a promising approach, which leverages inherent patterns within data through diverse augmentations to train models without explicit labels. However, while SSL has shown notable advancements in accuracy, its high computation costs remain a daunting impediment, particularly for resource-constrained platforms. To address this problem, we introduce SimWnW, a similarity-based efficient self-supervised learning framework. By strategically removing less important regions in augmented images and feature maps, SimWnW not only reduces computation costs but also eliminates irrelevant features that might slow down the learning process, thereby accelerating model convergence. The experimental results show that SimWnW effectively reduces the amount of computation costs in self-supervised model training without compromising accuracy. Specifically, SimWnW yields up to 54% and 51% computation savings in training from scratch and transfer learning tasks, respectively.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: hyperbolic space random forest decision tree
Scores: [ 5 6 8 6 8 ]
Hyperbolic geometry is gaining traction in machine learning due to its capacity to effectively capture hierarchical structures in real-world data. Hyperbolic spaces, where neighborhoods grow exponentially, offer substantial advantages and have consistently delivered state-of-the-art results across diverse applications. However, hyperbolic classifiers often grapple with computational challenges. Methods reliant on Riemannian optimization frequently exhibit sluggishness, stemming from the increased computational demands of operations on Riemannian manifolds. In response to these challenges, we present HyperDT, a novel extension of decision tree algorithms into hyperbolic space. Crucially, HyperDT eliminates the need for computationally intensive Riemannian optimization, numerically unstable exponential and logarithmic maps, or pairwise comparisons between points by leveraging inner products to adapt Euclidean decision tree algorithms to hyperbolic space. Our approach is conceptually straightforward and maintains constant-time decision complexity while mitigating the scalability issues inherent in high-dimensional Euclidean spaces. Building upon HyperDT, we introduce HyperRF, a hyperbolic random forest model. Extensive benchmarking across diverse datasets underscores the superior performance of these models, providing a swift, precise, accurate, and user-friendly toolkit for hyperbolic data analysis.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Time series forecasting Deep learning Domain generalization Representation learning Sinkhorn divergence
Scores: [ 8 6 6 ]
We propose Feature-aligned N-BEATS as a domain-generalized time series forecasting model. It is a nontrivial extension of N-BEATS with doubly residual stacking principle (Oreshkin et al. [42]) into a representation learning framework. In particular, it revolves around marginal feature probability measures induced by the intricate composition of residual and feature extracting operators of N-BEATS in each stack and aligns them stack-wisely via an approximate of an optimal transport distance referred to as the Sinkhorn divergence. The training loss consists of an empirical risk minimization from multiple source domains, i.e., forecasting loss, and an alignment loss calculated with the Sinkhorn divergence, which allows the model to learn invariant features stack-wisely across multiple source data sequences while retaining N-BEATS's interpretable design and forecasting power. Comprehensive experimental evaluations with ablation studies are provided and the corresponding results demonstrate the proposed model's forecasting and generalization capabilities.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Test-time adaption Imbalanced multi-modal learning
Scores: [ 8 8 8 8 ]
Test-time adaption (TTA) has emerged as a new paradigm for reconciling distribution shifts between domains without accessing source data. However, existing TTA methods mainly concentrate on uni-modal tasks, overlooking the complexity in multi-modal scenarios.In this paper, we delve into the multi-modal test-time adaption and reveal a new challenge named reliability bias. Different from the definition of traditional distribution shifts, reliability bias refers to the information discrepancies across different modalities derived from intra-modal distribution shifts. To solve the challenge, we propose a novel method, dubbed reliable fusion and robust adaption (RFRA). On the one hand, unlike the existing TTA paradigm that mainly repurposes the normalization layers, RFRA employs a new paradigm that modulates the attention between modalities in a self-adaptive way, supporting reliable fusion against reliability bias. On the other hand, RFRA adopts a novel objective function for robust multi-modal adaption, where the contributions of confident predictions could be amplified and the negative impacts of noisy predictions could be mitigated. Moreover, we introduce two new benchmarks to facilitate comprehensive evaluations of multi-modal TTA under reliability bias. Extensive experiments on the benchmarks not only verify the effectiveness of our method but also give some new observations to the community. The code and benchmarks will be released.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: self-supervised learning whitening dimensional collapse spectral transformation iterative normalization
Scores: [ 6 6 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Machine Unlearning Unsupervised Learning Deep Learning
Scores: [ 5 3 8 8 ]
Machine unlearning aims to remove information derived from forgotten data while preserving that of the remaining dataset in a well-trained model. With the increasing emphasis on data privacy, several approaches to machine unlearning have emerged. However, these methods typically rely on complete supervision throughout the unlearning process. Unfortunately, obtaining such supervision, whether for the forgetting or remaining data, can be impractical due to the substantial cost associated with annotating real-world datasets. This challenge prompts us to propose a supervision-free unlearning approach that operates without the need for labels during the unlearning process. Specifically, we introduce a variational approach to approximate the distribution of representations for the remaining data. Leveraging this approximation, we adapt the original model to eliminate information from the forgotten data at the representation level. To further address the issue of lacking supervision information, which hinders alignment with ground truth, we introduce a contrastive loss to facilitate the matching of representations between the remaining data and those of the original model, thus preserving predictive performance. Experimental results across various unlearning tasks demonstrate the effectiveness of our proposed method, Label-Agnostic Forgetting (LAF) without using any labels, which achieves comparable performance to state-of-the-art methods that rely on full supervision information. Furthermore, our approach excels in semi-supervised scenarios, leveraging limited supervision information to outperform fully supervised baselines. This work not only showcases the viability of supervision-free unlearning in deep models but also opens up a new possibility for future research in unlearning at the representation level.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Representation Learning Multimodal learning Learning Theory
Scores: [ 6 6 6 8 ]
Multi-modal learning has become increasingly popular due to its ability to leverage information from different data sources (e.g., text and images) to improve the model performance. Recently, CLIP has emerged as an effective approach that employs vision-language contrastive pretraining to learn joint image and text representations and exhibits remarkable performance in zero-shot learning and text-guided natural image generation. Despite the substantial practical success of CLIP, its theoretical understanding remains elusive. In this paper, we formally study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned. We also analyze its zero-shot transfer performance on the downstream tasks. In addition, we conduct empirical evaluations on real data to backup our theory. Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Contrastive Learning Diffusion Model Representation Learning Self-supervised Learning Deep Learning
Scores: [ 6 6 5 3 8 ]
Contrastive Learning (CL) has emerged as one of the most successful paradigms for unsupervised visual representation learning, yet it often depends on intensive manual data augmentations. With the rise of generative models, especially diffusion models, the ability to generate realistic images close to the real data distribution has been well recognized. These generated high-equality images have been successfully applied to enhance contrastive representation learning, a technique termed data inflation''. However, we find that the generated data (even from a good diffusion model like DDPM) may sometimes even harm contrastive learning. We investigate the causes behind this failure from the perspective of both data inflation and data augmentation. For the first time, we reveal the complementary roles that stronger data inflation should be accompanied by weaker augmentations, and vice versa. We also provide rigorous theoretical explanations for these phenomena via deriving its generalization bounds under data inflation. Drawing from these insights, we propose Adaptive Inflation (AdaInf), a purely data-centric strategy without introducing any extra computation cost. On benchmark datasets, AdaInf can bring significant improvements for various contrastive learning methods. Notably, without using external data, AdaInf obtains 94.70% linear accuracy on CIFAR-10 with SimCLR, setting a new record that surpasses many sophisticated methods.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Semi-Supervised Learning; Robustness; Open Environments
Scores: [ 8 8 8 8 ]
Semi-supervised learning (SSL) has emerged as a promising paradigm to alleviate the dependency on abundant labeled data by harnessing the power of unlabeled data. Although many SSL algorithms have been proposed, their performance in practical applications is not robust because the assumption that labeled and unlabeled data are consistent does not hold. In open environments, the sources of labeled and unlabeled data may differ, leading to inconsistent data distributions and even data spaces. This paper points out that previous research on robust SSL has approached the problem from a static perspective, thereby only achieving local robustness rather than global robustness. We reshape the research framework of robust SSL by using the Robustness Analysis Curve (RAC) and the associated metrics defined based on it. Based on these metrics, we build a benchmark that encompasses three types of open environments: inconsistent data distributions, inconsistent label spaces, and inconsistent feature spaces to assess the performance of widely used statistical and deep SSL algorithms with tabular, image, and text datasets. This paper also conducted a detailed analysis, based on experimental results and theory, on how to make SSL algorithms more robust in open environments.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: transformers signal propagation theory self-attention initialisation simpler architectures skip connections normalisation fast training speed
Scores: [ 8 8 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Spectral Contrastive Learning
Scores: [ 6 6 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Sorting network Neural sorting network Differentiable swap functions
Scores: [ 6 6 6 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Deep imbalanced regression Contrastive learning Representation learning
Scores: [ 6 8 6 6 ]
Imbalanced distributions are ubiquitous in real-world data. They create constraints on Deep Neural Networks to represent the minority labels and avoid bias towards majority labels. The extensive body of imbalanced approaches address categorical label spaces but fail to effectively extend to regression problems where the label space is continuous. Local and global correlations among continuous labels provide valuable insights towards effectively modelling relationships in feature space. In this work, we propose ConR, a contrastive regularizer that models global and local label similarities in feature space and prevents the features of minority samples from being collapsed into their majority neighbours. ConR discerns the disagreements between the label space and feature space, and imposesa penalty on these disagreements. ConR minds the continuous nature of label space with two main strategies in a contrastive manner: incorrect proximities are penalized proportionate to the label similarities and the correct ones are encouraged to model local similarities. ConR consolidates essential considerations into a generic, easy-to-integrate, and efficient method that effectively addresses deep imbalanced regression. Moreover, ConR is orthogonal to existing approaches and smoothly extends to uni- and multi-dimensional label spaces. Our comprehensive experiments show that ConR significantly boosts the performance of all the state-of-the-art methods on four large-scale deep imbalanced regression benchmarks.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: graph neural networks implicit graph neural networks implicit models graph learning
Scores: [ 8 3 6 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Local Learning Biologically Motivated Learning Deep Learning
Scores: [ 6 8 5 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: self-supervised learning masked autoencoder audio representation learning
Scores: [ 6 6 6 3 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: masked autoencoding matrix completion structured representation learning white-box transformers
Scores: [ 8 6 5 8 ]
Modern learning frameworks often train deep neural networks with massive amounts of unlabeled data to learn representations by solving simple pretext tasks, then use the representations as foundations for downstream tasks. These networks are empirically designed; as such, they are usually not interpretable, their representations are not structured, and their designs are potentially redundant. White-box deep networks, in which each layer explicitly identifies and transforms structures in the data, present a promising alternative. However, existing white-box architectures have only been shown to work at scale in supervised settings with labeled data, such as classification. In this work, we provide the first instantiation of the white-box design paradigm that can be applied to large-scale unsupervised representation learning. We do this by exploiting a fundamental connection between diffusion, compression, and (masked) completion, deriving a deep transformer-like masked autoencoder architecture, called CRATE-MAE, in which the role of each layer is mathematically fully interpretable: they transform the data distribution to and from a structured representation. Extensive empirical evaluations confirm our analytical insights. CRATE-MAE demonstrates highly promising performance on large-scale imagery datasets while using only ~30% of the parameters compared to the standard masked autoencoder with the same model configuration. The representations learned by CRATE-MAE have explicit structure and also contain semantic meaning.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Location Encoding Positional Encoding Implicit Neural Representations Species Distribution Modeling
Scores: [ 5 8 8 6 ]
Learning feature representations of geographical space is vital for any machine learning model that integrates geolocated data, spanning application domains such as remote sensing, ecology, or epidemiology.Recent work mostly embeds coordinates using sine and cosine projections based on Double Fourier Sphere (DFS) features -- these embeddings assume a rectangular data domain even on global data, which can lead to artifacts, especially at the poles. At the same time, relatively little attention has been paid to the exact design of the neural network architectures these functional embeddings are combined with. This work proposes a novel location encoder for globally distributed geographic data that combines spherical harmonic basis functions, natively defined on spherical surfaces, with sinusoidal representation networks (SirenNets) that can be interpreted as learned Double Fourier Sphere embedding. We systematically evaluate the cross-product of positional embeddings and neural network architectures across various classification and regression benchmarks and synthetic evaluation datasets. In contrast to previous approaches that require the combination of both positional encoding and neural networks to learn meaningful representations, we show that both spherical harmonics and sinusoidal representation networks are competitive on their own but set state-of-the-art performances across tasks when combined.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Pre Training Transformers State Space Models Long Range Models Fair Evaluation
Scores: [ 8 8 8 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: self-supervised learning open-world learning segmentation
Scores: [ 6 6 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Canonical Correlation Analysis Multiview Learning Generalised Eigenvalue Problems Stochastic Gradient Descent
Scores: [ 8 5 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: causal representation learning; identifiability
Scores: [ 6 6 8 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Self supervised learning representation learning image encoder
Scores: [ 5 8 6 6 8 ]
Human visual recognition system shows astonishing capability of compressing visual information into a set of tokens containing rich representations without label supervision. One critical driving principle behind it is perceptual grouping. Despite being widely used in computer vision in the early 2010s, it remains a mystery whether perceptual grouping can be leveraged to derive a neural visual recognition backbone that generates as powerful representations. In this paper, we propose the Perceptual Group Tokenizer, a model that entirely relies on grouping operations to extract visual features and perform self-supervised representation learning, where a series of grouping operations are used to iteratively hypothesize the context for pixels or superpixels to refine feature representations. We show that the proposed model can achieve competitive performance compared to state-of-the-art vision architectures, and inherits desirable properties including adaptive computation without re-training, and interpretability. Specifically, Perceptual Group Tokenizer achieves 79.7% on ImageNet-1K self-supervised learning benchmark with linear probe evaluation, marking a new progress under this paradigm.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: foundation models time-to-event electronic health records deep learning self-supervised learning transfer learning
Scores: [ 8 8 6 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Hypergraph Ordinary Differential Equations Dynamic System
Scores: [ 5 5 8 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Traffic Predictoin Deep Learning Spatio-Temporal data modeling
Scores: [ 6 5 6 6 ]
Accurate traffic forecasting is challenging due to the complex dependency on road networks, various types of roads, and the abrupt speed change due to the events. Recent works mainly focus on dynamic spatial modeling with adaptive graph embedding or graph attention having less consideration for temporal characteristics and in-situ modeling. In this paper, we propose a novel deep learning model named TESTAM, which individually models recurring and non-recurring traffic patterns by a mixture-of-experts model with three experts on temporal modeling, spatio-temporal modeling with static graph, and dynamic spatio-temporal dependency modeling with dynamic graph. By introducing different experts and properly routing them, TESTAM could better model various circumstances, including spatially isolated nodes, highly related nodes, and recurring and non-recurring events. For the proper routing, we reformulate a gating problem into a classification problem with pseudo labels. Experimental results on three public traffic network datasets, METR-LA, PEMS-BAY, and EXPY-TKY, demonstrate that TESTAM achieves a better indication and modeling of recurring and non-recurring traffic.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: multi-faceted representation user interests item characteristics
Scores: [ 5 8 6 ]
We seek to uncover the latent interest units from behavioral data to better learn user preferences under the VAE framework. Existing practices tend to ignore the multiple facets of item characteristics, which may not capture it at appropriate granularity. Moreover, current studies equate the granularity of item space to that of user interests, which we postulate is not ideal as user interests would likely map to a small subset of item space. In addition, the compositionality of user interests has received inadequate attention, preventing the modeling of interactions between explanatory factors driving a user's decision.To resolve this, we propose to align user interests with multi-faceted item characteristics. First, we involve prototype-based representation learning to discover item characteristics along multiple facets. Second, we compose user interests from uncovered item characteristics via binding mechanism, separating the granularity of user preferences from that of item space. Third, we design a dedicated bi-directional binding block, aiding the derivation of compositional user interests.On real-world datasets, the experimental results demonstrate the strong performance of our proposed method compared to a series of baselines.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: time series forecasting transformer robust learning token mixing
Scores: [ 6 6 5 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Knowledge distillation temperature scaling regularization image classification
Scores: [ 8 6 6 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Graph contrastive learning spiking neural networks binarized graph representation learning
Scores: [ 6 6 6 3 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: linear mode connectivity layer-wise federated averaging
Scores: [ 6 6 6 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Knowledge Distillation Self-supervised Learning
Scores: [ 8 8 6 6 ]
Transformer architecture search (TAS) has achieved remarkable progress in automating the neural architecture design process of vision transformers. Recent TAS advancements have discovered outstanding transformer architectures while saving tremendous labor from human experts. However, it is still cumbersome to deploy these methods in real-world applications due to the expensive costs of data labeling under the supervised learning paradigm. To this end, this paper proposes a masked image modelling (MIM) based self-supervised neural architecture search method specifically designed for vision transformers, termed as MaskTAS, which completely avoids the expensive costs of data labeling inherited from supervised learning. Based on the one-shot NAS framework, MaskTAS requires to train various weight-sharing subnets, which can easily diverged without strong supervision in MIM-based self-supervised learning.For this issue, we design the search space of MaskTAS as a siamesed teacher-student architecture to distill knowledge from pre-trained networks, allowing for efficient training of the transformer supernet.To achieve self-supervised transformer architecture search, we further design a novel unsupervised evaluation metric for the evolutionary search algorithm, where each candidate of the student branch is rated by measuring its consistency with the larger teacher network.Extensive experiments demonstrate that the searched architectures can achieve state-of-the-art accuracy on benchmark dataset even without using manual labels. Moreover, the proposed MaskTAS can generalize well to various data domains and tasks by searching specialized transformer architectures in self-supervised manner.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Graph neural networks Model evaluation Distribution shift
Scores: [ 8 8 8 8 ]
Evaluating the performance of a well-trained GNN model on real-world graphs is a pivotal step for reliable GNN online deployment and serving. Due to a lack of test node labels and unknown potential training-test graph data distribution shifts, conventional model evaluation encounters limitations in calculating performance metrics (e.g., test error) and measuring graph data-level discrepancies, particularly when the training graph used for developing GNNs remains unobserved during test time.In this paper, we study a new research problem, online GNN evaluation, which aims to provide valuable insights into the well-trained GNNs's ability to effectively generalize to real-world unlabeled graphs under the test-time graph distribution shifts.Concretely, we develop an effective learning behavior discrepancy score, dubbed LeBeD, to estimate the test-time generalization errors of well-trained GNN models. Through a novel GNN re-training strategy with a parameter-free optimality criterion, the proposed LeBeD comprehensively integrates learning behavior discrepancies from both node prediction and structure reconstruction perspectives.This enables the effective evaluation of the well-trained GNNs' ability to capture test node semantics and structural representations, making it an expressive metric for estimating the generalization error in online GNN evaluation.Extensive experiments on real-world test graphs under diverse graph distribution shifts could verify the effectiveness of the proposed method, revealing its strong correlation with ground-truth test errors on various well-trained GNN models.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: self-supervised learning memorization encoders generalization ssl
Scores: [ 6 6 8 6 ]
Self-supervised learning (SSL) has recently received significant attention due to its ability to train high-performance encoders purely on unlabeled data---often scraped from the internet. This data can still be sensitive and empirical evidence suggests that SSL encoders memorize private information of their training data and can disclose them at inference time. Since existing theoretical definitions of memorization from supervised learning rely on labels, they do not transfer to SSL. To address this gap, we propose a framework for defining memorization within the context of SSL. Our definition compares the difference in alignment of representations for data points and their augmented views returned by both encoders that were trained on these data points and encoders that were not. Through comprehensive empirical analysis on diverse encoder architectures and datasets we highlight that even though SSL relies on large datasets and strong augmentations---both known in supervised learning as regularization techniques that reduce overfitting---still significant fractions of training data points experience high memorization. Through our empirical results, we show that this memorization is essential for encoders to achieve higher generalization performance on different downstream tasks.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Learning from Label Proportions Belief Propagation Pseudo-Labeling Embedding Learning
Scores: [ 5 6 5 8 5 ]
Learning from Label Proportions (LLP) is a learning problem where only aggregate level labels are available for groups of instances, called bags, during training, and the aim is to get the best performance at the instance-level on the test data. This setting arises in domains like advertising and medicine due to privacy considerations. We propose a novel algorithmic framework for this problem that iteratively performs two main steps. For the first step (Pseudo Labeling) in every iteration, we define a Gibbs distribution over binary instance labels that incorporates a) covariate information through the constraint that instances with similar covariates should have similar labels and b) the bag level aggregated label. We then use Belief Propagation (BP) to marginalize the Gibbs distribution to obtain pseudo labels. In the second step (Embedding Refinement), we use the pseudo labels to provide supervision for a learner that yields a better embedding. Further, we iterate on the two steps again by using the second step's embeddings as new covariates for the next iteration. In the final iteration, a classifier is trained using the pseudo labels. Our algorithm displays strong gains against several SOTA baselines (upto 15%) for the LLP Binary Classification problem on various dataset types - tabular and Image. We achieve these improvements with minimal computational overhead above standard supervised learning due to Belief Propagation, for large bag sizes, even for a million samples.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Masked modeling; Data imputation
Scores: [ 5 6 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Semantic Segmentation Pixel Learning Representation Learning Contrastive Learning
Scores: [ 6 6 6 6 ]
More distinguishable and consistent pixel features for each category will benefit the semantic segmentation under various settings.Existing efforts to mine better pixel-level features attempt to explicitly model the categorical distribution, which fails to achieve optimal due to the significant pixel feature variance.Moreover, prior research endeavors have scarcely delved into the thorough analysis and meticulous handling of pixel-level variances, leaving semantic segmentation at a coarse granularity.In this work, We analyze the causes of pixel-level variance and raise the concept of pixel learning to concentrate on the tailored learning process of pixels, handle pixel-level variance, and enhance the segmentation model's per-pixel recognition capability.Under the context of the pixel learning scheme, each image is viewed as a distribution of pixels, and pixel learning aims to pursue consistent pixel representation inside an image, continuously align pixels from different images (distributions), and eventually achieve consistent pixel representation for each category.We proposed a pure pixel-level learning framework, namely PiXL, which consists of a pixel partition module to divide pixels into sub-domains, a prototype generation, a selection module to prepare targets for subsequent alignment, and a pixel alignment module to guarantee pixel feature consistency intra- and inter-images.Extensive evaluations of multiple learning paradigms, including unsupervised domain adaptation and semi-/fully-supervised segmentation, show that PiXL outperforms state-of-the-art performances, especially when annotated images are scarce.Visualization of the embedding space further demonstrates that pixel learning attains a superior representation of pixel features.The code will be available upon acceptance.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: contrastive learning metric learning regularization
Scores: [ 8 8 5 6 ]
Similarity-based representation learning has shown impressive capabilities in both supervised (e.g., metric learning) and unsupervised (e.g., contrastive learning) scenarios. Existing approaches effectively constrained the representation difference (i.e., the disagreement between the embeddings of two instances) to fit the corresponding (pseudo) similarity supervision. However, most of them can hardly restrict the variation of representation difference, sometimes leading to overfitting results where the clusters are disordered by drastically changed differences. In this paper, we thus propose a novel difference alignment regularization (DAR) to encourage all representation differences between inter-class instances to be as close as possible, so that the learning algorithm can produce consistent differences to distinguish data points from each other. To this end, we construct a new cross-total-variation (CTV) norm to measure the divergence among representation differences, and we convert it into an equivalent stochastic form for easy optimization. Then, we integrate the proposed regularizer into the empirical loss for difference-aligned similarity learning (DASL), shrinking the hypothesis space and alleviating overfitting. Theoretically, we prove that our regularizer tightens the error bound of the traditional similarity learning. Experiments on multi-domain data demonstrate the superiority of DASL over existing approaches in both supervised metric learning and unsupervised contrastive learning tasks.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Retrieval Augmentation; Non-Knowledge-Intensive Task; Natural Language Understanding; Neural Architecture Search
Scores: [ 8 6 3 8 ]
Retrieval-based augmentations that aim to incorporate knowledge from an external database into language models have achieved great success in various knowledge-intensive (KI) tasks, such as question-answering and text generation.However, integrating retrievals in non-knowledge-intensive (NKI) tasks, such as text classification, is still challenging.Existing works focus on concatenating retrievals to inputs as context to form the prompt-based inputs. Unfortunately, such methods require language models to have the capability to handle long texts.Besides, inferring such concatenated data would also consume a significant amount of computational resources.To solve these challenges, we propose \textbf{ReFusion} in this paper, a computation-efficient \textbf{Re}trieval representation \textbf{Fusion} with neural architecture search. The main idea is to directly fuse the retrieval representations into the language models.Specifically, we first propose an online retrieval module that retrieves representations of similar sentences.Then, we present a retrieval fusion module including two effective ranking schemes, i.e., reranker-based scheme and ordered-mask-based scheme, to fuse the retrieval representations with hidden states.Furthermore, we use Neural Architecture Search (NAS) to seek the optimal fusion structure across different layers. Finally, we conduct comprehensive experiments, and the results demonstrate our ReFusion can achieve superior and robust performance on various NKI tasks.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: federated learning contrastive learning self-supervised semi-supervised mutual information
Scores: [ 5 8 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Time Series Forecasting Transformer Multivariate Time Series
Scores: [ 6 6 6 ]
Long-term time series forecasting (LTSF) is important for various domains but is confronted by challenges in handling the complex temporal-contextual relationships. As multivariate input models underperforming some recent univariate counterparts, we posit that the issue lies in the inefficiency of existing multivariate LTSF Transformers to model series-wise relationships: the characteristic differences between series are often captured incorrectly. To address this, we introduce ARM: a multivariate temporal-contextual adaptive learning method, which is an enhanced architecture specifically designed for multivariate LTSF modelling. ARM employs Adaptive Univariate Effect Learning (AUEL), Random Dropping (RD) training strategy, and Multi-kernel Local Smoothing (MKLS), to better handle individual series temporal patterns and correctly learn inter-series dependencies. ARM demonstrates superior performance on multiple benchmarks without significantly increasing computational costs compared to vanilla Transformer, thereby advancing the state-of-the-art in LTSF. ARM is also generally applicable to other LTSF architecture beyond vanilla Transformer.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: score matching deep equilibrium model density estimation
Scores: [ 6 8 6 6 ]
Score matching methods -- estimate probability densities without computing the normalization constant -- are particularly useful in deep learning. However, computational and memory costs of score matching methods can be prohibitive for high-dimensional data or complex models, particularly due to the derivatives or Hessians of the log density function appearing in the objective function. Some existing approaches modify the objective function to reduce the quadratic computational complexity for Hessian computation. However, the memory bottleneck of score matching methods remains for deep learning. This study improves the memory efficiency of score matching by leveraging deep equilibrium models. We provide a theoretical analysis of deep equilibrium models for scoring matching and applying implicit differentiation to higher-order derivatives. Empirical evaluations demonstrate that our approach enables the development of deep and expressive models with improved performance and comparable computational and memory costs over shallow architectures.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: segmentation in the loop for recognition hierarchical segmentation part-to-whole recognition vision transformer
Scores: [ 8 6 10 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: random tensors multi-view clustering random matrix theory tensor unfolding
Scores: [ 6 5 6 8 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: self-supervised learning domain-agnostic learning masked modeling protein biology chemistry particle physics
Scores: [ 6 6 6 6 6 ]
Self-supervised learning excels in learning representations from large amounts of unlabeled data, demonstrating success across multiple data modalities. Yet, extending self-supervised learning to new modalities is non-trivial because the specifics of existing methods are tailored to each domain, such as domain-specific augmentations which reflect the invariances in the target task. While masked modeling is promising as a domain-agnostic framework for self-supervised learning because it does not rely on input augmentations, its mask sampling procedure remains domain-specific. We present Self-guided Masked Autoencoders (SMA), a fully domain-agnostic masked modeling method. SMA trains an attention based model using a masked modeling objective, by learning masks to sample without any domain-specific assumptions. We evaluate SMA on three self-supervised learning benchmarks in protein biology, chemical property prediction, and particle physics. We find SMA is capable of learning representations without domain-specific knowledge and achieves state-of-the-art performance on these three benchmarks.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Data Selection Empirical Risk Minimization Influence Functions High dimensional asymptotics
Scores: [ 8 5 8 10 ]
Given a sample of size N, it is often useful to select a subsample of smaller size n<N to be used for statistical estimation or learning. Such a data selection step is useful to reduce the requirements of data labeling and the computational complexity of learning. We assume to be given N unlabeled samples xi, and to be given access to a `surrogate model' that can predict labels yi better than random guessing. Our goal is to select a subset of the samples, to be denoted by {xi}i∈G, of size |G|=n<N. We then acquire labels for this set and we use them to train a model via regularized empirical risk minimization. By using a mixture of numerical experiments on real and synthetic data, and mathematical derivations under low- and high- dimensional asymptotics, we show that: (i) Data selection can be very effective, in particular beating training on the full sample in some cases; (ii) Certain popular choices in data selection methods (e.g. unbiased reweighted subsampling, or influence function-based subsampling) can be substantially suboptimal.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: unsupervised object-centric learning discrete representations
Scores: [ 6 6 8 6 ]
Neural discrete representations are crucial components of modern neural networks. However, their main limitation is that the primary strategies such as VQ-VAE can only provide representations at the patch level. Therefore, one of the main goals of representation learning, acquiring conceptual, semantic, and compositional abstractions such as the color and shape of an object, remains elusive. In this paper, we present the first approach to semantic neural discrete representation learning. The proposed model, called Semantic Vector-Quantized Variational Autoencoder (SVQ), leverages recent advances in unsupervised object-centric learning to address this limitation. Specifically, we observe that a simple approach quantizing at the object level poses a significant challenge and propose constructing scene representations hierarchically, from low-level discrete concept schemas to object representations. Additionally, we suggest a novel method for training a prior over these semantic representations, enabling the ability to generate images following the underlying data distribution, which is lacking in most object-centric models. In experiments on various 2D and 3D object-centric datasets, we find that our model achieves superior generation performance compared to non-semantic vector quantization methods such as VQ-VAE and previous object-centric generative models. Furthermore, we find that the semantic discrete representations can solve downstream scene understanding tasks that require reasoning about the properties of different objects in the scene.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Object-Centric learning Compositionality
Scores: [ 6 6 8 6 ]
Learning compositional representation is a key aspect of object-centric learning as it enables flexible systematic generalization and supports complex visual reasoning. However, most of the existing approaches rely on auto-encoding objective, while the compositionality is implicitly imposed by the architectural or algorithmic bias in the encoder. This misalignment between auto-encoding objective and learning compositionality often results in failure of capturing meaningful object representations. In this study, we propose a novel objective that explicitly encourages compositionality of the representations. Built upon the existing object-centric learning framework (e.g., slot attention), our method incorporates additional constraints that an arbitrary mixture of object representations from two images should be valid by maximizing the likelihood of the composite data. We demonstrate that incorporating our objective to the existing framework consistently improves the objective-centric learning and enhances the robustness to the architectural choices.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Graph Neural Networks Expressive Power Homomorphism Subgraph Counting Weisfeiler-Lehman
Scores: [ 8 10 8 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: in-context learning transformers looped transformers
Scores: [ 8 6 5 ]
Transformers have demonstrated effectiveness in in-context solving data-fitting problems from various (latent) models, as reported by Garg et al. (2022). However, the absence of an inherent iterative structure in the transformer architecture presents a challenge in emulating the iterative algorithms, which are commonly employed in traditional machine learning methods. To address this, we propose the utilization of looped transformer architecture and its associated training methodology, with the aim of incorporating iterative characteristics into the transformer architectures. Experimental results suggest that the looped transformer achieves performance comparable to the standard transformer in solving various data-fitting problems, while utilizing less than 10% of the parameter count.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Emergent Communication Emergent Language Probabilistic Generative Model Variational Autoencoder beta-VAE Zipf’s law of abbreviation Harris’s articulation scheme
Scores: [ 5 8 8 3 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: representation learning representation learning theory contrastive learning robustness
Scores: [ 6 6 5 5 ]
Recently, multimodal contrastive learning (MMCL) approaches, such as CLIP \citep{radford2021learning}, have achieved a remarkable success in learning representations that are robust against distribution shift and generalize to new domains. Despite the empirical success, the mechanism behind learning such generalizable representations is not understood. In this work, we rigorously analyze this problem and uncover two mechanisms behind MMCL's robustness: \emph{intra-class contrasting}, which allows the model to learn features with a high variance, and \emph{inter-class feature sharing}, where annotated details in one class help learning other classes better. Both mechanisms prevent spurious features that are over-represented in the training data to overshadow the generalizable core features. This yields superior zero-shot classification accuracy under distribution shift. Furthermore, we theoretically demonstrate the benefits of using rich captions on robustness and explore the effect of annotating different types of details in the captions. We validate our theoretical findings through experiments, including a well-designed synthetic experiment and an experiment involving training CLIP on MS COCO \citep{lin2014microsoft} and evaluating the model on variations of shifted ImageNet.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Contrastive learning Forward learning Local learning Image classification
Scores: [ 8 8 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Tabular data Deep neural networks Tabular representation learning Prototype learning
Scores: [ 8 8 8 ]
Tabular data have been playing a mostly important role in diverse real-world fields, such as healthcare, engineering, finance, etc.With the recent success of deep learning, many tabular machine learning (ML) methods based on deep networks (e.g., Transformer, ResNet) have achieved competitive performance on tabular benchmarks. However, existing deep tabular ML methods suffer from the representation entanglement and localization, which largely hinders their prediction performance and leads to performance inconsistency on tabular tasks.To overcome these problems, we explore a novel direction of applying prototype learning for tabular ML and propose a prototype-based tabular representation learning framework, PTaRL, for tabular prediction tasks. The core idea of PTaRL is to construct prototype-based projection space (P-Space) and learn the disentangled representation around global data prototypes. Specifically, PTaRL mainly involves two stages: (i) Prototype Generating, that constructs global prototypes as the basis vectors of P-Space for representation, and (ii) Prototype Projecting, that projects the data samples into P-Space and keeps the core global data information via Optimal Transport. Then, to further acquire the disentangled representations, we constrain PTaRL with two strategies: (i) to diversify the coordinates towards global prototypes of different representations within P-Space, we bring up a diversifying constraint for representation calibration; (ii) to avoid prototype entanglement in P-Space, we introduce a matrix orthogonalization constraint to ensure the independence of global prototypes. Finally, we conduct extensive experiments in PTaRL coupled with state-of-the-art deep tabular ML models on various tabular benchmarks and the results have shown our consistent superiority.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: image clustering vision-language models large language models foundation models
Scores: [ 8 8 6 6 ]
Classical clustering methods do not provide users with direct control of the clustering results, and the clustering results may not be consistent with the relevant criterion that a user has in mind. In this work, we present a new methodology for performing image clustering based on user-specified criteria in the form of text by leveraging modern Vision-Language Models and Large Language Models. We call our method Image Clustering Conditioned on Text Criteria (IC$|TC),anditrepresentsadifferentparadigmofimageclustering.IC|TCrequiresaminimalandpracticaldegreeofhumaninterventionandgrantstheusersignificantcontrolovertheclusteringresultsinreturn.OurexperimentsshowthatIC|$TC can effectively cluster images with various criteria, such as human action, physical location, or the person's mood, significantly outperforming baselines.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Interpretation Structured output Energy function Explainable Structured output
Scores: [ 5 3 6 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: partial label learning label disambiguation candidate label set pruning
Scores: [ 8 8 8 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: tensor product representation systematic generalization compositional generalization binding problem structured representation learning competitive attention
Scores: [ 8 6 6 6 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: language model interpretability representation learning sparsity dictionary learning unsupervised learning
Scores: [ 5 6 1 6 6 ]
One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Moreover, we show that with our learned set of features, we can pinpoint the features that are causally responsible for counterfactual behaviour on the indirect object identification task \citep{wang2022interpretability} to a finer degree than previous decompositions. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: neural collapse orthogonal fixed classifier imbalanced continual
Scores: [ 5 6 6 6 ]
Fixed classifiers in neural networks for classification problems have demonstrated cost efficiency and even outperformed learnable classifiers in some popular benchmarks when incorporating orthogonality. Despite these advantages, prior research has yet to investigate the training dynamics of fixed orthogonal classifiers on neural collapse, a recently clarified phenomenon that last-layer features converge to a specific form, called simplex ETF, in training classification models involving the post-zero-error phase. Ensuring this phenomenon is critical for obtaining global optimality in a layer-peeled model, potentially leading to enhanced performance in practice. However, fixed orthogonal classifiers cannot invoke neural collapse due to their geometric limitations. To overcome the limits, we analyze a zero-mean neural collapse considering the orthogonality in non-negative Euclidean space. Then, we propose a fixed non-negative orthogonal classifier that induces the optimal solution and maximizes the margin of an orthogonal layer-peeled model by satisfying the properties of zero-mean neural collapse. Building on this foundation, we exploit a feature dimension separation effect inherent in our classifier for further purposes: (1) enhances softmax masking by mitigating feature interference in continual learning and (2) tackles the limitations of mixup on the hypersphere in imbalanced learning. We conducted comprehensive experiments on various datasets and architectures and demonstrated significant performance improvements.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Representation learning Self-supervised learning random data projections domain-agnostic representation learning
Scores: [ 6 6 6 8 5 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: contrastive learning neural networks stream data analysis
Scores: [ 6 5 6 8 8 ]
Modern techniques like contrastive learning have been effectively used in many areas, including computer vision, natural language processing, and graph-structured data. Creating positive examples that assist the model in learning robust and discriminative representations is a crucial stage in contrastive learning approaches. Usually, preset human intuition directs the selection of relevant data augmentations. Due to patterns that are easily recognized by humans, this rule of thumb works well in the vision and language domains. However, it is impractical to visually inspect the temporal structures in time series. The diversity of time series augmentations at both the dataset and instance levels makes it difficult to choose meaningful augmentations on the fly. Thus, although prevalent, contrastive learning with data augmentation has been less studied in the time series domain. In this study, we address this gap by analyzing time series data augmentation using information theory and summarizing the most commonly adopted augmentations in a unified format. We then propose a parametric augmentation method, AutoTCL, which can be adaptively employed to support time series representation learning. The proposed approach is encoder-agnostic, allowing it to be seamlessly integrated with different backbone encoders. Experiments on univariate forecasting tasks demonstrate the highly competitive results of our method, with an average 6.5% reduction in MSE and 4.7% in MAE over the leading baselines. In classification tasks, AutoTCL achieves a 1.2% increase in average accuracy.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Unsupervised learning global correspondence point cloud statsitical shape modeling
Scores: [ 8 5 8 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: neural networks monotonicity scalability conventional error-backpropagation
Scores: [ 8 5 8 5 ]
In this research, we focus on the problem of learning monotonic neural networks, as preserving the monotonicity of a model with respect to a subset of inputs is crucial for practical applications across various domains. Although several methods have recently been proposed to address this problem, they have limitations such as not guaranteeing monotonicity in certain cases, requiring additional inference time, lacking scalability with increasing network size and number of monotonic inputs, and manipulating network weights during training. To overcome these limitations, we introduce a simple but novel architecture of the partially connected network which incorporates a 'scalable monotonic hidden layer' comprising three units: the exponentiated unit, ReLU unit, and confluence unit. This allows for the repetitive integration of the scalable monotonic hidden layers without other structural constraints. Consequently, our method offers ease of implementation and rapid training through the conventional error-backpropagation algorithm. We accordingly term this method as Scalable Monotonic Neural Networks (SMNN). Numerical experiments demonstrated that our method achieved comparable prediction accuracy to the state-of-the-art approaches while effectively addressing the aforementioned weaknesses.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Temporal Graph Analysis Topological Data Analysis Graph Property Prediction Graph Neural Networks
Scores: [ 8 8 6 6 ]
Many real-world networks evolve over time, and predicting the evolution of such networks remains a challenging task. Graph Neural Networks (GNNs) have shown empirical success for learning on static graphs, but they lack the ability to effectively learn from nodes and edges with different timestamps. Consequently, the prediction of future properties in temporal graphs remains a relatively under-explored area.In this paper, we aim to bridge this gap by introducing a principled framework, named GraphPulse. The framework combines two important techniques for the analysis of temporal graphs within a Newtonian framework. First, we employ the Mapper method, a key tool in topological data analysis, to extract essential clustering information from graph nodes. Next, we harness the sequential modeling capabilities of Recurrent Neural Networks (RNNs) for temporal reasoning regarding the graph's evolution. Through extensive experimentation, we demonstrate that our model enhances the ROC-AUC metric by 10.2% in comparison to the top-performing state-of-the-art method across various temporal networks. We provide the implementation of GraphPulse at https://anonymous.4open.science/r/Graph_Pulse
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Hypergraph Self-supervised learning Hypergraph neural network
Scores: [ 5 8 5 6 ]
Hypergraphs are marked by complex topology, expressing higher-order interactions among multiple nodes with hyperedges, and better capturing the topology is essential for effective representation learning. Recent advances in generative self-supervised learning (SSL) suggest that hypergraph neural networks (HNNs) learned from generative self-supervision have the potential to effectively encode the complex hypergraph topology. Designing a generative SSL strategy for hypergraphs, however, is not straightforward. Questions remain with regard to its generative SSL task, connection to downstream tasks, and empirical properties of learned representations. In light of the promises and challenges, we propose a novel generative SSL strategy for hypergraphs. We first present a generative SSL task on hypergraphs, hyperedge filling, and highlight its theoretical connection to node classification. Based on the generative SSL task, we propose a hypergraph SSL method, HYPEBOY. HYPEBOY learns effective general-purpose hypergraph representations, outperforming 15 baseline methods across 11 benchmark datasets. To our knowledge, this is the first study on generative SSL on hypergraphs, and we demonstrate its theoretical and empirical strengths for hypergraph representation learning.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: language models pre-training training efficiency parameter-efficient fine-tuning lora
Scores: [ 5 6 6 6 ]
Despite the dominance and effectiveness of scaling, resulting in large networks with hundreds of billions of parameters, the necessity to train overparametrized models remains poorly understood, while training costs grow exponentially. In this paper, we explore parameter-efficient training techniques as an approach to training large neural networks. We introduce a novel method called ReLoRA, which utilizes low-rank updates to train high-rank networks. We apply ReLoRA to training transformer language models with up to 1.3B parameters and demonstrate comparable performance to regular neural network training. ReLoRA saves up to 5.5Gb of RAM per GPU and improves training speed by 9-40% depending on the model size and hardware setup. Our findings show the potential of parameter-efficient techniques for large-scale pre-training.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: pruning large-scale data curation concept-based LAION DataComp
Scores: [ 5 3 8 5 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Deep Learning Imitation Learning Stability Exponential Moving Average Optimization
Scores: [ 3 5 8 ]
This work studies training instabilities of behavior cloning with deep neural networks. We observe that minibatch SGD updates to the policy network during training result in sharp oscillations in long-horizon rewards, despite negligibly affecting the behavior cloning loss. We empirically disentangle the statistical and computational causes of these oscillations, and find them to stem from the chaotic propagation of minibatch SGD noise through unstable closed-loop dynamics. While SGD noise is benign in the single-step action prediction objective, it results in catastrophic error accumulation over long horizons, an effect we term gradient variance amplification (GVA). We demonstrate that many standard mitigation techniques do not alleviate GVA, but that taking an exponential moving average (EMA) of iterates is surprisingly effective at doing so. Furthermore, we illustrate the generality of the phenomenon by showing both the existence of GVA and its amelioration by EMA in autoregressive language generation. Finally, we provide theoretical vignettes both exhibiting the benefits of EMA in alleviating GVA and illustrating the extent to which classical convex models help in understanding the benefits of iterate averaging in deep learning.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Graph neural network unsupervised anomaly detection
Scores: [ 6 6 6 5 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Singular value decomposition; Sparse matrix; Unsupervised learning
Scores: [ 8 6 6 5 ]
Updating truncated Singular Value Decomposition (SVD) has extensive applications in representation learning.The continuous evolution of massive-scaled data matrices in practical scenarios highlights the importance of aligning SVD-based models with fast-paced updates.Recent methods for updating truncated SVD can be recognized as Rayleigh-Ritz projection procedures where their projection matrices are augmented based on the original singular vectors.However, the updating process in these methods densifies the update matrix and applies the projection to all singular vectors, resulting in inefficiency.This paper presents a novel method for dynamically approximating the truncated SVD of a sparse and temporally evolving matrix.The proposed method takes advantage of sparsity in the orthogonalization process of the augment matrices and employs an extended decomposition to store projections in the column space of singular vectors independently.Numerical experimental results on updating truncated SVD for evolving sparse matrices show an order of magnitude improvement in the efficiency of our proposed method while maintaining precision comparing to previous methods.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: label noise early stopping deep learning
Scores: [ 6 6 5 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: adversarial robustness robust fairness adversarial examples adversarial training
Scores: [ 6 8 6 5 ]
The disparity in accuracy between classes in standard training is amplified during adversarial training, a phenomenon termed the robust fairness problem. Existing methodologies aimed to enhance robust fairness by sacrificing the model's performance on easier classes in order to improve its performance on harder ones. However, we observe that under adversarial attacks, the majority of the model's predictions for samples from the worst class are biased towards classes similar to the worst class, rather than towards the easy classes. Through theoretical and empirical analysis, we demonstrate that robust fairness deteriorates as the distance between classes decreases. Motivated by these insights, we introduce the Distance-Aware Fair Adversarial Training (DAFA) methodology, which addresses robust fairness by taking into account the similarities between classes. Specifically, our method assigns distinct adversarial margins and loss weights to each class and adjusts them to encourage a trade-off in robustness among similar classes. Experimental results across various datasets demonstrate that our method not only maintains average robust accuracy but also significantly improves the worst robust accuracy, indicating a marked improvement in robust fairness compared to existing methods.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Mutual Distillation Semantic to Instance Instance to Semantic Point-level Supervised Instance Segmentation
Scores: [ 3 6 6 8 6 ]
Point-level Supervised Instance Segmentation (PSIS) aims to enhance the applicability and scalability of instance segmentation by utilizing low-cost yet instance-informative annotations. Existing PSIS methods usually rely on positional information to distinguish objects, but predicting precise boundaries remains challenging due to the lack of contour annotations. Nevertheless, weakly supervised semantic segmentation methods are proficient in utilizing intra-class feature consistency to capture the boundary contours of the same semantic regions. In this paper, we design a Mutual Distillation Module (MDM) to leverage the complementary strengths of both instance position and semantic information and achieve accurate instance-level object perception. The MDM consists of Semantic to Instance (S2I) and Istance to Semantic (I2S). S2I is guided by the precise boundaries of semantic regions to learn the association between annotated points and instance contours. I2S leverages discriminative relationships between instances to facilitate the differentiation of various objects within the semantic map. Extensive experiments substantiate the efficacy of MDM in fostering the synergy between instance and semantic information, consequently improving the quality of instance-level object representations. Our method achieves 55.7 mAP50 and 17.6 mAP on the PASCAL VOC and MS COCO datasets, significantly outperforming recent PSIS methods and several box-supervised instance segmentation competitors.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Supervised contrastive learning neural collapse implicit bias class imbalance
Scores: [ 6 6 5 6 ]
Supervised contrastive loss (SCL) is a competitive and often superior alternative to the cross-entropy loss for classification. While prior studies have demonstrated that both losses yield symmetric training representations under balanced data, this symmetry breaks under class imbalances. This paper presents an intriguing discovery: the introduction of a ReLU activation at the final layer effectively restores the symmetry in SCL-learned representations. We arrive at this finding analytically, by establishing that the global minimizers of an unconstrained features model with SCL loss and entry-wise non-negativity constraints form an orthogonal frame. Extensive experiments conducted across various datasets, architectures, and imbalance scenarios corroborate our finding. Importantly, our experiments reveal that the inclusion of the ReLU activation restores symmetry without compromising test accuracy. This constitutes the first geometry characterization of SCL under imbalances. Additionally, our analysis and experiments underscore the pivotal role of batch selection strategies in representation geometry. By proving necessary and sufficient conditions for mini-batch choices that ensure invariant symmetric representations, we introduce batch-binding as an efficient strategy that guarantees these conditions hold.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Learning Theory Representation Learning Self-supervised Learning Data Augmentation RKHS Approximation RKHS Regression
Scores: [ 8 5 6 8 ]
Data augmentation is critical to the empirical success of modern self-supervised representation learning, such as contrastive learning and masked language modeling.However, a theoretical understanding of the exact role of the augmentation remains limited.Recent work has built the connection between self-supervised learning and the approximation of the top eigenspace of a graph Laplacian operator, suggesting that learning a linear probe atop such representation can be connected to RKHS regression.Building on this insight, this work delves into a statistical analysis of augmentation-based pretraining.Starting from the isometry property, a geometric characterization of the target function given by the augmentation, we disentangle the effects of the model and the augmentation,and prove two generalization bounds that are free of model complexity.Our first bound works for an arbitrary encoder, and it is the sum of an estimation error bound incurred by fitting a linear probe, and an approximation error bound by RKHS approximation.Our second bound specifically addresses the casewhere the encoder extracts the top-d eigenspace of a finite-sample-based approximation of the underlying RKHS.A key ingredient in our analysis is the augmentation complexity,which we use to quantitatively compare different augmentations and analyze their impact on downstream performance.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: set-level labels fast excess risk rate representation learning few-shot learning
Scores: [ 6 5 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Protein-ligand binding representation learning self-supervised
Scores: [ 6 6 5 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: learning with noisy label noisy label classification Transition matrix Dirichlet distribution Importance sampling
Scores: [ 6 6 6 5 6 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Noisy correspondence Video-language pre-training
Scores: [ 8 8 8 8 ]
Most existing video-language studies mainly focus on learning short video clips, leaving long-term temporal dependencies rarely explored due to the over-high computational cost of modeling long videos. To address this issue, one feasible solution is learning the correspondence between video clips and captions, which however would inevitably encounter the multi-granularity noisy correspondence (MNC) problem. To be specific, MNC refers to the clip-caption misalignment (course-grained) and frame-word misalignment (fine-grained), which would hinder temporal learning and video understanding. In this paper, we propose NOise Robust Temporal Optimal traNsport (Norton) that addresses MNC in a unified optimal transport (OT) framework. In brief, Norton employs video-paragraph and clip-caption contrastive losses to capture long-term dependencies based on OT. To address course-grained misalignment in video-paragraph contrast, Norton filters out the irrelevant clips and captions through an alignable prompt bucket and realigns the asynchronous clip-caption pairs based on transport distance. To address the fine-grained misalignment, Norton incorporates a soft-maximum operator to identify crucial words and key frames. Additionally, Norton exploits the potential faulty negative samples in clip-caption contrastive learning by rectifying the alignment target with OT assignment to ensure precise temporal modeling. Extensive experiments on video retrieval, videoQA, and action segmentation verify the effectiveness of our method. The code will be released on GitHub upon acceptance.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: representation learning vector quantization quantization
Scores: [ 6 6 8 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: automatic evaluation large language models llm-as-a-judge
Scores: [ 5 6 1 6 ]
Recently, GPT-4 has become the de facto evaluator for long-form text generated by large language models (LLMs). However, for practitioners and researchers with large and custom evaluation tasks, GPT-4 is unreliable due to its closed-source nature, uncontrolled versioning, and prohibitive costs. In this work, we propose PROMETHEUS a fully open-source LLM that is on par with GPT-4’s evaluation capabilities when the appropriate reference materials (reference answer, score rubric) are accompanied. For this purpose, we construct a new dataset – FEEDBACK COLLECTION – that consists of 1K fine-grained score rubrics, 20K instructions, and 100K natural language feedback generated by GPT-4. Using the FEEDBACK COLLECTION, we train PROMETHEUS, a 13B evaluation-specific LLM that can assess any given response based on novel and unseen score rubrics and reference materials provided by the user. Our dataset’s versatility and diversity make our model generalize to challenging real-world criteria, such as prioritizing conciseness, child-readability, or varying levels of formality. We show that PROMETHEUS shows a stronger correlation with GPT-4 evaluation compared to ChatGPT on seven evaluation benchmarks (Two Feedback Collection testsets, MT Bench, Vicuna Bench, Flask Eval, MT Bench Human Judgment, and HHH Alignment), showing the efficacy of our model and dataset design. During human evaluation with hand-crafted score rubrics, PROMETHEUS shows a Pearson correlation of 0.897 with human evaluators, which is on par with GPT-4-0613 (0.882), and greatly outperforms ChatGPT (0.392). Remarkably, when assessing the quality of the generated feedback, PROMETHEUS demonstrates a win rate of 58.62% when compared to GPT-4 evaluation and a win rate of 79.57% when compared to ChatGPT evaluation. Our findings suggests that by adding reference materials and training on GPT-4 feedback, we can obtain effective open-source evaluator LMs.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: checkerboard learning robustness
Scores: [ 6 8 8 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Information Bottleneck Cauchy-Schwarz Divergence Regression
Scores: [ 6 6 5 6 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: compositional generalization identifiability object-centric learning generalization OOD generalization unsupervised learning slot attention disentanglement autoencoders representation learning
Scores: [ 8 8 6 ]
Learning representations that generalize to novel compositions of known concepts is crucial for bridging the gap between human and machine perception. One prominent effort is learning object-centric representations, which are widely conjectured to enable compositional generalization. Yet, it remains unclear when this conjecture will be true, as a principled theoretical or empirical understanding of compositional generalization is lacking. In this work, we investigate when compositional generalization is guaranteed for object-centric representations through the lens of identifiability theory. We show that autoencoders that satisfy structural assumptions on the decoder and enforce encoder-decoder consistency will learn object-centric representations that provably generalize compositionally. We validate our theoretical result and highlight the practical relevance of our assumptions through experiments on synthetic image data.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Time Series Prediction; Multivariate Time Series; Modern Hopfield Networks; Sparse Hopfield Model; Hopfield Layer; Attention Mechanism
Scores: [ 5 8 5 8 ]
We present STanHop-Net (Sparse Tandem Hopfield Network) for multivariate time series prediction with memory-enhanced capabilities. At the heart of our approach is STanHop, a novel Hopfield-based neural network block, which sparsely learns and stores both temporal and cross-series representations in a data-dependent fashion. In essence, STanHop sequentially learn temporal representation and cross-series representation using two tandem sparse Hopfield layers. In addition, StanHop incorporates two additional external memory modules: a Plug-and-Play module and a Tune-and-Play module for train-less and task-aware memory-enhancements, respectively. They allow StanHop-Net to fastly respond to certain sudden events. Methodologically, we construct the StanHop-Net by stacking STanHop blocks in a hierarchical fashion, enabling multi-resolution feature extraction with resolution-specific sparsity. Theoretically, we introduce a sparse extension of the modern Hopfield model and show that it endows a tighter memory retrieval error compared to the dense counterpart without sacrificing memory capacity. Empirically, we validate the efficacy of our framework on both synthetic and real-world settings.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Adversarial training Adversarial Robustness Generalization Robustness Robust overfitting Selective training
Scores: [ 5 6 6 5 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Extreme Classification Extreme Multi-Label Learning
Scores: [ 5 8 6 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Spiking Neural Networks Randomized Smoothing Adversarial Learning Certified Robustness
Scores: [ 8 6 5 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Graph compression; Graph neural networks
Scores: [ 5 8 6 6 5 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Large Language Models AI-Generated Texts Positive-Unlabeled Learning
Scores: [ 6 6 6 8 ]
Recent releases of Large Language Models (LLMs), e.g. ChatGPT, are astonishing at generating human-like texts, but they may impact the authenticity of texts. Previous works proposed methods to detect these AI-generated texts, including simple ML classifiers, pretrained-model-based zero-shot methods, and finetuned language classification models. However, mainstream detectors always fail on short texts, like SMSes, Tweets, and reviews. In this paper, a Multiscale Positive-Unlabeled (MPU) training framework is proposed to address the difficulty of short-text detection without sacrificing long-texts. Firstly, we acknowledge the human-resemblance property of short machine texts, and rephrase AI text detection as a partial Positive-Unlabeled (PU) problem by regarding these short machine texts as partially "unlabeled". Then in this PU context, we propose the length-sensitive Multiscale PU Loss, where a recurrent model in abstraction is used to estimate positive priors of scale-variant corpora. Additionally, we introduce a Text Multiscaling module to enrich training corpora. Experiments show that our MPU method augments detection performance on long AI-generated texts, and significantly improves short-text detection of language model detectors. Language Models trained with MPU could outcompete existing detectors on various short-text and long-text detection benchmarks.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: unsupervised domain translation translation identifiability distribution matching unpaired image to image translation
Scores: [ 6 8 6 5 ]
Unsupervised domain translation (UDT) is often realized by generative adversarial network (GAN)-based probability distribution matching of the source and target domains. CycleGAN stands as arguably the most representative approach among this line of work. However, it was noticed in recent works that CycleGAN and variants could fail to identify the desired translation function and produce content-misaligned translations. This limitation arises due to the presence of multiple translation functions---referred to as measure-preserving automorphism" (MPA)---in the solution space of the learning criteria. Despite the awareness of such identifiability issues, solutions have remained elusive. This study delves into the core identifiability challenge and introduces an MPA elimination theory. Our analysis shows that MPA is unlikely to exist, if multiple pairs of diverse cross-domain conditional distributions are aligned by the learning criterion. Our theory leads to a UDT learner using distribution matching over auxiliary variable-induced subsets of the domains---other than over the entire source/target domains as in the classical setting. The proposed framework is the first to rigorously establish identifiability of the desired translation function for UDT, to our best knowledge. Experiments corroborate with our theoretical claims.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: simulation dynamics nerf particle dynamics
Scores: [ 8 6 6 6 ]
Realistic simulation is critical for applications ranging from robotics to animation. Traditional analytic simulators sometimes struggle to capture sufficiently realistic simulation which can lead to problems including the well known "sim-to-real" gap in robotics. Learned simulators have emerged as an alternative for better capturing real-world physical dynamics, but require access to privileged ground truth physics information such as precise object geometry or particle tracks. Here we propose a method for learning simulators directly from observations. Visual Particle Dynamics (VPD) jointly learns a latent particle-based representation of 3D scenes, a neural simulator of the latent particle dynamics, and a renderer that can produce images of the scene from arbitrary views. VPD learns end to end from posed RGB-D videos and does not require access to privileged information. Unlike existing 2D video prediction models, we show that VPD's 3D structure enables scene editing and long-term predictions. These results pave the way for downstream applications ranging from video editing to robotic planning.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Open-set Recognization Graph Classification Domain Adaptation
Scores: [ 8 5 3 8 8 ]
Recently, numerous graph neural network methods have been developed to tackle domain shifts in graph data. However, these methods presuppose that unlabeled target graphs belong to categories previously seen in the source domain. This assumption could not hold true for in-the-wild target graphs. In this paper, we delve deeper to explore a more realistic problem open-set graph domain adaptation. Our objective is to not only identify target graphs from new categories but also accurately classify remaining target graphs into their respective categories under domain shift and label scarcity. To address this challenging problem, we introduce a novel method named Dual Structured Exploration with Mixup (DREAM). DREAM incorporates a graph-level representation learning branch as well as a subgraph-enhanced branch, which jointly explores graph topological structures from both global and local viewpoints. To maximize the use of unlabeled target graphs, we train these two branches simultaneously using posterior regularization to enhance their inter-module consistency. To accommodate the open-set setting, we amalgamate dissimilar samples to generate virtual unknown samples belonging to novel classes. Moreover, to alleviate domain shift, we establish a k nearest neighbor-based graph-of-graphs and blend multiple neighbors of each sample to produce cross-domain virtual samples for inter-domain consistency learning. Extensive experiments validate the effectiveness of our proposed DREAM compared with various state-of-the-art approaches in different settings.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Audio-language learning conditional audio separation unsupervised learning weakly supervised learning semi-supervised learning
Scores: [ 8 6 6 ]
Conditional sound separation in multi-source audio mixtures without having access to single source sound data during training is a long standing challenge. Existing mix-and-separate based methods suffer from significant performance drop with multi-source training mixtures due to the lack of supervision signal for single source separation cases during training. However, in the case of language-conditional audio separation, we do have access to corresponding text descriptions for each audio mixture in our training data, which can be seen as (rough) representations of the audio samples in the language modality. That raises the curious question of how to generate supervision signal for single-source audio extraction by leveraging the fact that single-source sounding language entities can be easily extracted from the text description. To this end, in this paper, we propose a generic bi-modal separation framework which can enhance the existing unsupervised frameworks to separate single-source signals in a target modality (i.e., audio) using the easily separable corresponding signals in the conditioning modality (i.e., language), without having access to single-source samples in the target modality during training. We empirically show that this is well within reach if we have access to a pretrained joint embedding model between the two modalities (i.e., CLAP). Furthermore, we propose to incorporate our framework into two fundamental scenarios to enhance separation performance. First, we show that our proposed methodology significantly improves the performance of purely unsupervised baselines by reducing the distribution shift between training and test samples. In particular, we show that our framework can achieve 71% boost in terms of Signal-to-Distortion Ratio (SDR) over the baseline, reaching 97.5% of the supervised learning performance. Second, we show that we can further improve the performance of the supervised learning itself by 17% if we augment it by our proposed weakly-supervised framework. Our framework achieves this by making large corpora of unsupervised data available to the supervised learning model as well as utilizing a natural, robust regularization mechanism through weak supervision from the language modality, and hence enabling a powerful semi-supervised framework for audio separation. Our code base and checkpoints will be released for further research and reproducibility.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Multivariate time series Self-supervised Time series representations Temporal features Time-Embeddings Representation Learning Missing data
Scores: [ 6 8 5 6 5 ]
Multivariate time series present challenges to standard machine learning techniques, as they are often unlabeled, high dimensional, noisy, and contain missing data. To address this, we propose T-Rep, a self-supervised method to learn time series representations at a timestep granularity. T-Rep learns vector embeddings of time alongside its feature extractor, to extract temporal features such as trend, periodicity, or distribution shifts from the signal. These time-embeddings are leveraged in pretext tasks, to incorporate smooth and fine-grained temporal dependencies in the representations, as well as reinforce robustness to missing data. We evaluate T-Rep on downstream classification, forecasting, and anomaly detection tasks. It is compared to existing self-supervised algorithms for time series, which it outperforms in all three tasks. We test T-Rep in missing data regimes, where it proves more resilient than its counterparts. Finally, we provide latent space visualisation experiments, highlighting the interpretability of the learned representations.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: sharpness-aware minimization representation learning spurious correlations
Scores: [ 6 6 8 6 6 6 ]
Sharpness-Aware Minimization (SAM) has emerged as a promising alternative to stochastic gradient descent (SGD) for minimizing the loss objective in neural network training. While the motivation behind SAM is to bias models towards flatter minima that are believed to generalize better, recent studies have shown conflicting evidence on the relationship between flatness and (in-distribution) generalization, leaving the mechanism behind SAM's performance improvement unclear. In this work, we present a complementary effect that cannot be explained by in-distribution improvements alone: we argue that SAM can enhance the quality of features in datasets containing redundant or spurious features. We explain how SAM can induce feature diversity by investigating a controlled setting. Our results imply that one mechanism by which SAM improves the quality of features is by adaptively suppressing well-learned features which gives remaining features opportunity to be learned.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Desiderata of Ideal Uniformity Metric Dimensional Collapse Wasserstein Distance Self-Supervised Learning
Scores: [ 5 5 6 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: model growth efficient deep learning continual learning
Scores: [ 8 8 6 6 ]
Scaling of deep neural networks, especially Transformers, is pivotal for their surging performance and has further led to the emergence of sophisticated reasoning capabilities in foundation models.Such scaling generally requires training large models from scratch with random initialization, failing to leverage the knowledge acquired by their smaller counterparts, which are already resource-intensive to obtain.To tackle this inefficiency, we present $\textbf{L}ossl\textbf{E}$ss $\textbf{MO}delExpansio\textbf{N}$ (LEMON), a recipe to initialize scaled models using the weights of their smaller but pre-trained counterparts. This is followed by model training with an optimized learning rate scheduler tailored explicitly for the scaled models, substantially reducing the training time compared to training from scratch.Notably, LEMON is versatile, ensuring compatibility with various network structures, including models like Vision Transformers and BERT.Our empirical results demonstrate that LEMON reduces computational costs by 56.7% for Vision Transformers and 33.2% for BERT when compared to training from scratch.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: self-supervised learning representation learning
Scores: [ 8 6 8 10 ]
Leveraging nearest neighbor retrieval for self-supervised representation learning has proven beneficial with object-centric images. However, this approach faces limitations when applied to scene-centric datasets, where multiple objects within an image are only implicitly captured in the global representation. Such global bootstrapping can lead to undesirable entanglement of object representations. Furthermore, even object-centric datasets stand to benefit from a finer-grained bootstrapping approach. In response to these challenges, we introduce a novel $\textbf{Cr}oss−\textbf{I}$mage Object-Level $\textbf{Bo}$otstrapping method tailored to enhance dense visual representation learning. By employing object-level nearest neighbor bootstrapping throughout the training, CrIBo emerges as a notably strong and adequate candidate for in-context learning, leveraging nearest neighbor retrieval at test time. CrIBo shows state-of-the-art performance on the latter task while being highly competitive in more standard downstream segmentation tasks. Our code and pretrained models will be publicly available upon acceptance.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: imbalanced learning subpopulation imbalance
Scores: [ 6 6 8 6 ]
Machine learning algorithms under skew distributions usually suffer from poor generalization, especially when the performance parity acts as an important criterion. This is more challenging on the class-balanced data that has some hidden imbalanced subpopulations, since prevalent techniques mainly conduct the class-level calibration and cannot perform the subpopulation-level adjustment without the explicit quantity. Regarding the implicit subpopulation imbalance, we reveal that the key to alleviating the detrimental effect lies in an effective subpopulation discovery with proper rebalancing. We then propose a novel subpopulation-imbalanced learning method, termed as Scatter and HarmonizE (SHE). Our method is built upon the guiding principle of optimal data partition, which involves assigning data to subpopulations in a manner that maximizes the predictive information from inputs to labels. With theoretical guarantees and empirical evidences, SHE succeeds in identifying the hidden subpopulations and encourages subpopulation-balanced predictions. Extensive experiments on various benchmark datasets show the effectiveness of SHE compared with a broad range of baselines.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Graph representation learning Heterogeneous graph Self-supervised learning
Scores: [ 6 8 8 6 6 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Contrastive Learning Mutual Information Contrastive Analysis Disentanglement
Scores: [ 8 5 8 8 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: representation vision transformer register SSL CLIP attention attention map interpretability
Scores: [ 8 8 8 8 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: in-context learning language modeling data distillation knowledge distillation
Scores: [ 6 8 5 6 ]
Large Language models (LLMs) have demonstrated impressive in-context learning (ICL) capabilities, where a LLM makes predictions for a given test input together with a few input-output pairs (demonstrations).Nevertheless, the inclusion of demonstrations poses a challenge, leading to a quadratic increase in the computational overhead of the self-attention mechanism.Existing solutions attempt to condense lengthy demonstrations into compact vectors. However, they often require task-specific retraining or compromise LLM's in-context learning performance. To mitigate these challenges, we present Meta Demonstration Distillation (MEND), where a language model learns to distill any lengthy demonstrations into vectors without retraining for a new downstream task. We exploit the knowledge distillation to enhance alignment between MEND and MEND, achieving both efficiency and effectiveness concurrently. MEND is endowed with the meta-knowledge of distilling demonstrations through a two-stage training process, which includes meta-distillation pretraining and fine-tuning.Comprehensive evaluations across seven diverse ICL settings using decoder-only (GPT-2) and encoder-decoder (T5) attest to MEND's prowess.It not only matches but often outperforms the Vanilla ICL as well as other state-of-the-art distillation models, while significantly reducing the computational demands. This innovation promises enhanced scalability and efficiency for the practical deployment of large language models.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: overfitting natural overfitting robust overfitting catastrophic overfitting
Scores: [ 5 6 6 8 ]
Overfitting negatively impacts the generalization ability of deep neural networks (DNNs) in both natural and adversarial training. Existing methods struggle to consistently address different types of overfitting, typically designing strategies that focus separately on either natural or adversarial patterns. In this work, we adopt a unified perspective by solely focusing on natural patterns to explore different types of overfitting. Specifically, we examine the memorization effect in DNNs and reveal a shared behaviour termed over-memorization, which impairs their generalization capacity. This behaviour manifests as DNNs suddenly becoming high-confidence in predicting certain training patterns and retaining a persistent memory for them. Furthermore, when DNNs over-memorize an adversarial pattern, they tend to simultaneously exhibit high-confidence prediction for the corresponding natural pattern. These findings motivate us to holistically mitigate different types of overfitting by hindering the DNNs from over-memorization natural patterns. To this end, we propose a general framework, Distraction Over-Memorization (DOM), which explicitly prevents over-memorization by either removing or augmenting the high-confidence natural patterns. Extensive experiments demonstrate the effectiveness of our proposed method in mitigating overfitting across various training paradigms.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: vision-language model CLIP zero-shot classification human perception contextual attributes spurious feature
Scores: [ 8 5 5 6 ]
Vision-language models like CLIP are widely used in zero-shot image classification due to their ability to understand various visual concepts and natural language descriptions. However, how to fully leverage CLIP's unprecedented human-like understanding capabilities to achieve better performance is still an open question. This paper draws inspiration from the human visual perception process: when classifying an object, humans first infer contextual attributes (e.g., background and orientation) which help separate the foreground object from the background, and then classify the object based on this information. Inspired by it, we observe that providing CLIP with contextual attributes improves zero-shot image classification and mitigates reliance on spurious features. We also observe that CLIP itself can reasonably infer the attributes from an image. With these observations, we propose a training-free, two-step zero-shot classification method PerceptionCLIP. Given an image, it first infers contextual attributes (e.g., background) and then performs object classification conditioning on them. Our experiments show that PerceptionCLIP achieves better generalization, group robustness, and interpretability.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: image harmonization medical imaging
Scores: [ 5 6 6 8 ]
Small sample sizes are common in many disciplines, which necessitates pooling roughly similar datasets across multiple sites/institutions to study weak but relevant associations between images and disease incidence. Such data often manifest shifts and imbalances in covariates (secondary non-imaging data). These issues are well-studied for classical models, but the ideas simply do not apply to overparameterized DNN models. Consequently, recent work has shown how strategies from fairness and invariant representation learning provides a meaningful starting point, but the current repertoire of methods remains limited to accounting for shifts/imbalances in just a couple of covariates at a time. In this paper, we show how viewing this problem from the perspective of Category theory provides a simple and effective solution that completely avoids elaborate multi-stage training pipelines that would otherwise be needed. We show the effectiveness of this approach via extensive experiments on real datasets. Further, we discuss how our style of formulation offers a unified perspective on at least 5+ distinct problem settings in vision, from self-supervised learningto matching problems in 3D reconstruction.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: order learning unsupervised clustering
Scores: [ 5 6 5 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: complex query answering knowledge graph link prediction
Scores: [ 8 8 6 6 ]
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: Preference Learning Reinforcement Learning from Human Feedback Social Choice Theory
Scores: [ 6 8 6 5 ]
In practice, preference learning from human feedback depends on incomplete data with hidden context. Hidden context refers to data that affects the feedback received, but which is not represented in the data used to train a preference model. This captures common issues of data collection, such as having human annotators with varied preferences, cognitive processes that result in seemingly irrational behavior, and combining data labeled according to different criteria. We prove that standard applications of preference learning, including reinforcement learning from human feedback (RLHF), implicitly aggregate over hidden contexts according to a well-known voting rule called Borda count. We show this can produce counter-intuitive results that are very different from other methods which implicitly aggregate via expected utility. Furthermore, our analysis formalizes the way that preference learning from users with diverse values tacitly implements a social choice function. A key implication of this result is that annotators have an incentive to misreport their preferences in order to influence the learned model, leading to vulnerabilities in the deployment of RLHF. As a step towards mitigating these problems, we introduce a class of methods called distributional preference learning (DPL). DPL methods estimate a distribution of possible score values for each alternative in order to better account for hidden context. Experimental results indicate that applying DPL to RLHF for LLM chatbots identifies hidden context in the data and significantly reduces subsequent jailbreak vulnerability.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Keywords: spiking neural network SNN network pruning stability neuromorphic leaky integrate and fire STDP sparsification task-agnostic pruning timescale optimization
Scores: [ 5 8 8 6 ]
Recurrent Spiking Neural Networks (RSNNs) have emerged as a computationally efficient and brain-inspired machine learning model. The design of sparse RSNNs with fewer neurons and synapses helps reduce the computational complexity of RSNNs. Traditionally, sparse SNNs are obtained by first training a dense and complex SNN for a target task and, next, eliminating neurons with low activity (activity-based pruning) while maintaining task performance. In contrast, this paper presents a task-agnostic methodology for designing sparse RSNNs by pruning an untrained (arbitrarily initialized) large model. We introduce a novel Lyapunov Noise Pruning (LNP) algorithm that uses graph sparsification methods and utilizes Lyapunov exponents to design a stable sparse RSNN from an untrained RSNN. We show that the LNP can leverage diversity in neuronal timescales to design a sparse Heterogeneous RSNN (HRSNN). Further, we show that the same sparse HRSNN model can be trained for different tasks, such as image classification and time-series prediction. The experimental results show that, in spite of being task-agnostic, LNP increases computational efficiency (fewer neurons and synapses) and prediction performance of RSNNs compared to traditional activity-based pruning of trained dense models.
Primary Area: visualization or interpretation of learned representations
Keywords: Generalization feature learning empirical phenomena
Scores: [ 8 6 8 8 ]
Primary Area: visualization or interpretation of learned representations
Keywords: Interpretability Learned Representations Neurosymbolic AI
Scores: [ 5 6 5 6 ]
Language models (LMs) can recall facts mentioned in context, as shown by their performance on reading comprehension tasks. When the context describes facts about more than one entity, the LM has to correctly bind attributes to their corresponding entity. We show, via causal experiments, that LMs' internal activations represent binding information by exhibiting appropriate binding ID vectors at the entity and attribute positions. We further show that binding ID vectors form a subspace and often transfer across tasks. Our results demonstrate that LMs learn interpretable strategies for representing symbolic knowledge in context, and that studying context activations is a fruitful direction for understanding LM cognition.
Primary Area: visualization or interpretation of learned representations
Keywords: Shape Modeling Medical Shape Analysis Interpretable Representation AI4Science
Scores: [ 8 6 6 6 ]
Primary Area: visualization or interpretation of learned representations
Keywords: Natural language processing interpretability language models
Scores: [ 8 8 6 ]
Primary Area: visualization or interpretation of learned representations
Keywords: Neural Radiance Fields Neural Rendering
Scores: [ 5 6 6 5 ]
Estimating neural radiance fields (NeRFs) is able to generate novel views of a scene from known imagery. Recent approaches have afforded dramatic progress on small bounded regions of the scene. For an unbounded scene where cameras point in any direction and contents exist at any distance, certain mapping functions are used to represent it within a bounded space, yet they either work in object-centric scenes or focus on objects close to the camera. The goal of this paper is to understand how to design a proper mapping function that considers per-scene optimization, which remains unexplored. We first present a geometric understanding of existing mapping functions that express the relation between the bounded and unbounded scenes. Here, we exploit a stereographic projection method to explain failures of the mapping functions, where input ray samples are too sparse to account for scene geometry in unbounded regions. To overcome the failures, we propose a novel mapping function based on a p-norm distance, allowing to adaptively sample the rays by adjusting the p-value according to scene geometry, even in unbounded regions. To take the advantage of our mapping function, we also introduce a new ray parameterization to properly allocate ray samples in the geometry of unbounded regions. Through the incorporation of both the novel mapping function and the ray parameterization within existing NeRF frameworks, our method achieves state-of-the-art novel view synthesis results on a variety of challenging datasets.
Primary Area: visualization or interpretation of learned representations
Keywords: Data Attribution Diffusion Models
Scores: [ 6 6 6 ]
Primary Area: visualization or interpretation of learned representations
Keywords: Graph Neural Networks Graph Explanation XAI
Scores: [ 8 8 5 3 ]
Graph Neural Networks (GNNs) are neural models that leverage the dependency structure in graphical data via message passing among the graph nodes. GNNs have emerged as pivotal architectures in analyzing graph-structured data, and their expansive application in sensitive domains requires a comprehensive understanding of their decision-making processes --- necessitating a framework for GNN explainability. An explanation function for GNNs takes a pre-trained GNN along with a graph as input, to produce a `sufficient statistic' subgraph with respect to the graph label. A main challenge in studying GNN explainability is to provide fidelity measures that evaluate the performance of these explanation functions. This paper studies this foundational challenge, spotlighting the inherent limitations of prevailing fidelity metrics, including Fid+, Fid−, and FidΔ. Specifically, a formal, information-theoretic definition of explainability is introduced and it is shown that existing metrics often fail to align with this definition across various statistical scenarios. The reason is due to potential distribution shifts when subgraphs are removed in computing these fidelity measures. Subsequently, a robust class of fidelity measures are introduced, and it is shown analytically that they are resilient to distribution shift issues and are applicable in a wide range of scenarios. Extensive empirical analysis on both synthetic and real datasets are provided to illustrate that the proposed metrics are more coherent with gold standard metrics.
Primary Area: visualization or interpretation of learned representations
Keywords: Interpretability ML Explainability ML Concept Bottleneck Model
Scores: [ 6 6 5 8 ]
The demand for transparency in healthcare and finance has led to interpretable machine learning (IML) models, notably the concept bottleneck models (CBMs), valued for its potential in performance and insights into deep neural networks. However, CBM's reliance on manually annotated data poses challenges. Label-free CBM has emerged to address this, but they remain unstable, affecting their faithfulness as explanatory tools. To address this inherent instability issue, we introduce a formal definition for an alternative concept called the Faithful Vision-Language Concept (FVLC) models. We present a methodology for constructing an FVLC that satisfies four critical properties. Our extensive experimentation, conducted on four benchmark datasets using Label-free CBM model architectures, demonstrates that our FVLC outperforms other baselines in terms of stability against input and concept set perturbations. Our approach incurs minimal accuracy degradation compared to the vanilla CBM, making it a promising solution for reliable and faithful model interpretation.
Primary Area: visualization or interpretation of learned representations
Keywords: Interpretability Transformers
Scores: [ 8 3 8 3 ]
Primary Area: visualization or interpretation of learned representations
Keywords: time series explainability perturbation
Scores: [ 6 5 6 6 ]
Primary Area: visualization or interpretation of learned representations
Keywords: Transformer Attention map Feed-forward Contextualization Interpretation Analysis Pre-trained models Masked language models Causal language models
Scores: [ 6 6 8 8 ]
Given that Transformers are ubiquitous in wide tasks, interpreting their internals is a pivotal issue. Still, their particular components, feed-forward (FF) blocks, have typically been less analyzed despite their substantial parameter amounts.We analyze the input contextualization effects of FF blocks by rendering them in the attention maps as a human-friendly visualization scheme.Our experiments with both masked- and causal-language models reveal that FF networks modify the input contextualization to emphasize specific types of linguistic compositions. In addition, FF and its surrounding components tend to cancel out each other's effects, suggesting potential redundancy in the processing of the Transformer layer.
Primary Area: visualization or interpretation of learned representations
Keywords: Interpretability Large Language Models Natural Language Processing Science of Deep Learning Representation Learning
Scores: [ 8 3 8 8 ]
Primary Area: visualization or interpretation of learned representations
Keywords: representational geometry kernel target alignment disentanglement activation function out-of-distribution generalization
Scores: [ 5 6 8 8 ]
Primary Area: visualization or interpretation of learned representations
Keywords: fine-tuning transformer-based language models feature analysis interpretation clinical classification
Scores: [ 6 8 6 ]
Pre-trained transformers are often fine-tuned to aid clinical decision-making using limited clinical notes. Model interpretability is crucial, especially in high-stakes domains like medicine, to establish trust and ensure safety, which requires human engagement. We introduce SUFO, a systematic framework that enhances interpretability of fine-tuned transformer feature spaces. SUFO utilizes a range of analytic and visualization techniques, including Supervised probing, Unsupervised similarity analysis, Feature dynamics, and Outlier analysis to address key questions about model trust and interpretability (e.g. model suitability for a task, feature space evolution during fine-tuning, and interpretation of fine-tuned features and failure modes). We conduct a case study investigating the impact of pre-training data where we focus on real-world pathology classification tasks, and validate our findings on MedNLI. We evaluate five 110M-sized pre-trained transformer models, categorized into general-domain (BERT, TNLR), mixed-domain (BioBERT, Clinical BioBERT), and domain-specific (PubMedBERT) groups. Our SUFO analyses reveal that: (1) while PubMedBERT, the domain-specific model, contains valuable information for fine-tuning, it can overfit to minority classes when class imbalances exist. In contrast, mixed-domain models exhibit greater resistance to overfitting, suggesting potential improvements in domain-specific model robustness; (2) in-domain pre-training accelerates feature disambiguation during fine-tuning; and (3) feature spaces undergo significant sparsification during this process, enabling clinicians to identify common outlier modes among fine-tuned models as demonstrated in this paper. These findings showcase the utility of SUFO in enhancing trust and safety when using transformers in medicine, and we believe SUFO can aid practitioners in evaluating fine-tuned language models (LMs) for other applications in medicine and in more critical domains.
Primary Area: visualization or interpretation of learned representations
Keywords: mixup neural collapse unconstrained features model
Scores: [ 6 6 6 5 ]
Mixup is a data augmentation strategy that employs convex combinations of training instances and their respective labels to augment the robustness and calibration of deep neural networks. Despite its widespread adoption, the nuanced mechanisms that underpin its success are not entirely understood. The observed phenomenon of Neural Collapse, where the last-layer activations and classifier of deep networks converge to a simplex equiangular tight frame (ETF), provides a compelling motivation to explore whether mixup induces alternative geometric configurations and whether those could explain its success. In this study, we delve into the last-layer activations of training data for deep networks subjected to mixup, aiming to uncover insights into its operational efficacy. Our investigation, spanning various architectures and dataset pairs, reveals that mixup's last-layer activations predominantly converge to a distinctive configuration. In this configuration, activations from mixed-up examples of identical classes align with the classifier, while those from different classes delineate channels along the decision boundary. To validate our empirical observations, we further conduct a theoretical analysis under the assumption of an unconstrained features model, utilizing the mixup loss. Through this, we characterize and derive the optimal last-layer features, culminating in a configuration consistent with our experimental findings, thereby shedding light on the intricate workings of mixup in the training of deep networks.
Primary Area: visualization or interpretation of learned representations
Keywords: Interpretable AI Submodular subset selection Explainable AI
Scores: [ 8 8 6 8 ]
Primary Area: visualization or interpretation of learned representations
Keywords: Interpretability world models probing
Scores: [ 8 5 6 8 ]
Primary Area: visualization or interpretation of learned representations
Keywords: Spiking Neural Network Event Camera Data Augmentation
Scores: [ 6 5 8 ]
Event camera, a novel bio-inspired vision sensor, has drawn a lot of attention for its low latency, low power consumption, and high dynamic range. Currently, overfitting remains a critical problem in event-based classification tasks for SNN due to its relatively weak spatial representation capability. Data augmentation is a simple but efficient method to alleviate overfitting and improve the generalization ability of neural networks, and saliency-based augmentation methods are proven to be effective in the image processing field. However, there is no approach available for extracting saliency maps from SNNs. Therefore, for the first time, we present Spiking Layer-Time-wise Relevance Propagation rule (\texttt{SLTRP}) and Spiking Layer-wise Relevance Propagation rule (\texttt{SLRP}) in order for SNN to generate stable and accurate CAM and saliency maps. Based on this, we propose \texttt{EventRPG}, which leverages relevance propagation on the spiking neural network for more efficient augmentation. Our proposed method has been evaluated on several SNN structures, achieving state-of-the-art performance in object recognition tasks including N-Caltech101, CIFAR10-DVS, with accuracies of 85.62% and 85.55%, as well as action recognition task SL-Animals with an accuracy of 91.59%. Codes will be available soon.
Primary Area: visualization or interpretation of learned representations
Keywords: explainable AI integrated gradient attribution path method
Scores: [ 8 8 6 ]
Rigorousness and clarity are both essential for interpretations of DNNs to engender human trust. Path methods are commonly employed to generate rigorous attributions that satisfy three axioms. However, the meaning of attributions remains ambiguous due to distinct path choices. To address the ambiguity, we introduce Concentration Principle, which centrally allocates high attributions to indispensable features, thereby endowing aesthetic and sparsity. We then present SAMP, a model-agnostic interpreter, which efficiently searches the near-optimal path from a pre-defined set of manipulation paths. Moreover, we propose the infinitesimal constraint (IC) and momentum strategy (MS) to improve the rigorousness and optimality. Visualizations show that SAMP can precisely reveal DNNs by pinpointing salient image pixels.We also perform quantitative experiments and observe that our method significantly outperforms the counterparts.
Primary Area: visualization or interpretation of learned representations
Keywords: interpretability factual errors hallucinations attention large language models
Scores: [ 6 6 6 ]
We investigate the internal behavior of Transformer-based Large Language Models (LLMs) when they generate factually incorrect text. We propose modeling factual queries as constraint satisfaction problems and use this framework to investigate how the LLM interacts internally with factual constraints. We find a strong positive relationship between the LLM's attention to constraint tokens and the factual accuracy of generations. We curate a suite of 11 datasets containing over 40,000 prompts to study the task of predicting factual errors with the Llama-2 family across all scales (7B, 13B, 70B). We propose SAT Probe, a method probing attention patterns, that can predict factual errors and fine-grained constraint satisfaction, and allow early error identification. The approach and findings take another step towards using the mechanistic understanding of LLMs to enhance their reliability.
Primary Area: visualization or interpretation of learned representations
Keywords: image morphing diffusion models clip inversion
Scores: [ 6 6 6 6 ]
We present a diffusion-based image morphing approach with perceptually-uniform sampling (IMPUS) that produces smooth, direct, and realistic interpolations given an image pair. A latent diffusion model has distinct conditional distributions and data embeddings for each of the two images, especially when they are from different classes. To bridge this gap, we interpolate in the locally linear and continuous text embedding space and Gaussian latent space. We first optimize the endpoint text embeddings and then map the images to the latent space using a probability flow ODE. Unlike existing work that takes an indirect morphing path, we show that the model adaptation yields a direct path and suppresses ghosting artifacts in the interpolated images. To achieve this, we propose an adaptive bottleneck constraint based on a novel relative perceptual path diversity score that automatically controls the bottleneck size and balances the diversity along the path with its directness. We also propose a perceptually-uniform sampling technique that enables visually smooth changes between the interpolated images. Extensive experiments validate that our IMPUS can achieve smooth, direct, and realistic image morphing and be applied to other image generation tasks.
Primary Area: visualization or interpretation of learned representations
Keywords: Instruction Tuning Robustness Large Language Models
Scores: [ 8 6 8 8 ]
Primary Area: visualization or interpretation of learned representations
Keywords: interpretability llms mechanistic interpretability circuit
Scores: [ 8 6 6 6 ]
Primary Area: visualization or interpretation of learned representations
Keywords: Feature aggregation Cost aggregation Transformer Dense matching
Scores: [ 5 5 6 6 ]
This paper introduces a Transformer-based integrative feature and cost aggregation network designed for dense matching tasks. In the context of dense matching, many works benefit from one of two forms of aggregation: feature aggregation, which pertains to the alignment of similar features, or cost aggregation, a procedure aimed at instilling coherence in the flow estimates across neighboring pixels. In this work, we first show that feature aggregation and cost aggregation exhibit distinct characteristics and reveal the potential for substantial benefits stemming from the judicious use of both aggregation processes. We then introduce a simple yet effective architecture that harnesses self- and cross-attention mechanisms to show that our approach unifies feature aggregation and cost aggregation and effectively harnesses the strengths of both techniques. Within the proposed attention layers, the features and cost volume both complement each other, and the attention layers are interleaved through a coarse-to-fine design to further promote accurate correspondence estimation. Finally at inference, our network produces multi-scale predictions, computes their confidence scores, and selects the most confident flow for final prediction. Our framework is evaluated on standard benchmarks for semantic matching, and also applied to geometric matching, where we show that our approach achieves significant improvements compared to existing methods.
Primary Area: visualization or interpretation of learned representations
Keywords: XAI XAI Evaluation Backdoor Watermark Backdoor Attack Backdoor for Positive Purposes
Scores: [ 8 8 6 5 ]
Primary Area: visualization or interpretation of learned representations
Keywords: Explainability Interpretability Transformer Fine-grained recognition Attribute discovery
Scores: [ 6 6 6 6 ]
We present a novel usage of Transformers to make image classification interpretable. Unlike mainstream classifiers that wait until the last fully-connected layer to incorporate class information to make predictions, we investigate a proactive approach, asking each class to search for itself in an image. We realize this idea via a Transformer encoder-decoder inspired by DEtection TRansformer (DETR). We learn "class-specific'' queries (one for each class) as input to the decoder, enabling each class to localize its patterns in an image via cross-attention. We name our approach INterpretable TRansformer (INTR), which is fairly easy to implement and exhibits several compelling properties. We show that INTR intrinsically encourages each class to attend distinctively; the cross-attention weights thus provide a faithful interpretation of the prediction. Interestingly, via "multi-head'' cross-attention, INTR could identify different "attributes'' of a class, making it particularly suitable for fine-grained classification and analysis, which we demonstrate on eight datasets.
Primary Area: visualization or interpretation of learned representations
Keywords: Interpretability Explainability Text-to-image models
Scores: [ 6 6 6 6 ]
Text-to-image diffusion models have demonstrated an unparalleled ability to generate high-quality, diverse images from a textual prompt. However, the internal representations learned by these models remain an enigma. In this work, we present Conceptor, a novel method to interpret the internal representation of a textual concept by a diffusion model. This interpretation is obtained by decomposing the concept into a small set of human-interpretable textual elements. Applied over the state-of-the-art Stable Diffusion model, Conceptor reveals non-trivial structures in the representations of concepts. For example, we find surprising visual connections between concepts, that transcend their textual semantics. We additionally discover concepts that rely on mixtures of exemplars, biases, renowned artistic styles, or a simultaneous fusion of multiple meanings of the concept.Through a large battery of experiments, we demonstrate Conceptor's ability to provide meaningful, robust, and faithful decompositions for a wide variety of abstract, concrete, and complex textual concepts, while allowing to naturally connect each decomposition element to its corresponding visual impact on the generated images.
Primary Area: visualization or interpretation of learned representations
Keywords: Latent Diffusion Models Diffusion Models Generative Models Unsupervised Semantic Segmentation
Scores: [ 6 6 6 6 ]
Diffusion models have recently received increasing research attention for their impressive transfer abilities to semantic segmentation tasks. However, previous works rely on additional supervision to produce fine-grained segmentation maps, leaving it unclear how much diffusion models alone understand the semantic relations of their generated images. To help answer this question, we exploit the semantic knowledge extracted from Stable Diffusion (SD) and build an image segmentor that can generate fine-grained segmentation maps without any additional training. The major issue that makes this task challenging for previous works is that semantically meaningful feature maps usually exist only in the spatially lower-dimensional layers, which makes it infeasible to extract pixel-level semantic relations directly from the feature maps. To overcome this challenge, our framework identifies semantic correspondences between image pixels and spatial locations of low-dimensional feature maps by analyzing SD’s generation process and utilizes them to construct image-resolution segmentation maps. In extensive experiments, the produced segmentation maps are shown to be well delineated and capture detailed parts of the images, indicating the existence of highly accurate pixel-level semantic knowledge in the diffusion models
Primary Area: visualization or interpretation of learned representations
Keywords: explainable AI interpretability feature attribution machine translation document-level machine translation natural language generation
Scores: [ 6 3 8 5 ]
Establishing whether language models can use contextual information in a human-plausible way is important to ensure their safe adoption in real-world settings. However, the questions of when and which parts of the context affect model generations are typically tackled separately, and current plausibility evaluations are practically limited to a handful of artificial benchmarks. To address this, we introduce $\textbf{P}$lausibility $\textbf{E}$valuation of $\textbf{Co}$ntext $\textbf{Re}$liance (PECoRe), an end-to-end interpretability framework designed to quantify context usage in language models' generations. Our approach leverages model internals to (i) contrastively identify context-sensitive target tokens in generated texts and (ii) link them to contextual cues justifying their prediction. We use PECoRe to quantify the plausibility of context-aware machine translation models, comparing model rationales with human annotations across several discourse-level phenomena. Finally, we apply our method to unannotated generations to identify context-mediated predictions and highlight instances of (im)plausible context usage in model translations.
Primary Area: visualization or interpretation of learned representations
Keywords: Diffusion Models Information Theory Interpretable Machine Learning
Scores: [ 6 6 6 6 6 6 ]
Denoising diffusion models enable conditional generation and density modeling of complex relationships like images and text. However, the nature of the learned relationships is opaque making it difficult to understand precisely what relationships between words and parts of an image are captured, or to predict the effect of an intervention. We illuminate the fine-grained relationships learned by diffusion models by noticing a precise relationship between diffusion and information decomposition. Exact expressions for mutual information and conditional mutual information can be written in terms of the denoising model. Furthermore, pointwise estimates can be easily estimated as well, allowing us to ask questions about the relationships between specific images and captions. Decomposing information even further to understand which variables in a high-dimensional space carry information is a long-standing problem. For diffusion models, we show that a natural non-negative decomposition of mutual information emerges, allowing us to quantify informative relationships between words and pixels in an image. We exploit these new relations to measure the compositional understanding of diffusion models, to do unsupervised localization of objects in images, and to measure effects when selectively editing images through prompt interventions.
Primary Area: visualization or interpretation of learned representations
Keywords: definitions of meaning semantic similarity sentence embeddings
Scores: [ 8 6 8 6 ]
We propose to extract meaning representations from autoregressive language models by considering the distribution of all possible trajectories extending an input text. This strategy is prompt-free, does not require fine-tuning, and is applicable to any pre-trained autoregressive model. Moreover, unlike vector-based representations, distribution-based representations can also model asymmetric relations (e.g., direction of logical entailment, hypernym/hyponym relations) by using algebraic operations between likelihood functions. These ideas are grounded in distributional perspectives on semantics and are connected to standard constructions in automata theory, but to our knowledge they have not been applied to modern language models. We empirically show that the representations obtained from large models align well with human annotations, outperform other zero-shot and prompt-free methods on semantic similarity tasks, and can be used to solve more complex entailment and containment tasks that standard embeddings cannot handle. Finally, we extend our method to represent data from different modalities (e.g., image and text) using multimodal autoregressive models.
Primary Area: visualization or interpretation of learned representations
Keywords: NLP LLMs Interpretability Explanation Causal Inference Matching
Scores: [ 6 6 6 6 ]
Causal explanations of the predictions of NLP systems are essential to ensure safety and establish trust. Yet, existing methods often fall short of explaining model predictions effectively or efficiently and are often model-specific. In this paper, we address model-agnostic explanations, proposing two approaches for counterfactual (CF) approximation. The first approach is CF generation, where a large language model (LLM) is prompted to change a specific text concept while keeping confounding concepts unchanged. While this approach is demonstrated to be very effective, applying LLM at inference-time is costly. We hence present a second approach based on matching, and propose a method that is guided by an LLM at training-time and learns a dedicated embedding space. This space is faithful to a given causal graph and effectively serves to identify matches that approximate CFs. After showing theoretically that approximating CFs is required in order to construct faithful explanations, we benchmark our approaches and explain several models, including LLMs with billions of parameters. Our empirical results demonstrate the excellent performance of CF generation models as model-agnostic explainers. Moreover, our matching approach, which requires far less test-time resources, also provides effective explanations, surpassing many baselines. We also find that Top-K techniques universally improve every tested method. Finally, we showcase the potential of LLMs in constructing new benchmarks for model explanation and subsequently validate our conclusions. Our work illuminates new pathways for efficient and accurate approaches to interpreting NLP systems.
Primary Area: visualization or interpretation of learned representations
Keywords: Interpretability Grokking Label noise Generalization Memorization Representations Modular Arithmetic
Scores: [ 5 8 5 ]
Robust generalization is a major challenge in deep learning, particularly when the number of trainable parameters is very large. In general, it is very difficult to know if the network has memorized a particular set of examples or understood the underlying rule (or both). Motivated by this challenge, we study an interpretable model where generalizing representations are understood analytically, and are easily distinguishable from the memorizing ones. Namely, we consider two-layer neural networks trained on modular arithmetic tasks where (xicdot100) of labels are corrupted (i.e. some results of the modular operations in the training set are incorrect). We show that (i) it is possible for the network to memorize the corrupted labels and achieve 100 generalization at the same time; (ii) the memorizing neurons can be identified and pruned, lowering the accuracy on corrupted data and improving the accuracy on uncorrupted data; (iii) regularization methods such as weight decay, dropout and BatchNorm force the network to ignore the corrupted data during optimization, and achieve 100 accuracy on the uncorrupted dataset; and (iv) the effect of these regularization methods is ("mechanistically") interpretable: weight decay and dropout force all the neurons to learn generalizing representations, while BatchNorm de-amplifies the output of memorizing neurons and amplifies the output of the generalizing ones. Finally, we show that in the presence of regularization, the training dynamics involves two consecutive stages: first, the network undergoes the grokking dynamics reaching high train and test accuracy; second, it unlearns the memorizing representations, where train accuracy suddenly jumps from 100 to 100(1−xi).
Primary Area: visualization or interpretation of learned representations
Keywords: explainable ai
Scores: [ 6 6 6 6 ]
Primary Area: visualization or interpretation of learned representations
Keywords: shortcut learning spurious correlations architectural inductive bias
Scores: [ 6 8 8 6 6 ]
Deep-learning models can extract a rich assortment of features from data. Which features a model uses depends not only on predictivity---how reliably a feature indicates train-set labels---but also on availability---how easily the feature can be extracted, or leveraged, from inputs. The literature on shortcut learning has noted examples in which models privilege one feature over another, for example texture over shape and image backgrounds over foreground objects. Here, we test hypotheses about which input properties are more available to a model, and systematically study how predictivity and availability interact to shape models' feature use. We construct a minimal, explicit generative framework for synthesizing classification datasets with two latent features that vary in predictivity and in factors we hypothesize to relate to availability, and quantify a model's shortcut bias---its over-reliance on the shortcut (more available, less predictive) feature at the expense of the core (less available, more predictive) feature. We find that linear models are relatively unbiased, but introducing a single hidden layer with ReLU or Tanh units yields a bias. Our empirical findings are consistent with a theoretical account based on Neural Tangent Kernels. Finally, we study how models used in practice trade off predictivity and availability in naturalistic datasets, discovering availability manipulations which increase models' degree of shortcut bias. Taken together, these findings suggest that the propensity to learn shortcut features is a fundamental characteristic of deep nonlinear architectures warranting systematic study given its role in shaping how models solve tasks.
Primary Area: visualization or interpretation of learned representations
Keywords: Mechanistic Interpretability AI Safety Interpretability Science of ML few-shot learning Large Language Models
Scores: [ 6 8 8 ]
Modern language models can imitate complex patterns through few-shot learning, enabling them to complete challenging tasks without fine-tuning. However, imitation can also lead models to reproduce inaccuracies or harmful content if present in the context. We study harmful imitation through the lens of a model’s internal representations, and identify two related phenomena: overthinking and false induction heads. The first phenomenon, overthinking, appears when we decode predictions from intermediate layers, given correct vs. incorrect few-shot demonstrations. At early layers, both demonstrations induce similar model behavior, but the behavior diverges sharply at some “critical layer”, after which the accuracy given incorrect demonstrations progressively decreases. The second phenomenon, false induction heads, are a possible mechanistic cause of overthinking: these are heads in late layers that attend to and copy false information from previous demonstrations, and whose ablation reduces overthinking. Beyond scientific understanding, our results suggest that studying intermediate model computations could be a promising avenue for understanding and guarding against harmful model behaviors.
Primary Area: visualization or interpretation of learned representations
Keywords: Out-of-distribution Generalization Neuron Activation
Scores: [ 5 8 6 8 ]
Primary Area: visualization or interpretation of learned representations
Keywords: interpretability failure modes bias
Scores: [ 6 8 6 5 ]
In this work, we study the challenge of providing human-understandable descriptions for failure modes in trained image classification models.Existing works address this problem by first identifying clusters (or directions) of incorrectly classified samples in a latent space and then aiming to provide human-understandable text descriptions for them.We observe that in some cases, describing text does not match wellwith identified failure modes, partially owing to the fact that shared interpretable attributes of failure modes may not be captured using clustering in the feature space.To improve on these shortcomings, we propose a novel approach that prioritizes interpretability in this problem: we start by obtaining human-understandable concepts (tags) of images in the dataset andthen analyze the model's behavior based on the presence or absence of combinations of these tags.Our method also ensures that the tags describing a failure mode form a minimal set,avoiding redundant and noisy descriptions.Through several experiments on different datasets, we show that our method successfully identifies failure modes and generates high-quality text descriptions associated with them.These results highlight the importance of prioritizing interpretability in understanding model failures.
Primary Area: visualization or interpretation of learned representations
Keywords: Visualization; Representation learning; Dimensionality reduction; Divergence
Scores: [ 3 5 8 8 ]
Primary Area: visualization or interpretation of learned representations
Keywords: Transferability Generalization Power
Scores: [ 8 6 5 ]
Primary Area: visualization or interpretation of learned representations
Keywords: multi-modal perception
Scores: [ 6 6 6 6 ]
Modeling and analyzing object and shape has been well studied in the past. However, manipulation of these complex tools and articulated objects remains difficult for autonomous agents. Our human hands, however, are dexterous and adaptive. We can easily adapt a manipulation skill on one object to all objects in the class and to other similar classes. Our intuition comes from that there is a close connection between manipulations and topology and articulation of objects. The possible articulation of objects indicates the types of manipulation necessary to operate the object. In this work, we aim to take a manipulation perspective to understand everyday objects and tools. We collect a multi-modal visual-tactile dataset that contains paired full-hand force pressure maps and manipulation videos. We also propose a novel method to learn a cross-modal latent manifold that allow for cross-modal prediction and discovery of latent structure in different data modalities. We conduct extensive experiments to demonstrate the effectiveness of our method.
Primary Area: visualization or interpretation of learned representations
Keywords: interpretability BERT syntax phase changes simplicity bias training dynamics
Scores: [ 5 8 8 10 ]
Primary Area: visualization or interpretation of learned representations
Keywords: Interpretability generative models
Scores: [ 6 6 6 6 ]
We introduce a generative model with an intrinsically interpretable layer---a concept bottleneck layer---that constrains the model to encode human-understandable concepts. The concept bottleneck layer partitions the generative model into three parts: the pre-concept bottleneck portion, the CB layer, and the post-concept bottleneck portion. To train CB generative models, we complement the traditional task-based loss function for training generative models with a concept loss and an orthogonality loss. The CB layer and these loss terms are model agnostic, which we demonstrate by applying the CB layer to three different families of generative models: generative adversarial networks, variational autoencoders, and diffusion models. On multiple datasets across different types of generative models, steering a generative model, with the CB layer, outperforms all baselines---in some cases, it is \textit{10 times} more effective. In addition, we show how the CB layer can be used to interpret the output of the generative model and debug the model during or post training.
Primary Area: visualization or interpretation of learned representations
Keywords: Graph Neural Networks GNN Explainability Decision Trees
Scores: [ 6 8 6 8 6 ]
We propose a new self-explainable Graph Neural Network (GNN) model: GraphChef. GraphChef integrates decision trees into the GNN message passing framework. Given a dataset, GraphChef returns a set of rules (a recipe) that explains each class in the dataset unlike existing GNNs and explanation methods that reason on individual graphs. Thanks to the decision trees, GraphChef recipes are human understandable. We also present a new pruning method to produce small and easy to digest trees. Experiments demonstrate that GraphChef reaches comparable accuracy to not self-explainable GNNs and produced decision trees are indeed small. We further validate the correctness of the discovered recipes on datasets where explanation ground truth is available: Reddit-Binary, MUTAG, BA-2Motifs, BA-Shapes, Tree-Cycle, and Tree-Grid.
Primary Area: visualization or interpretation of learned representations
Keywords: Interpretability Concepts Energy-Based Model Probabilistic Methods
Scores: [ 6 8 6 6 6 ]
Primary Area: visualization or interpretation of learned representations
Keywords: language model interpretability interpretability mechanistic interpretability circuit analysis activation patching large language models
Scores: [ 6 8 6 ]
Mechanistic interpretability seeks to understand the internal mechanisms ofmachine learning models, where localization—identifying the important modelcomponents—is a key step. Activation patching, also known as causal tracing orinterchange intervention, is a standard technique for this task (Vig et al., 2020), butthe literature contains many variants with little consensus on the choice of hyperparameters or methodology. In this work, we systematically examine the impactof methodological details in activation patching, including evaluation metrics andcorruption methods. In several settings of localization and circuit discovery in language models, we find that varying these hyperparameters could lead to disparateinterpretability results. Backed by empirical observations, we give conceptual arguments for why certain metrics or methods may be preferred. Finally, we providerecommendations for the best practices of activation patching going forwards.
Primary Area: visualization or interpretation of learned representations
Keywords: attribution transferability adversarial attack
Scores: [ 6 6 6 5 ]
Deep Neural Networks (DNNs) have achieved state-of-the-art performance in various application scenarios. However, due to the real-world noise and human-added perturbations, the trustworthiness of DNNs has been a critical concern from the security perspective. Therefore, it is imperative to provide explainability for the decisions made by the non-linear and complex parameterized models. Given the diverse decision boundaries across various models and specific tasks, attribution methods are promising for this goal, yet its performance can be further improved. In this paper, for the first time, we present that the decision boundary exploration approaches of attribution are consistent with the process for transferable adversarial attacks. Utilizing this consistency, we introduce a novel attribution method via model parameter exploration. Furthermore, inspired by the capability of frequency exploration to investigate the model parameters, we provide enhanced explainability for DNN models by manipulating the input features based on frequency information to explore the decision boundaries of different models. The large-scale experiments demonstrate that our \textbf{A}ttribution method for \textbf{E}xplanation with model parameter e\textbf{X}ploration (AttEXplore) outperforms other state-of-the-art interpretability methods. Moreover, by employing other transferable attack techniques, AttEXplore can explore potential variations in attribution outcomes. Our code is available at: https://anonymous.4open.science/r/AMPE-6C32/.
Primary Area: visualization or interpretation of learned representations
Keywords: Kernel k-means Price of explainability
Scores: [ 8 6 8 6 ]
Despite the growing popularity of explainable and interpretable machine learning, there is still surprisingly limited work on inherently interpretable clustering methods. Recently, there has been a surge of interest in explaining the classic k-means algorithm, leading to efficient algorithms that approximate k-means clusters using axis-aligned decision trees. However, interpretable variants of k-means have limited applicability in practice, where more flexible clustering methods are often needed to obtain useful partitions of the data. In this work, we investigate interpretable kernel clustering, and propose algorithms that construct decision trees to approximate the partitions induced by kernel k-means, a nonlinear extension of k-means. We further build on previous work on explainable k-means and demonstrate how a suitable choice of features allows preserving interpretability without sacrificing approximation guarantees on the interpretable model.
Primary Area: visualization or interpretation of learned representations
Keywords: Mechanistic Interpretability Natural Language Processing Large Language Models
Scores: [ 8 3 8 ]
Mechanistic interpretability aims to attribute high-level model behaviors to specific, interpretable learned features. It is hypothesized that these features manifest as directions or low-dimensional subspaces within activation space. Accordingly, recent studies have explored the identification and manipulation of such subspaces to reverse-engineer computations, employing methods such as activation patching. In this work, we demonstrate that naïve approaches to subspace interventions can give rise to interpretability illusions.Specifically, even if patching along a subspace has the intended end-to-end causal effect on model behavior, this effect may be achieved by activating \emph{a dormant parallel pathway} using a component that is \textit{causally disconnected} from the model output.We demonstrate this in a mathematical example, realize the example empirically in two different settings (the Indirect Object Identification (IOI) task and factual recall), and argue that activating dormant pathways ought to be prevalent in practice.In the context of factual recall, we further show that the illusion is related to rank-1 fact editing, providing a mechanistic explanation for previous work observing an inconsistency between fact editing performance and fact localisation.However, this does not imply that activation patching of subspaces is intrinsically unfit for interpretability.To contextualize our findings, we also show what a success case looks like in a task (IOI) where prior manual circuit analysis allows an understanding of the location of the ground truth feature. We explore the additional evidence needed to argue that a patched subspace is faithful.
Primary Area: visualization or interpretation of learned representations
Keywords: Feature interaction variable importance Rashomon set model interpretability
Scores: [ 6 6 6 ]
Interactions among features are central to understanding the behavior of machine learning models. Recent research has made significant strides in detecting and quantifying feature interactions in single predictive models. However, we argue that the feature interactions extracted from a single pre-specified model may not be trustworthy since: a well-trained predictive model may not preserve the true feature interactions and there exist multiple well-performing predictive models that differ in feature interaction strengths. Thus, we recommend exploring feature interaction strengths in a model class of approximately equally accurate predictive models. In this work, we introduce the feature interaction score (FIS) in the context of a Rashomon set, representing a collection of models that achieve similar accuracy on a given task. We propose a general and practical algorithm to calculate the FIS in the model class. We demonstrate the properties of the FIS via synthetic data and draw connections to other areas of statistics. Additionally, we introduce a Halo plot for visualizing the feature interaction variance in high-dimensional space and a swarm plot for analyzing FIS in a Rashomon set. Experiments with recidivism prediction and image classification illustrate how feature interactions can vary dramatically in importance for similarly accurate predictive models. Our results suggest that the proposed FIS can provide valuable insights into the nature of feature interactions in machine learning models.
Primary Area: visualization or interpretation of learned representations
Keywords: Image-to-point cloud registration cross-modality feature extraction diffusion models
Scores: [ 6 8 6 ]
Primary Area: visualization or interpretation of learned representations
Keywords: In-Context Learning Interpretability
Scores: [ 6 6 6 6 ]
We report the presence of a simple mechanism that represents an input-output function as a vector within autoregressive transformer language models. Using causal mediation analysis on a diverse range of in-context-learning (ICL) tasks, we find that a small number attention heads transport a compact representation of the demonstrated task, which we call a function vector (FV). We test the causal effects of FVs in a variety of input contexts and find that for many tasks FVs are robust to changes in context, i.e., they trigger execution of the task on inputs such as zero-shot and natural text settings that do not resemble ICL. By measuring the causal effects of the FV at each layer of the network, we find that FVs do not directly perform a task through embedding arithmetic, but rather they trigger the model to perform the task using potentially nonlinear computations. Finally, we investigate the internal structure of FVs and find while that they contain informationthat directly encodes the output space of the function, this information alone is not sufficient to reconstruct an FV. Taken together, our findings suggest that LLMs contain internal abstractions of general-purpose functions that can be invoked in a variety of contexts.
Primary Area: visualization or interpretation of learned representations
Keywords: Fine-Tuning Interpretability Mechanisms
Scores: [ 6 6 8 ]
Fine-tuning large pre-trained models has become the de facto strategy for developing both task-specific and general-purpose machine learning systems, including developing models that are safe to deploy. Despite its clear importance, there has been little work that explains how fine-tuning alters the underlying capabilities learnt by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just inhibit existing ones? An answer to this question would improve our ability to trust fine-tuning protocols meant to improve the safety of pre-trained models and delete unsafe capabilities. We aim to make progress on this question by answering it in controlled settings where we can use mechanistic interpretability tools (e.g.~ network pruning and probing) to understand how the model's underlying capabilities are changing. We perform an exhaustive analysis of the effects of fine-tuning in these settings, and show: (i) the ubiquitous protocol of fine-tuning with a small learning rate rarely alters the underlying model capabilities; (ii) often a minimal transformation, which we call a wrapper, is learned on top of the underlying model capability, yielding the impression that a new capability has been learned or a prior capability has been deleted; and (iii) continuing the fine-tuning process on a task where the pretraining capabilities are relevant leads to sample-efficient revival'' of the capability, i.e., the model starts to accurately reuse that capability in just a few gradient steps. \textit{This potentially indicates a practitioner could unintentionally render a safe model to be unsafe by merely fine-tuning on a downstream task.} We additionally perform analysis on language models trained on the TinyStories dataset to support our claims in a realistic setting.
Primary Area: visualization or interpretation of learned representations
Keywords: Entity Tracking Large Language Model
Scores: [ 5 6 6 ]
Fine-tuning on generalized tasks such as instruction following, code generation, and mathematics has been shown to enhance language models' performance on a range of tasks. Nevertheless, explanations of how such fine-tuning influences the internal computations in these models remain elusive. We study how fine-tuning affects the internal mechanisms implemented in language models. As a case study, we explore the property of entity tracking, a crucial facet of language comprehension, where models fine-tuned on mathematics have substantial performance gains. We identify a mechanism that enables entity tracking and show that (i) both the original model and its fine-tuned version implement entity tracking with the same circuit. In fact, the entity tracking circuit of the fine-tuned version performs better than the full original model. (ii) The circuits of all the models implement roughly the same functionality, that is entity tracking is performed by tracking the position of the correct entity in both the original model and its fine-tuned version. (iii) Performance boost in the fine-tuned model is primarily attributed to its improved ability to handle positional information. To uncover these findings, we employ two methods: DCM, which automatically detects model components responsible for specific semantics, and CMAP, a new approach for patching activations across models to reveal improved mechanisms. Our findings suggest that fine-tuning enhances, rather than fundamentally alters, the mechanistic operation of the model.
Primary Area: visualization or interpretation of learned representations
Keywords: In-Context Learning Large Language Models Interpretability Computational Cognitive Science
Scores: [ 6 6 6 6 ]
Large language models (LLMs) trained on huge corpora of text datasets demonstrate complex, emergent capabilities, achieving state-of-the-art performance on tasks they were not explicitly trained for. The precise nature of LLM capabilities is often unclear, and different prompts can elicit different capabilities through in-context learning. We propose a Cognitive Interpretability framework that enables us to analyze in-context learning dynamics to understand latent concepts in LLMs underlying behavioral patterns. This provides a more nuanced understanding than success-or-failure evaluation benchmarks, but does not require observing internal activations as a mechanistic interpretation of circuits would require. Inspired by the cognitive science of human randomness perception, we use random binary sequences as context and study dynamics of in-context learning by manipulating properties of context data, such as sequence length. In the latest GPT-3.5+ models, we find emergent abilities to generate pseudo-random numbers and learn basic formal languages, with striking in-context learning dynamics where model outputs transition sharply from pseudo-random behaviors to deterministic repetition.
Primary Area: visualization or interpretation of learned representations
Keywords: Interpretation; Transformer
Scores: [ 5 5 8 3 ]
Large-scale pre-trained vision foundation models, such as CLIP, have become de facto backbones for various vision tasks. However, due to their black-box nature, understanding the underlying rules behind these models’ predictions and controlling model behaviors have remained open challenges. We present a framework for interpreting vision transformer’s latent tokens with natural language. Given a latent token, our framework retains its semantic information to the final layer using transformer’s local operations and retrieves the closest text for explanation. Our approach enables understanding of model visual reasoning procedure without needing additional model training or data collection. Based on the obtained interpretations, our framework allows for model editing that controls model reasoning behaviors and improves model robustness against biases and spurious correlations.
Primary Area: visualization or interpretation of learned representations
Keywords: CLIP interpretability explainability
Scores: [ 8 8 8 8 ]
We investigate the CLIP image encoder by analyzing how individual model components affect the final representation. We decompose the image representation as a sum across individual image patches, model layers, and attention heads, and use CLIP's text representation to interpret the summands. Interpreting the attention heads, we characterize each head's role by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g.~location or shape). Next, interpreting the image patches, we uncover an emergent spatial localization within CLIP. Finally, we use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter. Our results indicate that scalable understanding of transformer models is attainable and can be used to repair and improve models.
Primary Area: visualization or interpretation of learned representations
Keywords: Category Theory; Neural Networks Feature Complexity
Scores: [ 8 3 6 6 ]
The behavior of neural networks still remains opaque, and a recently widely noted phenomenon is that networks often achieve similar performance when initialized with different random parameters. This phenomenon has attracted significant attention in measuring the similarity between features learned by distinct networks. However, feature similarity could be vague in describing the same feature since equivalent features hardly exist. In this paper, we expand the concept of equivalent feature and provide the definition of what we call functionally equivalent features. These features produce equivalent output under certain transformations. Using this definition, we aim to derive a more intrinsic metric for the so-called feature complexity regarding the redundancy of features learned by a neural network at each layer. We offer a formal interpretation of our approach through the lens of category theory, a well-developed area in mathematics. To quantify the feature complexity, we further propose an efficient algorithm named Iterative Feature Merging. Our experimental results validate our ideas and theories from various perspectives. We empirically demonstrate that the functionally equivalence widely exists among different features learned by the same neural network and we could reduce the number of parameters of the network without affecting the performance. We have also drawn several interesting empirical findings, including: 1) the larger the network, the more redundant features it learns; 2) in particular, we show how to prune the networks based on our finding using direct equivalent feature merging, without fine-tuning which is often needed in peer network pruning methods; 3) same structured networks with higher feature complexity achieve better performance; 4) through the layers of a neural network, the feature complexity first increase then decrease; 5) for the image classification task, a group of functionally equivalent features may correspond to a specific semantic meaning. Source code will be made publicly available.
Primary Area: visualization or interpretation of learned representations
Keywords: Depthwise Convolutions Explainability Neuroscience Computer Vision ConvNext
Scores: [ 8 8 3 6 ]
Recent advances in depthwise-separable convolutional neural networks (DS-CNNs) have led to novel architectures, that surpass the performance of classical CNNs, by a considerable scalability and accuracy margin. This paper reveals another striking property of DS-CNN architectures: discernible and explainable patterns emerge in their trained depthwise convolutional kernels in all layers. Through an extensive analysis of millions of trained filters, with different sizes and from various models, we employed unsupervised clustering with autoencoders, to categorize these filters. Astonishingly, the patterns converged into a few main clusters, each resembling the difference of Gaussian (DoG) functions, and their first and second-order derivatives. Notably, we classify over 95% and 90% of the filters from state-of-the-art ConvNeXtV2 and ConvNeXt models, respectively. This finding is not merely a technological curiosity; it echoes the foundational models neuroscientists have long proposed for the vision systems of mammals. Our results thus deepen our understanding of the emergent properties of trained DS-CNNs and provide a bridge between artificial and biological visual processing systems. More broadly, they pave the way for more interpretable and biologically-inspired neural network designs in the future.
Primary Area: visualization or interpretation of learned representations
Keywords: Explainable AI Neural networks Symbolism
Scores: [ 6 8 6 8 ]
Primary Area: visualization or interpretation of learned representations
Keywords: XAI Unsupervised node representation learning Counterfactual Explanations
Scores: [ 8 6 6 6 ]