ICLR2024

POSTER-1: On the generalization capacity of neural networks during generic multimodal reasoning

Primary Area: applications to neuroscience & cognitive science

Keywords: compositional generalization compositionality representation learning out-of-distribution generalization multimodal reasoning

Scores: [ 6 3 8 ]

The advent of the Transformer has led to the development of large language models (LLM), which appear to demonstrate human-like capabilities. To assess the generality of this class of models and a variety of other base neural network architectures to multimodal domains, we evaluated and compared their capacity for multimodal generalization. We introduce a multimodal question-answer benchmark to evaluate three specific types of out-of-distribution (OOD) generalization performance: distractor generalization (generalization in the presence of distractors), systematic compositional generalization (generalization to new task permutations), and productive compositional generalization (generalization to more complex tasks with deeper dependencies). While we found that most architectures faired poorly on most forms of generalization (e.g., RNNs and standard Transformers), models that leveraged cross-attention mechanisms between input domains, such as the Perceiver, fared better. Our positive results demonstrate that for multimodal distractor and systematic generalization, cross-attention is an important mechanism to integrate multiple sources of information. On the other hand, all architectures failed in productive generalization, suggesting fundamental limitations of existing architectures for specific types of multimodal OOD generalization.These results demonstrate the strengths and limitations of specific architectural components underlying modern neural models for multimodal reasoning. Finally, we provide Generic COG (gCOG), a configurable benchmark with several multimodal generalization splits, for future studies to explore.

POSTER-2: Perceptual Measurements, Distances and Metrics

Primary Area: applications to neuroscience & cognitive science

Keywords: perceptual scale perceptual distance gaussian texture naturalistic textures texture interpolation

Scores: [ 8 5 5 3 ]

POSTER-3: Emergent mechanisms for long timescales depend on training curriculum and affect performance in memory tasks

Primary Area: applications to neuroscience & cognitive science

Keywords: timescales recurrent neural networks memory tasks curriculum learning

Scores: [ 8 8 6 5 ]

Recurrent neural networks (RNNs) in the brain and in silico excel at solving tasks with intricate temporal dependencies. Long timescales required for solving such tasks can arise from properties of individual neurons (single-neuron timescale, τ, e.g., membrane time constant in biological neurons) or recurrent interactions among them (network-mediated timescale). However, the contribution of each mechanism for optimally solving memory-dependent tasks remains poorly understood. Here, we train RNNs to solve N-parity and N-delayed match-to-sample tasks with increasing memory requirements controlled by N by simultaneously optimizing recurrent weights and $\tau$s. We find that for both tasks RNNs develop longer timescales with increasing N, but depending on the learning objective, they use different mechanisms. Two distinct curricula define learning objectives: sequential learning of a single-N (single-head) or simultaneous learning of multiple $N$s (multi-head). Single-head networks increase their τ with N and are able to solve tasks for large N, but they suffer from catastrophic forgetting. However, multi-head networks, which are explicitly required to hold multiple concurrent memories, keep τ constant and develop longer timescales through recurrent connectivity. Moreover, we show that the multi-head curriculum increases training speed and network stability to ablations and perturbations, and allows RNNs to generalize better to tasks beyond their training regime. This curriculum also significantly improves training GRUs and LSTMs for large-N tasks. Our results suggest that adapting timescales to task requirements via recurrent interactions allows learning more complex objectives and improves the RNN's performance.

POSTER-4: Threaten Spiking Neural Networks through Combining Rate and Temporal Information

Primary Area: applications to neuroscience & cognitive science

Keywords: Spiking Neural Networks Rate and Temporal Information Hybrid Adversarial Attack

Scores: [ 6 6 8 8 ]

SPOTLIGHT-1: Synaptic Weight Distributions Depend on the Geometry of Plasticity

Primary Area: applications to neuroscience & cognitive science

Keywords: synaptic weight distributions synaptic plasticity biologically plausible learning mirror descent

Scores: [ 1 8 5 8 ]

A growing literature in computational neuroscience leverages gradient descent and learning algorithms that approximate it to study synaptic plasticity in the brain. However, the vast majority of this work ignores a critical underlying assumption: the choice of distance for synaptic changes (i.e. the geometry of synaptic plasticity). Gradient descent assumes that the distance is Euclidean, but many other distances are possible, and there is no reason that biology necessarily uses Euclidean geometry. Here, using the theoretical tools provided by mirror descent, we show that, regardless of the loss being minimized, the distribution of synaptic weights will depend on the geometry of synaptic plasticity. We use these results to show that experimentally-observed log-normal weight distributions found in several brain areas are not consistent with standard gradient descent (i.e. a Euclidean geometry), but rather with non-Euclidean distances. Finally, we show that it should be possible to experimentally test for different synaptic geometries by comparing synaptic weight distributions before and after learning. Overall, this work shows that the current paradigm in theoretical work on synaptic plasticity that assumes Euclidean synaptic geometry may be misguided and that it should be possible to experimentally determine the true geometry of synaptic plasticity in the brain.

POSTER-5: Adaptive deep spiking neural network with global-local learning via balanced excitatory and inhibitory mechanism

Primary Area: applications to neuroscience & cognitive science

Keywords: Spiking neural networks excitation inhibition mechanism Long-Term Potentiation Long-Term Depression

Scores: [ 5 8 8 5 ]

The training method of Spiking Neural Networks (SNNs) is an essential problem, and how to integrate local and global learning is a worthy research interest. However, the current integration methods do not consider the network conditions suitable for local and global learning, and thus fail to balance their advantages. In this paper, we propose an Excitation-Inhibition Mechanism-assisted Hybrid Learning(EIHL) algorithm that adjusts the network connectivity by using the excitation-inhibition mechanism and then switches between local and global learning according to the network connectivity. The experimental results on CIFAR10/100 and DVS-CIFAR10 demonstrate that the EIHL not only has better accuracy performance than other methods but also has excellent sparsity advantage. Especially, the Spiking VGG11 is trained by EIHL, STBP, and STDP on DVS_CIFAR10, respectively. The accuracy of the Spiking VGG11 model on EIHL is 62.45%, which is 4.35% higher than STBP and 11.40% higher than STDP, and the sparsity is 18.74%, which is 18.74% higher than the other two methods. Moreover, the excitation-inhibition mechanism used in our method also offers a new perspective on the field of SNN learning.

POSTER-6: Understanding AI Cognition: A Neural Module for Inference Inspired by Human Memory Mechanisms

Primary Area: applications to neuroscience & cognitive science

Keywords: memory transformers architecture inference cognitive model

Scores: [ 8 6 6 ]

How humans and machines make sense of current inputs for relation reasoning and question-answering, and put the perceived information into context of our past memories, has been a challenging conundrum in cognitive science and artificial intelligence. Inspired by human brain's memory system and cognitive architectures, we propose a PMI framework that consists of perception, memory and inference components. Notably, the memory module comprises working and long-term memory, with the latter endowed with a higher-order structure to retain more accumulated knowledge and experiences. Through a differentiable competitive write access, current perceptions update working memory, which is later merged with long-term memory via outer product associations, averting memory overflow and minimizing information conflicts. In the inference module, relevant information is retrieved from two separate memory origins and associatively integrated to attain a more comprehensive and precise interpretation of current perceptions. We exploratively apply our PMI to improve prevailing Transformers and CNN models on question-answering tasks like bAbI-20k and Sort-of-CLEVR datasets, as well as relation reasoning and image classification tasks, and in each case, our PMI enhancements consistently outshine their original counterparts significantly. Visualization analyses reveal that memory consolidation, along with the interaction and integration of information from diverse memory sources, substantially contributes to the model effectiveness on inference tasks.

POSTER-7: How connectivity structure shapes rich and lazy learning in neural circuits

Primary Area: applications to neuroscience & cognitive science

Keywords: Computational neuroscience recurrent neural networks learning connectivity structure inductive bias rich and lazy learning deep learning theory neural representations

Scores: [ 8 6 8 5 ]

In theoretical neuroscience, recent work leverages deep learning tools to explore how some network attributes critically influence its learning dynamics. Notably, initial weight distributions with small (resp. large) variance may yield a rich (resp. lazy) regime, where significant (resp. minor) changes to network states and representation are observed over the course of learning. However, in biology, neural circuit connectivity generally has a low-rank structure and therefore differs markedly from the random initializations generally used for these studies. As such, here we investigate how the structure of the initial weights — in particular their effective rank — influences the network learning regime. Through both empirical and theoretical analyses, we discover that high-rank initializations typically yield smaller network changes indicative of lazier learning, a finding we also confirm with experimentally-driven initial connectivity in recurrent neural networks. Conversely, low-rank initialization biases learning towards richer learning. Importantly, however, as an exception to this rule, we find lazier learning can still occur with a low-rank initialization that aligns with task and data statistics. Our research highlights the pivotal role of initial weight structures in shaping learning regimes, with implications for metabolic costs of plasticity and risks of catastrophic forgetting.

POSTER-8: Traveling Waves Encode The Recent Past and Enhance Sequence Learning

Primary Area: applications to neuroscience & cognitive science

Keywords: RNNs Traveling Waves Memory Sequence Modeling

Scores: [ 6 6 8 5 ]

Traveling waves of neural activity have been observed throughout the brain at a diversity of regions and scales; however, their precise computational role is still debated. One physically grounded hypothesis suggests that the cortical sheet may act like a wave-field capable of invertibly storing a short-term memory of sequential stimuli through induced waves traveling across the cortical surface, and indeed many experimental results from neuroscience correlate wave activity with memory tasks. To date, however, the computational implications of this idea have remained hypothetical due to the lack of a simple recurrent neural network architecture capable of exhibiting such waves. In this work, we introduce a model to fill this gap, which we denote the Wave-RNN (wRNN), and demonstrate how such an architecture indeed efficiently encodes the recent past through a suite of synthetic memory tasks where wRNNs learn faster and reach significantly lower error than wave-free counterparts. We further explore the implications of this memory storage system on more complex sequence modeling tasks such as sequential image classification and find that wave-based models not only again outperform comparable wave-free RNNs while using significantly fewer parameters, but additionally perform comparably to more complex gated architectures such as LSTMs and GRUs.

POSTER-9: BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity

Primary Area: applications to neuroscience & cognitive science

Keywords: neuroscience brain fmri captions image synthesis visual cortex computational neuroscience

Scores: [ 8 6 8 6 ]

Understanding the functional organization of higher visual cortex is a central focus in neuroscience. Past studies have primarily mapped the visual and semantic selectivity of neural populations using hand-selected stimuli, which may potentially bias results towards pre-existing hypotheses of visual cortex functionality. Moving beyond conventional approaches, we introduce a data-driven method that generates natural language descriptions for images predicted to maximally activate individual voxels of interest. Our method -- Semantic Captioning Using Brain Alignments ("BrainSCUBA") -- builds upon the rich embedding space learned by a contrastive vision-language model and utilizes a pre-trained large language model to generate interpretable captions. We validate our method through fine-grained voxel-level captioning across higher-order visual regions. We further perform text-conditioned image synthesis with the captions, and show that our images are semantically coherent and yield high predicted activations. Finally, to demonstrate how our method enables scientific discovery, we perform exploratory investigations on the distribution of "person" representations in the brain, and discover fine-grained semantic selectivity in body-selective areas. Unlike earlier studies that decode text, our method derives voxel-wise captions of semantic selectivity. Our results show that BrainSCUBA is a promising means for understanding functional preferences in the brain, and provides motivation for further hypothesis-driven investigation of visual cortex.

POSTER-10: CLIP-MUSED: CLIP-Guided Multi-Subject Visual Neural Information Semantic

Primary Area: applications to neuroscience & cognitive science

Keywords: Multi-subject visual neural decoding representational similarity analysis CLIP transformer

Scores: [ 5 6 8 ]

The study of decoding visual neural information faces challenges in generalizing single-subject decoding models to multiple subjects, due to individual differences. Moreover, the limited availability of data from a single subject has a constraining impact on model performance. Although prior multi-subject decoding methods have made significant progress, they still suffer from several limitations, including difficulty in extracting global neural response features, linear scaling of model parameters with the number of subjects, and inadequate characterization of the relationship between neural responses of different subjects to various stimuli. To overcome these limitations, we propose a CLIP-guided Multi-sUbject visual neural information SEmantic Decoding (CLIP-MUSED) method. Our method comprises a Transformer-based feature extractor that models global neural representations and two learnable subject-specific tokens representing different neural response patterns. These tokens enable the model to aggregate multi-subject data without a linear increase in the number of parameters. Additionally, we employ representational similarity analysis (RSA) to guide token representation learning based on the topological relationship of visual stimuli in the representation space of CLIP, enabling full characterization of the relationship between neural responses of different subjects under different stimuli. Finally, token representations are used for multi-subject semantic decoding. Our proposed method outperforms single-subject decoding methods and achieves state-of-the-art performance among the existing multi-subject methods on two fMRI datasets. Visualization results provide insights into the effectiveness of our proposed method. Code is available at https://github.com/CLIP-MUSED/CLIP-MUSED.

POSTER-11: Estimating Shape Distances on Neural Representations with Limited Samples

Primary Area: applications to neuroscience & cognitive science

Keywords: representational geometry shape metrics dissimilarity metrics

Scores: [ 8 10 6 6 ]

Measuring geometric similarity between high-dimensional network representations is a topic of longstanding interest to neuroscience and deep learning. Although many methods have been proposed, only a few works have rigorously analyzed their statistical efficiency or quantified estimator uncertainty in data-limited regimes. Here, we derive upper and lower bounds on the worst-case convergenceof standard estimators of shape distance—a measure of representational dissimilarity proposed by Williams et al. (2021). These bounds reveal the challenging nature of the problem in high-dimensional feature spaces. To overcome these challenges, we introduce a novel method-of-moments estimator with a tunable bias-variance tradeoff parameterized by an upper bound on bias. We show that this estimator achieves superior performance to standard estimators in simulation and on neural data, particularly in high-dimensional settings. Our theoretical work and estimator thus respectively define and dramatically expand the scope of neural data for which geometric similarity can be accurately measured.

POSTER-12: TAB: Temporal Accumulated Batch Normalization in Spiking Neural Networks

Primary Area: applications to neuroscience & cognitive science

Keywords: Temporal Batch Normalization Spiking Neural Networks

Scores: [ 6 8 8 6 ]

Spiking Neural Networks (SNNs) are attracting growing interest for their energy-efficient computing when implemented on neuromorphic hardware. However, directly training SNNs, even adopting batch normalization (BN), is highly challenging due to their non-differentiable activation function and the temporally delayed accumulation of outputs over time. For SNN training, this temporal accumulation gives rise to Temporal Covariate Shifts (TCS) along the temporal dimension, a phenomenon that would become increasingly pronounced with layerwise computations across multiple layers and time-steps. In this paper, we introduce TAB (Temporal Accumulated Batch Normalization), a novel SNN batch normalization method that addresses the temporal covariate shift issue by aligning with neuron dynamics (specifically the accumulated membrane potential) and utilizing temporal accumulated statistics for data normalization. Within its framework, TAB effectively encapsulates the historical temporal dependencies that underlie the membrane potential accumulation process, thereby establishing a natural connection between neuron dynamics and TAB batch normalization. Experimental results on CIFAR-10, CIFAR-100, and DVS-CIFAR10 show that our TAB method outperforms other state-of-the-art methods.

POSTER-13: Diffeomorphic Mesh Deformation via Efficient Optimal Transport for Cortical Surface Reconstruction

Primary Area: applications to neuroscience & cognitive science

Keywords: Mesh deformation optimal transport cortical surface reconstruction computer vision medical imaging.

Scores: [ 6 6 8 8 ]

Mesh deformation plays a pivotal role in many 3D vision tasks including dynamic simulations, rendering, and reconstruction. However, defining an efficient discrepancy between predicted and target meshes remains an open problem. A prevalent approach in current deep learning is the set-based approach which measures the discrepancy between two surfaces by comparing two randomly sampled point-clouds from the two meshes with Chamfer pseudo-distance. Nevertheless, the set-based approach still has limitations such as lacking a theoretical guarantee for choosing the number of points in sampled point-clouds, and the pseudo-metricity and the quadratic complexity of the Chamfer divergence. To address these issues, we propose a novel metric for learning mesh deformation. The metric is defined by sliced Wasserstein distance on meshes represented as probability measures that generalize the set-based approach. By leveraging probability measure space, we gain flexibility in encoding meshes using diverse forms of probability measures, such as continuous, empirical, and discrete measures via \textit{varifold} representation. After having encoded probability measures, we can compare meshes by using the sliced Wasserstein distance which is an effective optimal transport distance with linear computational complexity and can provide a fast statistical rate for approximating the surface of meshes. To the end, we employ a neural ordinary differential equation (ODE) to deform the input surface into the target shape by modeling the trajectories of the points on the surface. Our experiments on cortical surface reconstruction demonstrate that our approach surpasses other competing methods in multiple datasets and metrics.

POSTER-14: A Progressive Training Framework for Spiking Neural Networks with Learnable Multi-hierarchical Model

Primary Area: applications to neuroscience & cognitive science

Keywords: Spiking Neural Networks Learnable Multi-hierarchical Model Spatio-Temporal Back-propagation

Scores: [ 8 6 6 5 ]

SPOTLIGHT-2: Towards Energy Efficient Spiking Neural Networks: An Unstructured Pruning Framework

Primary Area: applications to neuroscience & cognitive science

Keywords: Spiking Neural Networks Network Pruning

Scores: [ 8 3 8 6 ]

Spiking Neural Networks (SNNs) have emerged as energy-efficient alternatives to Artificial Neural Networks (ANNs) when deployed on neuromorphic chips. While recent studies have demonstrated the impressive performance of deep SNNs on challenging tasks, their energy efficiency advantage has been diminished. Existing methods targeting energy consumption reduction do not fully exploit sparsity, whereas powerful pruning methods can achieve high sparsity but are not directly targeted at energy efficiency, limiting their effectiveness in energy saving. Furthermore, none of these works fully exploit the sparsity of neurons or the potential for unstructured neuron pruning in SNNs. In this paper, we propose a novel pruning framework that combines unstructured weight pruning with unstructured neuron pruning to maximize the utilization of the sparsity of neuromorphic computing, thereby enhancing energy efficiency. To the best of our knowledge, this is the first application of unstructured neuron pruning to deep SNNs. Experimental results demonstrate that our method achieves impressive energy efficiency gains. The sparse network pruned by our method with only 0.63% remaining connections can achieve a remarkable 91 times increase in energy efficiency compared to the original dense network, requiring only 8.5M SOPs for inference, with merely 2.19% accuracy loss on the CIFAR-10 dataset. Our work suggests that deep and dense SNNs exhibit high redundancy in energy consumption, highlighting the potential for targeted SNN sparsification to save energy.

POSTER-15: Turning large language models into cognitive models

Primary Area: applications to neuroscience & cognitive science

Keywords: cognitive modeling large language models neural networks cognitive psychology decision-making

Scores: [ 8 8 8 8 ]

Large language models are powerful systems that excel at many tasks, ranging from translation to mathematical reasoning. Yet, at the same time, these models often show unhuman-like characteristics. In the present paper, we address this gap and ask whether large language models can be turned into cognitive models. We find that -- after finetuning them on data from psychological experiments -- these models offer accurate representations of human behavior, even outperforming traditional cognitive models in two decision-making domains. In addition, we show that their representations contain the information necessary to model behavior on the level of individual subjects. Finally, we demonstrate that finetuning on multiple tasks enables large language models to predict human behavior in a previously unseen task. Taken together, these results suggest that large, pre-trained models can be adapted to become models of human cognition, which opens up future research directions toward building more general cognitive models.

POSTER-16: Decoding Natural Images from EEG for Object Recognition

Primary Area: applications to neuroscience & cognitive science

Keywords: EEG object recognition contrastive learning brain-computer interface

Scores: [ 3 8 8 8 ]

Electroencephalography (EEG) signals, known for the convenient non-invasive acquisition but low signal-to-noise, have recently gained substantial attention due to the potential to decode natural images. This paper presents a self-supervised framework to demonstrate the feasibility of learning image representations from EEG signals, particularly for object recognition. The framework utilizes image and EEG encoders to extract features from paired image stimuli and EEG responses. Contrastive learning aligns these two modalities by constraining their similarity. With the framework, we attain significantly above-chance results on a comprehensive EEG-image dataset, achieving a top-1 accuracy of 15.6% and a top-5 accuracy of 42.8% in challenging 200-way zero-shot tasks. Moreover, we perform extensive experiments to explore the biological plausibility by resolving the temporal, spatial, spectral, and semantic aspects of EEG signals. Besides, we introduce attention modules to capture spatial correlations, providing implicit evidence of the brain activity perceived from EEG data. These findings yield valuable insights for neural decoding and brain-computer interfaces in real-world scenarios.

POSTER-17: Exploring Effective Stimulus Encoding via Vision System Modeling for Visual Prostheses

Primary Area: applications to neuroscience & cognitive science

Keywords: Visual prostheses Spiking recurrent neural network Dynamic vision sensor Biological phosphene model

Scores: [ 6 6 8 5 ]

Visual prostheses are potential devices to restore vision for the blind, which highly depends on the quality of stimulation patterns. However, existing processing frameworks utilize region detection or deep learning model to generate stimulation patterns without effective optimization methods to achieve better vision recovery. In this paper, we propose for the first time an end-to-end stimulation patterns optimization framework that consists of a retinal network to mimic the behavior of retina, phosphene model to simulate phosphene generated by retinal prostheses, and primary vision system network (PVS-net) to mimic the function from retina to visual cortex. Combining these three components, the framework can simulate the whole process of visual signals processing from external scenes to the visual perception in the cortex. Besides, we adopt biological spike responses of the visual cortex as target signals during training, providing an efficient way to generate and verify the quality of stimulation patterns. The proposed retina network adopts a spike representation encoding technique to record external scenes and a spiking recurrent neural network to predict the stimulation patterns. The phosphene model and VVS-net simulate the phosphene in the retina and predict multiple V1 neurons' response. Experimental results show that the generated stimulation patterns not only contain the feature of original scenes but also have biological plausibility to generate similar perceptions in the visual cortex. The performance of the proposed framework achieves 0.78, evaluated by the Pearson correlation coefficient between predicted values and the recorded response of normal neurons.

ORAL-1: Predictive auxiliary objectives in deep RL mimic learning in the brain

Primary Area: applications to neuroscience & cognitive science

Keywords: hippocampus neuroscience cognitive science deep reinforcement learning representation learning prediction

Scores: [ 8 8 8 ]

POSTER-18: Multi-modal Gaussian Process Variational Autoencoders for Neural and Behavioral Data

Primary Area: applications to neuroscience & cognitive science

Keywords: Gaussian Processes Latent Variable Models Variational Autoencoders Neuroscience

Scores: [ 8 3 8 5 5 ]

Characterizing the relationship between neural population activity and behavioral data is a central goal of neuroscience. While latent variable models (LVMs) are successful in describing high-dimensional data, they are typically only designed for a single type of data, making it difficult to identify structure shared across different experimental data modalities. Here, we address this shortcoming by proposing an unsupervised LVM which extracts shared and independent latents for distinct, simultaneously recorded experimental modalities. We do this by combining Gaussian Process Factor Analysis (GPFA), an interpretable LVM for neural spiking data with temporally smooth latent space, with Gaussian Process Variational Autoencoders (GP-VAEs), which similarly use a GP prior to characterize correlations in a latent space, but admit rich expressivity due to a deep neural network mapping to observations. We achieve interpretability in our model by partitioning latent variability into components that are either shared between or independent to each modality. We parameterize the latents of our model in the Fourier domain, and show improved latent identification using this approach over standard GP-VAE methods. We validate our model on simulated multi-modal data consisting of Poisson spike counts and MNIST images that scale and rotate smoothly over time. We show that the multi-modal GP-VAE (MM-GPVAE) is able to not only identify the shared and independent latent structure across modalities accurately, but provides good reconstructions of both images and neural rates on held-out trials. Finally, we demonstrate our framework on two real world multi-modal experimental settings: Drosophila whole-brain calcium imaging alongside tracked limb positions, and Manduca sexta spike train measurements from ten wing muscles as the animal tracks a visual stimulus.

POSTER-19: Learning dynamic representations of the functional connectome in neurobiological networks

Primary Area: applications to neuroscience & cognitive science

Keywords: neural networks non-negative tensor factorization dynamic community detection functional connectome

Scores: [ 6 6 8 ]

The static synaptic connectivity of neuronal circuits stands in direct contrast to the dynamics of their function. As in changing community interactions, different neurons can participate actively in various combinations to effect behaviors at different times. We introduce an unsupervised approach to learn the dynamic affinities between neurons in live, behaving animals, and to reveal which communities form among neurons at different times. The inference occurs in two major steps. First, pairwise non-linear affinities between neuronal traces from brain-wide calcium activity are organized by non-negative tensor factorization (NTF). Each factor specifies which groups of neurons are most likely interacting for an inferred interval in time, and for which animals. Finally, a generative model that allows for weighted community detection is applied to the functional motifs produced by NTF to reveal a dynamic functional connectome. Since time codes the different experimental variables (e.g., application of chemical stimuli), this provides an atlas of neural motifs active during separate stages of an experiment (e.g., stimulus application or spontaneous behaviors). Results from our analysis are experimentally validated, confirming that our method is able to robustly predict causal interactions between neurons to generate behavior.

POSTER-20: Bayesian Bi-clustering of Neural Spiking Activity with Latent Structures

Primary Area: applications to neuroscience & cognitive science

Keywords: neural spikes Bayesian nonparametric bi-clustering

Scores: [ 6 8 6 ]

Modern neural recording techniques allow neuroscientists to obtain spiking activity of multiple neurons from different brain regions over long time periods, which requires new statistical methods to be developed for understanding structure of the large-scale data. In this paper, we develop a bi-clustering method to cluster the neural spiking activity spatially and temporally, according to their low-dimensional latent structures. The spatial (neuron) clusters are defined by the latent trajectories within each neural population, while the temporal (state) clusters are defined by (populationally) synchronous local linear dynamics shared with different periods. To flexibly extract the bi-clustering structure, we build the model non-parametrically, and develop an efficient Markov chain Monte Carlo (MCMC) algorithm to sample the posterior distributions of model parameters. Validating our proposed MCMC algorithm through simulations, we find the method can recover unknown parameters and true bi-clustering structures successfully. We then apply the proposed bi-clustering method to multi-regional neural recordings under different experiment settings, where we find that simultaneously considering latent trajectories and spatial-temporal clustering structures can provide us with a more accurate and interpretable result. Overall, the proposed method provides scientific insights for large-scale (counting) time series with elongated recording periods, and it can potentially have application beyond neuroscience.

POSTER-21: Spatio-Temporal Approximation: A Training-Free SNN Conversion for Transformers

Primary Area: applications to neuroscience & cognitive science

Keywords: Spiking Neural Network Spike Calibration Transformer

Scores: [ 8 8 6 6 ]

Spiking neural networks (SNNs) are energy-efficient and hold great potential for large-scale inference. Since training SNNs from scratch is costly and has limited performance, converting pretrained artificial neural networks (ANNs) to SNNs is an attractive approach that retains robust performance without additional training data and resources. However, while existing conversion methods work well on convolution networks, emerging Transformer models introduce unique mechanisms like self-attention and test-time normalization, leading to non-causal non-linear interactions unachievable by current SNNs. To address this, we approximate these operations in both temporal and spatial dimensions, thereby providing the first SNN conversion pipeline for Transformers. We propose \textit{Universal Group Operators} to approximate non-linear operations spatially and a \textit{Temporal-Corrective Self-Attention Layer} that approximates spike multiplications at inference through an estimation-correction approach. Our algorithm is implemented on a pretrained ViT-B/32 from CLIP, inheriting its zero-shot classification capabilities, while improving control over conversion losses. To our knowledge, this is the first direct training-free conversion of a pretrained Transformer to a purely event-driven SNN, promising for neuromorphic hardware deployment.

POSTER-22: Modeling state-dependent communication between brain regions with switching nonlinear dynamical systems

Primary Area: applications to neuroscience & cognitive science

Keywords: neuroscience neural dynamics dynamical systems decision making

Scores: [ 5 6 6 6 6 ]

Understanding how multiple brain regions interact to produce behavior is a major challenge in systems neuroscience, with many regions causally implicated in common tasks such as sensory processing and decision making. A precise description of interactions between regions remains an open problem. Moreover, neural dynamics are nonlinear and non-stationary. Here, we propose MR-SDS, a multiregion, switching nonlinear state space model that decomposes global dynamics into local and cross-communication components in the latent space. MR-SDS includes directed interactions between brain regions, allowing for estimation of state-dependent communication signals, and accounts for sensory inputs effects, history effects, and heterogeneity across days and animals. We show that our model accurately recovers latent trajectories, vector fields underlying switching nonlinear dynamics, and cross-region communication profiles in three simulations. We then apply our method to two large-scale, multi-region neural datasets involving mouse decision making. The first includes hundreds of neurons per region, recorded simultaneously at single-cell-resolution across 3 distant cortical regions. The second is a mesoscale widefield dataset of 8 adjacent cortical regions imaged across both hemispheres. On these multi-region datasets, our model outperforms existing piece-wise linear multi-region models and reveals multiple distinct dynamical states and a rich set of cross-region communication profiles.

POSTER-23: Neuroformer: Multimodal and Multitask Generative Pretraining for Brain Data

Primary Area: applications to neuroscience & cognitive science

Keywords: GPT Pretraining Transformers Multimodal Multitask Neuron Neuroscience Neuronal Spikes Brain Cortex Human Mice Biology

Scores: [ 6 8 6 5 ]

State-of-the-art systems neuroscience experiments yield large-scale multimodal data, and these data sets require new tools for analysis. Inspired by the success of large pretrained models in vision and language domains, we reframe the analysis of large-scale, cellular-resolution neuronal spiking data into an auto-regressive spatiotemporal generation problem. Neuroformer is a multimodal, multitask generative pre-trained transformer (GPT) model that is specifically designed to handle the intricacies of data in systems neuroscience. It scales linearly with feature size, can process an arbitrary number of modalities, and is adaptable to downstream tasks, such as predicting behavior. We first trained Neuroformer on simulated datasets, and found that it both accurately predicted simulated neuronal circuit activity, and also intrinsically inferred the underlying neural circuit connectivity, including direction. When pretrained to decode neural responses, the model predicted the behavior of a mouse with only few-shot fine-tuning, suggesting that the model begins learning how to do so directly from the neural representations themselves, without any explicit supervision. We used an ablation study to show that joint training on neuronal responses and behavior boosted performance, highlighting the model's ability to associate behavioral and neural representations in an unsupervised manner. These findings show that Neuroformer can analyze neural datasets and their emergent properties, informing the development of models and hypotheses associated with the brain.

POSTER-24: Emergent Communication with Conversational Repair

Primary Area: applications to neuroscience & cognitive science

Keywords: emergent communication conversation signaling games language evolution

Scores: [ 8 6 5 ]

Research on conversation has put emphasis on the importance of a multi-level communication system, in which the interlocutors aim to establish and maintain common ground. In natural conversations, repair mechanisms such as clarification requests are frequently used to improve mutual understanding.Here we explore the effects of conversational repair on languages emerging in signaling games. We extend the basic Lewis signaling game setup with a feedback channel that allows for the transmission of messages backwards from the receiver to the sender. Further, we add noise to the communication channel so that repair mechanisms become necessary for optimal performance.We find that for models that were trained with a feedback channel the sender agents produce less compositional messages. However, they still achieve a substantially higher generalization performance, putting to question the role of compositionality for generalization.These findings generalize also to a more realistic case involving naturalistic images in a guessing game setup.More broadly, this study provides an important step towards the creation of signaling games that more closely resemble the conditions under which human languages emerged.

SPOTLIGHT-3: Complex priors and flexible inference in recurrent circuits with dendritic nonlinearities

Primary Area: applications to neuroscience & cognitive science

Keywords: computational neuroscience probabilistic coding neural sampling priors

Scores: [ 6 5 6 6 ]

Despite many successful examples in which probabilistic inference can account for perception, we have little understanding of how the brain represents and uses structured priors that capture the complexity of natural input statistics. Here we construct a recurrent circuit model that can implicitly represent priors over latent variables, and combine them with sensory and contextual sources of information to encode task-specific posteriors. Inspired by the recent success of diffusion models as means of learning and using priors over images, our model uses dendritic nonlinearities optimized for denoising, and stochastic somatic integration with the degree of noise modulated by an oscillating global signal. Combining these elements into a recurrent network yields a dynamical system that samples from the prior at a rate prescribed by the period of the global oscillator. Additional inputs reflecting sensory or top-down contextual information alter these dynamics to generate samples from the corresponding posterior, with different input gating patterns selecting different inference tasks. We demonstrate that this architecture can sample from low dimensional nonlinear manifolds and multimodal posteriors. Overall, the model provides a new framework for circuit-level representation of probabilistic information, in a format that facilitates flexible inference.

POSTER-25: A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive Coding Networks

Primary Area: applications to neuroscience & cognitive science

Keywords: Cognitive Science parallelization

Scores: [ 3 6 6 6 ]

Predictive coding networks are neuroscience-inspired models with roots in both Bayesian statistics and neuroscience. Training such models, however, is quite inefficient and unstable. In this work, we show how by changing the temporal scheduling of the update rule for the synaptic weights leads to an algorithm that is much more efficient and stable than the original one, and has theoretical guarantees in terms of convergence. The proposed algorithm, that we call incremental predictive coding (iPC) is also more biologically plausible than the original one, as it it fully automatic. In an extensive set of experiments, we show that iPC constantly outperforms the original formulation on a large number of benchmarks for image classification, as well as for the training of both conditional and masked language models, in terms of test accuracy, efficiency, and convergence with respect to a large set of hyperparameters.

POSTER-26: BrainLM: A foundation model for brain activity recordings

Primary Area: applications to neuroscience & cognitive science

Keywords: foundation model fMRI

Scores: [ 6 6 6 6 ]

We introduce the Brain Language Model (BrainLM), a foundation model for brain activity dynamics trained on 6,700 hours of fMRI recordings. Utilizing self-supervised masked-prediction training, BrainLM demonstrates proficiency in both fine-tuning and zero-shot inference tasks. Fine-tuning allows for the accurate prediction of clinical variables like age, anxiety, and PTSD as well as forecasting of future brain states. Critically, the model generalizes well to entirely new external cohorts not seen during training. In zero-shot inference mode, BrainLM can identify intrinsic functional networks directly from raw fMRI data without any network-based supervision during training. The model also generates interpretable latent representations that reveal relationships between brain activity patterns and cognitive states. Overall, BrainLM offers a versatile and interpretable framework for elucidating the complex spatiotemporal dynamics of human brain activity. It serves as a powerful "lens" through which massive repositories of fMRI data can be analyzed in new ways, enabling more effective interpretation and utilization at scale. The work demonstrates the potential of foundation models to advance computational neuroscience research.

POSTER-27: Sufficient conditions for offline reactivation in recurrent neural networks

Primary Area: applications to neuroscience & cognitive science

Keywords: computational neuroscience offline reactivation replay recurrent neural networks path integration noise

Scores: [ 6 6 6 5 ]

During periods of quiescence, such as sleep, neural activity in many brain circuits resembles that observed during periods of task engagement. However, the precise conditions under which task-optimized networks can autonomously reactivate the same network states responsible for online behavior is poorly understood. In this study, we develop a mathematical framework that outlines sufficient conditions for the emergence of neural reactivation in circuits that encode features of smoothly varying stimuli. We demonstrate mathematically that noisy recurrent networks optimized to track environmental state variables using change-based sensory information naturally develop denoising dynamics, which, in the absence of input, cause the network to revisit state configurations observed during periods of online activity. We validate our findings using numerical experiments on two canonical neuroscience tasks: spatial position estimation based on self-motion cues, and head direction estimation based on angular velocity cues. Overall, our work provides theoretical support for modeling offline reactivation as an emergent consequence of task optimization in noisy neural circuits.

SPOTLIGHT-4: Large Brain Model for Learning Generic Representations with Tremendous EEG Data in BCI

Primary Area: applications to neuroscience & cognitive science

Keywords: EEG brain-computer interface representation learning

Scores: [ 8 8 6 ]

The current electroencephalogram (EEG) based deep learning models are typically designed for specific datasets and applications in brain-computer interaction (BCI), limiting the scale of the models and thus diminishing their perceptual capabilities and generalizability. Recently, Large Language Models (LLMs) have achieved unprecedented success in text processing, prompting us to explore the capabilities of Large EEG Models (LEMs). We hope that LEMs can break through the limitations of different task types of EEG datasets, and obtain universal perceptual capabilities of EEG signals through unsupervised pre-training. Then the models can be fine-tuned for different downstream tasks. However, compared to text data, the volume of EEG datasets is generally small and the format varies widely. For example, there can be mismatched numbers of electrodes, unequal length data samples, varied task designs, and low signal-to-noise ratio. To overcome these challenges, we propose a unified foundation model for EEG called Large Brain Model (LaBraM). LaBraM enables cross-dataset learning by segmenting the EEG signals into EEG channel patches. Vector-quantized neural spectrum prediction is used to train a semantically rich neural tokenizer that encodes continuous raw EEG channel patches into compact neural codes. We then pre-train neural Transformers by predicting the original neural codes for the masked EEG channel patches. The LaBraMs were pre-trained on about 2,500 hours of various types of EEG signals from around 20 datasets and validated on multiple different types of downstream tasks. Experiments on abnormal detection, event type classification, emotion recognition, and gait prediction show that our LaBraM outperforms all compared SOTA methods in their respective fields. Our code will be released.

SPOTLIGHT-5: Deep Geodesic Canonical Correlation Analysis for Covariance-Based Neuroimaging Data

Primary Area: applications to neuroscience & cognitive science

Keywords: Brain-Computer Interfaces Geometric Deep Learning Self-Supervised Learning Neuroscience

Scores: [ 8 6 8 ]

In human neuroimaging, multi-modal imaging techniques are frequently combined to enhance our comprehension of whole-brain dynamics and improve diagnosis in clinical practice. Modalities like electroencephalography and functional magnetic resonance imaging provide distinct views of brain dynamics due to diametral spatiotemporal sensitivities and underlying neurophysiological coupling mechanisms. These distinct views pose a considerable challenge to learning a shared representation space, especially when dealing with covariance-based data characterized by their geometric structure. To capitalize on the geometric structure, we introduce a measure called geodesic correlation, which expands traditional correlation consistency to covariance-based data on the symmetric positive definite (SPD) manifold. This measure is derived from classical canonical correlation analysis and serves to evaluate the consistency of latent representations obtained from paired views. For multi-view/-modal, self-supervised learning where one or both latent views are SPD, we propose an innovative geometric deep learning framework termed DeepGeoCCA. Its primary objective is to enhance the geodesic correlation of unlabeled, paired data, thereby generating novel representations while retaining the geometric structures. In simulations and experiments with multi-view and multi-modal human neuroimaging data, we find that DeepGeoCCA learns latent representations with high geodesic consistency for unseen data while retaining relevant information for downstream tasks.

SPOTLIGHT-6: Predictive, scalable and interpretable knowledge tracing on structured domains

Primary Area: applications to neuroscience & cognitive science

Keywords: knowledge tracing interpretable representations knowledge graphs probabilistic models variational inference continual learning

Scores: [ 6 5 8 8 ]

Intelligent tutoring systems optimize the selection and timing of learning materials to enhance understanding and long-term retention. This requires estimates of both the learner's progress ("knowledge tracing"; KT), and the prerequisite structure of the learning domain ("knowledge mapping"). While recent deep learning models achieve high KT accuracy, they do so at the expense of the interpretability of psychologically-inspired models. In this work, we present a solution to this trade-off. PSI-KT is a hierarchical generative approach that explicitly models how both individual cognitive traits and the prerequisite structure of knowledge influence learning dynamics, thus achieving interpretability by design. Moreover, by using scalable Bayesian inference, PSI-KT targets the real-world need for efficient personalization even with a growing body of learners and interaction data. Evaluated on three datasets from online learning platforms, PSI-KT achieves superior multi-step predictive accuracy and scalable inference in continual-learning settings, all while providing interpretable representations of learner-specific traits and the prerequisite structure of knowledge that causally supports learning. In sum, predictive, scalable and interpretable knowledge tracing with solid knowledge mapping lays a key foundation for effective personalized learning to make education accessible to a broad, global audience.

POSTER-28: Forward Learning with Top-Down Feedback: Empirical and Analytical Characterization

Primary Area: applications to neuroscience & cognitive science

Keywords: Forward-only learning Biologically inspired learning Artificial neural networks Analytical characterization

Scores: [ 5 5 8 8 ]

POSTER-29: COCO-Periph: Bridging the Gap Between Human and Machine Perception in the Periphery

Primary Area: applications to neuroscience & cognitive science

Keywords: peripheral vision object detection dataset foveation psychophysics

Scores: [ 3 6 8 ]

Evaluating deep neural networks (DNNs) as models of human perception has given rich insights into both human visual processing and representational properties of DNNs. We extend this work by analyzing how well DNNs perform compared to humans when constrained by peripheral vision -- which limits human performance on a variety of tasks, but also benefits the visual system significantly. We evaluate this by (1) modifying the Texture Tiling Model (TTM), a well tested model of peripheral vision to be more flexibly used with DNNs, (2) generating a large dataset which we call COCO-Periph that contains images transformed to capture the information available in human peripheral vision, and (3) comparing DNNs to humans at peripheral object detection using a psychophysics experiment. Our results show that common DNNs underperform at object detection compared to humans when simulating peripheral vision with TTM. Training on COCO-Periph begins to reduce the gap between human and DNN performance and leads to small increases in corruption robustness, but DNNs still struggle to capture human-like sensitivity to peripheral clutter. Our work brings us closer to accurately modeling human vision, and paves the way for DNNs to mimic and sometimes benefit from properties of human visual processing.

POSTER-30: One-hot Generalized Linear Model for Switching Brain State Discovery

Primary Area: applications to neuroscience & cognitive science

Keywords: Hidden markov models generalized linear models functional connectivity inference

Scores: [ 8 3 8 ]

Exposing meaningful and interpretable neural interactions is critical to understanding neural circuits. Inferred neural interactions from neural signals primarily reflect functional interactions. In a long experiment, subject animals may experience different stages defined by the experiment, stimuli, or behavioral states, and hence functional interactions can change over time. To model dynamically changing functional interactions, prior work employs state-switching generalized linear models with hidden Markov models (i.e., HMM-GLMs). However, we argue they lack biological plausibility, as functional interactions are shaped and confined by the underlying anatomical connectome. Here, we propose a novel prior-informed state-switching GLM. We introduce both a Gaussian prior and a one-hot prior over the GLM in each state. The priors are learnable. We will show that the learned prior should capture the state-constant interaction, shedding light on the underlying anatomical connectome and revealing more likely physical neuron interactions. The state-dependent interaction modeled by each GLM offers traceability to capture functional variations across multiple brain states. Our methods effectively recover true interaction structures in simulated data, achieve the highest predictive likelihood with real neural datasets, and render interaction structures and hidden states more interpretable when applied to real neural data.

POSTER-31: Learning interpretable control inputs and dynamics underlying animal locomotion

Primary Area: applications to neuroscience & cognitive science

Keywords: computational neuroscience interpretable dynamics motor control animal behavior dynamical systems system identification unsupervised learning zebrafish

Scores: [ 6 6 5 6 ]

A central objective in neuroscience is to understand how the brain orchestrates movement. Recent advances in automated tracking technologies have made it possible to document behavior with unprecedented temporal resolution and scale, generating rich datasets which can be exploited to gain insights into the neural control of movement. One common approach is to identify stereotypical motor primitives using cluster analysis. However, this categorical description can limit our ability to model the effect of more continuous control schemes. Here we take a control theoretic approach to behavioral modeling and argue that movements can be understood as the output of a controlled dynamical system. Previously, models of movement dynamics, trained solely on behavioral data, have been effective in reproducing observed features of neural activity. These models addressed specific scenarios where animals were trained to execute particular movements upon receiving a prompt. In this study, we extend this approach to analyze the full natural locomotor repertoire of an animal: the zebrafish larva. Our findings demonstrate that this repertoire can be effectively generated through a sparse control signal driving a latent Recurrent Neural Network (RNN). Our model's learned latent space preserves key kinematic features and disentangles different categories of movements. To further interpret the latent dynamics, we used balanced model reduction to yield a simplified model. Collectively, our methods serve as a case study for interpretable system identification, and offer a novel framework for understanding neural activity in relation to movement.

POSTER-32: A Cognitive Model for Learning Abstract Relational Structures from Memory-based Decision-Making Tasks

Primary Area: applications to neuroscience & cognitive science

Keywords: Hippocampal formation; episodic memory; relational learning

Scores: [ 6 8 8 5 ]

Motivated by a recent neuroscientific hypothesis, some theoretical studies have accounted for neural cognitive maps in the rodent hippocampal formation as a representation of the general relational structure across task environments. However, despite their remarkable results, it is unclear whether their account can be extended to more general settings beyond spatial random-walk tasks in 2D environments. To address this question, we construct a novel cognitive model that performs memory-based decision-making tasks, inspired by previous human studies, for learning abstract relational structures. Building on previous approaches of modular architecture, we develop a learning algorithm that performs reward-guided relational inference across different entity domains, while maintaining dynamic binding between abstract relations and concrete entities using our specific memory mechanism enabling content replacement. Our experiments show (i) the capability of our model to capture relational structures that can generalize over new domains with unseen entities, (ii) the difficulty of our task that leads previous models, including Neural Turing Machine and vanilla Transformer, to complete failure, and (iii) the similarity of performance and internal representations of our model to recent human behavioral and fMRI experimental data, a cognitive map representation in the human hippocampal formation.

POSTER-33: VBH-GNN: Variational Bayesian Heterogeneous Graph Neural Networks for Cross-subject Emotion Recognition

Primary Area: applications to neuroscience & cognitive science

Keywords: EEG Emotion Recognition Multi-modal Domain Adaptation cross-subject

Scores: [ 8 8 6 6 ]

The research on human emotion under electroencephalogram (EEG) is an emerging field in which cross-subject emotion recognition (ER) is a promising but challenging task. Many approaches attempt to find emotionally relevant domain-invariant features using domain adaptation (DA) to improve the accuracy of cross-subject ER. Two problems exist with these methods. First, only single-modal data (EEG) are utilized, ignoring the complementarity between multi-modal physiological signals. Second, these methods aim to completely match the signal feature between different domains, which is difficult due to the extreme individual differences of EEG. To solve these problems, we introduced the complementarity of multi-modal physiological signals and proposed a new method for cross-subject ER that does not align the distribution of signal features but rather the distribution of spatio-temporal relationships between features. We design a Variational Bayesian Heterogeneous Graph Neural Network (VBH-GNN) with Relationship Distribution Adaptation (RDA). The RDA first aligns the domains by expressing the model space as a posterior distribution of a heterogeneous graph (HetG) for a given source domain through Bayesian graph inference. Then RDA transforms the HetG into an emotion-specific graph to further align the domains for the downstream ER task. Extensive experiments on two public datasets, DEAP and Dreamer, show that our VBH-GNN outperforms state-of-the-art methods.

SPOTLIGHT-7: Online Stabilization of Spiking Neural Networks

Primary Area: applications to neuroscience & cognitive science

Keywords: spiking neural networks online training

Scores: [ 6 8 8 6 ]

Spiking neural networks (SNNs), attributed to the binary, event-driven nature of spikes, possess heightened biological plausibility and enhanced energy efficiency on neuromorphic hardware compared to analog neural networks (ANNs). Mainstream SNN training schemes apply backpropagation-through-time (BPTT) with surrogate gradients to replace the non-differentiable spike emitting process during backpropagation. While achieving competitive performance, the requirement for storing intermediate information at all time-steps incurs higher memory consumption and fails to fulfill the online property crucial to biological brains. Our work focuses on online training techniques, aiming for memory efficiency while preserving biological plausibility. The limitation of not having access to future information in early time steps in online training has constrained previous efforts to incorporate advantageous modules such as batch normalization. To address this problem, we propose Online Spiking Renormalization (OSR) to ensure consistent parameters between testing and training, and Online Threshold Stabilizer (OTS) to stabilize neuron firing rates across time steps. Furthermore, we design a novel online approach to compute the variable mean and variance over time for OSR. Experiments conducted on various datasets demonstrate the proposed method's superior performance among SNN online training algorithms.

SPOTLIGHT-8: Critical Learning Periods Emerge Even in Deep Linear Networks

Primary Area: applications to neuroscience & cognitive science

Keywords: critical learning periods deep neural networks gradient descent linear networks

Scores: [ 8 6 10 5 ]

Critical learning periods are periods early in development where temporary sensory deficits can have a permanent effect on behavior and learned representations. Despite the radical differences between biological and artificial networks, critical learning periods have been empirically observed in both systems. This suggests that critical periods may be fundamental to learning and not an accident of biology.Yet, why exactly critical periods emerge in deep networks is still an open question, and in particular it is unclear whether the critical periods observed in both systems depend on particular architectural or optimization details. To isolate the key underlying factors, we focus on deep linear network models, and show that, surprisingly, such networks also display much of the behavior seen in biology and artificial networks, while being amenable to analytical treatment. We show that critical periods depend on the depth of the model and structure of the data distribution. We also show analytically and in simulations that the learning of features is tied to competition between sources. Finally, we extend our analysis to multi-task learning to show that pre-training on certain tasks can damage the transfer performance on new tasks, and show how this depends on the relationship between tasks and the duration of the pre-training stage. To the best of our knowledge, our work provides the first analytically tractable model that sheds light into why critical learning periods emerge in biological and artificial networks.

POSTER-34: BrainPy: a differentiable brain simulator bridging brain simulation and brain-inspired computing

Primary Area: applications to neuroscience & cognitive science

Keywords: brain simulator brain simulation computational neuroscience brain-inspired computing

Scores: [ 6 10 8 6 6 ]

Brain simulation builds dynamical models to mimic the structure and functions of the brain, while brain-inspired computing (BIC) develops intelligent systems by learning from the structure and functions of the brain. The two fields are intertwined and should share a common programming framework to facilitate each other's development. However, none of the existing software in the fields can achieve this goal, because traditional brain simulators lack differentiability for training, while existing deep learning (DL) frameworks fail to capture the biophysical realism and complexity of brain dynamics. In this paper, we introduce BrainPy, a differentiable brain simulator developed using JAX and XLA, with the aim of bridging the gap between brain simulation and BIC. BrainPy expands upon the functionalities of JAX, a powerful AI framework, by introducing complete capabilities for flexible, efficient, and scalable brain simulation. It offers a range of sparse and event-driven operators for efficient and scalable brain simulation, an abstraction for managing the intricacies of synaptic computations, a modular and flexible interface for constructing multi-scale brain models, and an object-oriented just-in-time compilation approach to handle the memory-intensive nature of brain dynamics. We showcase the efficiency and scalability of BrainPy on benchmark tasks, highlight its differentiable simulation for biologically plausible spiking models, and discuss its potential to support research at the intersection of brain simulation and BIC.

POSTER-35: Manipulating dropout reveals an optimal balance of efficiency and robustness in biological and machine visual systems

Primary Area: applications to neuroscience & cognitive science

Keywords: Efficient coding object representation dropout robustness human fMRI occipitotemporal cortex cognitive neuroscience distributed coding

Scores: [ 6 6 6 6 ]

According to the efficient coding hypothesis, neural populations encode information optimally when representations are high-dimensional and uncorrelated. However, such codes may carry a cost in terms of generalization and robustness. Past empirical studies of early visual cortex (V1) in rodents have suggested that this tradeoff indeed constrains sensory representations. However, it remains unclear whether these insights generalize across the hierarchy of the human visual system, and particularly to object representations in high-level occipitotemporal cortex (OTC). To gain new empirical clarity, here we develop a family of object recognition models with parametrically varying dropout proportion p, which induces systematically varying dimensionality of internal responses (while controlling all other inductive biases). We find that increasing dropout produces an increasingly smooth, low-dimensional representational space. Optimal robustness to lesioning is observed at around 70% dropout, after which both accuracy and robustness decline. Representational comparison to large-scale 7T fMRI data from occipitotemporal cortex in the Natural Scenes Dataset reveals that this optimal degree of dropout is also associated with maximal emergent neural predictivity. Finally, using new techniques for achieving denoised estimates of the eigenspectrum of human fMRI responses, we compare the rate of eigenspectrum decay between model and brain feature spaces. We observe that the match between model and brain representations is associated with a common balance between efficiency and robustness in the representational space. These results suggest that varying dropout may reveal an optimal point of balance between the efficiency of high-dimensional codes and the robustness of low dimensional codes in hierarchical vision systems.

POSTER-36: Leveraging Generative Models for Unsupervised Alignment of Neural Time Series Data

Primary Area: applications to neuroscience & cognitive science

Keywords: neural dynamics transfer learning distribution alignment neuroscience few-shot learning

Scores: [ 6 8 6 6 ]

Large scale inference models are widely used in neuroscience to extract latent representations from high-dimensional neural recordings. Due to the statistical heterogeneities between sessions and animals, a new model is trained from scratch to infer the underlying dynamics for each new dataset. This is computationally expensive and does not fully leverage all the available data. Moreover, as these models get more complex, they can be challenging to train. In parallel, it is becoming common to use pre-trained models in the machine learning community for few shot and transfer learning. One major hurdle that prevents the re-use of generative models in neuroscience is the complex spatio-temporal structure of neural dynamics within and across animals. Interestingly, the underlying dynamics identified from different datasets on the same task are qualitatively similar. In this work, we exploit this observation and propose a source-free and unsupervised alignment approach that utilizes the learnt dynamics and enables the re-use of trained generative models. We validate our approach on simulations and show the efficacy of the alignment on neural recordings from the motor cortex obtained during a reaching task.

POSTER-37: Maximally discriminative stimuli for functional cell type identification

Primary Area: applications to neuroscience & cognitive science

Keywords: clustering discriminative stimuli interpretable optimization expectation-maximization functional cell types digital twins feature visualization pre-image search maximally exciting image

Scores: [ 5 6 5 6 ]

Identifying cell types and understanding their functional properties is crucial for unraveling the mechanisms underlying perception and cognition. For example in the retina, functional types can be identified by a carefully selected and manually curated battery of stimuli. However, this requires expert domain knowledge and biases the procedure towards previously known cell types. In the visual cortex, it is still unknown what functional types exist and how to identify them. Thus, for unbiased identification of the functional cell types in retina and visual cortex, new approaches are needed. Here we propose an optimization-based clustering approach using deep predictive models to obtain functional clusters of neurons using Maximally Discriminative Stimuli (MDS). Our approach alternates between stimulus optimization with cluster reassignment akin to an expectation-maximization algorithm. The algorithm recovers functional clusters in mouse retina, marmoset retina and macaque visual area V4. This demonstrates that our approach can successfully find discriminative stimuli across species, stages of the visual system and recording techniques. Presenting maximally discriminative stimuli during data acquisition allows for on-the-fly assignment to functional cell types, and paves the way for experiments that were previously limited by experimental time. Crucially, MDS are interpretable: they visualize the distinctive stimulus patterns that most unambiguously identify a specific type of neuron. We will make our code avail- able online upon publication.

POSTER-38: Learning Delays in Spiking Neural Networks using Dilated Convolutions with Learnable Spacings

Primary Area: applications to neuroscience & cognitive science

Keywords: Spiking Neural Networks Delays Neuromorphic Computing Speech Recognition

Scores: [ 6 8 8 6 ]

Spiking Neural Networks (SNNs) are a promising research direction for building power-efficient information processing systems, especially for temporal tasks such as speech recognition. In SNNs, delays refer to the time needed for one spike to travel from one neuron to another. These delays matter because they influence the spike arrival times, and it is well-known that spiking neurons respond more strongly to coincident input spikes. More formally, it has been shown theoretically that plastic delays greatly increase the expressivity in SNNs. Yet, efficient algorithms to learn these delays have been lacking. Here, we propose a new discrete-time algorithm that addresses this issue in deep feedforward SNNs using backpropagation, in an offline manner. To simulate delays between consecutive layers, we use 1D convolutions across time. The kernels contain only a few non-zero weights – one per synapse – whose positions correspond to the delays. These positions are learned together with the weights using the recently proposed Dilated Convolution with Learnable Spacings (DCLS). We evaluated our method on three datasets: the Spiking Heidelberg Dataset (SHD), the Spiking Speech Commands (SSC) and its non spiking version Google Speech Commands v0.02 (GSC) benchmarks, which require detecting temporal patterns. We used feedforward SNNs with two or three hidden fully connected layers, and vanilla leaky integrate-and-fire neurons. We showed that fixed random delays help and that learning them helps even more. Furthermore, our method outperformed the state-of-the-art in the three datasets without using recurrent connections and with substantially fewer parameters. Our work demonstrates the potential of delay learning in developing accurate and precise models for temporal data processing. Our code is based on PyTorch / SpikingJelly and available at: https://anonymous.4open.science/r/SNN-delays-6DD1/

POSTER-39: Brain decoding: toward real-time reconstruction of visual perception

Primary Area: applications to neuroscience & cognitive science

Keywords: brain decoding neuroimaging image generation visual perception

Scores: [ 6 8 8 6 ]

In the past five years, the use of generative and foundational AI systems has greatly improved the decoding of brain activity. Visual perception, in particular, can now be decoded from functional Magnetic Resonance Imaging (fMRI) with remarkable fidelity. This neuroimaging technique, however, suffers from a limited temporal resolution ($\approx$0.5,Hz) and thus fundamentally constrains its real-time usage. Here, we propose an alternative approach based on magnetoencephalography (MEG), a neuroimaging device capable of measuring brain activity with high temporal resolution ($\approx$5,000 Hz). For this, we develop an MEG decoding model trained with both contrastive and regression objectives and consisting of three modules: i) pretrained embeddings obtained from the image, ii) an MEG module trained end-to-end and iii) a pretrained image generator. Our results are threefold: Firstly, our MEG decoder shows a 7X improvement of image-retrieval over classic linear decoders. Second, late brain responses to images are best decoded with DINOv2, a recent foundational image model. Third, image retrievals and generations both suggest that MEG signals primarily contain high-level visual features, whereas the same approach applied to 7T fMRI also recovers low-level features. Overall, these results provide an important step towards the decoding - in real time - of the visual processes continuously unfolding within the human brain.

POSTER-40: Spike-driven Transformer V2: Meta Spiking Neural Network Architecture Inspiring the Design of Next-generation Neuromorphic Chips

Primary Area: applications to neuroscience & cognitive science

Keywords: Spiking neural networks Spiking transformer Transformer-based SNNs Neuromorphic chips Spike-driven self-attention

Scores: [ 5 6 6 ]

Neuromorphic computing, which exploits Spiking Neural Networks (SNNs) on neuromorphic chips, is a promising energy-efficient alternative to traditional AI. CNN-based SNNs are the current mainstream of neuromorphic computing. By contrast, no neuromorphic chips are designed especially for Transformer-based SNNs, which have just emerged, and their performance is only on par with CNN-based SNNs, offering no distinct advantage. In this work, we propose a general Transformer-based SNN architecture, termed as Meta-SpikeFormer", whose goals are: (1) Lower-power, supports the spike-driven paradigm that there is only sparse addition in the network; (2) Versatility, handles various vision tasks; (3) High-performance, shows overwhelming performance advantages over CNN-based SNNs; (4) Meta-architecture, provides inspiration for future next-generation Transformer-based neuromorphic chip designs. Specifically, we extend the Spike-driven Transformer in \citet{yao2023spike} into a meta architecture, and explore the impact of structure, spike-driven self-attention, and skip connection on its performance. On ImageNet-1K, Meta-SpikeFormer achieves 80.0% top-1 accuracy (55M), surpassing the current state-of-the-art (SOTA) SNN baselines (66M) by 3.7%. This is the first direct training SNN backbone that can simultaneously supports classification, detection, and segmentation, obtaining SOTA results in SNNs. Finally, we discuss the inspiration of the meta SNN architecture for neuromorphic chip design.

POSTER-41: Enhancing Small Medical Learners with Privacy-preserving Contextual Prompting

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: natural language processing large language model prompting application in healthcare privacy

Scores: [ 6 6 6 6 ]

Large language models (LLMs) demonstrate remarkable medical expertise, but data privacy concerns impede their direct use in healthcare environments. Although offering improved data privacy protection, domain-specific small language models (SLMs) often underperform LLMs, emphasizing the need for methods that reduce this performance gap while alleviating privacy concerns. In this paper, we present a simple yet effective method that harnesses LLMs' medical proficiency to boost SLM performance in medical tasks under privacyrestricted scenarios. Specifically, we mitigate patient privacy issues by extracting keywords from medical data and prompting the LLM to generate a medical knowledge-intensive context by simulating clinicians' thought processes. This context serves as additional input for SLMs, augmenting their decision-making capabilities. Our method significantly enhances performance in both few-shot and full training settings across three medical knowledge-intensive tasks, achieving up to a 22.57% increase in absolute accuracy compared to SLM fine-tuning without context, and sets new state-of-the-art results in two medical tasks within privacy-restricted scenarios. Further out-of-domain testing and experiments in two general domain datasets showcase its generalizability and broad applicability.

SPOTLIGHT-9: Long-Term Typhoon Trajectory Prediction: A Physics-Conditioned Approach Without Reanalysis Data

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Weather Forecasting Typhoon Trajectory Forecasting Tropical Cyclone Climate Change

Scores: [ 8 8 6 3 ]

In the face of escalating climate changes, typhoon intensities and their ensuing damage have surged. Accurate trajectory prediction is crucial for effective damage control. Traditional physics-based models, while comprehensive, are computationally intensive and rely heavily on the expertise of forecasters. Contemporary data-driven methods often rely on reanalysis data, which can be considered to be the closest to the true representation of weather conditions. However, reanalysis data is not produced in real-time and requires time for adjustment since prediction models are calibrated with observational data. This reanalysis data, such as ERA5, falls short in challenging real-world situations. Optimal preparedness necessitates predictions at least 72 hours in advance, beyond the capabilities of standard physics models. In response to these constraints, we present an approach that harnesses real-time Unified Model (UM) data, sidestepping the limitations of reanalysis data. Our model provides predictions at 6-hour intervals for up to 72 hours in advance and outperforms both state-of-the-art data-driven methods and numerical weather prediction models. In line with our efforts to mitigate adversities inflicted by typhoon, we release our preprocessed PHYSICS TRACK dataset, which includes ERA5 reanalysis data, typhoon best-track, and UM forecast data. We also submit our source code and pretrained models.

POSTER-42: Rigid Protein-Protein Docking via Equivariant Elliptic-Paraboloid Interface Prediction

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Equivariant Graph Neural Network rigid body protein-protein docking interface fitting

Scores: [ 5 8 8 3 ]

The study of rigid protein-protein docking plays an essential role in a variety of tasks such as drug design and protein engineering. Recently, several learning-based methods have been proposed for the task, exhibiting much faster docking speed than those computational methods. In this paper, we propose a novel learning-based method called ElliDock, which predicts an elliptic paraboloid to represent the protein-protein docking interface. To be specific, our model estimates elliptic paraboloid interfaces for the two input proteins respectively, and obtains the roto-translation transformation for docking by making two interfaces coincide. By its design, ElliDock is independently equivariant with respect to arbitrary rotations/translations of the proteins, which is an indispensable property to ensure the generalization of the docking process. Experimental evaluations show that ElliDock achieves the fastest inference time among all compared methods, and outperforms state-of-the-art learning-based methods, like DiffDock-PP and Alphafold-Multimer, for particularly antibody-antigen docking.

ORAL-2: Protein Discovery with Discrete Walk-Jump Sampling

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: generative modeling langevin mcmc energy-based models score-based models protein design protein discovery

Scores: [ 8 8 8 ]

We resolve difficulties in training and sampling from a discrete generative model by learning a smoothed energy function, sampling from the smoothed data manifold with Langevin Markov chain Monte Carlo (MCMC), and projecting back to the true data manifold with one-step denoising. Our Discrete Walk-Jump Sampling formalism combines the contrastive divergence training of an energy-based model and improved sample quality of a score-based model, while simplifying training and sampling by requiring only a single noise level. We evaluate the robustness of our approach on generative modeling of antibody proteins and introduce the distributional conformity score to benchmark protein generative models. By optimizing and sampling from our models for the proposed distributional conformity score, 97-100% of generated samples are successfully expressed and purified and 70% of functional designs show equal or improved binding affinity compared to known functional antibodies on the first attempt in a single round of laboratory experiments. We also report the first demonstration of long-run fast-mixing MCMC chains where diverse antibody protein classes are visited in a single MCMC chain.

SPOTLIGHT-10: Equivariant Matrix Function Neural Networks

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: equivariance graph neural networks long range

Scores: [ 6 8 5 ]

Graph Neural Networks (GNNs), especially message-passing neural networks (MPNNs), have emerged as powerful architectures for learning on graphs in diverse applications. However, MPNNs face challenges when modeling non-local interactions in systems such as large conjugated molecules, metals, or amorphous materials.Although Spectral GNNs and traditional neural networks such as recurrent neural networks and transformers mitigate these challenges, they often lack extensivity, adaptability, generalizability, computational efficiency, or fail to capture detailed structural relationships or symmetries in the data. To address these concerns, we introduce Matrix Function Neural Networks (MFNs), a novel architecture that parameterizes non-local interactions through analytic matrix equivariant functions. Employing resolvent expansions offers a straightforward implementation and the potential for linear scaling with system size.The MFN architecture achieves state-of-the-art performance in standard graph benchmarks, such as the ZINC and TU datasets, and is able to capture intricate non-local interactions in quantum systems. The code and the datasets will be made public.

POSTER-43: Conversational Drug Editing Using Retrieval and Domain Feedback

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Large Language Models prompt retrieval domain feedback conversation drug editing drug optimization controllable generation small molecule peptide protein

Scores: [ 6 6 6 ]

Recent advancements in conversational large language models (LLMs), such as ChatGPT, have demonstrated remarkable promise in various domains, including drug discovery. However, existing works mainly focus on investigating the capabilities of conversational LLMs on chemical reactions and retrosynthesis. While drug editing, a critical task in the drug discovery pipeline, remains largely unexplored. To bridge this gap, we propose ChatDrug, a framework to facilitate the systematic investigation of drug editing using LLMs. ChatDrug jointly leverages a prompt module, a retrieval and domain feedback module, and a conversation module to streamline effective drug editing. We empirically show that ChatDrug reaches the best performance on all 39 drug editing tasks, encompassing small molecules, peptides, and proteins. We further demonstrate, through 10 case studies, that ChatDrug can successfully identify the key substructures for manipulation, generating diverse and valid suggestions for drug editing. Promisingly, we also show that ChatDrug can offer insightful explanations from a domain-specific perspective, enhancing interpretability and enabling informed decision-making.

ORAL-3: ClimODE: Climate Forecasting With Physics-informed Neural ODEs

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: neural ODE time-series forecasting climate prediction physics-informed ML

Scores: [ 8 8 8 8 ]

Climate prediction traditionally relies on complex numerical simulations of atmospheric physics. Deep learning approaches, such as transformers, have recently challenged the simulation paradigm with complex network forecasts. However, they often act as data-driven black-box models that neglect the underlying physics and lack uncertainty quantification. We address these limitations with ClimODE, a spatiotemporal continuous-time process that implements a key principle of advection from statistical mechanics, namely, weather changes due to a spatial movement of quantities over time. ClimODE models precise weather evolution with value-conserving dynamics, learning global weather transport as a neural flow, which also enables estimating the uncertainty in predictions. Our approach outperforms existing data-driven methods in global and regional forecasting with an order of magnitude smaller parameterization, establishing a new state of the art.

POSTER-44: Robust Model-Based Optimization for Challenging Fitness Landscapes

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Model based optimization protein engineering

Scores: [ 6 8 5 6 ]

Protein design, a grand challenge of the day, involves optimization on a fitness landscape, and leading methods adopt a model-based approach where a model is trained on a training set (protein sequences and fitness) and proposes candidates to explore next. These methods are challenged by sparsity of high-fitness samples in the training set, a problem that has been in the literature. A less recognized but equally important problem stems from the distribution of training samples in the design space: leading methods are not designed for scenarios where the desired optimum is in a region that is not only poorly represented in training data, but also relatively far from the highly represented low-fitness regions. We show that this problem of “separation” in the design space is a significant bottleneck in existing model-based optimization tools and propose a new approach that uses a novel VAE as its search model to overcome the problem. We demonstrate its advantage over prior methods in robustly finding improved samples, regardless of the imbalance and separation between low- and high-fitness samples. Our comprehensive benchmark on real and semi-synthetic protein datasets as well as solution design for physics-informed neural networks, showcases the generality of our approach in discrete and continuous design spaces. Our implementation is available at https://anonymous.4open.science/r/PPGVAE-F83E.

POSTER-45: Learning to design protein-protein interactions with enhanced generalization

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: protein-protein interactions protein design generalization self-supervised learning equivariant 3D representations

Scores: [ 3 6 6 8 6 ]

Discovering mutations enhancing protein-protein interactions (PPIs) is critical for advancing biomedical research and developing improved therapeutics. While machine learning approaches have substantially advanced the field, they often struggle to generalize beyond training data in practical scenarios. The contributions of this work are three-fold. First, we construct PPIRef, the largest and non-redundant dataset of 3D protein-protein interactions, enabling effective large-scale learning. Second, we leverage PPIRef to pre-train PPIformer, a new SE(3)-equivariant model, generalizing across diverse protein-binder variants. We fine-tune PPIformer to predict effects of mutations on protein-protein interactions via a thermodynamically motivated adjustment of the pre-training loss function. Finally, we demonstrate the enhanced generalization of our new PPIformer approach by outperforming other state-of-the-art methods on the new non-leaking splits of the standard labeled PPI mutational data and independent case studies optimizing a human antibody against SARS-CoV-2 and increasing staphylokinase thrombolytic activity.

SPOTLIGHT-11: Compositional Generative Inverse Design

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: inverse design generative design PDE physical simulation compositional

Scores: [ 8 8 6 6 ]

Inverse design, where we seek to design input variables in order to optimize an underlying objective function, is an important problem that arises across fields such as mechanical engineering to aerospace engineering. Inverse design is typically formulated as an optimization problem, with recent works leveraging optimization across learned dynamics models. However, as models are optimized they tend to fall into adversarial modes, preventing effective sampling. We illustrate that by instead optimizing over the learned energy function captured by the diffusion model, we can avoid such adversarial examples and significantly improve design performance. We further illustrate how such a design system is compositional, enabling us to combine multiple different diffusion models representing subcomponents of our desired system to design systems with every specified component. In an N-body interaction task and a challenging 2D multi-airfoil design task, we demonstrate that by composing the learned diffusion model at test time, our method allows us to design initial states and boundary shapes that are more complex than those in the training data. Our method outperforms state-of-the-art neural inverse design method by an average of 41.5% in prediction MAE and 14.3% in design objective for the N-body dataset and discovers formation flying to minimize drag in the multi-airfoil design task.

POSTER-46: Scalable Diffusion for Materials Generation

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Generative models for materials diffusion models density function theory

Scores: [ 6 8 5 6 ]

​​​​Generative models trained on internet-scale data are capable of generating novel and realistic texts, images, and videos. A natural next question is whether these models can advance science, for example by generating novel stable materials. Traditionally, models with explicit structures (e.g., graphs) have been used in modeling structural relationships in scientific data (e.g., atoms and bonds in crystals), but generating structures can be difficult to scale to large and complex systems. Another challenge in generating materials is the mismatch between standard generative modeling metrics and downstream applications. For instance, common metrics such as the reconstruction error do not correlate well with the downstream goal of discovering novel stable materials. In this work, we tackle the scalability challenge by developing a unified crystal representation that can represent any crystal structure (UniMat), followed by training a diffusion probabilistic model on these UniMat representations. Our empirical results suggest that despite the lack of explicit structure modeling, UniMat can generate high fidelity crystal structures from larger and more complex chemical systems, outperforming previous graph-based approaches under various generative modeling metrics. To better connect the generation quality of materials to downstream applications, such as discovering novel stable materials, we propose additional metrics for evaluating generative models of materials, including per-composition formation energy and stability with respect to convex hulls through decomposition energy from Density Function Theory (DFT). Lastly, we show that conditional generation with UniMat can scale to previously established crystal datasets with up to millions of crystals structures, outperforming random structure search (the current leading method for structure discovery) in discovering new stable materials.

SPOTLIGHT-12: Q-TAPE: A Task-Agnostic Pre-Trained Approach for Quantum Properties Estimation

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: quantum properties estimation pre-training fine-tuning

Scores: [ 8 8 8 8 ]

Properties estimation for quantum systems is crucial for addressing quantum many-body problems in physics and chemistry. Recently, task-specific deep learning models have exhibited an enhanced capacity to estimate the properties, surpassing the performance of conventional statistical approaches. However, with rapid escalation of quantum computers, existing learning-based models fall short in learning from explosion of quantum data generated by the systems under different physical conditions. Inspired by the triumphs of Large Language Models in Natural Language Processing and Computer Vision, we introduce Q-TAPE, a task-agnostic pre-trained model that 1) facilitates learning of the rich information from diverse quantum systems with different physical conditions in a fully unsupervised fashion; 2) delivers high performance with limited training data, mitigating the cost for quantum data collection and reducing the time for convergence for different supervised tasks. Extensive experiments demonstrate the promising efficacy of Q-TAPE in various tasks including classifying quantum phases of matter on Rydberg atom model and predicting two-body correlation function on anisotropic Heisenberg model. Source code will be made publicly available.

POSTER-47: Fine-Tuned Language Models Generate Stable Inorganic Materials as Text

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: generative model large language model stable materials AI for science

Scores: [ 6 5 6 ]

POSTER-48: Encoding Unitig-level Assembly Graphs with Heterophilous Constraints for Metagenomic Contigs Binning

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Metagenomics Binning Computational Genomics Graph Neural Networks

Scores: [ 5 3 5 ]

Metagenomics studies genomic material derived from mixed microbial communities in diverse environments, holding considerable significance for both human health and environmental sustainability. Metagenomic binning refers to the clustering of genomic subsequences obtained from high-throughput DNA sequencing into distinct bins, each representing a constituent organism within the community. Mainstream binning methods primarily rely on sequence features such as composition and abundance, making them unable to effectively handle sequences shorter than 1,000 bp and inherent noise within sequences. Several binning tools have emerged, aiming to enhance binning outcomes by using the assembly graph generated by assemblers, which encodes valuable overlapping information among genomic sequences. However, existing assembly graph-based binners mainly focus on simplified contig-level assembly graphs that are recreated from assembler’s original graphs, unitig-level assembly graphs. The simplification reduces the resolution of the connectivity information in original graphs. In this paper, we design a novel binning tool named UnitigBin, which leverages representation learning on unitig-level assembly graphs while adhering to heterophilious constraints imposed by single-copy marker genes, ensuring that constrained contigs cannot be grouped together. Extensive experiments conducted on synthetic and real datasets demonstrate that UnitigBin significantly surpasses state-of-the-art binning tools.

POSTER-49: Scaling physics-informed hard constraints with mixture-of-experts

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Physics-Informed Machine Learning PDEs differentiable optimization neural networks mixture of experts constrained optimization

Scores: [ 6 6 6 6 6 ]

Imposing known physical constraints, such as conservation laws, during the training of neural networks introduces an inductive bias that can improve accuracy, convergence, and data efficiency for physical problems. While such constraints can be softly imposed via loss function penalties, recent advancements in differentiable physics and optimization improve accuracy by integrating constrained optimization within neural network training, enabling a stricter adherence to physical constraints. However, imposing hard constraints significantly increases computational and memory costs, especially for complex dynamical systems, as it requires solving the differentiable optimization problem over a large number of points in the spatiotemporal domain. To address this challenge, we develop a scalable approach to enforce hard physical constraints using Mixture-of-Experts (MoE). Our approach imposes the constraint over smaller decomposed domains, each of which is solved by an expert'' through differentiable optimization. During training, each expert independently performs a localized backpropagation step by leveraging the implicit function theorem; the independence of each expert allows for parallelization across multiple GPUs. Compared to standard differentiable optimization, our scalable approach achieves greater accuracy on challenging non-linear systems, concurrently improving training stability and requiring significantly less computation time during both training and inference stages.

POSTER-50: GraphCare: Enhancing Healthcare Predictions with Personalized Knowledge Graphs

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: EHR Prediction Personalized Knowledge Graph Graph Neural Network

Scores: [ 6 6 6 ]

Clinical predictive models often rely on patients’ electronic health records (EHR), but integrating medical knowledge to enhance predictions and decision-making is challenging. This is because personalized predictions require personalized knowledgegraphs (KGs), which are difficult to generate from patient EHR data. To address this, we propose GraphCare, an open-world framework that uses external KGs to improve EHR-based predictions. Our method extracts knowledge from large language models (LLMs) and external biomedical KGs to build patient-specific KGs, which are then used to train our proposed Bi-attention AugmenTed(BAT) graph neural network (GNN) for healthcare predictions. On two public datasets, MIMIC-III and MIMIC-IV, GraphCare surpasses baselines in four vital healthcare prediction tasks: mortality, readmission, length of stay (LOS), and drug recommendation. On MIMIC-III, it boosts AUROC by 17.6% and 6.6% for mortality and readmission, and F1-score by 7.9% and 10.8% for LOS and drug recommendation, respectively. Notably, GraphCare demonstrates a substantial edge in scenarios with limited data availability. Our findings highlight the potential of using external KGs in healthcare prediction tasks and demonstrate the promise of GraphCare in generating personalized KGs for promoting personalized medicine.

POSTER-51: Energy-conserving equivariant GNN for elasticity of lattice architected metamaterials

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: mechanical metamaterials lattices elasticity GNN equivariant positive definite energy conservation

Scores: [ 6 5 8 5 ]

Lattices are architected metamaterials whose properties strongly depend on their geometrical design.The analogy between lattices and graphs enables the use of graph neural networks (GNNs) as a faster surrogate model compared to traditional methods such as finite element modelling.In this work we present a higher-order GNN model trained to predict the fourth-order stiffness tensor of periodic strut-based lattices.The key features of the model are (i) SE(3) equivariance, and (ii) consistency with the thermodynamic law of conservation of energy.We compare the model to non-equivariant models based on a number of error metrics and demonstrate the benefits of the encoded equivariance and energy conservation in terms of predictive performance and reduced training requirements.

POSTER-52: Evaluating Representation Learning on the Protein Structure Universe

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Protein Representation Learning Protein Structure

Scores: [ 8 6 5 6 ]

Protein structure representation learning is the foundation for promising applications in drug discovery, protein design, and protein function prediction. However, there remains a need for a robust, standardised benchmark to track the progress of new and established methods with greater granularity and relevance to downstream applications. In this work, we introduce a comprehensive and open benchmark suite for evaluating protein structure representation learning methods.We provide several pre-training methods, downstream tasks and pre-training corpora comprised of both experimental and predicted structures, offering a balanced challenge to representation learning algorithms. These tasks enable the systematic evaluation of the quality of the learned embeddings, the structural and functional relationships captured, and their usefulness in downstream tasks. We benchmark state-of-the-art protein-specific and generic geometric Graph Neural Networks and the extent to which they benefit from different types of pre-training. We find that pre-training consistently improves the performance of both rotation-invariant and equivariant models, and that equivariant models seem to benefit even more from pre-training compared to invariant models.We aim to establish a common ground for the machine learning and computational biology communities to collaborate, compare, and advance protein structure representation learning. By providing a standardised and rigorous evaluation platform, we expect to accelerate the development of novel methodologies and improve our understanding of protein structures and their functions. The codebase incorporates several engineering contributions which considerably reduces the barrier to entry for pre-training and working with large structure-based datasets. Our benchmark is available at: https://anonymous.4open.science/r/ProteinWorkshop-B8F5/

SPOTLIGHT-13: CAMIL: Context-Aware Multiple Instance Learning for Cancer Detection and Subtyping in Whole Slide Images

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Multiple Instance Learning Histopathology Nearest Neighbors Graph Representation

Scores: [ 8 6 8 8 ]

The visual examination of tissue biopsy sections is fundamental for cancer diagnosis, with pathologists analyzing sections at multiple magnifications to discern tumor cells and their subtypes. However, existing attention-based multiple instance learning (MIL) models, used for analyzing Whole Slide Images (WSIs) in cancer diagnostics, often overlook the contextual information of tumor and neighboring tiles, leading to misclassifications. To address this, we propose the Context-Aware Multiple Instance Learning (CAMIL) architecture. CAMIL incorporates neighbor-constrained attention to consider dependencies among tiles within a WSI and integrates contextual constraints as prior knowledge into the MIL model. We evaluated CAMIL on subtyping non-small cell lung cancer (TCGA-NSCLC) and detecting lymph node (CAMELYON16) metastasis, achieving test AUCs of 0.959% and 0.975%, respectively, outperforming other state-of-the-art methods. Additionally, CAMIL enhances model interpretability by identifying regions of high diagnostic value

POSTER-53: Improving protein optimization with smoothed fitness landscapes

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: protein design discrete optimization protein engineering markov chain monte carlo graph signal processing

Scores: [ 6 6 6 3 ]

The ability to engineer novel proteins with higher fitness on a given task would be revolutionary for biotechnology and medicine. Modeling the combinatorially large space of sequences is infeasible; prior methods often constrain optimization to a small mutational radius, but this drastically limits the design space. Instead of heuristics, we propose smoothing the fitness landscape to facilitate protein optimization. First, we formulate protein fitness as a graph signal then use Tikunov regularization to smooth the fitness landscape. We find optimizing in this smoothed landscape leads to improved performance across multiple methods in the GFP and AAV benchmarks. Second, we achieve state-of-the-art results utilizing discrete energy-based models and MCMC in the smoothed landscape. Our method, called Gibbs sampling with Graph-based Smoothing (GGS), demonstrates a unique ability to achieve 2.5 fold fitness improvement (using an in-silico oracle) over its training set. GGS demonstrates potential to optimize proteins in the limited data regime. Code is included as part of the supplementary material.

POSTER-54: Long-Short-Range Message-Passing: A Fragmentation-Based Framework to Capture Non-Local Atomistic Interactions

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Molecular Modeling Quantum Chemistry Fragmentation Non-Local Interactions

Scores: [ 8 5 6 6 ]

Computational simulation of chemical and biological systems using ab initio molecular dynamics has been a challenge over decades. Researchers have attempted to address the problem with machine learning and fragmentation-based methods. However, the two approaches fail to give a satisfactory description of long-range and many-body interactions, respectively. Inspired by fragmentation-based methods, we propose the Long-Short-Range Message-Passing (LSR-MP) framework as a generalization of the existing equivariant graph neural networks (EGNNs) with the intent to incorporate long-range interactions efficiently and effectively. We apply the LSR-MP framework to the recently proposed ViSNet and demonstrate the state-of-the-art results with up to 40% MAE reduction for molecules in MD22 and Chignolin datasets. Consistent improvements to various EGNNs will also be discussed to illustrate the general applicability and robustness of our LSR-MP framework. The code for our experiments and trained model weights could be found at https://anonymous.4open.science/r/LSRM-760E/.

POSTER-55: Curriculum reinforcement learning for quantum architecture search under hardware errors

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Quantum Computing Reinforcement Learning Quantum Chemistry Quantum Architecture Search Optimization

Scores: [ 5 5 6 6 6 ]

The key challenge in the noisy intermediate-scale quantum era is finding useful circuits compatible with current device limitations. Variational quantum algorithms (VQAs) offer a solution where a circuit architecture is first fixed, and then the individual gate parameters are optimized in an external loop to solve a task. However, the performance optimization can be intractable, and the overall performance, as well as the optimization, highly depends on the initially fixed circuit's architecture. Several quantum architecture search (QAS) algorithms have been developed to automatically select the best circuit architecture. In the case of parameter optimization, it has been observed that noise effects dramatically influence the optimizer performance and final outcomes, and this is a key line of study. However, the effects of noise on the architecture search, which could be just as critical, are poorly understood. In this work, we tackle this issue. To do so, we first significantly improve the computational time to simulate realistic quantum circuits by employing pauli transfer matrix formalism in the Pauli-Liouville basis by fusing gates with their respective noise models and values. Then, we devise a curriculum-based reinforcement learning QAS (CRLQAS) algorithm optimized to tackle the challenges of realistic VQA deployment by introducing (i) a 3-D architecture encoding and restrictions on environment dynamics to explore the search space of possible circuits efficiently, (ii) an episode halting scheme to steer the agent to find shorter circuits, and (iii) a novel variant of simultaneous perturbation stochastic approximation algorithm as an optimizer for faster convergence. Numerical experiments focusing on quantum chemistry tasks demonstrate that CRLQAS outperforms existing QAS algorithms across noiseless and noisy environments.

POSTER-56: Neural Probabilistic Protein-Protein Docking via a Differentiable Energy Model

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: protein-protein docking energy-based model geometric deep learning energy-function

Scores: [ 6 8 3 6 ]

Protein complex formation, a pivotal challenge in contemporary biology, has recently gained interest from the machine learning community, particularly concerning protein-ligand docking tasks. In this paper, we delve into the equally crucial but comparatively under-investigated domain of protein-protein docking. Specifically, we propose a geometric deep learning framework, termed EBMDock, which employs statistical potential as its energy function. This approach produces a probability distribution over docking poses, such that the identified docking pose aligns with a minimum point in the energy landscape. We employ a differential algorithm grounded in Langevin dynamics to efficiently sample from the docking pose distribution. Additionally, we incorporate energy-based training using contrastive divergence, enhancing both performance and stability. Empirical results demonstrate that our approach achieves superior performance on two benchmark datasets DIPS and DB5.5. Furthermore, the results suggest EBMDock can serve as an orthogonal enhancement to existing methods. Source code will be released.

POSTER-57: Solving High Frequency and Multi-Scale PDEs with Gaussian Processes

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: ML PDE solver gaussian process PINN

Scores: [ 6 5 6 6 ]

Machine learning based solvers have garnered much attention in physical simulation and scientific computing, with a prominent example, physics-informed neural networks (PINNs). However, PINNs often struggle to solve high-frequency and multi-scale PDEs, which can be due to the spectral bias during neural network training. To address this problem, we resort to the Gaussian process (GP) framework. To flexibly capture the dominant frequencies, we model the power spectrum of the PDE solution with a student t mixture or Gaussian mixture. We then apply inverse Fourier transform to obtain the covariance function (according to the Wiener-Khinchin theorem). The covariance derived from the Gaussian mixture spectrum corresponds to the known spectral mixture kernel. We are the first to discover its rationale and effectiveness for PDE solving. Next, we estimate the mixture weights in the log domain, which we show is equivalent to placing a Jeffreys prior. It automatically induces sparsity, prunes excessive frequencies, and adjusts the remaining toward the ground truth. Third, to enable efficient and scalable computation on massive collocation points, which are critical to capture high frequencies, we place the collocation points on a grid, and multiply our covariance function at each input dimension. We use the GP conditional mean to predict the solution and its derivatives so as to fit the boundary condition and the equation itself. As a result, we can derive a Kronecker product structure in the covariance matrix. We use Kronecker product properties and multilinear algebra to greatly promote computational efficiency and scalability, without any low-rank approximations. We show the advantage of our method in systematic experiments.

POSTER-58: Graph Transformers on EHRs: Better Representation Improves Downstream Performance

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: transformers graph neural networks electronic health records

Scores: [ 5 8 5 6 ]

Following the success of transformer-based methods across various machine learning applications, their adoption to healthcare predictive tasks using electronic health records (EHR) has also expanded extensively. Similarly, graph-based methods have been shown to be very effective in capturing inherent graph-type relationships in EHRs, leading to improved downstream performance. Although integrating these two families of approaches seems like a natural next step, in practice, creating such a design is challenging and has not been done. This is partly due to known EHR problems, such as high sparsity, making extracting meaningful temporal representations of medical visits challenging. In this study, we propose GT-BEHRT, a new approach that leverages temporal visit embeddings extracted from a graph transformer and uses a BERT-based model to obtain more robust patient representations, especially on longer EHR sequences. The graph-based approach allows GT-BEHRT to implicitly capture the intrinsic graphical relationships between medical observations, while the BERT model extracts the temporal relationships between visits, loosely mimicking the clinicians' decision-making process. As part of our method, we also present a two-step pre-training strategy for learning better graphical and temporal representations. Our proposed method achieves state-of-the-art performance in a variety of standard medical predictive tasks, demonstrating the versatility of our approach.

POSTER-59: Large-scale training of foundation models for wearable biosignals

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Self-supervised learning Representation learning Foundation models Biosignals Wearable devices Health Photoplethysmography PPG Electrocardiogram ECG

Scores: [ 8 5 8 6 ]

POSTER-60: DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genomes

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: DNA Genome Language Model Foundation Model Benchmark

Scores: [ 6 8 6 6 ]

Decoding the linguistic intricacies of the genome is a crucial problem in biology, and pre-trained foundational models such as DNABERT and Nucleotide Transformer have made significant strides in this area. Existing works have largely hinged on k-mer, fixed-length permutations of A, T, C, and G, as the token of the genome language due to its simplicity. However, we argue that the computation and sample inefficiencies introduced by k-mer tokenization are primary obstacles in developing large genome foundational models. We provide conceptual and empirical insights into genome tokenization, building on which we propose to replace k-mer tokenization with Byte Pair Encoding (BPE), a statistics-based data compression algorithm that constructs tokens by iteratively merging the most frequent co-occurring genome segment in the corpus. We demonstrate that BPE not only overcomes the limitations of k-mer tokenization but also benefits from the computational efficiency of non-overlapping tokenization.Based on these insights, we introduce DNABERT-2, a refined genome foundation model that adapts an efficient tokenizer and employs multiple strategies to overcome input length constraints, reduce time and memory expenditure, and enhance model capability. Furthermore, we identify the absence of a comprehensive and standardized benchmark for genome understanding as another significant impediment to fair comparative analysis. In response, we propose the Genome Understanding Evaluation (GUE), a comprehensive multi-species genome classification dataset that amalgamates 28 distinct datasets across 7 tasks, with input lengths ranging from 70 to 1000. Through comprehensive experiments on the GUE benchmark, we demonstrate that DNABERT-2 achieves comparable performance to the state-of-the-art model with 21× fewer parameters and approximately 92× less GPU time in pre-training. Compared to DNABERT, while being 3× more efficient, DNABERT-2 outperforms it on 23 out of 28 datasets, with an average improvement of 6 absolute scores on GUE.The code, data, and pre-trained model will be publicly available.

POSTER-61: Multimodal Molecular Pretraining via Modality Blending

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: multimodal molecular pretraining molecular representation learning modality blending

Scores: [ 6 5 8 5 ]

Self-supervised learning has recently gained growing interest in molecular modeling for scientific tasks such as AI-assisted drug discovery. Current studies consider leveraging both 2D and 3D molecular structures for representation learning. However, relying on straightforward alignment strategies that treat each modality separately, these methods fail to exploit the intrinsic correlation between 2D and 3D representations that reflect the underlying structural characteristics of molecules, and only perform coarse-grained molecule-level alignment. To derive fine-grained alignment and promote structural molecule understanding, we introduce an atomic-relation level "blend-then-predict" self-supervised learning approach, MoleBLEND, which first blends atom relations represented by different modalities into one unified relation matrix for joint encoding, then recovers modality-specific information for 2D and 3D structures individually. By treating atom relationships as anchors, MoleBLEND organically aligns and integrates visually dissimilar 2D and 3D modalities of the same molecule at fine-grained atomic level, painting a more comprehensive depiction of each molecule. Extensive experiments show that MoleBLEND achieves state-of-the-art performance across major 2D/3D molecular benchmarks. We further provide theoretical insights from the perspective of mutual-information maximization, demonstrating that our method unifies contrastive, generative (cross-modality prediction) and mask-then-predict (single-modality prediction) objectives into one single cohesive framework.

POSTER-62: Searching for High-Value Molecules Using Reinforcement Learning and Transformers

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: chemistry reinforcement learning language models

Scores: [ 8 6 8 6 ]

Reinforcement learning (RL) over text representations can be effective for finding high-value policies that can search over graphs. However, RL requires careful structuring of the search space and algorithm design to be effective in this challenge. Through extensive experiments, we explore how different design choices for text grammar and algorithmic choices for training can affect an RL policy's ability to generate molecules with desired properties. We arrive at a new RL-based molecular design algorithm (ChemRLformer) and perform a thorough analysis using 25 molecule design tasks, including computationally complex protein docking simulations. From this analysis, we discover unique insights in this problem space and show that ChemRLformer achieves state-of-the-art performance while being more straightforward than prior work by demystifying which design choices are actually helpful for text-based molecule design.

POSTER-63: Interpretable Meta-Learning of Physical Systems

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: meta-learning physical systems multi-task learning interpretable deep learning identifiability electrostatics robotics control reinforcement learning scientific discovery

Scores: [ 5 6 8 ]

Machine learning methods can be a valuable aid in the scientific process, but they need to face challenging settings where data come from inhomogeneous experimental conditions. Recent meta-learning methods have made significant progress in multi-task learning, but they rely on black-box neural networks, resulting in high computational costs and limited interpretability. We introduce CAMEL, a new meta-learning architecture capable of learning efficiently from multiple environments, with an affine structure with respect to the learning task. We prove that CAMEL can identify the physical parameters of the system, enabling interpreable learning. We demonstrate the competitive generalization performance and the low computational cost of our method by comparing it to state-of-the-art algorithms on physical systems, ranging from toy models to complex, non-analytical systems. The interpretability of our method is illustrated with original applications to physical-parameter-induced adaptation and to adaptive control and system identification.

POSTER-64: KW-Design: Pushing the Limit of Protein Deign via Knowledge Refinement

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Protein Deign Graph Finetuning

Scores: [ 6 6 6 ]

POSTER-65: EquiformerV2: Improved Equivariant Transformer for Scaling to Higher-Degree Representations

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: equivariant neural networks graph neural networks computational physics transformer networks

Scores: [ 6 6 6 ]

Equivariant Transformers such as Equiformer have demonstrated the efficacy of applying Transformers to the domain of 3D atomistic systems. However, they are limited to small degrees of equivariant representations due to their computational complexity. In this paper, we investigate whether these architectures can scale well to higher degrees. Starting from Equiformer, we first replace SO(3) convolutionswith eSCN convolutions to efficiently incorporate higher-degree tensors. Then, to better leverage the power of higher degrees, we propose three architectural improvements – attention re-normalization, separable S2 activation and separable layer normalization. Putting these all together, we propose EquiformerV2, which outperforms previous state-of-the-art methods on large-scale OC20 dataset by up to 9% on forces, 4% on energies, offers better speed-accuracy trade-offs, and 2$\times$ reduction in DFT calculations needed for computing adsorption energies. Additionally, EquiformerV2 trained on only OC22 dataset outperforms GemNet-OC trained on both OC20 and OC22 datasets, achieving much better data efficiency. Finally, we compare EquiformerV2 with Equiformer on QM9 and OC20 S2EF-2Mdatasets to better understand the performance gain brought by higher degrees.

POSTER-66: Continuous Field Reconstruction from Sparse Observations with Implicit Neural Networks

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: implicit neural representations field reconstruction sparse observation

Scores: [ 6 6 6 6 ]

Reliably reconstructing physical fields from sparse sensor data is a challenge that frequenty arises in many scientific domains. In practice, the process generating the data is often not known to sufficient accuracy. Therefore, there is a growing interest in the deep neural network route to the problem. In this work, we present a novel approach that learns a continuous representation of the field using implicit neural representations (INR). Specifically, after factorizing spatiotemporal variability into spatial and temporal components using the technique of separation of variables, the method learns relevant basis functions from sparsely sampled irregular data points to thus develop a continuous representation of the data. In experimental evaluations, the proposed model outperforms recent INR methods, offering superior reconstruction quality on simulation data from a state of the art climate model and on a second dataset that comprises of ultra-high resolution satellite-based sea surface temperature field.

POSTER-67: Space Group Constrained Crystal Generation

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: material generation spacegroup diffusion generative models

Scores: [ 6 8 8 ]

POSTER-68: Dynamics-Informed Protein Design with Structure Conditioning

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Diffusion Models Generative Modeling Protein Design Normal Mode Analysis

Scores: [ 5 5 6 6 ]

Current protein generative models are able to design novel backbones with desired shapes or functional motifs. However, despite the importance of a protein’s dynamical properties for its function, conditioning on dynamical properties remains elusive. We present a new approach to protein generative modeling by leveraging Normal Mode Analysis that enables us to capture dynamical properties too. We introduce a method for conditioning the diffusion probabilistic models on protein dynamics, specifically on the lowest non-trivial normal mode of oscillation. Our method, similar to the classifier guidance conditioning, formulates the sampling process as being driven by conditional and unconditional terms. However, unlike previous works, we approximate the conditional term with a simple analytical function rather than an external neural network, thus making the eigenvector calculations approachable. We present the corresponding SDE theory as a formal justification of our approach. We extend our framework to conditioning on structure and dynamics at the same time, enabling scaffolding of the dynamical motifs. We demonstrate the empirical effectiveness of our method by turning the open-source unconditional protein diffusion model Genie into the conditional model with no retraining. Generated proteins exhibit the desired dynamical and structural properties while still being biologically plausible. Our work represents a first step towards incorporating dynamical behaviour in protein design and may open the door to designing more flexible and functional proteins in the future.

POSTER-69: BioBridge: Bridging Biomedical Foundation Models via Knowledge Graph

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: drug discovery foundation model multi-modal learning knowledge graph

Scores: [ 8 6 6 8 ]

Foundation models (FMs) are able to leverage large volumes of unlabeled data to demonstrate superior performance across a wide range of tasks. However, FMs developed for biomedical domains have largely remained unimodal, i.e., independently trained and used for tasks on protein sequences alone, small molecule structures alone, or clinical data alone.To overcome this limitation of biomedical FMs, we present BioBridge, a novel parameter-efficient learning framework, to bridge independently trained unimodal FMs to establish multimodal behavior. BioBridge achieves it by utilizing Knowledge Graphs (KG) to learn transformations between one unimodal FM and another without fine-tuning any underlying unimodal FMs.Our empirical results demonstrate that BioBridge canbeat the best baseline KG embedding methods (on average by 76.3%) in cross-modal retrieval tasks. We also identify BioBridge demonstrates out-of-domain generalization ability by extrapolating to unseen modalities or relations. Additionally, we also show that BioBridge presents itself as a general-purpose retriever that can aid biomedical multimodal question answering as well as enhance the guided generation of novel drugs.

SPOTLIGHT-14: MAPE-PPI: Towards Effective and Efficient Protein-Protein Interaction Prediction via Microenvironment-Aware Protein Embedding

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Bioinformatics Protein-Protein Interaction Protein Sequence-Structure Co-Modeling

Scores: [ 8 6 3 ]

Protein-Protein Interactions (PPIs) are fundamental in various biological processes and play a key role in life activities. The growing demand and cost of experimental PPI assays require computational methods for efficient PPI prediction. While existing methods rely heavily on protein sequence for PPI prediction, it is the protein structure that is the key to determine the interactions. However, the complexity of protein structure modeling hinders its application to large-scale PPI prediction. To address this challenge and take both protein modalities into account, we define the microenvironment of an amino acid residue by its sequence and structural contexts, which describe the surrounding chemical properties and geometric features. In addition, microenvironments defined in previous work are largely based on experimentally assayed physicochemical properties, for which the "vocabulary" is usually extremely small. This makes it difficult to cover the diversity and complexity of microenvironments. In this paper, we propose Microenvironment-Aware Protein Embedding for PPI prediction (MPAE-PPI), which encodes microenvironments into chemically meaningful discrete codes via a sufficiently large microenvironment vocabulary" (i.e., codebook). Moreover, we propose a novel pre-training strategy, namely Masked Codebook Modeling (MCM), to capture the dependencies between different microenvironments by randomly masking the codebook and reconstructing the input. With the learned microenvironment codebook, we can reuse it as an off-the-shelf tool to efficiently and effectively encode proteins of different sizes and functions for large-scale PPI prediction. Extensive experiments show that MAPE-PPI can scale to PPI prediction with millions of PPIs with superior trade-offs between effectiveness and computational efficiency than the state-of-the-art competitors.

POSTER-70: Better Neural PDE Solvers Through Data-Free Mesh Movers

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: neural PDE solvers adaptive moving mesh neural operators Monge-Ampère equation

Scores: [ 6 6 6 ]

Recently, neural networks have been extensively employed to solve partial differential equations (PDEs) in physical system modeling. While major studies focus on learning system evolution on predefined static mesh discretizations, some methods utilize reinforcement learning or supervised learning techniques to create adaptive and dynamic meshes, due to the dynamic nature of these systems. However, these approaches face two primary challenges: (1) the need for expensive optimal mesh data, and (2) the change of the solution space's degree of freedom and topology during mesh refinement. To address these challenges, this paper proposes a neural PDE solver with a neural mesh adapter. To begin with, we introduce a novel data-free neural mesh adaptor, called Data-free Mesh Mover (DMM), with two main innovations. Firstly, it is an operator that maps the solution to adaptive meshes and is trained using the Monge-Ampère equation without optimal mesh data. Besides, it dynamically changes the mesh by moving existing nodes rather than adding or deleting nodes and edges. Theoretical analysis shows that meshes generated by DMM have the lowest interpolation error bound. Based on DMM, to efficiently and accurately model dynamic systems, we develop a moving mesh based neural PDE solver (MM-PDE) that embeds the moving mesh with a two-branch architecture and a learnable interpolation framework to preserve information within the data. Empirical experiments demonstrate that our method generates suitable meshes and considerably enhances accuracy when modeling widely considered PDE systems.

POSTER-71: PhyloGFN: Phylogenetic inference with generative flow networks

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Phylogenetic Inference GFlowNets Bayesian Inference Deep Generative Modeling

Scores: [ 6 8 8 5 ]

Phylogenetics is a branch of computational biology that studies the evolutionary relationships among biological entities. Its long history and numerous applications notwithstanding, inference of phylogenetic trees from sequence data remains challenging: the high complexity of tree space poses a significant obstacle for the current combinatorial and probabilistic techniques. In this paper, we adopt the framework of generative flow networks (GFlowNets) to tackle two core problems in phylogenetics: parsimony-based and Bayesian phylogenetic inference. Because GFlowNets are well-suited for sampling complex combinatorial structures, they are a natural choice for exploring and sampling from the multimodal posterior distribution over tree topologies and evolutionary distances. We demonstrate that our amortized posterior sampler, PhyloGFN, produces diverse and high-quality evolutionary hypotheses on real benchmark datasets. PhyloGFN is competitive with prior works in marginal likelihood estimation and achieves a closer fit to the target distribution than state-of-the-art variational inference methods.

POSTER-72: Active Retrosynthetic Planning Aware of Route Quality

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Retrosynthetic planning route evaluation reinforcement learning

Scores: [ 6 8 6 6 ]

Retrosynthetic planning is a sequential decision-making process of identifying synthetic routes from the available building block materials to reach a desired target molecule.Though existing planning approaches show promisingly high solving rates and low costs, the trivial route cost evaluation via pre-trained forward reaction prediction models certainly falls short of real-world chemical practice.An alternative option is to annotate the actual cost of a route, such as yield, through chemical experiments or input from chemists, while this often leads to substantial query costs.In order to strike the balance between query costs and route quality evaluation, we propose an Active Retrosynthetic Planning (ARP) framework that remains compatible with the established retrosynthetic planners.On one hand, the proposed ARP trains an actor that decides whether to query the cost of a reaction; on the other hand, it resorts to a critic to estimate the value of a molecule with its preceding reaction cost as input. Those molecules with low reaction costs are preferred to expand first.We apply our framework to different existing approaches on both the benchmark and an expert dataset and demonstrate that it outperforms the existing state-of-the-art approach by 6.2% in route quality while reducing the query cost by 12.8%.In addition, ARP consistently plans high-quality routes with either abundant or sparse annotations.

POSTER-73: Crystalformer: Infinitely Connected Attention for Periodic Structure Encoding

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Materials Science Attention Networks Transformer Physics-Informed ML

Scores: [ 5 8 8 8 ]

Predicting physical properties of materials from their crystal structures is a fundamental problem in materials science. In peripheral areas such as the prediction of molecular properties, fully connected attention networks have been shown to be successful. However, unlike these finite atom arrangements, crystal structures are infinitely repeating, periodic arrangements of atoms, whose fully connected attention results in infinitely connected attention. In this work, we show that this infinitely connected attention can lead to a computationally tractable and physically interpretable formulation. We then propose a simple yet effective transformer-based encoder architecture for crystal structures called Crystalformer. Compared with an existing transformer-based model, the proposed model requires only 38% of number of parameters per attention block. Despite the architectural simplicity, the proposed method outperforms state-of-the-art methods for various property regression tasks on the Materials Project and JARVIS-DFT datasets.

POSTER-74: Retro-fallback: retrosynthetic planning in an uncertain world

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Retrosynthesis planning chemistry search

Scores: [ 6 6 6 ]

SPOTLIGHT-15: Learning the greatest common divisor: explaining transformer predictions

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: mathematics arithmetic transformers explainability

Scores: [ 5 6 8 5 ]

We train small transformers to calculate the greatest common divisor (GCD) of two positive integers, and show that their predictions are fully explainable. During training, models learn a list D of divisors, and predict the largest element of D that divides both inputs. We also show that training distributions have a large impact on performance. Models trained from uniform operands only learn a handful of GCD (up to 38 out of 100). Training from log-uniform operands boosts performance to 73 correct GCD, and balancing the distribution of GCD, from inverse square to log-uniform, to 91. On the other hand, a uniform distribution of GCD in the training set breaks model explainability.

POSTER-75: Stabilizing Backpropagation Through Time to Learn Complex Physics

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: backpropagation through time unrolling differentiable physics differentiable simulators optimization

Scores: [ 3 8 8 8 ]

Of all the vector fields surrounding the minima of recurrent learning setups, the gradient field with its exploding and vanishing updates appears a poor choice for optimization, offering little beyond efficient computability. We seek to improve this suboptimal practice in the context of physics simulations, where backpropagating feedback through many unrolled time steps is considered crucial to acquiring temporally coherent behavior. The alternative vector field we propose follows from two principles: physics simulators, unlike neural networks, have a balanced gradient flow and certain modifications to the backpropagation pass leave the positions of the original minima unchanged. As any modification of backpropagation decouples forward and backward pass, the rotation-free character of the gradient field is lost. Therefore, we discuss the negative implications of using such a rotational vector field for optimization and how to counteract them. Our final procedure is easily implementable via a sequence of gradient stopping and component-wise comparison operations, which do not negatively affect scalability. Our experiments on three control problems show that especially as we increase the complexity of each task, the unbalanced updates from the gradient can no longer provide the precise control signals necessary while our method still solves the tasks.

POSTER-76: From Zero to Turbulence: Generative Modeling for 3D Flow Simulation

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: pde generative diffusion navier-stokes cfd

Scores: [ 8 8 5 6 ]

Simulations of turbulent flows in 3D are one of the most expensive simulations in computational fluid dynamics (CFD). Many works have been written on surrogate models to replace numerical solvers for fluid flows with faster, learned, autoregressive models. However, the intricacies of turbulence in three dimensions necessitate training these models with very small time steps, while generating realistic flow states requires either long roll-outs with many steps and significant error accumulation or starting from a known, realistic flow state—something we aimed to avoid in the first place. Instead, we propose to approach turbulent flow simulation as a generative task directly learning the manifold of all possible turbulent flow states without relying on any initial flow state. For our experiments, we introduce a challenging 3D turbulence dataset of high-resolution flows and detailed vortex structures caused by various objects and derive two novel sample evaluation metrics for turbulent flows. On this dataset, we show that our generative model captures the distribution of turbulent flows caused by unseen objects and generates high-quality, realistic samples amenable for downstream applications without access to any initial state.

POSTER-77: BENO: Boundary-embedded Neural Operators for Elliptic PDEs

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: AI for PDEs; physical simulation neural operators boundary-embedded

Scores: [ 6 8 5 6 8 ]

Elliptic partial differential equations (PDEs) are a major class of time-independent PDEs that play a key role in many scientific and engineering domains such as fluid dynamics, plasma physics, and solid mechanics. Recently, neural operators have emerged as a promising technique to solve elliptic PDEs more efficiently by directly mapping the input to solutions. However, existing networks typically neglect complex geometries and inhomogeneous boundary values present in the real world. Here we introduce Boundary-Embedded Neural Operators (BENO), a novel neural operator architecture that embeds the complex geometries and inhomogeneous boundary values into the solving of elliptic PDEs. Inspired by classical Green's function, BENO consists of two Graph Neural Networks (GNNs) for interior source term and boundary values, respectively. Furthermore, a Transformer encoder maps the global boundary geometry into a latent vector which influences each message passing layer of the GNNs. We test our model and strong baselines extensively in elliptic PDEs with complex boundary conditions. We show that all existing baseline methods fail to learn the solution operator. In contrast, our model, endowed with boundary-embedded architecture, outperforms state-of-the-art neural operators and strong baselines by an average of 60.96%.

POSTER-78: Learning semilinear neural operators: A unified recursive framework for prediction and data assimilation.

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Neural operator PDEs semi-linear evolution sequential learning filtering data assimilation

Scores: [ 5 8 5 8 ]

Recent advances in the theory of Neural Operators (NOs) have enabled fast and accurate computation of the solutions to complex systems described by partial differential equations (PDEs). Despite their great success, current NO-based solutions face important challenges when dealing with spatio-temporal PDEs over long time scales. Specifically, the current theory of NOs does not present a systematic framework to perform data assimilation and efficiently correct the evolution of PDE solutions over time based on sparsely sampled noisy measurements. In this paper, we propose a learning-based state-space approach to compute the solution operators to infinite-dimensional semilinear PDEs. Exploiting the structure of semilinear PDEs and the theory of nonlinear observers in function spaces, we develop a flexible recursive method that allows for both prediction and data assimilation by combining prediction and correction operations. The proposed framework is capable of producing fast and accurate predictions over long time horizons, dealing with irregularly sampled noisy measurements to correct the solution, and benefits from the decoupling between the spatial and temporal dynamics of this class of PDEs. We show through experiments on the Kuramoto-Sivashinsky, Navier-Stokes and Korteweg-de Vries equations that the proposed model is robust to noise and can leverage arbitrary amounts of measurements to correct its prediction over a long time horizon with little computational overhead.

POSTER-79: DecompOpt: Controllable and Decomposed Diffusion Models for Structure-based Molecular Optimization

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Structure-based drug design molecule optimization diffusion models 3D molecule generation

Scores: [ 6 5 6 8 6 ]

Recently, 3D generative models have shown promising performances in structure-based drug design by learning to generate ligands given target binding sites. However, only modeling the target-ligand distribution can hardly fulfill one of the main goals in drug discovery -- designing novel ligands with desired properties, e.g., high binding affinity, easily synthesizable, etc. This challenge becomes particularly pronounced when the target-ligand pairs used for training do not align with these desired properties. Moreover, most existing methods aim at solving de novo design task, while many generative scenarios requiring flexible controllability, such as R-group optimization and scaffold hopping, have received little attention. In this work, we propose DecompOpt, a structure-based molecular optimization method based on a controllable and decomposed diffusion model. DecompOpt presents a new generation paradigm which combines optimization with conditional diffusion models to achieve desired properties while adhering to the molecular grammar. Additionally, DecompOpt offers a unified framework covering both de novo design and controllable generation. To achieve so, ligands are decomposed into substructures which allows fine-grained control and local optimization. Experiments show that DecompOpt can efficiently generate molecules with improved properties than strong de novo baselines, and demonstrate great potential in controllable generation tasks.

POSTER-80: Training-free Multi-objective Diffusion Model for 3D Molecule Generation

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Multi-objective Diffusion Model 3D Molecule

Scores: [ 8 6 6 6 ]

Searching for novel and diverse molecular candidates is a critical undertaking in drug and material discovery. Existing approaches have successfully adapted the diffusion model, the most effective generative model in image generation, to create 1D SMILES strings, 2D chemical graphs, or 3D molecular conformers. However, these methods are not efficient and flexible enough to generate 3D molecules with multiple desired properties, as they require additional training for the models for each new property or even a new combination of existing properties. Moreover, some properties may potentially conflict, making it impossible to find a molecule that satisfies all of them simultaneously. To address these challenges, we present a training-free conditional 3D molecular generation algorithm based on off-the-shelf unconditional diffusion models and property prediction models. The key techniques include modeling the loss of property prediction models as energy functions, considering the property relation between multiple conditions as a probabilistic graph, and developing a stable posterior estimation for computing the conditional score function. We conducted experiments on both single-objective and multi-objective 3D molecule generation, focusing on quantum properties, and compared our approach with the trained or fine-tuned diffusion models. Our proposed model achieves superior performance in generating molecules that meet the conditions, without any additional training cost.

POSTER-81: An operator preconditioning perspective on training in physics-informed machine learning

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: physics-informed machine learning operator preconditioning deep learning neural network training

Scores: [ 5 6 5 6 8 8 ]

SPOTLIGHT-16: Accelerating Data Generation for Neural Operators via Krylov Subspace Recycling

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: AI4PDE; Neural Operator; Data Generation; Krylov Subspace

Scores: [ 8 6 6 8 ]

Learning neural operators for solving partial differential equations (PDEs) has attracted great attention due to its high inference efficiency.However, training such operators requires generating a substantial amount of labeled data, i.e., PDE problems together with their solutions.The data generation process is exceptionally time-consuming, as it involves solving numerous systems of linear equations to obtain numerical solutions to the PDEs.Many existing methods solve these systems independently without considering their inherent similarities, resulting in extremely redundant computations.To tackle this problem, we propose a novel method, namely Sorting Krylov Recycling (SKR), to boost the efficiency of solving these systems, thus significantly accelerating data generation for neural operators training.To the best of our knowledge, SKR is the first attempt to address the time-consuming nature of data generation for learning neural operators.The working horse of SKR is Krylov subspace recycling, a powerful technique for solving a series of interrelated systems by leveraging their inherent similarities.Specifically, SKR employs a sorting algorithm to arrange these systems in a sequence, where adjacent systems exhibit high similarities.Then it equips a solver with Krylov subspace recycling to solve the systems sequentially instead of independently, thus effectively enhancing the solving efficiency.Both theoretical analysis and extensive experiments demonstrate that SKR can significantly accelerate neural operator data generation, achieving a remarkable speedup of up to 13.9 times.

POSTER-82: The Discovery of Binding Modes Requires Rethinking Docking Generalization

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: generalization molecular docking protein-ligand binding diffusion models benchmark bootstrapping self-training

Scores: [ 6 5 5 8 6 ]

Accurate blind docking has the potential to lead to new biological breakthroughs, but for this promise to be realized, it is critical that docking methods generalize well across the proteome. However, existing benchmarks fail to rigorously assess generalizability. Therefore, we develop DockGen, a new benchmark based on the ligand-binding domains of proteins, and we show that machine learning-based docking models have very weak generalization abilities even when combined with various data augmentation strategies. Instead, we propose Confidence Bootstrapping, a new training paradigm that solely relies on the interaction between a diffusion and a confidence model. Unlike previous self-training methods from other domains, we directly exploit the multi-resolution generation process of diffusion models using rollouts and confidence scores to reduce the generalization gap. We demonstrate that Confidence Bootstrapping significantly improves the ability of ML-based docking methods to dock to unseen protein classes, edging closer to accurate and generalizable blind docking methods.

POSTER-83: Structural Inference with Dynamics Encoding and Partial Correlation Coefficients

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Structural Inference AI4Science

Scores: [ 10 6 6 ]

This paper introduces a novel approach to structural inference, combining a variational dynamics encoder with partial correlation coefficients. In contrast to prior methods, our approach leverages variational inference to encode node dynamics within latent variables, and structural reconstruction relies on the calculation of partial correlation coefficients derived from these latent variables.This unique design endows our method with scalability and extends its applicability to both one-dimensional and multi-dimensional feature spaces.Furthermore, by reorganizing latent variables according to temporal steps, our approach can effectively reconstruct directed graph structures. We validate our method through extensive experimentation on twenty datasets from a benchmark dataset and biological networks. Our results showcase the superior scalability, accuracy, and versatility of our proposed approach compared to existing methods.Moreover, experiments conducted on noisy data affirm the robustness of our method.

POSTER-84: Rethinking the symmetry-preserving circuits for constrained variational quantum algorithms

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: variational quantum algorithm quantum machine learning symmetry-preserving circuit

Scores: [ 5 6 8 8 ]

POSTER-85: Weakly Supervised Virus Capsid Detection with Image-Level Annotations in Electron Microscopy Images

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Weakly Supervised Object Detection Limited Annotation Time Bounding Box Regression Electron Microscopy

Scores: [ 5 6 5 6 ]

Current state-of-the-art methods for object detection rely on annotated bounding boxes of large data sets for training. However, obtaining such annotations is expensive and can require up to hundreds of hours of manual labor. This poses a challenge, especially since such annotations can only be provided by experts, as they require knowledge about the scientific domain. To tackle this challenge, we propose a domain-specific weakly supervised object detection algorithm that only relies on image-level annotations, which are significantly easier to acquire. Our method distills the knowledge of a pre-trained model, on the task of predicting the presence or absence of a virus in an image, to obtain a set of pseudo-labels that can be used to later train a state-of-the-art object detection model. To do so, we use an optimization approach with a shrinking receptive field to extract virus particles directly without specific network architectures. Through a set of extensive studies, we show how the proposed pseudo-labels are easier to obtain, and, more importantly, are able to outperform other existing weak labeling methods, and even ground truth labels, in cases where the time to obtain the annotation is limited.

POSTER-86: Guaranteed Approximation Bounds for Mixed-Precision Neural Operators

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: neural operators

Scores: [ 6 8 5 8 ]

Neural operators, such as Fourier Neural Operators (FNO), form a principled approach for learning solution operators for partial differential equations (PDE) and other mappings between function spaces. However, many real-world problems require high-resolution training data, and the training time and limited GPU memory pose big barriers. One solution is to train neural operators in mixed precision to reduce the memory requirement and increase training speed. However, existing mixed-precision training techniques are designed for standard neural networks, and we find that their direct application to FNO leads to numerical overflow and poor memory efficiency. Further, at first glance, it may appear that mixed precision in FNO will lead to drastic accuracy degradation since reducing the precision of the Fourier transform yields poor results in classical numerical solvers. We show that this is not the case; in fact, we prove that reducing the precision in FNO still guarantees a good approximation bound, when done in a targeted manner. Specifically, we build on the intuition that neural operator learning inherently induces an approximation error, arising from discretizing the infinite-dimensional ground-truth input function, implying that training in full precision is not needed. We formalize this intuition by rigorously characterizing the approximation and precision errors of FNO and bounding these errors for general input functions. We prove that the precision error is asymptotically comparable to the approximation error. Based on this, we design a simple method to optimize the memory-intensive half-precision tensor contractions by greedily finding the optimal contraction order. Through extensive experiments on different state-of-the-art neural operators, datasets, and GPUs, we demonstrate that our approach reduces GPU memory usage by up to 50% and improves throughput by 58% with little or no reduction in accuracy.

POSTER-87: From Molecules to Materials: Pre-training Large Generalizable Models for Atomic Property Prediction

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: atomic property prediction pre-training 3D atomic pre-training graph neural networks multi-task learning molecules materials

Scores: [ 5 8 5 5 ]

The role of machine learning in computing atomic properties is expanding rapidly for a wide range of applications from healthcare to climate change. One important ingredient that has enabled this development is the creation of large and diverse molecular datasets. Given the extreme computational cost of these datasets, an important question moving forward is: Can we limit the need for exhaustive large dataset creation by pre-training a foundation style model over multiple chemical domains to generate transferable atomic representations for downstream fine-tuning tasks? Generalization across the entire molecular space is challenging due to the range and complexity of atomic interactions that exist. In this paper, we present Joint Multi-domain Pre-training (JMP), a robust supervised pre-training strategy that utilizes data from multiple chemical domains, $\sim$120 million examples in total. We demonstrate state-of-the-art results across many targets of the rMD17, QM9, MatBench, QMOF, SPICE, and MD22 datasets. Finally, we conduct ablations to study the impact of different components of JMP on downstream performance.

POSTER-88: Efficient Sharpness-Aware Minimization for Molecular Graph Transformer Models

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Sharpness-Aware Minimization Molecular Graph

Scores: [ 6 6 6 6 ]

Sharpness-aware minimization (SAM) has received increasing attention in computer vision since it can effectively eliminate the sharp local minima from the training trajectory and mitigate generalization degradation. However, SAM requires two sequential gradient computations during the optimization of each step: one to obtain the perturbation gradient and the other to obtain the updating gradient. Compared with the base optimizer (e.g., Adam), SAM doubles the time overhead due to the additional perturbation gradient. By dissecting the theory of SAM and observing the training gradient of the molecular graph transformer, we propose a new algorithm named GraphSAM, which reduces the training cost of SAM and improves the generalization performance of graph transformer models. There are two key factors that contribute to this result: (i) \textit{gradient approximation}: we use the updating gradient of the previous step to approximate the perturbation gradient at the intermediate steps smoothly (\textbf{increases efficiency}); (ii) \textit{loss landscape approximation}: we theoretically prove that the loss landscape of GraphSAM is limited to a small range centered on the expected loss of SAM (\textbf{guarantees generalization performance}). The extensive experiments on six datasets with different tasks demonstrate the superiority of GraphSAM, especially in optimizing the model update process.

POSTER-89: Protein Multimer Structure Prediction via PPI-guided Prompt Learning

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: docking path prediction protein complex structure prompt learning

Scores: [ 8 5 5 3 ]

Understanding the 3D structures of protein multimers is crucial, as they play a vital role in regulating various cellular processes. It has been empirically confirmed that the multimer structure prediction (MSP) can be well handled in a step-wise assembly fashion using provided dimer structures and predicted protein-protein interactions (PPIs). However, due to the biological gap in the formation of dimers and larger multimers, directly applying PPI prediction techniques can often cause a poor generalization to the MSP task. To address this challenge, we aim to extend the PPI knowledge to multimers of different scales (i.e., chain numbers). Specifically, we propose PromptMSP, a pre-training and Prompt tuning framework for Multimer Structure Prediction. First, we tailor the source and target tasks for effective PPI knowledge learning and efficient inference, respectively. We design PPI-inspired prompt learning to narrow the gaps of two task formats and generalize the PPI knowledge to multimers of different scales. We utilize the meta-learning approach to learn a reliable initialization of the prompt model, enabling our prompting framework to effectively adapt to limited data for large-scale multimers. Empirically, we achieve both significant accuracy (RMSD and TM-Score) and efficiency improvements compared to advanced MSP models. For instance, when both methods utilize AlphaFold-Multimer to prepare dimers, PromptMSP achieves a 21.43% improvement in TM-Score with only 0.5% of the running time compared to the competitive MoLPC baseline.

SPOTLIGHT-17: Lagrangian Flow Networks for Conservation Laws

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Physics-informed Neural Network Fluid Dynamics Conservation Law Partial Differential Equation Conditional Normalizing Flows Bird-Migration

Scores: [ 6 8 8 ]

We introduce Lagrangian Flow Networks (LFlows) for modeling fluid densities and velocities continuously in space and time.By construction, the proposed LFlows satisfy the continuity equation,a PDE describing mass conservation in its differentiable form. Our model is based on the insight that solutions to the continuity equation can be expressed astime-dependent density transformations via differentiable and invertible maps.This follows from classical theory of the existence and uniqueness of Lagrangian flows for smooth vector fields.Hence, we model fluid densities by transforming a base density with parameterized diffeomorphisms conditioned on time.The key benefit compared to methods relying on numerical ODE solvers or PINNs is that the analytic expression of the velocity is always consistent with changes in density.Furthermore, we require neither expensive numerical solvers, nor additional penalties to enforce the PDE.LFlows show higher predictive accuracy in density modeling tasks compared to competing models in 2D and 3D,while being computationally efficient.As a real-world application, we model bird migration based on sparse weather radar measurements.

ORAL-4: Unified Generative Modeling of 3D Molecules with Bayesian Flow Networks

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Drug Design Molecule Generation Deep Learning Computational Biology

Scores: [ 8 8 8 8 ]

Advanced generative model (\textit{e.g.}, diffusion model) derived from simplified continuity assumptions of data distribution, though showing promising progress, has been difficult to apply directly to geometry generation applications due to the \textit{multi-modality} and \textit{noise-sensitive} nature of molecule geometry. This work introduces Geometric Bayesian Flow Networks (GeoBFN), which naturally fits molecule geometry by modeling diverse modalities in the differentiable parameter space of distributions. GeoBFN maintains the SE-(3) invariant density modeling property by incorporating equivariant inter-dependency modeling on parameters of distributions and unifying the probabilistic modeling of different modalities. Through optimized training and sampling techniques, we demonstrate that GeoBFN achieves state-of-the-art performance on multiple 3D molecule generation benchmarks in terms of generation quality (90.87% molecule stability in QM9 and 85.6% atom stability in GEOM-DRUG\footnote{The scores are reported at 1k sampling steps for fair comparison, and our scores could be further improved if sampling sufficiently longer steps.}). GeoBFN can also conduct sampling with any number of steps to reach an optimal trade-off between efficiency and quality (\textit{e.g.}, 20$\times$ speedup without sacrificing performance).

POSTER-90: Experimental Design for Multi-Channel Imaging via Task-Driven Feature Selection

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Experimental Design Supervised Feature Selection Multi-Channel Imaging Hyperspectral Imaging Magnetic Resonance Imaging (MRI)

Scores: [ 6 6 6 6 ]

This paper presents a data-driven, task-specific paradigm for experimental design, to shorten acquisition time, reduce costs, and accelerate the deployment of imaging devices. Current standard approaches in experimental design focus on model-parameter estimation and require specification of a particular model, whereas in imaging, other tasks may drive the design. Furthermore, such approaches are often lead to intractable optimisation problems in real-world imaging applications. Here we put forward a new paradigm for experimental design that simultaneously optimizes the design (set of image channels) and trains a machine-learning model to execute a user-specified image-analysis task. The approach obtains data densely-sampled over the measurement space (many image channels) for a small number of acquisitions, then identifies a subset of channels of pre-specified size that best supports the task. We propose a method: TADRED for TAsk-DRiven experimental design in imaging, to identify the most informative channel-subset whilst simultaneously training a network to execute the task given the subset. Experiments demonstrate the potential of TADRED in diverse imaging applications: several clinically-relevant tasks in magnetic resonance imaging; and remote sensing and physiological applications of hyperspectral imaging. Results show substantial improvement over classical experimental design, two recent application-specific methods within the new paradigm we explore, and state-of-the-art approaches in supervised feature selection. We anticipate further applications of our approach; code (for reviewers) is available: \cite{ouranonymouscode}.

POSTER-91: Symphony: Symmetry-Equivariant Point-Centered Spherical Harmonics for Molecule Generation

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: molecule spherical harmonics equivariant symmetry generation

Scores: [ 6 6 6 8 ]

We present Symphony, an E(3) equivariant autoregressive generative model for 3D molecular geometriesthat iteratively builds a molecule from molecular fragments.Existing autoregressive models such as G-SchNet and G-SphereNet for molecules utilize rotationally invariant features to respect the 3D symmetries of molecules.In contrast, Symphony uses message-passing with higher-degree E(3)-equivariant features.This allows a novel representation of probability distributions via spherical harmonic signals to efficiently model the 3D geometry ofmolecules. We show that Symphony is able to accurately generate small molecules from the QM9 dataset, outperforming existingautoregressive models and approaching the performance of diffusion models.

POSTER-92: SineNet: Learning Temporal Dynamics in Time-Dependent Partial Differential Equations

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Partial differential equations Physics simulation Dynamics learning

Scores: [ 8 6 6 6 ]

We consider using deep neural networks to solve time-dependent partial differential equations (PDEs), where multi-scale processing is crucial for modeling complex, time-evolving dynamics. While the U-Net architecture with skip connections is commonly used by prior studies to enable multi-scale processing, our analysis shows that the need for features to evolve across layers results in temporally misaligned features in skip connections, which limits the model’s performance. To address this limitation, we propose SineNet, consisting of multiple sequentially connected U-shaped network blocks, referred to as waves. In SineNet, high-resolution features are evolved progressively through multiple stages, thereby reducing the amount of misalignment within each stage. We furthermore analyze the role of skip connections in enabling both parallel and sequential processing of multi-scale information. Our method is rigorously tested on multiple PDE datasets, including the Navier-Stokes equations and shallow water equations, showcasing the advantages of our proposed approach over conventional U-Nets with a comparable parameter budget. We further demonstrate that increasing the number of waves in SineNet while maintaining the same number of parameters leads to a monotonically improved performance. The results highlight the effectiveness of SineNet and the potential of our approach in advancing the state-of-the-art in neural PDE solver design.

SPOTLIGHT-18: SNIP: Bridging Mathematical Symbolic and Numeric Realms with Unified Pre-training

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Symbolic Mathematics Pre-training Transformers Symbolic Regression Deep Learning

Scores: [ 8 6 8 8 ]

In an era where symbolic mathematical equations are indispensable for modeling complex natural phenomena, scientific inquiry often involves collecting observations and translating them into mathematical expressions. Recently, deep learning has emerged as a powerful tool for extracting insights from data. However, existing models typically specialize in either numeric or symbolic domains, and are usually trained in a supervised manner tailored to specific tasks. This approach neglects the substantial benefits that could arise from a task-agnostic unified understanding between symbolic equations and their numeric counterparts. To bridge the gap, we introduce SNIP, a Symbolic-Numeric Integrated Pre-training, which employs joint contrastive learning between symbolic and numeric domains, enhancing their mutual similarities in the pre-trained embeddings. By performing latent space analysis, we observe that SNIP provides cross-domain insights into the representations, revealing that symbolic supervision enhances the embeddings of numeric data and vice versa. We evaluate SNIP across diverse tasks, including symbolic-to-numeric mathematical property prediction and numeric-to-symbolic equation discovery, commonly known as symbolic regression. Results show that SNIP effectively transfers to various tasks, consistently outperforming fully supervised baselines and competing strongly with established task-specific methods, especially in few-shot learning scenarios where available data is limited.

POSTER-93: Symmetric Basis Convolutions for Learning Lagrangian Fluid Mechanics

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: CConv GNN SPH Lagrangian Fluids Fourier Methods

Scores: [ 6 6 5 6 ]

Learning physical simulations has been an essential and central aspect of many recent research efforts in machine learning, particularly for Navier-Stokes-based fluid mechanics. Classic numerical solvers have traditionally been computationally expensive and challenging to use in inverse problems, whereas Neural solvers aim to address both concerns through machine learning. We propose a general formulation for continuous convolutions using separable basis functions as a superset of existing methods and evaluate a large set of basis functions in the context of (a) a compressible 1D SPH simulation, (b) a weakly compressible 2D SPH simulation, and (c) an incompressible 2D SPH Simulation. We demonstrate that even and odd symmetries included in the basis functions are key aspects of stability and accuracy.Our broad evaluation shows that Fourier-based continuous convolutions outperform all other architectures regarding accuracy and generalization. Finally, using these Fourier-based networks, we show that prior inductive biases, such as window functions, are no longer necessary. An implementation of our approach, as well as complete datasets and solver implementations, is available at REDACTED FOR DOUBLE-BLIND REVIEW.

POSTER-94: Let's do the time-warp-attend: Learning topological invariants of dynamical systems

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: dynamical systems bifurcations topological invariance Hopf bifurcation physics-informed machine learning augmentation RNA velocity repressilator pancreatic endocrinogenesis

Scores: [ 6 3 6 6 ]

Dynamical systems across the sciences, from electrical circuits to ecological networks, undergo qualitative and often catastrophic changes in behavior called \textit{bifurcations} when their underlying parameters cross a threshold. Existing methods predict oncoming catastrophes from time-series in individual systems but struggle both to categorize qualitative dynamical regimes across diverse systems and to generalize to real data. To address this challenge, we propose a data-driven, physically-informed deep-learning framework for classifying dynamical regimes and characterizing bifurcation boundaries based on the extraction of topologically invariant features. We focus on the paradigmatic case of the supercritical Hopf bifurcation, which is used to model periodic dynamics across a wide range of applications. Our convolutional attention method is trained with data augmentations that encourage the learning of topological invariants which can be used to detect bifurcation boundaries in unseen systems and to design models of biological systems like oscillatory gene regulatory networks. We further demonstrate our method's use in analyzing real data, recovering distinct proliferation and differentiation dynamics along pancreatic endocrinogenesis trajectory in gene expression space based on single-cell data. Our method provides valuable insights into the qualitative, long-term behavior of a wide range of dynamical systems as well as detect bifurcations or catastrophic transitions in large-scale physical and biological systems.

POSTER-95: Gradual Optimization Learning for Conformational Energy Minimization

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: energy minimization conformational optimization

Scores: [ 8 6 6 6 ]

Molecular conformation optimization is crucial to computer-aided drug discovery and materials design.Traditional energy minimization techniques rely on iterative optimization methods that use molecular forces calculated by a physical simulator (oracle) as anti-gradients.However, this is a computationally expensive approach that requires many interactions with a physical simulator.One way to accelerate this procedure is to replace the physical simulator with a neural network.Despite recent progress in neural networks for molecular conformation energy prediction, such models are prone to distribution shift, leading to inaccurate energy minimization.We find that the quality of energy minimization with neural networks can be improved by providing optimization trajectories as additional training data.Still, it takes around 5×105 additional conformations to match the physical simulator's optimization quality.In this work, we present the Gradual Optimization Learning Framework (GOLF) for energy minimization with neural networks that significantly reduces the required additional data.The framework consists of an efficient data-collecting scheme and an external optimizer.The external optimizer utilizes gradients from the energy prediction model to generate optimization trajectories, and the data-collecting scheme selects additional training data to be processed by the physical simulator. Our results demonstrate that the neural network trained with GOLF performs \textit{on par} with the oracle on a benchmark of diverse drug-like molecules using $50$x less additional data.

SPOTLIGHT-19: GTMGC: Using Graph Transformer to Predict Molecule’s Ground-State Conformation

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: molecular conformation prediction molecule modeling graph neural network graph transformer

Scores: [ 3 8 8 ]

The ground-state conformation of a molecule is often decisive for its properties. However, experimental or computational methods, such as density functional theory (DFT), are time-consuming and labor-intensive for obtaining this conformation. Deep learning (DL) based molecular representation learning (MRL) has made significant advancements in molecular modeling and has achieved remarkable results in various tasks. Consequently, it has emerged as a promising approach for directly predicting the ground-state conformation of molecules. In this regard, we introduce GTMGC, a novel network based on Graph-Transformer (GT) that seamlessly predicts the spatial configuration of molecules in a 3D space from their 2D topological architecture in an end-to-end manner. Moreover, we propose a novel self-attention mechanism called Molecule Structural Residual Self-Attention (MSRSA) for molecular structure modeling. This mechanism not only guarantees high model performance and easy implementation but also lends itself well to other molecular modeling tasks. Our method has been evaluated on the Molecule3D benchmark dataset and the QM9 dataset. Experimental results demonstrate that our approach achieves remarkable performance and outperforms current state-of-the-art methods as well as the widely used open-source software RDkit.

POSTER-96: DATS: Difficulty-Aware Task Sampler for Meta-Learning Physics-Informed Neural Networks

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Meta Learning Physics Informed Neural Network (PINN) Adaptive Sampler Partial Differential Equations

Scores: [ 6 6 6 ]

Advancements in deep learning have led to the development of physics-informed neural networks (PINNs) for solving partial differential equations (PDEs) without being supervised by PDE solutions. While vanilla PINNs require training one network per PDE configuration, recent works have showed the potential to meta-learn PINNs across a range of PDE configurations. It is however known that PINN training is associated with different levels of difficulty, depending on the underlying PDE configurations or the number of residual sampling points available. Existing meta-learning approaches, however, treat all PINN tasks equally. We address this gap by introducing a novel difficulty-aware task sampler (DATS) for meta-learning of PINNs. We derive an optimal analytical solution to optimize the probability for sampling individual PINN tasks in order to minimize their validation loss across tasks. We further present two alternative strategies to utilize this sampling probability to either adaptively weigh PINN tasks, or dynamically allocate optimal residual points across tasks. We evaluated DATS against uniform and self-paced task-sampling baselines on two representative meta-PINN models, across four benchmark PDEs as well as three different residual point sampling strategies. The results demonstrated that DATS was able to improve the accuracy of meta-learned PINN solutions when reducing performance disparity across PDE configurations, at only a fraction of residual sampling budgets required by its baselines.

POSTER-97: Invariance-based Learning of Latent Dynamics

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Latent Dynamical Model Invariant Decomposition Neural Network Spatio-Temporal Attention

Scores: [ 5 6 6 6 ]

We propose a new model class aimed at predicting dynamical trajectories from high-dimensional empirical data. This is done by combining variational autoencoders and spatio-temporal attention within a framework designed to enforce certain scientifically-motivated invariances.The models allow inference of systembehaviour at any continuous time and generalization well beyond the data distributions seen during training.Furthermore, the models do not require anexplicit neural ODE formulation, making them efficient and highly scalable in practice.We study behaviour through simple theoretical analyses and extensive experiments on synthetic and real-world datasets. The latter investigate the ability to predict the trajectories of very complicated systems based on finite data and show that the proposed approaches can outperform existing neural-dynamical models.We study also more general inductive bias in the context of transfer to data obtained under entirely novel system interventions. Overall, our results provide a new framework for efficiently learning complicated dynamics in a data-driven manner, with potential applications in a wide range of fields including physics, biology, and engineering.

POSTER-98: PINNsFormer: A Transformer-Based Framework For Physics-Informed Neural Networks

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Physics-Informed Neural Networks Transformer Self-Attention

Scores: [ 5 5 8 8 ]

Physics-Informed Neural Networks (PINNs) have emerged as a promising deep learning framework for approximating numerical solutions to partial differential equations (PDEs). However, conventional PINNs, relying on multilayer perceptrons (MLP), neglect the crucial temporal dependencies inherent in practical physics systems and thus fail to propagate the initial condition constraints globally and accurately capture the true solutions under various scenarios. In this paper, we introduce a novel Transformer-based framework, termed PINNsFormer, designed to address this limitation. PINNsFormer can accurately approximate PDE solutions by utilizing multi-head attention mechanisms to capture temporal dependencies. PINNsFormer transforms point-wise inputs into pseudo sequences and replaces point-wise PINNs loss with a sequential loss. Additionally, it incorporates a novel activation function, \texttt{Wavelet}, which anticipates Fourier decomposition through deep neural networks. Empirical results demonstrate that PINNsFormer achieves superior generalization ability and accuracy across various scenarios, including PINNs failure modes and high-dimensional PDEs. Moreover, PINNsFormer offers flexibility in integrating existing learning schemes for PINNs, further enhancing its performance.

POSTER-99: A Plug-and-Play Image Registration Network

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: deformable image registration plug-and-play priors deep equilibrium models iterative algorithms

Scores: [ 8 8 6 6 ]

Deformable image registration (DIR) is an active research topic in biomedical imaging. There is a growing interest in developing DIR methods based on deep learning (DL). A traditional DL approach to DIR is based on training a convolutional neural network (CNN) to estimate the registration field between two input images. While conceptually simple, this approach comes with a limitation that it exclusively relies on a pre-trained CNN without explicitly enforcing fidelity between the registered image and the reference. We present plug-and-play image registration network (PIRATE) as a new DIR method that addresses this issue by integrating an explicit data-fidelity penalty and a CNN prior. PIRATE pre-trains a CNN denoiser on the registration field and "plugs" it into an iterative method as a regularizer. We additionally present PIRATE+ that fine-tunes the CNN prior in PIRATE using deep equilibrium models (DEQ). PIRATE+ interprets the fixed-point iteration of PIRATE as a network with effectively infinite layers and then trains the resulting network end-to-end, enabling it to learn more task-specific information and boosting its performance. Our numerical results on OASIS and CANDI datasets show that our methods achieve state-of-the-art performance on DIR.

POSTER-100: Long-range Neural Atom Learning for Molecular Graphs

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Long Range Interaction; Molecular Graph;

Scores: [ 6 8 6 ]

POSTER-101: Str2Str: A Score-based Framework for Zero-shot Protein Conformation Sampling

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: proteins conformational sampling diffusion models score-based models generative modeling equivariant network

Scores: [ 6 6 6 6 ]

The dynamic nature of proteins is crucial for determining their biological functions and properties, for which Monte Carlo (MC) and molecular dynamics (MD) simulations stand as predominant tools to study such phenomena. By utilizing empirically derived force fields, MC or MD simulations explore the conformational space through numerically evolving the system via Markov chain or Newtonian mechanics. However, the high-energy barrier of the force fields can hamper the exploration of both methods by the rare event, resulting in inadequately sampled ensemble without exhaustive running. Existing learning-based approaches perform direct sampling yet heavily rely on target-specific simulation data for training, which suffers from high data acquisition cost and poor generalizability. Inspired by simulated annealing, we propose Str2Str, a novel structure-to-structure translation framework capable of zero-shot conformation sampling with roto-translation equivariant property. Our method leverages an amortized denoising score matching objective trained on general crystal structures and has no reliance on simulation data during both training and inference. Experimental results across several benchmarking protein systems demonstrate that Str2Str outperforms previous state-of-the-art generative structure prediction models and can be orders of magnitude faster compared with long MD simulations.

POSTER-102: LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: large language model multimodal medical imaging chest X-ray bidirectional instruction-tuning vision-question answering

Scores: [ 5 8 6 6 6 ]

Following the impressive development of LLMs, vision-language alignment in LLMs is actively being researched to enable multimodal reasoning and visual input/output. This direction of research is particularly relevant to medical imaging because accurate medical image analysis and generation consist of a combination of reasoning based on visual features and prior knowledge. Many recent works have focused on training adapter networks that serve as an information bridge between image processing (encoding or generating) networks and LLMs; but presumably, in order to achieve maximum reasoning potential of LLMs on visual information as well, visual and language features should be allowed to interact more freely. This is especially important in the medical domain because understanding and generating medical images such as chest X-rays (CXR) require not only accurate visual and language-based reasoning but also a more intimate mapping between the two modalities. Thus, taking inspiration from previous work on the transformer and VQ-GAN combination for bidirectional image and text generation, we build upon this approach and develop a method for instruction-tuning an LLM pre-trained only on text to gain vision-language capabilities for medical images. Specifically, we leverage a pretrained LLM’s existing question-answering and instruction-following abilities to teach it to understand visual inputs by instructing it to answer questions about image inputs and, symmetrically, output both text and image responses appropriate to a given query by tuning the LLM with diverse tasks that encompass image-based text-generation and text-based image-generation. We show that our LLM-CXR trained in this approach shows better image-text alignment in both CXR understanding and generation tasks while being smaller in size compared to previously developed models that perform a narrower range of tasks.

POSTER-103: Complete and Efficient Graph Transformers for Crystal Material Property Prediction

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: crystal material property prediction geometric graph representation learning

Scores: [ 6 5 3 6 ]

Crystal structures are characterized by atomic bases within a primitive unit cell that repeats along a regular lattice throughout 3D space. The periodic and infinite nature of crystals poses unique challenges for geometric graph representation learning. Specifically, constructing graphs that effectively capture the complete geometric information of crystals and handle chiral crystals remains an unsolved and challenging problem. In this paper, we introduce a novel approach that utilizes the periodic patterns of unit cells to establish the lattice-based representation for each atom, enabling efficient and expressive graph representations of crystals. Furthermore, we propose ComFormer, a SE(3) transformer designed specifically for crystalline materials. ComFormer includes two variants; namely, iComFormer that employs invariant geometric descriptors of Euclidean distances and angles, and eComFormer that utilizes equivariant vector representations. Experimental results demonstrate the state-of-the-art predictive accuracy of ComFormer variants on various tasks across three widely-used crystal benchmarks.

POSTER-104: CellPLM: Pre-training of Cell Language Model Beyond Single Cells

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Single-cell analysis Pretrained models AI for science

Scores: [ 8 6 6 6 ]

The current state-of-the-art single-cell pre-trained models are greatly inspired by the success of large language models. They trained transformers by treating genes as tokens and cells as sentences. However, three fundamental differences between single-cell data and natural language data are overlooked: (1) scRNA-seq data are presented as bag-of-genes instead of sequences of RNAs; (2) Cell-cell relations are more intricate and important than inter-sentence relations; and (3) The quantity of single-cell data is considerably inferior to text data, and they are very noisy. In light of these characteristics, we propose a new pre-trained model, CellPLM, which takes cells as tokens and tissues as sentences. In addition, we leverage spatially-resolved transcriptomic data in pre-training to facilitate learning cell-cell relationships and introduce a Gaussian prior distribution as an additional inductive bias to overcome data limitations. CellPLM is the first single-cell pre-trained transformer that encodes cell-cell relations and it consistently achieves state-of-the-art performance across distinct downstream tasks.

POSTER-105: Learning Scalar Fields for Molecular Docking with Fast Fourier Transforms

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: protein structure structural biology drug discovery molecular docking

Scores: [ 8 5 6 ]

POSTER-106: Pre-training Sequence, Structure, and Surface Features for Comprehensive Protein Representation Learning

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Protein representation learning self-supervised learning implicit neural representation

Scores: [ 5 6 6 6 ]

Proteins can be represented in various ways, including their sequences, 3D structures, and surfaces. While recent studies have successfully employed sequence- or structure-based representations to address multiple tasks in protein science, there has been significant oversight in incorporating protein surface information, a critical factor for protein function. In this paper, we present a pre-training strategy that incorporates information from protein sequences, 3D structures, and surfaces to improve protein representation learning. Specifically, we utilize Implicit Neural Representations (INRs) for learning surface characteristics, and name it ProteinINR. We confirm that ProteinINR successfully reconstructs protein surfaces, and integrate this surface learning into the existing pre-training strategy of sequences and structures. Our results demonstrate that our approach can enhance performance in various downstream tasks, thereby underscoring the importance of including surface attributes in protein representation learning. These findings underline the importance of understanding protein surfaces for generating effective protein representations.

POSTER-107: Towards Cross Domain Generalization of Hamiltonian Representation via Meta Learning

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: hamiltonian dynamics cross domain generalization learning physics meta learning

Scores: [ 5 6 6 6 ]

Recent advancements in deep learning for physics have focused on discovering shared representations of target systems by incorporating physics priors or inductive biases into neural networks. While effective, these methods are confined to the system domain in which the type of system remains consistent and thus cannot ensure the adaptation to new, or unseen physical systems governed by different laws. For example, a neural network trained on a mass-spring system cannot guarantee the accurate prediction of the behavior of a two-body system or any other system with different physical laws. In this work, we take a significant leap forward by targeting cross domain generalization within the field of Hamiltonian dynamics. We model our system with a graph neural network and employ a meta learning algorithm to enable the model to gain experience over a distribution of tasks and make it adapt to new physics. Our approach aims to learn a unified Hamiltonian representation that is generalizable across multiple system domains, thereby overcoming the limitations of system-specific models. We validate our approach on a dataset comprising various physical systems and evaluate its adaptability to a new type of dynamical system with previously unseen physics. Our results demonstrate that the meta trained model not only adapts effectively to new systems but also captures a generalized Hamiltonian representation that is consistent across different physical domains.Overall, through the use of meta learning, we offer a framework that achieves cross domain generalization, providing a step towards a unified model for understanding a wide array of dynamical systems via deep learning.

SPOTLIGHT-20: De novo Protein Design Using Geometric Vector Field Networks

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Protein design Protein structure encoder Inverse folding Protein diffusion

Scores: [ 8 6 8 ]

Advances like protein diffusion have marked revolutionary progress in de novo protein design, a central topic in life science. These methods typically depend on protein structure encoders to model residue backbone frames, where atoms do not exist. Most prior encoders rely on atom-wise features, such as angles and distances between atoms, which are not available in this context. Only a few basic encoders, like IPA, have been proposed for this scenario, exposing the frame modeling as a bottleneck. In this work, we introduce the Vector Field Network (VFN), that enables network layers to perform learnable vector computations between coordinates of frame-anchored virtual atoms, thus achieving a higher capability for modeling frames. The vector computation operates in a manner similar to a linear layer, with each input channel receiving 3D virtual atom coordinates instead of scalar values. The multiple feature vectors output by the vector computation are then used to update the residue representations and virtual atom coordinates via attention aggregation. Remarkably, VFN also excels in modeling both frames and atoms, as the real atoms can be treated as the virtual atoms for modeling, positioning VFN as a potential universal encoder. In protein diffusion (frame modeling), VFN exhibits a impressive performance advantage over IPA, excelling in terms of both designability (67.04% vs. 53.58%) and diversity (66.54% vs. 51.98%). In inverse folding(frame and atom modeling), VFN outperforms the previous SoTA model, PiFold (54.7% vs. 51.66%), on sequence recovery rate; we also propose a method of equipping VFN with the ESM model, which significantly surpasses the previous ESM-based SoTA (62.67% vs. 55.65%), LM-Design, by a substantial margin.

POSTER-108: Learning Flexible Body Collision Dynamics with Hierarchical Contact Mesh Transformer

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: graph transformer physics-based simulation mesh collision flexible

Scores: [ 6 6 6 6 ]

Recently, many mesh-based graph neural network (GNN) models have been proposed for modeling complex high-dimensional physical systems. Remarkable achievements have been made in significantly reducing the solving time compared to traditional numerical solvers. These methods are typically designed to i) reduce the computational cost in solving physical dynamics and/or ii) propose techniques to enhance the solution accuracy in fluid and rigid body dynamics. However, it remains under-explored whether they are effective in addressing the challenges of flexible body dynamics, where instantaneous collisions occur within a very short timeframe.In this paper, we present Hierarchical Contact Mesh Transformer (HCMT), which uses hierarchical mesh structures and can learn long-range dependencies (occurred by collisions) among spatially distant positions of a body --- two close positions in a higher-level mesh corresponds to two distant positions in a lower-level mesh. HCMT enables long-range interactions, and the hierarchical mesh structure quickly propagates collision effects to faraway positions. To this end, it consists of a contact mesh Transformer and a hierarchical mesh Transformer (CMT and HMT, respectively). Lastly, we propose a unique flexible body dynamics dataset, which is commonly used for product designs. We also compare the performance of several baselines using well-known benchmark datasets. Our results show that HCMT provides significant performance improvements over existing methods.

POSTER-109: MgNO: Efficient Parameterization of Linear Operators via Multigrid

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: neural operator; multigrid; universal approximation; boundary condition

Scores: [ 6 6 6 8 ]

In this work, we propose a concise neural operator architecture for operator learning. Drawing an analogy with a conventional fully connected neural network, we define the neural operator as follows: the output of the i-th neuron in a nonlinear operator layer is defined by Oi(u)=σ(jWiju+Bij). Here, Wij denotes the bounded linear operator connecting j-th input neuron to i-th output neuron, and the bias Bij takes the form of a function rather than a scalar. Given its new universal approximation property, the efficient parameterization of the bounded linear operators between two neurons (Banach spaces) plays a critical role. As a result, we introduce MgNO, utilizing multigrid structures to parameterize these linear operators between neurons. This approach offers both mathematical rigor and practical expressivity. Additionally, MgNO obviates the need for conventional lifting and projecting operators typically required in previous neural operators. Moreover, it seamlessly accommodates diverse boundary conditions. Our empirical observations reveal that MgNO exhibits superior ease of training compared to CNN-based models, while also displaying a reduced susceptibility to overfitting when contrasted with spectral-type neural operators. We demonstrate the efficiency and accuracy of our method with consistently state-of-the-art performance on different types of partial differential equations (PDEs).

POSTER-110: Beam Enumeration: Probabilistic Explainability For Sample Efficient Self-conditioned Molecular Design

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Molecular generative models reinforcement learning natural language processing drug discovery sample-efficiency explainability

Scores: [ 3 8 8 8 ]

Generative molecular design has moved from proof-of-concept to real-world applicability, as marked by the surge in very recent papers reporting experimental validation. Key challenges in explainability and sample efficiency present opportunities to enhance generative design to directly optimize expensive high-fidelity oracles and provide actionable insights to domain experts. Here, we propose Beam Enumeration to exhaustively enumerate the most probable sub-sequences from language-based molecular generative models and show that molecular substructures can be extracted. When coupled with reinforcement learning, extracted substructures become meaningful, providing a source of explainability and improving sample efficiency through self-conditioned generation. Beam Enumeration is generally applicable to any language-based molecular generative model and notably further improves the performance of the recently reported Augmented Memory algorithm, which achieved the new state-of-the-art on the Practical Molecular Optimization benchmark for sample efficiency. The combined algorithm generates more high reward molecules and faster, given a fixed oracle budget. Beam Enumeration is the first method to jointly address explainability and sample efficiency for molecular design.

POSTER-111: Removing Biases from Molecular Representations via Information Maximization

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Molecular Representation Batch Effect Contrastive Learning Information Maximization Drug Discovery

Scores: [ 6 8 6 6 ]

High-throughput drug screening -- using cell imaging or gene expression measurements as readouts of drug effect -- is a critical tool in biotechnology to assess and understand the relationship between the chemical structure and biological activity of a drug. Since large-scale screens have to be divided into multiple experiments, a key difficulty is dealing with batch effects, which can introduce systematic errors and non-biological associations in the data. We propose InfoCORE, an Information maximization approach for COnfounder REmoval, to effectively deal with batch effects and obtain refined molecular representations. InfoCORE establishes a variational lower bound on the conditional mutual information of the latent representations given a batch identifier. It adaptively reweights samples to equalize their implied batch distribution. Extensive experiments on drug screening data reveal InfoCORE's superior performance in a multitude of tasks including molecular property prediction and molecule-phenotype retrieval. Additionally, we show results for how InfoCORE offers a versatile framework and resolves general distribution shifts and issues of data fairness by minimizing correlation with spurious features or removing sensitive attributes.

POSTER-112: A Foundation Model for Error Correction Codes

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Error Correction Codes Foundation Model

Scores: [ 8 3 6 8 ]

In recent years, Artificial Intelligence has undergone a paradigm shift with the rise of foundation models, which are trained on large amounts of data, typically in a self-supervised way, and can then be adapted to a wide range of downstream tasks. In this work, we propose the first foundation model for Error Correction Codes. This model is trained on multiple codes and can then be applied to an unseen code. To enable this, we extend the Transformer architecture in multiple ways: (1) a code-invariant initial embedding, which is also position- and length-invariant, (2) a learned modulation of the attention maps that is conditioned on the Tanner graph, and (3) a length-invariant code-aware noise prediction module that is based on the parity-check matrix. The proposed architecture is trained on multiple short- and medium-length codes and is able to generalize to unseen codes. Its performance on these codes matches and even outperforms the state of the art, despite having a smaller capacity than the leading code-specific transformers. The suggested framework therefore demonstrates, for the first time, the benefits of learning a universal decoder rather than a neural decoder optimized for a given code.

SPOTLIGHT-21: RetroBridge: Modeling Retrosynthesis with Markov Bridges

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Retrosynthesis Reactions Chemistry Drug Discovery Markov Bridge

Scores: [ 8 6 8 6 ]

Retrosynthesis planning is a fundamental challenge in chemistry which aims at designing multi-step reaction pathways from commercially available starting materials to a target molecule. Each step in multi-step retrosynthesis planning requires accurate prediction of possible precursor molecules given the target molecule and confidence estimates to guide heuristic search algorithms. We model single-step retrosynthesis as a distribution learning problem in a discrete state space. First, we introduce the Markov Bridge Model, a generative framework aimed to approximate the dependency between two intractable discrete distributions accessible via a finite sample of coupled data points. Our framework is based on the concept of a Markov bridge, a Markov process pinned at its endpoints. Unlike diffusion-based methods, our Markov Bridge Model does not need a tractable noise distribution as a sampling proxy and directly operates on the input product molecules as samples from the intractable prior distribution. We then address the retrosynthesis planning problem with our novel framework and introduce RetroBridge, a template-free retrosynthesis modeling approach that achieves state-of-the-art results on standard evaluation benchmarks.

SPOTLIGHT-22: Space and time continuous physics simulation from partial observations

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Physics simulation interpolation

Scores: [ 8 8 8 6 8 ]

Modern techniques for physical simulations rely on numerical schemes and mesh-refinement methods to address trade-offs between precision and complexity, but these handcrafted solutions are tedious and require high computational power. Data-driven methods based on large-scale machine learning promise high adaptivity by integrating long-range dependencies more directly and efficiently. In this work, we focus on computational fluid dynamics and address the shortcomings of a large part of the literature, which are based on fixed support for computations and predictions in the form of regular or irregular grids. We propose a novel setup to perform predictions in a continuous spatial and temporal domain while being trained on sparse observations. We formulate the task as a double observation problem and propose a solution with two interlinked dynamical systems defined on, respectively, the sparse positions and the continuous domain, which allows to forecast and interpolate a solution from the initial condition. Our practical implementation involves recurrent GNNs and a spatio-temporal attention observer capable of interpolating the solution at arbitrary locations. Our model not only generalizes to new initial conditions (as standard auto-regressive models do) but also performs evaluation at arbitrary space and time locations. We evaluate on three standard datasets in fluid dynamics and compare to strong baselines, which are outperformed in classical settings and the extended new task requiring continuous predictions.

POSTER-113: Generative Adversarial Policy Network for Modelling Protein Complexes

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: protein complex structure prediction docking path prediction policy network reinforcement learning

Scores: [ 6 6 6 6 ]

Structure prediction of large protein complexes (a.k.a., protein multimer mod-elling, PMM) can be achieved through the one-by-one assembly using provideddimer structures and predicted docking paths. However, existing PMM methodsstruggle with vast search spaces and generalization challenges: (1) The assemblyof a N -chain multimer can be depicted using graph structured data, with eachchain represented as a node and assembly actions as edges. Thus the assemblygraph can be arbitrary acyclic undirected connected graph, leading to the com-binatorial optimization space of N^(N −2) for the PMM problem. (2) Knowledgetransfer in the PMM task is non-trivial. The gradually limited data availability asthe chain number increases necessitates PMM models that can generalize acrossmultimers of various chains. To address these challenges, we propose GAPN, aGenerative Adversarial Policy Network powered by domain-specific rewards andadversarial loss through policy gradient for automatic PMM prediction. Specifi-cally, GAPN learns to efficiently search through the immense assembly space andoptimize the direct docking reward through policy gradient. Importantly, we de-sign a adversarial reward function to enhance the receptive field of our model. Inthis way, GAPN will simultaneously focus on a specific batch of multimers andthe global assembly rules learned from multimers with varying chain numbers.Empirically, we have achieved both significant accuracy (measured by RMSDand TM-Score) and efficiency improvements compared to leading complex mod-eling software. GAPN outperforms the state-of-the-art method (MoLPC) with upto 27% improvement in TM-Score, with a speed-up of 600×.

POSTER-114: BroGNet: Momentum-Conserving Graph Neural Stochastic Differential Equation for Learning Brownian Dynamics

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Brownian dynamics stochastic differential equation graph neural network scientific machine learning

Scores: [ 6 8 6 6 ]

Neural networks (NNs) that exploit strong inductive biases based on physical laws and symmetries have shown remarkable success in learning the dynamics of physical systems directly from their trajectory. However, these works focus only on the systems that follow deterministic dynamics, such as Newtonian or Hamiltonian. Here, we propose a framework, namely Brownian graph neural networks (BroGNet), combining stochastic differential equations (SDEs) and GNNs to learn Brownian dynamics directly from the trajectory. We modify the architecture of BroGNet to enforce linear momentum conservation of the system, which, in turn, provides superior performance on learning dynamics as revealed empirically. We demonstrate this approach on several systems, namely, linear spring, linear spring with binary particle types, and non-linear spring systems, all following Brownian dynamics at finite temperatures. We show that BroGNet significantly outperforms proposed baselines across all the benchmarked Brownian systems. In addition, we demonstrate zero-shot generalizability of BroGNet to simulate unseen system sizes that are two orders of magnitude larger and to different temperatures than those used during training. Finally, we show that BroGNet conserves the momentum of the system resulting in superior performance and data efficiency. Altogether, our study contributes to advancing the understanding of the intricate dynamics of Brownian motion and demonstrates the effectiveness of graph neural networks in modeling such complex systems.

POSTER-115: Neural Spectral Methods

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Machine learning for PDE spectral methods neural network differentiation spectral loss

Scores: [ 8 8 8 3 ]

We present Neural Spectral Methods, a technique to solve parametric Partial Differential Equations (PDEs), grounded in classical spectral methods. In contrast to current machine learning approaches which enforce PDE constraints by minimizing the numerical quadrature of the residuals in the spatiotemporal domain, our method uses orthogonal bases to learn PDE solutions as mappings between spectral coefficients. By leveraging Parseval's Identity, we introduce a new training strategy through a \textit{spectral loss}. This enables more efficient differentiation through the neural network, and substantially reduces training complexity. During inference time, the computational cost of our method remains constant, regardless of the spatiotemporal resolution of the domain. Our experimental results demonstrate that our method significantly outperforms previous machine learning approaches in terms of speed and accuracy by several orders of magnitude. When compared to numerical solvers of the same accuracy, our method demonstrates a 10× increase in performance speed.

POSTER-116: MOFDiff: Coarse-grained Diffusion for Metal-Organic Framework Design

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Keywords: Materials design diffusion model metal-organic framework carbon capture generative model AI for Science

Scores: [ 8 8 8 8 ]

Metal-organic frameworks (MOFs) are of immense interest in applications such as gas storage and carbon capture due to their exceptional porosity and tunable chemistry. Their modular nature has enabled the use of template-based methods to generate hypothetical MOFs by combining molecular building blocks in accordance with known network topologies. However, the ability of these methods to identify top-performing MOFs is often hindered by the limited diversity of the resulting chemical space. In this work, we propose MOFDiff: a coarse-grained (CG) diffusion model that generates CG MOF structures through a denoising diffusion process over the coordinates and identities of the building blocks. The all-atom MOF structure is then determined through a novel assembly algorithm. As the diffusion model generates 3D MOF structures by predicting scores in E(3), we employ equivariant graph neural networks that respect the permutational and roto-translational symmetries. We comprehensively evaluate our model's capability to generate valid and novel MOF structures and its effectiveness in designing outstanding MOF materials for carbon capture applications with molecular simulations.

POSTER-117: GRAPH-CONSTRAINED DIFFUSION FOR END-TO-END PATH PLANNING

Primary Area: applications to robotics, autonomy, planning

Keywords: path planning diffusion model

Scores: [ 6 6 8 6 6 ]

Path planning underpins various applications such as transportation, logistics, and robotics.Conventionally, path planning is formulated with explicit optimization objectives such as distance or time.However, real-world data reveals that user intentions are hard-to-model, suggesting a need for data-driven path planning that implicitly incorporates the complex user intentions.In this paper, we propose GDP, a diffusion-based model for end-to-end data-driven path planning.It effectively learns path patterns via a novel diffusion process that incorporates constraints from road networks, and plans paths as conditional path generation given the origin and destination as prior evidence.GDP is the first solution that bypasses the traditional search-based frameworks, a long-standing performance bottleneck in path planning.We validate the efficacy of GDP on two real-world datasets.Our GDP beats strong baselines by 14.2% ~ 43.5% and achieves state-of-the-art performances.

POSTER-118: Large Language Models as Generalizable Policies for Embodied Tasks

Primary Area: applications to robotics, autonomy, planning

Keywords: Embodied AI Reinforcement Learning Large Language Models Foundational Models

Scores: [ 6 6 6 5 ]

We show that large language models (LLMs) can be adapted to be generalizable policies for embodied visual tasks. Our approach, called Large LAnguage model Reinforcement Learning Policy (LLaRP), adapts a pre-trained frozen LLM to take as input text instructions and visual egocentric observations and output actions directly in the environment. Using reinforcement learning, we train LLaRP to see and act solely through environmental interactions. We show that LLaRP is robust to complex paraphrasings of task instructions and can generalize to new tasks that require novel optimal behavior. In particular, on 1,000 unseen tasks it achieves 42% success rate, 1.7x the success rate of other common learned baselines or zero-shot applications of LLMs. Finally, to aid the community in studying language conditioned, massively multi-task, embodied AI problems we release a novel benchmark, Language Rearrangement, consisting of 150,000 training and 1,000 testing tasks for language-conditioned rearrangement.

ORAL-5: Learning Interactive Real-World Simulators

Primary Area: applications to robotics, autonomy, planning

Keywords: Generative simulator simulating real-world interactions planning reinforcement learning vision language models video generation

Scores: [ 8 8 8 6 ]

Generative models trained on internet data have revolutionized how text, image, and video content can be created. Perhaps the next milestone for generative models is to simulate realistic experience in response to actions taken by humans, robots, and other interactive agents. Applications of a real-world simulator range from controllable content creation in games and movies, to training embodied agents purely in simulation that can be directly deployed in the real world. We explore the possibility of learning a universal simulator (UniSim) of real-world interaction through generative modeling. We first make the important observation that natural datasets available for learning a real-world simulator are often rich along different axes (e.g., abundant objects in image data, densely sampled actions in robotics data, and diverse movements in navigation data). With careful orchestration of diverse datasets, each providing a different aspect of the overall experience, UniSim can emulate how humans and agents interact with the world by simulating the visual outcome of both high-level instructions such as “open the drawer” and low-level controls such as “move by x,y” from otherwise static scenes and objects. There are numerous use cases for such a real-world simulator. As an example, we use UniSim to train both high-level vision-language planners and low-level reinforcement learning policies, each of which exhibit zero-shot real-world transfer after training purely in a learned real-world simulator. We also show that other types of intelligence such as video captioning models can benefit from training with simulated experience in UniSim, opening up even wider applications.

SPOTLIGHT-23: Grounding Language Plans in Demonstrations Through Counter-Factual Perturbations

Primary Area: applications to robotics, autonomy, planning

Keywords: Grounding LLM Learning Mode Abstractions for Manipulation Learning from Demonstration Robotics Task and Motion Planning

Scores: [ 8 6 8 ]

Grounding the abstract knowledge captured by Large Language Models (LLMs) inphysical domains remains a pivotal yet unsolved problem. Whereas prior workshave largely focused on leveraging LLMs for generating abstract plans in symbolicspaces, this work uses LLMs to guide the learning for structures and constraintsin robot manipulation tasks. Specifically, we borrow from manipulation plan-ning literature the concept of mode families, defining specific types of motionconstraints among sets of objects, to serve as an intermediate layer that connectshigh-level language representations with low-level physical trajectories. By lo-cally perturbing a small set of successful human demonstrations, we augment thedataset with additional successful executions as well as counterfactuals that failthe task. Our explanation-based learning framework trains neural network-basedclassifiers to differentiate success task executions from failures and as a by-productlearns classifiers that ground low-level states into mode families without denselabeling. This further enables us to learn structured policies for the target task.Experimental validation in both 2D continuous-space and robotic manipulationenvironments demonstrates the robustness of our mode-basedimitation methods under external perturbations.

POSTER-119: Learning with Language-Guided State Abstractions

Primary Area: applications to robotics, autonomy, planning

Keywords: human-ai interaction state abstractions

Scores: [ 5 6 6 ]

We describe a framework for using natural language to design state abstractions for imitation learning. Generalizable policy learning in high-dimensional observation spaces is facilitated by well-designed state representations, which can surface important features of an environment and hide irrelevant ones. Today, these state representations must often be manually specified, or derived from other labor-intensive labeling procedures. Our method, LGA (\textit{language-guided abstraction}), uses a combination of natural language supervision and background knowledge from language models (LMs) to automatically build state representations tailored to unseen tasks. In LGA, a user first provides a (possibly incomplete) description of a target task in natural language; next, a pre-trained LM translates this task description into a state abstraction function that masks out irrelevant features; finally, an imitation policy is trained using a small number of demonstrations and LGA-generated abstract states. Experiments on simulated robotic tasks show that LGA yields state abstractions similar to human-designed ones, but in a fraction of the time, and that these abstractions improve generalization and robustness in the presence of spurious correlations and ambiguous specifications. We illustrate the utility of the learned abstractions on mobile manipulation tasks with a Spot robot.

POSTER-120: Open the Black Box: Step-based Policy Updates for Temporally-Correlated Episodic Reinforcement Learning

Primary Area: applications to robotics, autonomy, planning

Keywords: Reinforcement Learning Robot Manipulation Movement Primitives Trajectory Correlation Modeling

Scores: [ 6 6 6 6 ]

In applying Reinforcement Learning (RL) to robot trajectory generation, two key challenges commonly emerge. First, the stochastic exploration strategies of step-based RL are unable to produce high-order smooth trajectories. Second, existing methods struggle with effectively modeling movement correlations among different time steps and degrees of freedom, which are crucial for complex tasks and safety measures. Episodic RL methods address these challenges by employing parameterized trajectory generators like Movement Primitives (MP), framing the problem as contextual optimization. While effective in generating smooth trajectories and capturing some movement correlations, these methods lack in utilizing temporal structure within trajectories, resulting in suboptimal sample efficiency. We introduce the Temporally-Correlated Episodic RL (TCE) method to address these shortcomings. TCE enhances exploration efficiency by sampling multi-second trajectories in a parameterized space, thereby assuring high-order trajectory smoothness and the capture of movement correlations. For policy updates, TCE breaks down the entire trajectory into smaller segments, evaluating each segment on its distinct advantages. This nuanced approach allows us to utilize temporal information obtained during trajectory execution.We validate the effectiveness of TCE through experiments on various robotic manipulation tasks, showing its advantages over step-based and episodic RL approaches.

POSTER-121: DV-3DLane: End-to-end Multi-modal 3D Lane Detection with Dual-view Representation

Primary Area: applications to robotics, autonomy, planning

Keywords: 3D Lane Detection Multi-modal

Scores: [ 10 3 6 6 6 ]

Accurate 3D lane estimation is pivotal for autonomous driving safety. However, prevalent monocular techniques suffer from depth loss and lighting variations, hampering accurate 3D lane detection. Conversely, LiDAR points offer geometric cues and enable precise localization. In this paper, we present DV-3DLane, a novel end-to-end Dual-View multi-modal 3D Lane detection framework that synergizes the strengths of both images and LiDAR points. Technically, we propose to learn multi-modal features in dual-view spaces, i.e., perspective view (PV) and bird's-eye-view (BEV), effectively tapping the modal-specific information. To achieve this, we propose three designs: 1) We design a bidirectional feature fusion strategy that integrates multi-modal features into each view space, exploiting the unique strengths of each space. 2) We propose a unified query generation approach that offers queries lane-aware prior knowledge from both views. 3) We introduce a 3D dual-view deformable attention mechanism, aggregating discriminative features from both PV and BEV into queries for accurate 3D lane detection. Extensive experiments on the public benchmark, OpenLane, demonstrate the efficacy and efficiency of DV-3DLane, i.e., it achieves state-of-the-art performance, with a remarkable 9.5 gain in F1 score and a substantial 49.8% reduction in errors.

ORAL-6: ASID: Active Exploration for System Identification and Reconstruction in Robotic Manipulation

Primary Area: applications to robotics, autonomy, planning

Keywords: sim2real system identification exploration

Scores: [ 6 8 5 8 ]

Model-free control strategies such as reinforcement learning have shown the ability to learn control strategies without requiring an accurate model or simulator of the world. While this is appealing due to the lack of modeling requirements, real-world RL can be unsafe and sample inefficient, making it impractical in many safety-critical domains. On the other hand, model-based control techniques leveraging accurate simulators can circumvent these challenges and use a large amount of cheap simulation data to learn controllers that can effectively transfer to the real world. The challenge with such model-based techniques is the requirement for an extremely accurate simulation, requiring both the specification of appropriate simulation assets and physical parameters. This requires considerable human effort to design for every environment being considered. In this work, we propose a learning system that can leverage a small amount of real-world data to autonomously refine a simulation model, and then plan an accurate control strategy that can be deployed in the real world. Our approach critically relies on utilizing an initial (possibly inaccurate) simulator to design effective exploration policies that, when deployed in the real world, collect high-quality data. We demonstrate the efficacy of this paradigm in identifying articulation, mass, and other physical parameters in several challenging robotic manipulation tasks, and illustrate that only a small amount of real-world data can allow for effective sim-to-real transfer.

POSTER-122: TASK PLANNING FOR VISUAL ROOM REARRANGEMENT UNDER PARTIAL OBSERVABILITY

Primary Area: applications to robotics, autonomy, planning

Keywords: Task Planning Object Search Deep-RL

Scores: [ 6 6 8 ]

This paper presents a novel hierarchical task planner under partial observabilitythat empowers an embodied agent to use visual input to efficiently plan a sequenceof actions for simultaneous object search and rearrangement in an untidy room,to achieve a desired tidy state. The paper introduces (i) a novel Search Networkthat utilizes commonsense knowledge from large language models to find unseenobjects, (ii) a Deep RL network trained with proxy reward, along with (iii) a novelgraph-based state representation to produce a scalable and effective planner thatinterleaves object search and rearrangement to minimize the number of steps takenand overall traversal of the agent, as well as to resolve blocked goal and swapcases, and (iv) a sample-efficient cluster-biased sampling for simultaneous trainingof the proxy reward network along with the Deep RL network. Furthermore,the paper presents new metrics and a benchmark dataset - RoPOR, to measurethe effectiveness of rearrangement planning. Experimental results show that ourmethod significantly outperforms the state-of-the-art rearrangement methods Weihset al. (2021a); Gadre et al. (2022); Sarch et al. (2022); Ghosh et al. (2022).

SPOTLIGHT-24: Neural Contractive Dynamical Systems

Primary Area: applications to robotics, autonomy, planning

Keywords: learning from demonstration dynamical systems contraction theory

Scores: [ 8 6 5 ]

Stability guarantees are crucial when ensuring a fully autonomous robot does not take undesirable or potentially harmful actions.Unfortunately, global stability guarantees are hard to provide in dynamical systems learned from data, especially when the learned dynamics are governed by neural networks. We propose a novel methodology to learn neural contractive dynamical systems, where our neural architecture ensures contraction, and hence, global stability. To efficiently scale the method to high-dimensional dynamical systems, we develop a variant of the variational autoencoder that learns dynamics in a low-dimensional latent representation space while retaining contractive stability after decoding. We further extend our approach to learning contractive systems on the Lie group of rotations to account for full-pose end-effector dynamic motions. The result is the first highly flexible learning architecture that provides contractive stability guarantees with capability to perform obstacle avoidance. Empirically, we demonstrate that our approach encodes the desired dynamics more accurately than the current state-of-the-art, which provides less strong stability guarantees.

POSTER-123: Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks

Primary Area: applications to robotics, autonomy, planning

Keywords: Long-horizon robot learning reinforcement learning LLMs

Scores: [ 6 8 6 ]

Large Language Models (LLMs) are highly capable of performing planning for long-horizon robotics tasks, yet existing methods require access to a pre-defined skill library (e.g. picking, placing, pulling, pushing, navigating). However, LLM planning does not address how to design or learn those behaviors, which remains challenging particularly in long-horizon settings. Furthermore, for many tasks of interest, the robot needs to be able to adjust its behavior in a fine-grained manner, requiring the agent to be capable of modifying low-level control actions. Can we instead use the internet-scale knowledge from LLMs for high-level policies, guiding reinforcement learning (RL) policies to efficiently solve robotic control tasks online without requiring a pre-determined set of skills? In this paper, we propose Plan-Seq-Learn (PSL): a modular approach that uses motion planning to bridge the gap between abstract language and learned low-level control for solving long-horizon robotics tasks from scratch. We demonstrate that PSL is capable of solving 20+ challenging single and multi-stage robotics tasks on four benchmarks at success rates of over 80% from raw visual input, out-performing language-based, classical, and end-to-end approaches. Video results and code at https://planseqlearn.github.io/

POSTER-124: SEPT: Towards Efficient Scene Representation Learning for Motion Prediction

Primary Area: applications to robotics, autonomy, planning

Keywords: motion prediction autonomous driving self-supervised learning

Scores: [ 6 8 6 8 ]

Motion prediction is crucial for autonomous vehicles to operate safely in complex traffic environments. Extracting effective spatiotemporal relationships among traffic elements is key to accurate forecasting. Inspired by the successful practice of pretrained large language models, this paper presents SEPT, a modeling framework that leverages self-supervised learning to develop powerful spatiotemporal understanding for complex traffic scenes. Specifically, our approach involves three masking-reconstruction modeling tasks on scene inputs including agents' trajectories and road network, pretraining the scene encoder to capture kinematics within trajectory, spatial structure of road network, and interactions among roads and agents. The pretrained encoder is then finetuned on the downstream forecasting task. Extensive experiments demonstrate that SEPT, without elaborate architectural design or manual feature engineering, achieves state-of-the-art performance on the Argoverse 1 and Argoverse 2 motion forecasting benchmarks, outperforming previous methods on all main metrics by a large margin.

POSTER-125: Multimodal Web Navigation with Instruction-Finetuned Foundation Models

Primary Area: applications to robotics, autonomy, planning

Keywords: Web Navigation Foundation Models Large Language Models Instruction Finetuning Decision Making Multimodal Document Understanding

Scores: [ 5 8 5 ]

The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data.In this work, we study data-driven offline training for web agents with vision-language foundation models.We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages and outputs web navigation actions, such as click and type.WebGUM is trained by jointly finetuning an instruction-finetuned language model and a vision encoder with temporal and local perception on a large corpus of demonstrations.We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning, outperforming prior works by a significant margin. On the MiniWoB, we improve over the previous best offline methods by more than 45.8%, even outperforming online-finetuned SoTA, humans, and GPT-4-based agent. On the WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing SoTA, PaLM-540B.Furthermore, WebGUM exhibits strong positive transfer to the real-world planning tasks on the Mind2Web.We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction.

POSTER-126: DIFFTACTILE: A Physics-based Differentiable Tactile Simulator for Contact-rich Robotic Manipulation

Primary Area: applications to robotics, autonomy, planning

Keywords: Tactile sensing Simulation Robotic manipulation

Scores: [ 8 6 6 6 ]

We introduce DIFFTACTILE, a physics-based and fully differentiable tactile simulation system designed to enhance robotic manipulation with dense and physically-accurate tactile feedback. In contrast to prior tactile simulators which primarily focus on manipulating rigid bodies and often rely on simplified approximations to model stress and deformations of materials in contact, DIFFTACTILE emphasizes physics-based contact modeling with high fidelity, supporting simulations of diverse contact modes and interactions with objects possessing a wide range of material properties. Our system incorporates several key components, including a Finite Element Method (FEM) -based soft body model for simulating the sensing elastomer, a multi-material simulator for modeling diverse object types (such as elastic, plastic, cables) under manipulation, a penalty-based contact model for handling contact dynamics. The differentiable nature of our system facilitates gradient-based optimization for both 1) refining physical properties in simulation using real-world data, hence narrowing the sim-to-real gap, and 2) efficient learning of tactile-assisted grasping and contact-rich manipulation skills. Additionally, we introduce a method to infer the optical response of our tactile sensor to contact using an efficient pixel-based neural module. We anticipate that DIFFTACTILE will serve as a useful platform for studying contact-rich manipulations, leveraging the benefits of dense tactile feedback and differentiable physics. The source codes of DIFFTACTILE will be publicly available.

SPOTLIGHT-25: Generalized Policy Iteration using Tensor Approximation for Hybrid Control

Primary Area: applications to robotics, autonomy, planning

Keywords: Optimal Control Hybrid Actions Robotics Approximate Dynamic Programming Tensor Approximation

Scores: [ 8 8 8 6 ]

Control of dynamic systems involving hybrid actions is a challenging task in robotics. To address this, we present a novel algorithm called Generalized Policy Iteration using Tensor Train (TTPI) that belongs to the class of Approximate Dynamic Programming (ADP). We use a low-rank tensor approximation technique called Tensor Train (TT) to approximate the state-value and advantage function which enables us to efficiently handle hybrid systems. We demonstrate the superiority of our approach over previous baselines for some benchmark problems with hybrid action spaces. Additionally, the robustness and generalization of the policy for hybrid systems are showcased through a real-world robotics experiment involving a non-prehensile manipulation task which is considered to be a highly challenging control problem.

POSTER-127: End-to-End (Instance)-Image Goal Navigation through Correspondence as an Emergent Phenomenon

Primary Area: applications to robotics, autonomy, planning

Keywords: Navigation Embodied AI Perception

Scores: [ 8 6 6 8 ]

Most recent work in goal oriented visual navigation resorts to large-scale machine learning in simulated environments. The main challenge lies in learning compact representations generalizable to unseen environments and in learning high-capacity perception modules capable of reasoning on high-dimensional input. The latter is particularly difficult when the goal is not given as a category ("ObjectNav") but as an exemplar image ("ImageNav"), as the perception module needs to learn a comparison strategy requiring to solve an underlying visual correspondence problem. This has been shown to be difficult from reward alone or with standard auxiliary tasks. We address this problem through a sequence of two pretext tasks, which serve as a prior for what we argue is one of the main bottleneck in perception, extremely wide-baseline relative pose estimation and visibility prediction in complex scenes. The first pretext task, cross-view completion is a proxy for the underlying visual correspondence problem, while the second task addresses goal detection and finding directly. We propose a new dual encoder with a large-capacity binocular ViT model and show that correspondence solutions naturally emerge from the training signals. Experiments show significant improvements and SOTA performance on the two benchmarks, ImageNav and the Instance-ImageNav variant, where camera intrinsics and height differ between observation and goal.

POSTER-128: Zero-Shot Robotic Manipulation with Pre-Trained Image-Editing Diffusion Models

Primary Area: applications to robotics, autonomy, planning

Keywords: robot learning diffusion model

Scores: [ 6 6 8 5 ]

If generalist robots are to operate in truly unstructured environments, they need to be able to recognize and reason about novel objects and scenarios. Such objects and scenarios might not be present in the robot's own training data. We propose SuSIE, a method that leverages an image editing diffusion model to act as a high-level planner by proposing intermediate subgoals that a low-level controller attains. Specifically, we fine-tune InstructPix2Pix on robot data such that it outputs a hypothetical future observation given the robot's current observation and a language command. We then use the same robot data to train a low-level goal-conditioned policy to reach a given image observation. We find that when these components are combined, the resulting system exhibits robust generalization capabilities. The high-level planner utilizes its Internet-scale pre-training and visual understanding to guide the low-level goal-conditioned policy, achieving significantly better generalization than conventional language-conditioned policies. We demonstrate that this approach solves real robot control tasks involving novel objects, distractors, and even environments, both in the real world and in simulation. The project website can be found at http://subgoal-image-editing.github.io.

POSTER-129: UC-NERF: Neural Radiance Field for under-calibrated multi-view cameras

Primary Area: applications to robotics, autonomy, planning

Keywords: Neural radiance field Multi-cameras under-calibration color inconsistency

Scores: [ 6 6 5 6 6 ]

Multi-camera setups find widespread use across various applications, such as autonomous driving, as they greatly expand sensing capabilities. Despite the fast development of Neural radiance field (NeRF) techniques and their wide applications in both indoor and outdoor scenes, applying NeRF to multi-camera systems remains very challenging. This is primarily due to the inherent under-calibration issues in multi-camera setup, including inconsistent imaging effects stemming from separately calibrated image signal processing units in diverse cameras, and system errors arising from mechanical vibrations during driving that affect relative camera poses.In this paper, we present UC-NeRF, a novel method tailored for novel view synthesis in under-calibrated multi-view camera systems.Firstly, we propose a layer-based color correction to rectify the color inconsistency in different image regions. Second, we propose virtual warping to generate more viewpoint-diverse but color-consistent virtual views for color correction and 3D recovery. Finally, a spatiotemporally constrained pose refinement is designed for more robust and accurate pose calibration in multi-camera systems.Our method not only achieves state-of-the-art performance of novel view synthesis in multi-camera setups, but also effectively facilitates depth estimation in large-scale outdoor scenes with the synthesized novel views.

POSTER-130: Trajeglish: Learning the Language of Driving Scenarios

Primary Area: applications to robotics, autonomy, planning

Keywords: self-driving traffic modeling autonomous vehicles simulation tracking transformer

Scores: [ 6 6 6 ]

A longstanding challenge for self-driving development is the ability to simulate dynamic driving scenarios seeded from recorded driving logs. Given an initial scene observed during a test drive, we seek the ability to sample plausible scene-consistent future trajectories for all agents in the scene, even when the actions for a subset of agents are chosen by an external source, such as a black-box self-driving planner. In order to model the complicated spatial and temporal interaction across agents in driving scenarios, we propose to tokenize the motion of dynamic agents and use tools from language modeling to model the full sequence of multi-agent actions. Our traffic model explicitly captures intra-timestep dependence between agents, which we show is essential for simulation given only a single frame of historical scene context, as well as enabling improvements when provided longer historical context. We demonstrate competitive results sampling scenarios given initializations from the Waymo Open Dataset with full autonomy as well as partial autonomy, and show that the representations learned by our model can quickly be adapted to improve performance on nuScenes, a considerably smaller dataset. We additionally use density estimates from our model to quantify the saliency of context length and intra-timestep interaction for the traffic modeling task.

SPOTLIGHT-26: Stabilizing Contrastive RL: Techniques for Robotic Goal Reaching from Offline Data

Primary Area: applications to robotics, autonomy, planning

Keywords: reinforcement learning self-supervised learning contrastive learning goal-conditioned RL offline RL robotics

Scores: [ 5 8 8 8 ]

Robotic systems that rely primarily on self-supervised learning have the potential to decrease the amount of human annotation and engineering effort required to learn control strategies. In the same way that prior robotic systems have leveraged self-supervised techniques from computer vision (CV) and natural language processing (NLP), our work builds on prior work showing that the reinforcement learning (RL) itself can be cast as a self-supervised problem: learning to reach any goal without human-specified rewards or labels. Despite the seeming appeal, little (if any) prior work has demonstrated how self-supervised RL methods can be practically deployed on robotic systems. By first studying a challenging simulated version of this task, we discover design decisions about architectures and hyperparameters that increase the success rate by 2×. These findings lay the groundwork for our main result: we demonstrate that a self-supervised RL algorithm based on contrastive learning can solve real-world, image-based robotic manipulation tasks, with tasks being specified by a single goal image provided after training.

ORAL-7: MetaGPT: Meta Programming for Multi-Agent Collaborative Framework

Primary Area: applications to robotics, autonomy, planning

Keywords: Autonomous Agent Meta Programming Multi-Agent Society Group Intelligence

Scores: [ 8 8 3 ]

Recently, remarkable progress has been made on automated problem solving through societies of agents based on large language models (LLMs). Previous LLM-based multi-agent systems can already solve simple dialogue tasks. More complex tasks, however, face challenges through logic inconsistencies due to cascading hallucinations caused by naively chaining LLMs. Here we introduce MetaGPT, an innovative meta-programming framework incorporating efficient human workflows into LLM-based multi-agent collaborations. MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences for more streamlined workflows, thus allowing agents with human-like domain expertise to verify intermediate results and reduce errors. MetaGPT utilizes an assembly line paradigm to assign diverse roles to various agents, efficiently breaking down complex tasks into subtasks involving many agents working together. On collaborative software engineering benchmarks, MetaGPT generates more coherent solutions than previous chat-based multi-agent systems.

POSTER-131: Fourier Transporter: Bi-Equivariant Robotic Manipulation in 3D

Primary Area: applications to robotics, autonomy, planning

Keywords: Robot Learning Geometric Deep Learning Robotic Manipulation Equivariant deep learning

Scores: [ 6 6 8 5 ]

Many complex robotic manipulation tasks can be decomposed as a sequence of pick and place actions. Training a robotic agent to learn this sequence over many different starting conditions typically requires many iterations or demonstrations, especially in 3D environments. In this work, we propose Fourier Transporter (FourTran) which leverages the two-fold SE(d)×mathrmSE(d) symmetry in the pick-place problem to achieve much higher sample efficiency. FourTran is an open-loop behavior cloning method trained using expert demonstrations to predict pick-place actions on new environments. FourTran is constrained to incorporate symmetries of the pick and place actions independently. Our method utilizes a fiber space Fourier transformation that allows for memory-efficient construction. We test our proposed network on the RLbench benchmark and achieve state-of-the-art results across various tasks.

POSTER-132: CrossLoco: Human Motion Driven Control of Legged Robots via Guided Unsupervised Reinforcement Learning

Primary Area: applications to robotics, autonomy, planning

Keywords: Human Motion Driven Control Legged Locomotion Unsupervised Reinforcement Learning

Scores: [ 6 6 8 ]

Human motion driven control (HMDC) is an effective approach for generating natural and compelling robot motions while preserving high-level semantics. However, establishing the correspondence between humans and robots with different body structures is not straightforward due to the mismatches in kinematics and dynamics properties, which causes intrinsic ambiguity to the problem. Many previous algorithms approach this motion retargeting problem with unsupervised learning, which requires the prerequisite skill sets. However, it will be extremely costly to learn all the skills without understanding the given human motions, particularly for high-dimensional robots. In this work, we introduce CrossLoco, a guided unsupervised reinforcement learning framework that simultaneously learns robot skills and their correspondence to human motions. Our key innovation is to introduce a cycle-consistency-based reward term designed to maximize the mutual information between human motions and robot states. We demonstrate that the proposed framework can generate compelling robot motions by translating diverse human motions, such as running, hopping, and dancing. We quantitatively compare our CrossLoco against the manually engineered and unsupervised baseline algorithms along with the ablated versions of our framework and demonstrate that our method translates human motions with better accuracy, diversity, and user preference. We also showcase its utility in other applications, such as synthesizing robot movements from

POSTER-133: Enhancing Human-AI Collaboration Through Logic-Guided Reasoning

Primary Area: applications to robotics, autonomy, planning

Keywords: Logic rule human-robot interaction and collaboration

Scores: [ 8 5 6 6 ]

We present a principled framework aimed at utilizing logical rules to enhance human-robot perception and collaboration effectively. Logical rules not only offer interpretable explanations for predictions but also possess the ability to generalize to various tasks, making them invaluable for learning. In this paper, we introduce a set of logic rules as knowledge, automatically derived from observational data, to deduce human goals and guide human-like agents in successful collaboration. We treat logic rules as latent variables and simultaneously trains a rule generator and a reasoning predictor utilizing these rules. In the first stage, we evaluate the posterior distribution over the latent rule content by learning rule embeddings from observed triplets, jointly representing entities and relations in a unified embedding space. These learned embeddings enable us to calculate a confidence score for each rule, indicating its consistency with observed triplets. In the second stage, we jointly optimize the rule generator and model parameters by maximizing the current expected log-likelihood. Extensive experiments validate each component of our framework, and results on multiple benchmarks demonstrate that our model outperforms the majority of existing approaches.

POSTER-134: OpenNerf: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views

Primary Area: applications to robotics, autonomy, planning

Keywords: 3D open set segmentation neural radiance fields VLM features

Scores: [ 6 5 8 5 ]

Large visual-language models (VLMs), like CLIP, enable open-set image segmentation to segment arbitrary concepts from an image in a zero-shot manner. This goes beyond the traditional closed-set assumption, i.e., where models can only segment classes from a pre-defined training set. More recently, first works on open-set segmentation in 3D scenes have appeared in the literature. These methods are heavily influenced by closed-set 3D convolutional approaches that process point clouds or polygon meshes. However, these 3D scene representations do not align well with the image-based nature of the visual-language models. Indeed, point cloud and 3D meshes typically have a lower resolution than images and the reconstructed 3D scene geometry might not project well to the underlying 2D image sequences used to compute pixel-aligned CLIP features. To address these challenges, we propose OpenNeRF which naturally operates on posed images and directly encodes the VLM features within the NeRF. This is similar in spirit to LERF, however our work shows that using pixel-wise VLM features (instead of global CLIP features) results in an overall less complex architecture without the need for additional DINO regularization. Our OpenNeRF further leverages NeRF’s ability to render novel views and extract open-set VLM features from areas that are not well observed in the initial posed images. For 3D point cloud segmentation on the Replica dataset, OpenNeRF outperforms recent open-vocabulary methods such as LERF and OpenScene by at least +4.9 mIoU.

POSTER-135: Social-Transmotion: Promptable Human Trajectory Prediction

Primary Area: applications to robotics, autonomy, planning

Keywords: human trajectory prediction robot navigation autonomous driving attention mechanism

Scores: [ 6 5 5 3 ]

Accurate human trajectory prediction is crucial for applications such as autonomous vehicles, robotics, and surveillance systems. Yet, existing models often fail to fully leverage the non-verbal social cues human subconsciously communicate when navigating the space. To address this, we introduce \textit{Social-Transmotion}, a generic model that exploits the power of transformers to handle diverse and numerous visual cues, capturing the multi-modal nature of human behavior. We translate the idea of a prompt from Natural Language Processing (NLP) to the task of human trajectory prediction, where a prompt can be a sequence of x-y coordinates on the ground, bounding boxes or body poses. This, in turn, augments trajectory data, leading to enhanced human trajectory prediction.Our model exhibits flexibility and adaptability by capturing spatiotemporal interactions between pedestrians based on the available visual cues, whether they are poses, bounding boxes, or a combination thereof.By the masking technique, we ensure our model's effectiveness even when certain visual cues are unavailable, although performance is further boosted with the presence of comprehensive visual data.We delve into the merits of using 2d versus 3d poses, and a limited set of poses. Additionally, we investigate the spatial and temporal attention map to identify which keypoints and frames of poses are vital for optimizing human trajectory prediction.Our approach is validated on multiple datasets, including JTA, JRDB, Pedestrians and Cyclists in Road Traffic, and ETH-UCY.

POSTER-136: THOUGHT PROPAGATION: AN ANALOGICAL APPROACH TO COMPLEX REASONING WITH LARGE LANGUAGE MODELS

Primary Area: applications to robotics, autonomy, planning

Keywords: LLM Complex Reasoning Language Model Reasoning

Scores: [ 6 6 8 ]

POSTER-137: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion

Primary Area: applications to robotics, autonomy, planning

Keywords: discrete diffusion; world model; autonomous driving

Scores: [ 6 8 6 10 6 ]

Learning world models can teach an agent how the world works in an unsupervised manner. Even though it can be viewed as a special case of sequence modeling, progress for scaling world models on robotic applications such as autonomous driving has been somewhat less rapid than scaling language models with Generative Pre-trained Transformers (GPT). We identify two reasons as major bottlenecks: dealing with complex and unstructured observation space, and having a scalable generative model. Consequently, we propose a novel world modeling approach that first tokenizes sensor observations with VQVAE, then predicts the future via discrete diffusion. To efficiently decode and denoise tokens in parallel, we recast Masked Generative Image Transformer into the discrete diffusion framework with a few simple changes, resulting in notable improvement. When applied to learning world models on point cloud observations, our model reduces prior SOTA Chamfer distance by more than 65% for 1s prediction, and more than 50% for 3s prediction, across NuScenes, KITTI Odometry, and Argoverse2 datasets. Our results demonstrate that discrete diffusion on tokenized agent experience can unlock the power of GPT-like unsupervised learning for robotic agents.

POSTER-138: DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models

Primary Area: applications to robotics, autonomy, planning

Keywords: Autonmous Driving Large Language Model Embodied AI

Scores: [ 6 6 8 5 ]

Recent advancements in autonomous driving have relied on data-driven approaches, which are widely adopted but face challenges including dataset bias, overfitting, and uninterpretability. Drawing inspiration from the knowledge-driven nature of human driving, we explore the question of how to instill similar capabilities into autonomous driving systems and summarize a paradigm that integrates an interactive environment, a driver agent, as well as a memory component to address this question. Leveraging large language models (LLMs) with emergent abilities, we propose the DiLu framework, which combines a Reasoning and a Reflection module to enable the system to perform decision-making based on common-sense knowledge and evolve continuously. Extensive experiments prove DiLu’s capability to accumulate experience and demonstrate a significant advantage in generalization ability over reinforcement learning-based methods. Moreover, DiLu is able to directly acquire experiences from real-world datasets which highlights its potential to be deployed on practical autonomous driving systems. To the best of our knowledge, we are the first to instill knowledge-driven capability into autonomous driving systems from the perspective of how humans drive. The demonstration video of the DiLu pipeline is attached in the supplementary material and we will open source the DiLu framework upon publication.

SPOTLIGHT-27: GenSim: Generating Robotic Simulation Tasks via Large Language Models

Primary Area: applications to robotics, autonomy, planning

Keywords: LLM Code Generation Robotic Simulation Multi-task Policy Learning

Scores: [ 8 8 8 8 ]

Collecting large amounts of real-world interaction data to train general robotic policies is often prohibitively expensive, thus motivating the use of simulation data. However, existing methods for data generation have generally focused on scene-level diversity (e.g., object instances and poses) rather than task-level diversity, due to the human effort required to come up with and verify novel tasks. This has made it challenging for policies trained on simulation data to demonstrate significant task-level generalization. In this paper, we propose to automatically generate rich simulation environments and expert demonstrations by exploiting a large language models' (LLM) grounding and coding ability. Our approach, dubbed GenSim, has two modes: goal-directed generation, wherein a target task is given to the LLM and the LLM proposes a task curriculum to solve the target task, and exploratory generation, wherein the LLM bootstraps from previous tasks and iteratively proposes novel tasks that would be helpful in solving more complex tasks. We use GPT4 to expand the existing benchmark by ten times to over 100 tasks, on which we conduct supervised finetuning and evaluate several LLMs including finetuned GPTs and Code Llama on code generation for robotic simulation tasks. Furthermore, we observe that LLMs-generated simulation programs can enhance task-level generalization significantly when used for multitask policy training. We further find that with minimal sim-to-real adaptation, the multitask policies pretrained on GPT4-generated simulation tasks exhibit stronger transfer to unseen long-horizon tasks in the real world and outperform baselines by 25%. See our project website (https://gen-sim.github.io) and demo (https://huggingface.co/spaces/Gen-Sim/Gen-Sim) for visualizations and open-source models and datasets.

POSTER-139: Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Primary Area: applications to robotics, autonomy, planning

Keywords: Visual Robot Manipulation Video Generative Pre-Training Causal Transformer

Scores: [ 8 6 5 3 ]

Generative pre-trained models have demonstrated remarkable effectiveness in language and vision domains by learning useful representations. In this paper, we extend the scope of this effectiveness by showing that visual robot manipulation can significantly benefit from large-scale video generative pre-training. We introduce GR-1, a straightforward GPT-style model designed for multi-task language-conditioned visual robot manipulation. GR-1 takes as inputs a language instruction, a sequence of observation images, and a sequence of robot states. It predicts robot actions as well as future images in an end-to-end manner. Thanks to a flexible design, GR-1 can be seamlessly finetuned on robot data after pre-trained on a large-scale video dataset. We perform extensive experiments on the challenging CALVIN benchmark and a real robot. On CALVIN benchmark, our method outperforms state-of-the-art baseline methods and improves the success rate from 88.9% to 94.9%. When trained on 10% data of the full dataset, GR-1 achieves a success rate of 77.8%, while the best baseline method achieves 66.8%. In the zero-shot generalization setting, GR-1 improves the success rate from 53.3% to 85.4%. In real robot experiments, GR-1 also outperforms the comparing baseline method. We provide inaugural evidence that a unified GPT-style transformer, augmented with large-scale video generative pre-training, exhibits remarkable generalization to multi-task visual robot manipulation. Code will be made available.

SPOTLIGHT-28: GIM: Learning Generalizable Image Matcher From Internet Videos

Primary Area: applications to robotics, autonomy, planning

Keywords: Image Matching Pose Estimation 3D Reconstruction

Scores: [ 6 8 10 8 ]

Image matching is a fundamental computer vision problem. While learning-based methods achieve state-of-the-art performance on existing benchmarks, they generalize poorly to in-the-wild images. Such methods typically need to train separate models for different scene types (e.g., indoor vs. outdoor) and are impractical when the scene type is unknown in advance. One of the underlying problems is the limited scalability of existing data construction pipelines, which limits the diversity of standard image matching datasets. To address this problem, we propose GIM, a self-training framework for learning a single generalizable model based on any image matching architecture using internet videos, an abundant and diverse data source. Given an architecture, GIM first trains it on standard domain-specific datasets and then combines it with complementary matching methods to create dense labels on nearby frames of novel videos. These labels are filtered by robust fitting, and then enhanced by propagating them to distant frames. The final model is trained on propagated data with strong augmentations. Not relying on complex 3D reconstruction makes GIM much more efficient and less likely to fail than standard SfM-and-MVS based frameworks. We also propose ZEB, the first zero-shot evaluation benchmark for image matching. By mixing data from diverse domains, ZEB can thoroughly assess the cross-domain generalization performance of different methods. Experiments demonstrate the effectiveness and generality of GIM. Applying GIM consistently improves the zero-shot performance of 3 state-of-the-art image matching architectures as the number of downloaded videos increases (Fig. 1 (a)); with 50 hours of YouTube videos, the relative zero-shot performance improves by 8.4% − 18.1%. GIM also enables generalization to extreme cross-domain data such as Bird Eye View (BEV) images of projected 3D point clouds (Fig. 1 (c)). More importantly, our single zero-shot model consistently outperforms domain-specific baselines when evaluated on downstream tasks inherent to their respective domains. The code will be released upon acceptance.

SPOTLIGHT-29: Learning to Act from Actionless Videos through Dense Correspondences

Primary Area: applications to robotics, autonomy, planning

Keywords: Video-Based Policy Video Dense Correspondence

Scores: [ 6 8 10 6 ]

In this work, we present an approach to construct a video-based robot policy capable of reliably executing diverse tasks across different robots and environments from few video demonstrations without using any action annotations. Our method leverages images as a task-agnostic representation, encoding both the state and action information, and text as a general representation for specifying robot goals. By synthesizing videos that “hallucinate” robot executing actions and in combination with dense correspondences between frames, our approach can infer the closed-formed action to execute to an environment without the need of any explicit action labels. This unique capability allows us to train the policy solely based on RGB videos and deploy learned policies to various robotic tasks. We demonstrate the efficacy of our approach in learning policies on table-top manipulation and navigation tasks. Additionally, we contribute an open-source framework for efficient video modeling, enabling the training of high-fidelity policy models with four GPUs within a single day.

POSTER-140: LaneSegNet: Map Learning with Lane Segment Perception for Autonomous Driving

Primary Area: applications to robotics, autonomy, planning

Keywords: Laneline detection autonomous driving topology reasoning

Scores: [ 6 5 5 8 ]

A map, as crucial information for downstream applications of an autonomous driving system, is usually represented in lanelines or centerlines. However, existing literature on map learning primarily focuses on either detecting geometry-based lanelines or perceiving topology relationships of centerlines. Both of these methods ignore the intrinsic relationship of lanelines and centerlines, that lanelines bind centerlines. While simply predicting both types of lane in one model is mutually excluded in learning objective, we advocate lane segment as a new representation that seamlessly incorporates both geometry and topology information. Thus, we introduce LaneSegNet, the first end-to-end mapping network generating lane segments to obtain a complete representation of the road structure. Our algorithm features two key modifications. One is a lane attention module to capture pivotal region details within the long-range feature space. Another is an identical initialization strategy for reference points, which enhances the learning of positional priors for lane attention. On the OpenLane-V2 dataset, LaneSegNet outperforms previous counterparts by a substantial gain across three tasks, i.e., map element detection (+4.8 mAP), centerline perception (+6.9 DET_l), and the newly defined one, lane segment perception (+5.6 mAP). Furthermore, it obtains a real-time inference speed of 14.7 FPS. Code would be released to the public.

SPOTLIGHT-30: Thin-Shell Object Manipulations With Differentiable Physics Simulations

Primary Area: applications to robotics, autonomy, planning

Keywords: differentiable physics simulation thin-shell object manipulation

Scores: [ 8 8 8 8 8 ]

In this work, we aim to teach robots to manipulate various thin-shell materials. Prior works studying thin-shell object manipulation mostly rely on heuristic policies or learn policies from real-world video demonstrations, and only focus on limited material types and tasks (e.g., cloth unfolding). However, these approaches face significant challenges when extended to a wider variety of thin-shell materials and a diverse range of tasks.On the other hand, while virtual simulations are shown to be effective in diverse robot skill learning and evaluation, prior thin-shell simulation environments only support a subset of thin-shell materials, which also limits their supported range of tasks. To fill in this gap, we introduce ThinShellLab - a fully differentiable simulation platform tailored for robotic interactions with diverse thin-shell materials possessing varying material properties, enabling flexible thin-shell manipulation skill learning and evaluation. Building on top of our developed simulation engine, we design a diverse set of manipulation tasks centered around different thin-shell objects. Our experiments suggest that manipulating thin-shell objects presents several unique challenges: 1) thin-shell manipulation relies heavily on frictional forces due to the objects' co-dimensional nature, 2) the materials being manipulated are highly sensitive to minimal variations in interaction actions, and 3) the constant and frequent alteration in contact pairs makes trajectory optimization methods susceptible to local optima, and neither standard reinforcement learning algorithms nor trajectory optimization methods (either gradient-based or gradient-free) are able to solve the tasks alone. To overcome these challenges, we present an optimization scheme that couples sampling-based trajectory optimization and gradient-based optimization, boosting both learning efficiency and converged performance across various proposed tasks. In addition, the differentiable nature of our platform facilitates a smooth sim-to-real transition. By tuning simulation parameters with a minimal set of real-world data, we demonstrate successful deployment of the learned skills to real-robot settings. ThinShellLab will be publicly available. Video demonstration and more information can be found on the project website https://thinshelllab.github.io.

POSTER-141: CORN: Contact-based Object Representation for Nonprehensile Manipulation of General Unseen Objects

Primary Area: applications to robotics, autonomy, planning

Keywords: pretraining robotics manipulation object representation representation learning

Scores: [ 8 5 5 10 ]

Nonprehensile manipulation is essential for manipulating objects that are too thin, large, or otherwise ungraspable in the wild. To sidestep the difficulty of contact modeling in conventional modeling-based approaches, reinforcement learning (RL) has recently emerged as a promising alternative. However, previous RL approaches either lack the ability to generalize over diverse object shapes, or use simple action primitives that limit the diversity of robot motions. Furthermore, using RL over diverse object geometry is challenging due to the high cost of training a policy that takes in high-dimensional sensory inputs. We propose a novel contact-based object representation and pretraining pipeline to tackle this. To enable massively parallel training, we leverage a lightweight patch-based transformer architecture for our encoder that processes point clouds, thus scaling our training across thousands of environments. Compared to learning from scratch, or other shape representation baselines, our representation facilitates both time- and data-efficient learning. We validate the efficacy of our overall system by zero-shot transferring the trained policy to novel real-world objects. We highly recommend the video attached in the supplementary material. Code and videos are available at \url{https://sites.google.com/view/contact-non-prehensile}.

SPOTLIGHT-31: Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization

Primary Area: applications to robotics, autonomy, planning

Keywords: Language Agent AI Agent Reinforcement Learning

Scores: [ 6 8 3 ]

Recent months have seen the emergence of a powerful new trend in which large language models (LLMs) are augmented to become autonomous language agents capable of performing objective oriented multi-step tasks on their own, rather than merely responding to queries from human users. Most existing language agents, however, are not optimized using environment-specific rewards. Although some agents enable iterative refinement through verbal feedback, they do not reason and plan in ways that are compatible with gradient-based learning from rewards. This paper introduces a principled framework for reinforcing large language agents by learning a retrospective model, which automatically tunes the language agent prompts from environment feedback through policy gradient. Specifically, our proposed agent architecture learns from rewards across multiple environments and tasks, for fine-tuning a pre-trained language model which refines the language agent prompt by summarizing the root cause of prior failed attempts and proposing action plans. Experimental results on various tasks demonstrate that the language agents improve over time and that our approach considerably outperforms baselines that do not properly leverage gradients from the environment.

POSTER-142: Fleet Policy Learning via Weight Merging and An Application to Robotic Tool-Use

Primary Area: applications to robotics, autonomy, planning

Keywords: Fleet Learning Weight Merging Multi-task Policy Learning

Scores: [ 8 5 8 ]

Fleets of robots ingest massive amounts of streaming data generated by interacting with their environments, far more than those that can be stored or transmitted with ease. At the same time, we hope that teams of robots can co-acquire diverse skills through their experiences in varied settings. How can we enable such fleet-level learning without having to transmit or centralize fleet-scale data? In this paper, we investigate distributed learning of policies as a potential solution. To efficiently merge policies in the distributed setting, we propose fleet-merge, an instantiation of distributed learning that accounts for the symmetries that can arise in learning policies that are parameterized by recurrent neural networks. We show that fleet-merge consolidates the behavior of policies trained on 50 tasks in the Meta-World environment, with the merged policy achieving good performance on nearly all training tasks at test time. Moreover, we introduce a novel robotic tool-use benchmark, fleet-tools, for fleet policy learning in compositional and contact-rich robot manipulation tasks, which might be of broader interest, and validate the efficacy of fleet-merge on the benchmark.

POSTER-143: SparseDFF: Sparse-View Feature Distillation for One-Shot Dexterous Manipulation

Primary Area: applications to robotics, autonomy, planning

Keywords: dexterous manipulaiton one-shot manipulation transfer distilled feature field implicit field

Scores: [ 6 6 6 6 ]

Humans excel at transferring manipulation skills across diverse object shapes, poses, and appearances due to their understanding of semantic correspondences between different instances. To endow robots with a similar high-level understanding, we develop a DFF for 3D scenes, leveraging large 2D vision models to distill semantic features from multiview images. While current research demonstrates advanced performance in reconstructing DFF from dense views, the development of learning a DFF from sparse views is relatively nascent, despite its prevalence in numerous manipulation tasks with fixed cameras. In this work, we introduce \method, a novel method for acquiring view-consistent 3D Distilled Feature Field from sparse RGBD observations, enabling one-shot learning of dexterous manipulations that are transferable to novel scenes. Specifically, we map the image features to the 3D point cloud, allowing for propagation across the 3D space to establish a dense feature field. At the core of SparseDFF is a lightweight feature refinement network, optimized with a contrastive loss between pairwise views after back-projecting the image features onto the 3D point cloud. Additionally, we implement a point-pruning mechanism to augment feature continuity within each local neighborhood. By establishing coherent feature fields on both source and target scenes, we devise an energy function that facilitates the minimization of feature discrepancies w.r.t. the end-effector parameters between the demonstration and the target manipulation. We evaluate our approach using a dexterous hand, mastering real-world manipulations on both rigid and deformable objects, and showcase robust generalization in the face of object and scene-context variations.

POSTER-144: Tree-Planner: Efficient Close-loop Task Planning with Large Language Models

Primary Area: applications to robotics, autonomy, planning

Keywords: Embodied Agents Large Language Models Task Planning

Scores: [ 5 8 5 3 ]

This paper studies close-loop task planning, which refers to the process of generating a sequence of skills (a plan) to accomplish a specific goal while adapting the plan based on real-time observations.Recently, prompting Large Language Models (LLMs) to generate actions iteratively has become a prevalent paradigm due to its superior performance and user-friendliness.However, this paradigm is plagued by two inefficiencies: high token consumption and redundant error correction, both of which hinder its scalability for large-scale testing and applications.To address these issues, we propose Tree-Planner, which reframes task planning with LLMs into three distinct phases: plan sampling, action tree construction, and grounded deciding.Tree-Planner starts by using an LLM to sample a set of potential plans before execution, followed by the aggregation of them to form an action tree.Finally, the LLM performs a top-down decision-making process on the tree, taking into account real-time environmental information.Experiments show that Tree-Planner achieves state-of-the-art performance while maintaining high efficiency.By decomposing LLM queries into a single plan-sampling call and multiple grounded-deciding calls,a considerable partof the prompt are less likely to be repeatedly consumed. As a result, token consumption is reduced by 92.2% compared to the previously best-performing model.Additionally, by enabling backtracking on the action tree as needed, the correction process becomes more flexible, leading to a 40.5% decrease in error corrections.

POSTER-145: ZeroFlow: Scalable Scene Flow via Distillation

Primary Area: applications to robotics, autonomy, planning

Keywords: Scene Flow Distillation Scaling

Scores: [ 5 5 8 ]

Scene flow estimation is the task of describing the 3D motion field between temporally successive point clouds. State-of-the-art methods use strong priors and test-time optimization techniques, but require on the order of tens of seconds to process full-size point clouds, making them unusable as computer vision primitives for real-time applications such as open world object detection. Feedforward methods are considerably faster, running on the order of tens to hundreds of milliseconds for full-size point clouds, but require expensive human supervision. To address both limitations, we propose Scene Flow via Distillation, a simple, scalable distillation framework that uses a label-free optimization method to produce pseudo-labels to supervise a feedforward model. Our instantiation of this framework, ZeroFlow, achieves state-of-the-art performance on the Argoverse 2 Self-Supervised Scene Flow Challenge while using zero human labels by simply training on large-scale, diverse unlabeled data. At test-time, ZeroFlow is over 1000$\times$ faster than label-free state-of-the-art optimization-based methods on full-size point clouds (34 FPS vs 0.028 FPS) and over 1000$\times$ cheaper to train on unlabeled data compared to the cost of human annotation ($394 vs ~$750,000). To facilitate further research, we will release our code, trained model weights, and high quality pseudo-labels for the Argoverse 2 and Waymo Open datasets.

SPOTLIGHT-32: Robotic Task Generalization via Hindsight Trajectory Sketches

Primary Area: applications to robotics, autonomy, planning

Keywords: robotics robot learning robot manipulation task representation behavior cloning multitask imitation learning goal conditioning

Scores: [ 5 8 8 ]

Generalization remains one of the most important desiderata for robust robot learning systems. While recently proposed approaches show promise in generalization to novel objects, semantic concepts, or visual distribution shifts, generalization to new tasks remains challenging. For example, a language-conditioned policy trained on pick-and-place tasks will not be able to generalize to a folding task, even if the arm trajectory of folding is similar to pick-and-place. Our key insight is that this kind of generalization becomes feasible if we represent the task through rough trajectory sketches. We propose a policy conditioning method using such rough trajectory sketches, which we call RT-Trajectory, that is practical, easy to specify, and allows the policy to effectively perform new tasks that would otherwise be challenging to perform. We find that trajectory sketches strike a balance between being detailed enough to express low-level motion-centric guidance while being coarse enough to allow the learned policy to interpret the trajectory sketch in the context of situational visual observations. In addition, we show how trajectory sketches can provide a useful interface to communicate with robotic policies -- they can be specified through simple human inputs like drawings or videos, or through automated methods such as modern image-generating or waypoint-generating methods. We evaluate RT-Trajectory at scale on a variety of real-world robotic tasks, and find that RT-Trajectory is able to perform a wider range of tasks compared to language-conditioned and goal-conditioned policies, when provided the same training data.

POSTER-146: Building Cooperative Embodied Agents Modularly with Large Language Models

Primary Area: applications to robotics, autonomy, planning

Keywords: Large Language Models Embodied Intelligence Multi-Agent Cooperation Human-AI Interaction Communication

Scores: [ 6 6 6 8 ]

In this work, we address challenging multi-agent cooperation problems with decentralized control, raw sensory observations, costly communication, and multi-objective tasks instantiated in various embodied environments. While previous research either presupposes a cost-free communication channel or relies on a centralized controller with shared observations, we harness the commonsense knowledge, reasoning ability, language comprehension, and text generation prowess of LLMs and seamlessly incorporate them into a cognitive-inspired modular framework that integrates with perception, memory, and execution. Thus building a Cooperative Embodied Language Agent CoELA, who can plan, communicate, and cooperate with others to accomplish long-horizon tasks efficiently. Our experiments on C-WAH and TDW-MAT demonstrate that CoELA driven by GPT-4 can surpass strong planning-based methods and exhibit emergent effective communication. Though current Open LMs like LLAMA-2 still underperform, we fine-tune a CoLLAMA with data collected with our agents and show how they can achieve promising performance. We also conducted a user study for human-agent interaction and discovered that CoELA communicating in natural language can earn more trust and cooperate more effectively with humans. Our research underscores the potential of LLMs for future research in multi-agent cooperation. Videos can be found on the project website https://llm-co.github.io/CoELA/ .

POSTER-147: What Matters to You? Towards Visual Representation Alignment for Robot Learning

Primary Area: applications to robotics, autonomy, planning

Keywords: Robot learning Preference learning Visual reward learning Representation alignment

Scores: [ 6 3 6 6 ]

When operating in service of people, robots need to optimize rewards aligned with end-user preferences. Since robots will rely on raw perceptual inputs like RGB images, their rewards will inevitably use visual representations. Recently there has been excitement in using representations from pre-trained visual models, but key to making these work in robotics is fine-tuning, which is typically done via proxytasks like dynamics prediction or enforcing temporal cycle-consistency. However, all these proxy tasks bypass the human’s input on what matters to them, exacerbating spurious correlations and ultimately leading to robot behaviors that are misaligned with user preferences. In this work, we propose that robots should leverage human feedback to align their visual representations with the end user and disentangle what matters for the task. We propose Representation-Aligned Preference-based Learning (RAPL), a method for solving the visual representation alignment problem and visual reward learning problem through the lens of preference-based learning and optimal transport. Across experiments in X-MAGICAL and in robotic manipulation, we find that RAPL’s reward consistently generates preferred robot behaviors with high sample efficiency, and shows strong zero-shot generalization when the visual representation is learned from a different embodiment than the robot’s.

POSTER-148: Course Correcting Koopman Representations

Primary Area: applications to robotics, autonomy, planning

Keywords: Koopman Autoencoders Dynamical Systems Sequence Modeling Inference-time Methods Planning Unsupervised Learning Representation Learning Robotics

Scores: [ 6 8 6 6 8 ]

Koopman representations aim to learn features of nonlinear dynamical systems (NLDS) which lead to linear dynamics in the latent space. Theoretically, such features can be used to simplify many problems in modeling and control of NLDS. In this work we study autoencoder formulations of this problem, and different ways they can be used to model dynamics, specifically for future state prediction over long horizons. We discover several limitations of predicting future states in the latent space and propose an inference-time mechanism, which we refer to as Periodic Reencoding, for faithfully capturing long term dynamics. We justify this method both analytically and empirically via experiments in low and high dimensional NLDS.

POSTER-149: Video Language Planning

Primary Area: applications to robotics, autonomy, planning

Keywords: Planning Hierarchical Planning Language Models Video Models Long-Horizon Planning

Scores: [ 6 8 8 6 ]

We are interested in enabling visual planning for complex long-horizon tasks in the space of generated videos and language, leveraging recent advances in large generative models pretrained on Internet-scale data. To this end, we present video language planning (VLP), an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models. VLP takes as input a long-horizon task instruction and current image observation, and outputs a long video plan that provides detailed multimodal (video and language) specifications that describe how to complete the final task. VLP scales with increasing computation budget where more computation time results in improved video plans, and is able to synthesize long-horizon video plans across different robotics domains -- from multi-object rearrangement, to multi-camera bi-arm dexterous manipulation. Generated video plans can be translated into real robot actions via goal-conditioned policies, conditioned on each intermediate frame of the generated video. Experiments show that VLP substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots (across 3 hardware platforms).

POSTER-150: AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation

Primary Area: applications to robotics, autonomy, planning

Keywords: interactive segmentation 3D instance segmentation point cloud

Scores: [ 6 5 8 3 ]

During interactive segmentation, a model and a user work together to delineate objects of interest in a 3D point cloud. In an iterative process, the model assigns each data point to an object (or the background), while the user corrects errors in the resulting segmentation and feeds them back into the model. The current best practice formulates the problem as binary classification and segments objects one at a time. The model expects the user to provide positive clicks to indicate regions wrongly assigned to the background and negative clicks on regions wrongly assigned to the object. Sequentially visiting objects is wasteful since it disregards synergies between objects: a positive click for a given object can, by definition, serve as a negative click for nearby objects. Moreover, a direct competition between adjacent objects can speed up the identification of their common boundary. We introduce AGILE3D, an efficient, attention-based model that (1) supports simultaneous segmentation of multiple 3D objects, (2) yields more accurate segmentation masks with fewer user clicks, and (3) offers faster inference. Our core idea is to encode user clicks as spatial-temporal queries and enable explicit interactions between click queries as well as between them and the 3D scene through a click attention module. Every time new clicks are added, we only need to run a lightweight decoder that produces updated segmentation masks. In experiments with four different 3D point cloud datasets, AGILE3D sets a new state-of-the-art. Moreover, we also verify its practicality in real-world setups with real user studies.

ORAL-8: A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

Primary Area: applications to robotics, autonomy, planning

Keywords: Web Navigation Web Automation Large Language Models Language Model Agents Tool Use Program Synthesis

Scores: [ 5 8 8 8 ]

Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web automation.However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML.We introduce WebAgent, an LLM-driven agent that learns from self-experience to complete tasks on real websites following natural language instructions.WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via Python programs generated from those.We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization.We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks; achieving 18.7% higher success rate than the prior method on MiniWoB web automation benchmark, and SoTA performance on Mind2Web, an offline task planning evaluation.

POSTER-151: The HIM Solution for Legged Locomotion: Minimal Sensors, Efficient Learning, and Substantial Agility

Primary Area: applications to robotics, autonomy, planning

Keywords: Reinforcement Learning Quadrupedal Locomotion Internal Model

Scores: [ 5 8 6 6 ]

This paper presents a Hybrid Internal Model (HIM) based method for legged locomotion control in quadruped robots. The method aims to address the limitations of existing learning-based locomotion control paradigms, which suffer from information losses, noisy observations, sample efficiency, and difficulties in developing general locomotion policies for robots with different sensor configurations. The proposed HIM method leverages joint encoders and an Inertial Measurement Unit (IMU) as the only sensors for predicting robot states. Considering the prediction frequency is higher than 50 Hz, the method infers current robot states upon the previous trajectory. The framework consists of two components: the information extractor HIM and the policy network. Unlike previous methods that explicitly model environmental observations such as base velocity and ground elevation, HIM only explicitly estimates velocity and encodes other environment dynamics as an implicit latent embedding. The latent dynamics are learned through contrastive learning, which enhances robustness and adaptability in disturbed and unpredictable environments. The proposed method is validated through simulations in different terrains and real-world experiments on the Unitree Go1 robot. The results demonstrate that HIM achieves substantial agility over challenging terrains with minimal sensors and fast convergence. The method shows promise for broader applications in locomotion control.

POSTER-152: Learning with a Mole: Transferable latent spatial representations for navigation without reconstruction

Primary Area: applications to robotics, autonomy, planning

Keywords: Navigation Embodied AI Perception

Scores: [ 6 8 6 ]

Agents navigating in 3D environments require some form of memory, which should hold a compact and actionable representation of the history of observations useful for decision taking and planning. In most end-to-end learning approaches the representation is latent and usually does not have a clearly defined interpretation, whereas classical robotics addresses this with scene reconstruction resulting in some form of map, usually estimated with geometry and sensor models and/or learning. In this work we propose to learn an actionable representation of the scene independently of the targeted downstream task and without explicitly optimizing reconstruction. The learned representation is optimized by a blind auxiliary agent trained to navigate with it on multiple short sub episodes branching out from a waypoint and, most importantly, without any direct visual observation. We argue and show that the blindness property is important and forces the (trained) latent representation to be the only means for planning. With probing experiments we show that the learned representation optimizes navigability and not reconstruction. On downstream tasks we show that it is robust to changes in distribution, in particular the sim2real gap, which we evaluate with a real physical robot in a real office building, significantly improving performance.

POSTER-153: Online Continual Learning for Interactive Instruction Following Agents

Primary Area: applications to robotics, autonomy, planning

Keywords: Embodied AI Continual Learning

Scores: [ 6 6 6 6 ]

In learning an embodied agent executing daily tasks via language directives, the literature largely assumes that the agent learns all training data at the beginning. We argue that such a learning scenario is less realistic, since a robotic agent is supposed to learn the world continuously as it explores and perceives it. To take a step towards a more realistic embodied agent learning scenario, we propose two continual learning setups for embodied agents; learning new behaviors (Behavior Incremental Learning, Behavior-IL) and new environments (Environment Incremental Learning, Environment-IL) For the tasks, previous ‘data prior’ based continual learning methods maintain logits for the past tasks. However, the stored information is often insufficiently learned information and requires task boundary information, which might not always be available. Here, we propose to update them based on confidence scores without task boundary information (i.e., task-free) in a moving average fashion, named Confidence-Aware Moving Average (CAMA). In the proposed challenging Behavior-IL and Environment-IL setups, our simple CAMA outperforms prior arts in our empirical validations by noticeable margins.

POSTER-154: LRR: Language-Driven Resamplable Continuous Representation against Adversarial Tracking Attacks

Primary Area: applications to robotics, autonomy, planning

Keywords: Tracking defence spatial-temporal implicit representation languange-image model

Scores: [ 5 6 6 ]

Visual object tracking plays a critical role in visual-based autonomous systems, as it aims to estimate the position and size of the object of interest within a live video. Despite significant progress made in this field, state-of-the-art (SOTA) trackers often fail when faced with adversarial perturbations in the incoming frames. This can lead to significant robustness and security issues when these trackers are deployed in the real world. To achieve high accuracy on both clean and adversarial data, we propose building a spatial-temporal continuous representation using the semantic text guidance of the object of interest. This novel continuous representation enables us to reconstruct incoming frames to maintain semantic and appearance consistency with the object of interest and its clean counterparts. As a result, our proposed method successfully defends against different SOTA adversarial tracking attacks while maintaining high accuracy on clean data. In particular, our method significantly increases tracking accuracy under adversarial attacks with around 90% relative improvement on UAV123, which is even higher than the accuracy on clean data.

POSTER-155: Learning Planning Abstractions from Language

Primary Area: applications to robotics, autonomy, planning

Keywords: Planning and Learning Learning Abstractions Compositional Generalization Robotic Manipulation

Scores: [ 6 8 5 3 ]

This paper presents a framework for learning state and action abstractions in sequential decision-making domains. Our framework, planning abstraction from language (PARL), utilizes language-annotated demonstrations to automatically discover a symbolic and abstract action space and induce a latent state abstraction based on it. PARL consists of three stages: 1) recovering object-level and action concepts, 2) learning state abstractions, abstract action feasibility, and transition models, and 3) applying low-level policies for abstract actions. During inference, given the task description, PARL first makes abstract action plans using the latent transition and feasibility functions, then refines the high-level plan using low-level policies. PARL generalizes across scenarios involving novel object instances and environments, unseen concept compositions, and tasks that require longer planning horizons than settings it is trained on.

POSTER-156: Neural Neighborhood Search for Multi-agent Path Finding

Primary Area: applications to robotics, autonomy, planning

Keywords: Combinatorial Optimization Neural Architecture Multi-agent Path Planning

Scores: [ 6 3 6 6 ]

Multi-agent path finding (MAPF) is the combinatorial problem of planning optimal collision-avoiding paths for multiple agents, with application to robotics, logistics, and transportation. Though many recent learning-based works have focused on large-scale combinatorial problems by guiding their decomposition into sequences of smaller subproblems, the combined spatiotemporal and time-restricted nature of MAPF poses a particular challenge for learning-based guidance of iterative approaches like large neighborhood search (LNS), which is already a state-of-the-art approach for MAPF even without learning. We address this challenge of neural-guided LNS for MAPF by designing an architecture which interleaves convolution and attention to efficiently represent MAPF subproblems, enabling practical guidance of LNS in benchmark settings. We demonstrate the speedup of our method over existing state-of-the-art LNS-based methods for MAPF as well as the robustness of our method to unseen settings. Our proposed method expands the horizon of effective deep learning-guided LNS methods into multi-path planning problems, and our proposed representation may be more broadly applicable for representing path-wise interactions.

SPOTLIGHT-33: Unified Human-Scene Interaction via Prompted Chain-of-Contacts

Primary Area: applications to robotics, autonomy, planning

Keywords: Human-Scene Interaction Chain-of-Contacts Unified LLM

Scores: [ 10 5 8 6 ]

Human-Scene Interaction (HSI) is a vital component of fields like embodied AI and virtual reality. Despite advancements in motion quality and physical plausibility, two pivotal factors, versatile interaction control and the development of a user-friendly interface, require further exploration before the practical application of HSI. This paper presents a unified HSI framework, UniHSI, which supports unified control of diverse interactions through language commands. This framework is built upon the definition of interaction as Chain of Contacts (CoC): steps of human joint-object part pairs, which is inspired by the strong correlation between interaction types and human-object contact regions.Based on the definition, UniHSI constitutes a Large Language Model (LLM) Planner to translate language prompts into task plans in the form of CoC, and a Unified Controller that turns CoC into uniform task execution. To facilitate training and evaluation, we collect a new dataset named ScenePlan that encompasses thousands of task plans generated by LLMs based on diverse scenarios. Comprehensive experiments demonstrate the effectiveness of our framework in versatile task execution and generalizability to real scanned scenes.

POSTER-157: ReSimAD: Zero-Shot 3D Domain Transfer for Autonomous Driving with Source Reconstruction and Target Simulation

Primary Area: applications to robotics, autonomy, planning

Keywords: Autonomous Driving 3D Domain Transfer Zero-shot 3D Detection

Scores: [ 6 5 6 6 ]

POSTER-158: LiDAR-PTQ: Post-Training Quantization for Point Cloud 3D Object Detection

Primary Area: applications to robotics, autonomy, planning

Keywords: Quantization 3D Object Detection Autonomous Driving

Scores: [ 6 6 6 6 ]

Due to highly constrained computing power and memory, deploying 3D lidar-based detectors on edge devices equipped in autonomous vehicles and robots poses a crucial challenge. Being a convenient and straightforward model compression approach, Post-Training Quantization (PTQ) has been widely adopted in 2D vision tasks. However, applying it directly to 3D lidar-based tasks inevitably leads to performance degradation. As a remedy, we propose an effective PTQ method called LiDAR-PTQ, which is particularly curated for 3D lidar detection (both SPConv-based and SPConv-free). Our LiDAR-PTQ features three main components, (1) a sparsity-based calibration method to determine the initialization of quantization parameters, (2) an adaptive rounding-to-nearest operation to minimize the layerwise reconstruction error, (3) a Task-guided Global Positive Loss (TGPL) to reduce the disparity between the final predictions before and after quantization. Extensive experiments demonstrate that our LiDAR-PTQ can achieve state-of-the-art quantization performance when applied to CenterPoint (both Pillar-based and Voxel-based). To our knowledge, for the very first time in lidar-based 3D detection tasks, the PTQ INT8 model's accuracy is almost the same as the FP32 model while enjoying 3X inference speedup. Moreover, our LiDAR-PTQ is cost-effective being 6X faster than the quantization-aware training method. The code will be released.

SPOTLIGHT-34: Empirical Analysis of Model Selection for Heterogeneous Causal Effect Estimation

Primary Area: causal reasoning

Keywords: Heterogeneous Treatment Effect Estimation Conditional Average Treatment Effect Causal Inference Model Selection

Scores: [ 5 8 8 8 ]

We study the problem of model selection in causal inference, specifically for the case of conditional average treatment effect (CATE) estimation under binary treatments. Unlike model selection in machine learning, there is no perfect analogue of cross-validation as we do not observe the counterfactual potential outcome for any data point. Towards this, there have been a variety of proxy metrics proposed in the literature, that depend on auxiliary nuisance models estimated from the observed data (propensity score model, outcome regression model). However, the effectiveness of these metrics has only been studied on synthetic datasets as we can access the counterfactual data for them. We conduct an extensive empirical analysis to judge the performance of these metrics introduced in the literature, and novel ones introduced in this work, where we utilize the latest advances in generative modeling to incorporate multiple realistic datasets. Our analysis suggests novel model selection strategies based on careful hyperparameter tuning of CATE estimators and causal ensembling.

POSTER-159: Counterfactual Density Estimation using Kernel Stein Discrepancies

Primary Area: causal reasoning

Keywords: counterfactual density estimation kernel Stein discrepancy causal inference kernel methods

Scores: [ 6 8 5 6 ]

Causal effects are usually studied in terms of the means of counterfactual distributions, which may be insufficient in many scenarios. Given a class of densities known up to normalizing constants, we propose to model counterfactual distributions by minimizing kernel Stein discrepancies in a doubly robust manner. This enables the estimation of counterfactuals over large classes of distributions while exploiting the desired double robustness. We present a theoretical analysis of the proposed estimator, providing sufficient conditions for consistency and asymptotic normality, as well as an examination of its empirical performance.

POSTER-160: Causal Inference with Conditional Front-Door Adjustment and Identifiable Variational Autoencoder

Primary Area: causal reasoning

Keywords: Causal Inference Variational AutoEncoder Front Door Criterion

Scores: [ 6 6 6 5 ]

An essential and challenging problem in causal inference is causal effect estimation from observational data. The problem becomes more difficult with the presence of unobserved confounding variables. The front-door adjustment is a practical approach for dealing with unobserved confounding variables. However, the restriction for the standard front-door adjustment is difficult to satisfy in practice. In this paper, we relax some of the restrictions by proposing the concept of conditional front-door (CFD) adjustment and develop the theorem that guarantees the causal effect identifiability of CFD adjustment. Furthermore, as it is often impossible for a CFD variable to be given in practice, it is desirable to learn it from data. By leveraging the ability of deep generative models, we propose CFDiVAE to learn the representation of the CFD adjustment variable directly from data with the identifiable Variational AutoEncoder and formally prove the model identifiability. Extensive experiments on synthetic datasets validate the effectiveness of CFDiVAE and its superiority over existing methods. The experiments also show that the performance of CFDiVAE is less sensitive to the causal strength of unobserved confounding variables. We further apply CFDiVAE to a real-world dataset to demonstrate its potential application.

POSTER-161: Independent-Set Design of Experiments for Estimating Treatment and Spillover Effects under Network Interference

Primary Area: causal reasoning

Keywords: Causal inference Design of experiments Interference Random graph Spillover effects Treatment effects Potential outcomes

Scores: [ 5 6 5 6 ]

Interference is ubiquitous when conducting causal experiments over social networks. Except for certain network structures, causal inference on the network in the presence of interference is difficult due to the entanglement between the treatment assignments and the interference levels. In this article, we conduct causal inference under interference on an observed, sparse but connected network, and we propose a novel design of experiments based on an independent set. Compared to conventional designs, the independent-set design focuses on an independent subset of data and controls their interference exposures through the assignments to the rest (auxiliary set). The independent-set design enhances the performance of causal estimators by trading sample quantity for sample quality. We show the capacity of our approach for various causal inference tasks, justify its superiority over conventional methods, and illustrate the empirical performance through simulations.

POSTER-162: Towards Characterizing Domain Counterfactuals for Invertible Latent Causal Models

Primary Area: causal reasoning

Keywords: counterfactual domain causal representation learning

Scores: [ 6 5 6 6 ]

Answering counterfactual queries has many important applications such as knowledge discovery and explainability, but is challenging when causal variables are unobserved and we only see a projection onto an observation space, for instance, image pixels. One approach is to recover the latent Structural Causal Model (SCM), but this typically needs unrealistic assumptions, such as linearity of the causal mechanisms. Another approach is to use naïve ML approximations, such as generative models, to generate counterfactual samples; however, these lack guarantees of accuracy. In this work, we strive to strike a balance between practicality and theoretical guarantees by focusing on a specific type of causal query called domain counterfactuals, which hypothesizes what a sample would have looked like if it had been generated in a different domain (or environment). Concretely, by only assuming invertibility, sparse domain interventions and access to observational data from different domains, we aim to improve domain counterfactual estimation both theoretically and practically with less restrictive assumptions. We define domain counterfactually equivalent models and prove necessary and sufficient properties for equivalent models that provide a tight characterization of the domain counterfactual equivalence classes. Building upon this result, we prove that every equivalence class contains a model where all intervened variables are at the end when topologically sorted by the causal DAG, i.e., all non-intervened variables have non-intervened ancestors. This surprising result suggests that a model design that only allows intervention in the last k latent variables may improve model estimation for counterfactuals. We then test this model design on extensive simulated and image-based experiments which show the sparse canonical model indeed improves counterfactual estimation over baseline non-sparse models.

POSTER-163: Bayesian Neural Controlled Differential Equations for Treatment Effect Estimation

Primary Area: causal reasoning

Keywords: treatment effect estimation neural differential equation variational Bayes medicine

Scores: [ 6 8 6 6 ]

Treatment effect estimation in continuous time is crucial for personalized medicine. However, existing methods for this task are limited to point estimates of the potential outcomes, whereas uncertainty estimates have been ignored. Needless to say, uncertainty quantification is crucial for reliable decision-making in medical applications. To fill this gap, we propose a novel Bayesian neural controlled differential equation (BNCDE) for treatment effect estimation in continuous time. In our BNCDE, the time dimension is modeled through a coupled system of neural controlled differential equations and neural stochastic differential equations, where the neural stochastic differential equations allow for tractable variational Bayesian inference. Thereby, for an assigned sequence of treatments, our BNCDE provides meaningful posterior predictive distributions of the potential outcomes. To the best of our knowledge, ours is the first tailored neural method to provide uncertainty estimates of treatment effects in continuous time. As such, our method is of direct practical value for promoting reliable decision-making in medicine.

POSTER-164: Conditional Instrumental Variable Regression with Representation Learning for Causal Inference

Primary Area: causal reasoning

Keywords: Causal Inference Variational AutoEncoder Instrumental Variable

Scores: [ 8 6 8 5 ]

This paper studies the challenging problem of estimating causal effects from observational data, in the presence of unobserved confounders. The two-stage least square (TSLS) method and its variants with a standard instrumental variable (IV) are commonly used to eliminate confounding bias, including the bias caused by unobserved confounders, but they rely on the linearity assumption. Besides, the strict condition of unconfounded instruments posed on a standard IV is too strong to be practical. To address these challenging and practical problems of the standard IV method (linearity assumption and the strict condition), in this paper, we use a conditional IV (CIV) to relax the unconfounded instrument condition of standard IV and propose a non-linear \underline{CIV} regression with \underline{C}onfounding \underline{B}alancing \underline{R}epresentation \underline{L}earning, CBRL.CIV, for jointly eliminating the confounding bias from unobserved confounders and balancing the observed confounders, without the linearity assumption. We theoretically demonstrate the soundness of CBRL.CIV. Extensive experiments on synthetic and two real-world datasets show the competitive performance of CBRL.CIV against state-of-the-art IV-based estimators and superiority in dealing with the non-linear situation.

SPOTLIGHT-35: ODE Discovery for Longitudinal Heterogeneous Treatment Effects Inference

Primary Area: causal reasoning

Keywords: Treatment Effects over Time

Scores: [ 5 8 5 8 8 ]

Inferring unbiased treatment effects has received widespread attention in the machine learning community. In recent years, our community has proposed numerous solutions in standard settings, high-dimensional treatment settings, and even longitudinal settings. While very diverse, the solution has almost always relied on neural networks for inference and simultaneous correction of assignment bias. New approaches typically build on top of previous approaches by proposing new (or refined) architectures and learning algorithms. However, the end result—a neural-network-based inference machine—remains unchallenged. In this paper, we introduce a different type of solution in the longitudinal setting: a closed-form and human-readable ordinary differential equation (ODE). While we still rely on continuous optimization to learn an ODE, the resulting inference machine is no longer a neural network. Doing so yields several advantages such as interpretability, irregular sampling, and a different set of identification assumptions. Above all, we consider the introduction of a completely new type of solution to be our most important contribution as it may spark entirely new innovations in treatment effects in general. We facilitate this by formulating our contribution as a framework that can transform any ODE discovery method into a treatment effects method.

ORAL-9: Robust agents learn causal world models

Primary Area: causal reasoning

Keywords: causality generalisation causal discovery domain adaptation out-of-distribution generalization

Scores: [ 8 6 10 8 ]

It has long been hypothesised that causal reasoning plays a fundamental role in robust and general intelligence. However, it is not known if agents must learn causal models in order to generalise under distributional shifts, or if other inductive biases are sufficient. We answer this question, showing that any agent capable of satisfying a regret bound under a large set of distributional shifts must have learned an approximate causal model of the data generating process, which converges to the true causal model for optimal agents. We discuss the implications of this result for several research areas including transfer learning and causal inference.

POSTER-165: Causal Modelling Agents: Causal Graph Discovery through Synergising Metadata- and Data-driven Reasoning

Primary Area: causal reasoning

Keywords: Causal Reasoning Causal Discovery Structural Causal Models Large Language Models

Scores: [ 6 3 8 8 ]

Scientific discovery hinges on the effective integration of metadata, which refers to a set of 'cognitive' operations such as determining what information is relevant for inquiry, and data, which encompasses physical operations such as observation and experimentation. This paper introduces the Causal Modelling Agent (CMA), a novel framework that synergizes the metadata-based reasoning capabilities of Large Language Models (LLMs) with the data-driven modelling of Deep Structural Causal Models (DSCMs) for the task of causal discovery. We evaluate the CMA's performance on a number of benchmarks, as well as on the real-world task of modelling the clinical and radiological phenotype of Alzheimer's Disease (AD). Our experimental results indicate that the CMA can outperform previous data-driven or metadata-driven approaches to causal discovery. In our real-world application, we use the CMA to derive new insights into the causal relationships among biomarkers of AD.

POSTER-166: Treatment Effects Estimation By Uniform Transformer

Primary Area: causal reasoning

Keywords: Causal inference covariate balance nonparametric estimation observational studies weighting method

Scores: [ 8 6 5 ]

In observational studies, balancing covariates in different treatment groups is essential to estimate treatment effects. One of the most commonly used methods for such purposes is weighting. The performance of this class of methods usually depends on strong regularity conditions for the underlying model, which might not hold in practice. In this paper, we investigate weighting methods from a functional estimation perspective and argue that the weights needed for covariate balancing could differ from those needed for treatment effects estimation under low regularity conditions. Motivated by this observation, we introduce a new framework of weighting that directly targets the treatment effects estimation. Unlike existing methods, the resulting estimator for a treatment effect under this new framework is a simple kernel-based U-statistic after applying a data-driven transformation to the observed covariates. We characterize the theoretical properties of the new estimators of treatment effects under a nonparametric setting and show that they are able to work robustly under low regularity conditions. The new framework is also applied to several numerical examples to demonstrate its practical merits.

POSTER-167: Structural Estimation of Partially Observed Linear Non-Gaussian Acyclic Model: A Practical Approach with Identifiability

Primary Area: causal reasoning

Keywords: causal discovery latent variable model structure learning

Scores: [ 8 6 6 6 ]

Conventional causal discovery approaches, which seek to uncover causal relationships among measured variables, are typically fragile to the presence of latent variables. While various methods have been developed to address this confounding issue, they often rely on strong assumptions about the underlying causal structure. In this paper, we consider a general scenario where measured and latent variables collectively form a partially observed causally sufficient linear system and latent variables may be anywhere in the causal structure. We theoretically show that with the aid of high-order statistics, the causal graph is (almost) fully identifiable if, roughly speaking, each latent set has a sufficient number of pure children, which can be either latent or measured. Naturally, LiNGAM, a model without latent variables, is encompassed as a special case. Based on the identification theorem, we develop a principled algorithm to identify the causal graph by testing for statistical independence involving only measured variables in specific manners. Experimental results show that our method effectively recovers the causal structure, even when latent variables are influenced by measured variables.

POSTER-168: Federated Causal Discovery from Heterogeneous Data

Primary Area: causal reasoning

Keywords: Causal Discovery Structure Learning Federated Learning Heterogeneous Data

Scores: [ 5 5 8 5 8 ]

Conventional causal discovery methods rely on centralized data, which is inconsistent with the decentralized nature of data in many real-world situations. This discrepancy has motivated the development of federated causal discovery (FCD) approaches. However, existing FCD methods may be limited by their potentially restrictive assumptions of identifiable functional causal models or homogeneous data distributions, narrowing their applicability in diverse scenarios. In this paper, we propose a novel FCD method attempting to accommodate arbitrary causal models and heterogeneous data. We first utilize a surrogate variable corresponding to the client index to account for the data heterogeneity across different clients. We then develop a federated conditional independence test (FCIT) for causal skeleton discovery and establish a federated independent change principle (FICP) to determine causal directions. These approaches involve constructing summary statistics as a proxy of the raw data to protect data privacy. Owing to the nonparametric properties, FCIT and FICP make no assumption about particular functional forms, thereby facilitating the handling of arbitrary causal models. We conduct extensive experiments on synthetic and real datasets to show the efficacy of our method.

POSTER-169: A Dynamical View of the Question of Why

Primary Area: causal reasoning

Keywords: Dynamical Causality Reinforcement Learning

Scores: [ 5 6 3 8 ]

POSTER-170: A Neural Framework for Generalized Causal Sensitivity Analysis

Primary Area: causal reasoning

Keywords: Causal machine learning treatment effect estimation sensitivity analysis unobserved confounding uncertainty estimation

Scores: [ 6 6 8 6 ]

Unobserved confounding is common in many applications, making causal inference from observational data challenging. As a remedy, causal sensitivity analysis is an important tool to draw causal conclusions under unobserved confounding with mathematical guarantees. In this paper, we propose NeuralCSA, a neural framework for generalized causal sensitivity analysis. Unlike previous work, our framework is compatible with (i) a large class of sensitivity models, including the marginal sensitivity model, f-sensitivity models, and Rosenbaum's sensitivity model; (ii) different treatment types (i.e., binary and continuous); and (iii) different causal queries, including (conditional) average treatment effects and simultaneous effects on multiple outcomes. This generality is achieved by learning a latent distribution shift that corresponds to a treatment intervention using two conditional normalizing flows. We provide theoretical guarantees that NeuralCSA is able to infer valid bounds on the causal query of interest and also demonstrate this empirically using both simulated and real-world data.

POSTER-171: Causally Aligned Curriculum Learning

Primary Area: causal reasoning

Keywords: Causal Inference Curriculum Learning Reinforcement Learning

Scores: [ 8 6 3 6 ]

ORAL-10: Gene Regulatory Network Inference in the Presence of Dropouts: a Causal View

Primary Area: causal reasoning

Keywords: Gene regulatory network Single-cell RNA-sequencing Dropout Zero-inflated data Causal model Causal discovery Nonparametric

Scores: [ 8 8 6 ]

Gene regulatory network inference (GRNI) is a challenging problem, particularly owing to the presence of zeros in single-cell RNA sequencing data: some are biological zeros representing no gene expression, while some others are technical zeros arising from the sequencing procedure (aka dropouts), which may bias GRNI by distorting the joint distribution of the measured gene expressions. Existing approaches typically handle dropout error via imputation, which may introduce spurious relations as the true joint distribution is generally unidentifiable. To tackle this issue, we introduce a causal graphical model to characterize the dropout mechanism, namely, Causal Dropout Model. We provide a simple yet effective theoretical result: interestingly, the conditional independence (CI) relations in the data with dropouts, after deleting the samples with zero values (regardless if technical or not) for the conditioned variables, are asymptotically identical to the CI relations in the original data without dropouts. This particular test-wise deletion procedure, in which we perform CI tests on the samples without zeros for the conditioned variables, can be seamlessly integrated with existing structure learning approaches including constraint-based and greedy score-based methods, thus giving rise to a principled framework for GRNI in the presence of dropouts. We further show that the causal dropout model can be validated from data, and many existing statistical models to handle dropouts fit into our model as specific parametric instances. Empirical evaluation on synthetic, curated, and real-world experimental transcriptomic data comprehensively demonstrate the efficacy of our method.

SPOTLIGHT-36: Bounds on Representation-Induced Confounding Bias for Treatment Effect Estimation

Primary Area: causal reasoning

Keywords: causal inference representation learning individualized treatment effect estimation

Scores: [ 8 8 5 8 ]

State-of-the-art methods for conditional average treatment effect (CATE) estimation make widespread use of representation learning. Here, the idea is to reduce the variance of the low-sample CATE estimation by a (potentially constrained) low-dimensional representation. However, low-dimensional representations can lose information about the observed confounders and thus lead to bias, because of which the validity of representation learning for CATE estimation is typically violated. In this paper, we propose a new, representation-agnostic framework for estimating bounds on the representation-induced confounding bias that comes from dimensionality reduction (or other constraints on the representations) in CATE estimation. First, we establish theoretically under which conditions CATEs are non-identifiable given low-dimensional (constrained) representations. Second, as our remedy, we propose to perform partial identification of CATEs or, equivalently, aim at estimating of upper and lower bounds of the representation-induced confounding bias. We demonstrate the effectiveness of our bounds in a series of experiments. In sum, our framework is of direct relevance in practice where the validity of CATE estimation is of importance.

POSTER-172: Adversarial Causal Bayesian Optimization

Primary Area: causal reasoning

Keywords: causality bayesian optimization

Scores: [ 6 6 6 6 ]

POSTER-173: Neural structure learning with stochastic differential equations

Primary Area: causal reasoning

Keywords: Structure Learning Causal Discovery Generative Model Variational Inference Differential Equation

Scores: [ 6 6 8 8 ]

Discovering the underlying relationships among variables from temporal observations has been a longstanding challenge in numerous scientific disciplines, including biology, finance, and climate science. The dynamics of such systems are often best described using continuous-time stochastic processes. Unfortunately, most existing structure learning approaches assume that the underlying process evolves in discrete-time and/or observations occur at regular time intervals. These mismatched assumptions can often lead to incorrect learned structures and models. In this work, we introduce a novel structure learning method, SCOTCH, which combines neural stochastic differential equations (SDE) with variational inference to infer a posterior distribution over possible structures. This continuous-time approach can naturally handle both learning from and predicting observations at arbitrary time points. Theoretically, we establish sufficient conditions for an SDE and SCOTCH to be structurally identifiable, and prove its consistency under infinite data limits. Empirically, we demonstrate that our approach leads to improved structure learning performance on both synthetic and real-world datasets compared to relevant baselines under regular and irregular sampling intervals.

POSTER-174: Doubly Robust Proximal Causal Learning for Continuous Treatments

Primary Area: causal reasoning

Keywords: causal effect continuous treatment kernel function doubly robust

Scores: [ 6 6 8 ]

Proximal causal learning is a powerful framework for identifying the causal effect under the existence of unmeasured confounders. Within this framework, the doubly robust (DR) estimator was derived and has shown its effectiveness in estimation, especially when the model assumption is violated. However, the current form of the DR estimator is restricted to binary treatments, while the treatments can be continuous in many real-world applications. The primary obstacle to continuous treatments resides in the delta function present in the original DR estimator, making it infeasible in causal effect estimation and introducing a heavy computational burden in nuisance function estimation. To address these challenges, we propose a kernel-based DR estimator that can well handle continuous treatments for proximal causal learning. Equipped with its smoothness, we show that its oracle form is a consistent approximation of the influence function. Further, we propose a new approach to efficiently solve the nuisance functions. We then provide a comprehensive convergence analysis in terms of the mean square error. We demonstrate the utility of our estimator on synthetic datasets and real-world applications.

POSTER-175: A Variational Framework for Estimating Continuous Treatment Effects with Measurement Error

Primary Area: causal reasoning

Keywords: treatment effect estimation continuous treatment measurement error

Scores: [ 8 8 6 ]

Estimating treatment effects has numerous real-world applications in various fields, such as epidemiology and political science. While much attention has been devoted to addressing the challenge using fully observational data, there has been comparatively limited exploration of this issue in cases when the treatment is not directly observed. In this paper, we tackle this problem by developing a general variational framework, which is flexible to integrate with advanced neural network-based approaches, to identify the average dose-response function (ADRF) with the continuously valued error-contaminated treatment. Our approach begins with the formulation of a probabilistic data generation model, treating the unobserved treatment as a latent variable. In this model, we leverage a learnable density estimation neural network to derive its prior distribution conditioned on covariates. This module also doubles as a generalized propensity score estimator, effectively mitigating selection bias arising from observed confounding variables. Subsequently, we calculate the posterior distribution of the treatment, taking into account the observed measurement and outcome. To mitigate the impact of treatment error, we introduce a re-parametrized treatment value, replacing the error-affected one, to make more accurate predictions regarding the outcome. To demonstrate the adaptability of our framework, we incorporate two state-of-the-art ADRF estimation methods and rigorously assess its efficacy through extensive simulations and experiments using semi-synthetic data.

POSTER-176: A Linear Algebraic Framework for Counterfactual Generation

Primary Area: causal reasoning

Keywords: counterfactual generation individual treatment effect synthetic data Gaussian mixture model

Scores: [ 6 6 3 6 ]

Estimating time-varying causal effects from real-world data attracts growing attention due to the high cost of performing interventional experiments and the wide availability of observational data. However, counterfactual data are not accessible, and true calculation of causal effects cannot be performed at the individual level. This paper proposes a linear algebraic framework to generate synthetic counterfactual data that exactly matches pretreatment factual data. Receiving treatment at a time T0 cannot cause any difference reversely to counterfactual generation at t<T0. If we hold this strictness on counterfactual generation, our approach claims the first-ever counterfactual generative model to create personalized clinical trial digital twins. Moreover, using simulated ground truth counterfactual data, we show that our method greatly outperforms the most cited methods of counterfactual generation and individual treatment effect estimation. We also provide a formula that can estimate the time-varying variance of individual treatment effects, interpreted as a confidence of generated counterfactuals to true values.

POSTER-177: Causal Structure Recovery with Latent Variables under Milder Distributional and Graphical Assumptions

Primary Area: causal reasoning

Keywords: causality causal discovery latent variables

Scores: [ 5 6 8 5 ]

Most traditional causal discovery approaches typically assume the absence of latent variables, a simplification that often does not align with real-world situations. Recently, there has been a surge of causal discovery methods that explicitly consider latent variables. While causal discovery with latent variables aims to reveal causal relations between observed variables in the presence of latent variables, latent causal structure learning seeks to identify latent variables and infer their causal relations, typically entailing strong distributional and graphical assumptions. In this paper, we endeavor to recover the whole causal structure involving both latent and observed variables under relatively milder assumptions. Specifically, we introduce two sets of assumptions, one allows arbitrary distribution and requires only one pure child per latent variable, the other requires no pure child and imposes the non-Gaussianity requirement on only a small subset of variables, and both of them allow causal edges between observed variables. Under either of them, we prove that the whole causal structure of linear latent variable models is identifiable. Our proof is constructive, leading to both theoretically sound and computationally efficient algorithms, which first identify latent variables from observed data and then infer causal relations between any two variables.

Primary Area: causal reasoning

Keywords: Causal Discovery; Latent Variables

Scores: [ 8 6 8 8 ]

POSTER-179: Causal-StoNet: Causal Inference for High-Dimensional Complex Data

Primary Area: causal reasoning

Keywords: Causal Information Stochastic Neural Network Adaptive Stochastic Gradient MCMC Sparsity Missing Data

Scores: [ 8 6 6 ]

With the advancement of data science, the collection of increasingly complex datasets has become commonplace. In such datasets, the data dimension can be extremely high, and the underlying data generation process can be unknown and highly nonlinear. As a result, the task of making causal inference with high-dimensional complex data has become a fundamental problem in many disciplines, such as medicine, econometrics, and social science. However, the existing methods for causal inference are frequently developed under the assumption that the data dimension is low or that the underlying data generation process is linear or approximately linear. To address these challenges, this paper proposes a novel stochastic deep learning approach for conducting causal inference with high-dimensional complex data. The proposed approach is based on some deep learning techniques, including sparse deep learning theory and stochastic neural networks, that have been developed in recent literature. By using these techniques, the proposed approach can address both the high dimensionality and unknown data generation process in a coherent way. Furthermore, the proposed approach can also be used when missing values are present in the datasets. Extensive numerical studies indicate that the proposed approach outperforms existing ones.

POSTER-180: Adaptive Instrument Design for Indirect Experiments

Primary Area: causal reasoning

Keywords: instrument variable experiment design indirect experiments adaptive design

Scores: [ 6 6 8 6 ]

Indirect experiments provide a valuable framework for estimating treatment effects in situations where conducting randomized control trials (RCTs) is impractical or unethical. Unlike RCTs, indirect experiments estimate treatment effects by leveraging (conditional) instrumental variables, enabling estimation through encouragement and recommendation rather than strict treatment assignment. However, the sample efficiency of such estimators depends not only on the inherent variability in outcomes but also on the varying compliance levels of users with the instrumental variables and the choice of estimator being used, especially when dealing with numerous instrumental variables. While adaptive experiment design has a rich literature for \textit{direct} experiments, in this paper we take the initial steps towards enhancing sample efficiency for \textit{indirect} experiments by adaptively designing a data collection policy over instrumental variables. Our main contribution is a practical computational procedure that utilizes influence functions to search for an optimal data collection policy, minimizing the mean-squared error of the desired (non-linear) estimator. Through experiments conducted in various domains inspired by real-world applications, we showcase how our method can significantly improve the sample efficiency of indirect experiments.

POSTER-181: Identifying Representations for Intervention Extrapolation

Primary Area: causal reasoning

Keywords: causality extrapolation exogenous variables causal representation learning identifiable representation learning control functions instrumental variables invariance

Scores: [ 8 8 8 8 ]

The premise of identifiable and causal representation learning is to improve the current representation learning paradigm in terms of generalizability or robustness. Despite recent progress in questions of identifiability, more theoretical results demonstrating concrete advantages of these methods for downstream tasks are needed. In this paper, we consider the task of intervention extrapolation: predicting how interventions affect an outcome, even when those interventions are not observed at training time, and show that identifiable representations can provide an effective solution to this task even if the interventions affect the outcome non-linearly. Our setup includes an outcome variable Y, observed features X, which are generated as a non-linear transformation of latent features Z, and exogenous action variables A, which influence Z. The objective of intervention extrapolation is then to predict how interventions on A that lie outside the training support of A affect Y. Here, extrapolation becomes possible if the effect of A on Z is linear and the residual when regressing Z on A has full support. As Z is latent, we combine the task of intervention extrapolation with identifiable representation learning, which we call Rep4Ex: we aim to map the observed features X into a subspace that allows for non-linear extrapolation in A. We show using Wiener’s Tauberian theorem that the hidden representation is identifiable up to an affine transformation in Z-space, which, we prove, is sufficient for intervention extrapolation. The identifiability is characterized by a novel constraint describing the linearity assumption of A on Z. Based on this insight, we propose a flexible method that enforces the linear invariance constraint and can be combined with any type of autoencoder. We validate our theoretical findings through a series of synthetic experiments and show that our approach can indeed succeed in predicting the effects of unseen interventions.

POSTER-182: Defining Expertise: Applications to Treatment Effect Estimation

Primary Area: causal reasoning

Keywords: expertise model selection balancing representations treatment effect estimation

Scores: [ 8 8 5 6 ]

Decision-makers are often experts of their domain and take actions based on their domain knowledge. Doctors, for instance, may prescribe treatments by predicting the likely outcome of each available treatment. Actions of an expert thus naturally encode part of their domain knowledge, and can help make inferences within the same domain: Knowing doctors try to prescribe the best treatment for their patients,we can tell treatments prescribed more frequently are likely to be more effective. Yet in machine learning, the fact that most decision-makers are experts is often overlooked, and “expertise” is seldom leveraged as an inductive bias. This is especially true for the literature on treatment effect estimation, where often the only assumption made about actions is that of overlap. In this paper, we argue that expertise—particularly the type of expertise the decision-makers of a domain are likely to have—can be informative in designing and selecting methods for treatment effect estimation. We formally define two types of expertise, predictive and prognostic, and demonstrate empirically that: (i) the prominent type of expertise in a domain significantly influences the performance of different methods in treatment effect estimation, and (ii) it is possible to predict the type of expertise present in a dataset, which can provide a quantitative basis for model selection.

POSTER-183: AgentBench: Evaluating LLMs as Agents

Primary Area: datasets and benchmarks

Keywords: Large language models Autonomous agents Reasoning Evaluation Benchmark

Scores: [ 6 8 6 8 3 ]

Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments.We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting.Our extensive test over 27 API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and OSS competitors.We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents.Training on code and high quality multi-turn alignment data could improve agent performance.Datasets, environments, and an integrated evaluation package for AgentBench are released.

POSTER-184: Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation

Primary Area: datasets and benchmarks

Keywords: off-policy evaluation offline reinforcement learning offline policy selection risk-return tradeoff

Scores: [ 8 5 8 5 ]

Off-Policy Evaluation (OPE) aims to assess the effectiveness of counterfactual policies using offline logged data and is frequently utilized to identify the top-k promising policies for deployment in online A/B tests. Existing evaluation metrics for OPE estimators primarily focus on the "accuracy" of OPE or that of downstream policy selection, neglecting risk-return tradeoff and efficiency in subsequent online policy deployment. To address this issue, we draw inspiration from portfolio evaluation in finance and develop a new metric, called SharpeRatio@k, which measures the risk-return tradeoff and efficiency of policy portfolios formed by an OPE estimator under varying online evaluation budgets (k). We first demonstrate, in two example scenarios, that our proposed metric can clearly distinguish between conservative and high-stakes OPE estimators and reliably identify the most efficient estimator capable of forming superior portfolios of candidate policies that maximize return with minimal risk during online deployment, while existing evaluation metrics produce only degenerate results. To facilitate a quick, accurate, and consistent evaluation of OPE via SharpeRatio@k, we have also implemented the proposed metric in an open-source software. Using SharpeRatio@k and the software, we conduct a benchmark experiment of various OPE estimators regarding their risk-return tradeoff, presenting several future directions for OPE research.

ORAL-11: ValUES: A Framework for Systematic Validation of Uncertainty Estimation in Semantic Segmentation

Primary Area: datasets and benchmarks

Keywords: uncertainty segmentation validation

Scores: [ 8 8 6 8 ]

POSTER-185: Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on HuggingFace

Primary Area: datasets and benchmarks

Keywords: dataset documentation data-centric AI computational social science

Scores: [ 5 8 5 8 ]

Advances in machine learning are closely tied to the creation of datasets. While data documentation is widely recognized as essential to the reliability, reproducibility, and transparency of ML, we lack a systematic empirical understanding of current dataset documentation practices. To shed light on this question, here we take Hugging Face - one of the largest platforms for sharing and collaborating on ML models and datasets - as a prominent case study. By analyzing all 7,433 dataset documentation on Hugging Face, our investigation provides an overview of the Hugging Face dataset ecosystem and insights into dataset documentation practices, yielding 5 main findings: (1) The dataset card completion rate shows marked heterogeneity correlated with dataset popularity: While 86.0% of the top 100 downloaded dataset cards fill out all sections suggested by Hugging Face community, only 7.9% of dataset cards with no downloads complete all these sections. (2) A granular examination of each section within the dataset card reveals that the practitioners seem to prioritize Dataset Description and Dataset Structure sections, accounting for 36.2% and 33.6% of the total card length, respectively, for the most downloaded datasets. In contrast, the Considerations for Using the Data section receives the lowest proportion of content, accounting for just 2.1% of the text. (3) By analyzing the subsections within each section and utilizing topic modeling to identify key topics, we uncover what is discussed in each section, and underscore significant themes encompassing both technical and social impacts, as well as limitations within the Considerations for Using the Data section. (4) Our findings also highlight the need for improved accessibility and reproducibility of datasets in the Usage sections. (5) In addition, our human annotation evaluation emphasizes the pivotal role of comprehensive dataset content in shaping individuals' perceptions of a dataset card's overall quality. Overall, our study offers a unique perspective on analyzing dataset documentation through large-scale data science analysis and underlines the need for more thorough dataset documentation in machine learning research.

POSTER-186: ADoPD: A Large-Scale Document Page Decomposition Dataset

Primary Area: datasets and benchmarks

Keywords: Document Understanding Dataset Segmentation Detection OCR Captioning

Scores: [ 8 6 6 6 ]

We introduce ADoPD, a large-scale document page decomposition dataset for document understanding, encompassing document entity segmentation, text detection, tagging, and captioning. ADoPD stands out with its novel document taxonomy, meticulously crafted through a data-driven approach enriched by both large-scale pretrained models and human expertise. Our dataset achieves diversity by combining outlier detection with a human-in-the-loop approach. This significant contribution advances the field of document analysis, deepening our insights into document structures and substantially enhancing document processing and analysis techniques. The amalgamation of data-driven exploration, thorough annotation, and the human-in-the-loop methodology paves the way for innovative improvements in document analysis capabilities and the advancement of document processing applications. We conduct a comprehensive evaluation of ADoPD using various methods and demonstrate its effectiveness.

SPOTLIGHT-37: Benchmarking Algorithms for Federated Domain Generalization

Primary Area: datasets and benchmarks

Keywords: federated learning distributed learning domain generalization out-of-distribution generalization benchmarking data paritioning.

Scores: [ 8 6 6 ]

While prior federated learning (FL) methods mainly consider client heterogeneity, we focus on the Federated Domain Generalization (DG) task, which introduces train-test heterogeneity in the FL context.Existing evaluations in this field are limited in terms of the scale of the clients and dataset diversity.Thus, we propose a Federated DG benchmark that aim to test the limits of current methods with high client heterogeneity, large numbers of clients, and diverse datasets. Towards this objective, we introduce a novel data partitioning method that allows us to distribute any domain dataset among few or many clients while controlling client heterogeneity. We then introduce and apply our methodology to evaluate 13 Federated DG methods, which include centralized DG methods adapted to the FL context, FL methods that handle client heterogeneity, and methods designed specifically for Federated DG on 7 datasets.Our results suggest that, despite some progress, significant performance gaps remain in Federated DG, especially when evaluating with a large number of clients, high client heterogeneity, or more realistic datasets. Furthermore, our extendable benchmark code will be publicly released to aid in benchmarking future Federated DG approaches.

POSTER-187: Can Large Language Models Infer Causation from Correlation?

Primary Area: datasets and benchmarks

Keywords: Large Language Models Natural Language Inference Causal Reasoning Inferring Causation from Correlation Benchmark Dataset

Scores: [ 6 6 6 ]

Causal inference is one of the hallmarks of human intelligence. While the field of CausalNLP has attracted much interest in the recent years, existing causal inference datasets in NLP primarily rely on discovering causality from empirical knowledge (e.g. commonsense knowledge). In this work, we propose the first benchmark dataset to test the pure causal inference skills of large language models (LLMs). Specifically, we formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables. We curate a large-scale dataset of more than 400K samples, on which we evaluate seventeen existing LLMs. Through our experiments, we identify a key shortcoming of LLMs in terms of their causal inference skills, and show that these models achieve almost close to random performance on the task. This shortcoming is somewhat mitigated when we try to re-purpose LLMs for this skill via finetuning, but we find that these models still fail to generalize – they can only perform causal inference in in-distribution settings when variable names and textual expressions used in the queries are similar to those in the training set, but fail in out-of-distribution settings generated by perturbing these queries. Corr2Cause is a challenging task for LLMs, and would be helpful in guiding future research on improving LLMs’ pure reasoning ability and generalizability.

POSTER-188: Alice Benchmarks: Connecting Real World Object Re-Identification with the Synthetic

Primary Area: datasets and benchmarks

Keywords: Synthetic Data Object re-ID Benchmarks

Scores: [ 6 6 8 6 ]

For object re-identification (re-ID), learning from synthetic data has become a promising strategy to cheaply acquire large-scale annotated datasets and effective models, with few privacy concerns. Many interesting research problems arise from this strategy, e.g., how to reduce the domain gap between synthetic source and real-world target. To facilitate developing more new approaches in learning from synthetic data, we introduce the Alice benchmarks, large-scale datasets providing benchmarks as well as evaluation protocols to the research community. Within the Alice benchmarks, two object re-ID tasks are offered: person and vehicle re-ID. We collected and annotated two challenging real-world target datasets: AlicePerson and AliceVehicle, captured under various illuminations, image resolutions, etc. As an important feature of our real target, the clusterability of its training set is not manually guaranteed to make it closer to a real domain adaptation test scenario. Correspondingly, we reuse existing PersonX and VehicleX as synthetic source domains. The primary goal is to train models from synthetic data that can work effectively in the real world. In this paper, we detail the settings of Alice benchmarks, provide an analysis of existing commonly-used domain adaptation methods, and discuss some interesting future directions. An online server will be set up for the community to evaluate methods conveniently and fairly.

POSTER-189: BEND: Benchmarking DNA Language Models on Biologically Meaningful Tasks

Primary Area: datasets and benchmarks

Keywords: Biological sequence analysis enhancer annotation gene finding gene annotation Language model genome modelling benchmark LLM embeddings representations DNA

Scores: [ 3 6 6 5 ]

The genome sequence contains the blueprint for governing cellular processes. While the availability of genomes has vastly increased over the last decades, experimental annotation of the various functional, non-coding and regulatory elements encoded in the DNA sequence remains both expensive and challenging. This has sparked interest in unsupervised language modeling of genomic DNA, a paradigm that has seen great success for protein sequence data. Although various DNA language models have been proposed, evaluation tasks often differ between individual works, and might not fully recapitulate the fundamental challenges of genome annotation, including the length, scale and sparsity of the data. In this study, we introduce BEND, a BENchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks defined on the human genome. We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features. BEND is available at https://anonymous.4open.science/r/BEND-8C42/README.md

POSTER-190: Evaluating Large Language Models at Evaluating Instruction Following

Primary Area: datasets and benchmarks

Keywords: large language models instruction tuning evaluation benchmark instruction following

Scores: [ 8 6 8 ]

As research in large language models (LLMs) continues to accelerate, LLM-based evaluation has emerged as a scalable and cost-effective alternative to human evaluations for comparing the ever-increasing list of models. This paper investigates the efficacy of these “LLM evaluators”, particularly in using them to assess instruction following, a metric that gauges how closely generated text adheres to theinstructions. We introduce a challenging meta-evaluation benchmark, LLMBAR, designed to test the ability of an LLM evaluator to discern instruction-following outputs. The authors curated 419 pairs of outputs, one adhering to instructions while the other diverging, yet may possess deceptive qualities that could mislead an LLM evaluator. Contrary to existing meta-evaluation, we discover that different evaluators (i.e., combinations of LLMs and prompts) exhibit distinct performance on LLMBAR and even the highest-scoring LLM evaluators have substantial room for improvement. We also present a novel suite of prompting strategies that further close the gap between LLM and human evaluators. With LLMBAR, we hope to offer more insight into the behavior of LLM evaluators and foster research in developing better instruction-following models.

SPOTLIGHT-38: Social Reward: Evaluating and Enhancing Generative AI through Million-User Feedback from an Online Creative Community

Primary Area: datasets and benchmarks

Keywords: human feedback text to image generative AI image quality scoring

Scores: [ 5 8 8 6 ]

SPOTLIGHT-39: A Benchmark for Learning to Translate a New Language from One Grammar Book

Primary Area: datasets and benchmarks

Keywords: low-resource languages indigenous languages endangered languages long context field linguistics unseen tasks large language models machine translation benchmark

Scores: [ 6 8 8 ]

Large language models (LLMs) can perform impressive feats with in-context learning or lightweight finetuning. It is natural to wonder how well these models adapt to genuinely new tasks, but how does one find tasks that are unseen in internet-scale training sets? We turn to a field that is explicitly motivated and bottlenecked by a scarcity of web data: low-resource languages. In this paper, we introduce MTOB (Machine Translation from One Book), a benchmark for learning to translate between English and Kalamang—a language with less than 200 speakers and therefore virtually no presence on the web—using several hundred pages of field linguistics reference materials. This task framing is novel in that it asks a model to learn a language from a single human-readable book of grammar explanations, rather than a large mined corpus of in-domain data, more akin to L2 language learning than L1 language acquisition. We demonstrate that baselines using current LLMs are promising but fall short of human performance, achieving 44.7 chrF on Kalamang to English translation and 45.8 chrF on English to Kalamang translation, compared to 51.6 and 57.0 chrF by a human who learned Kalamang from the same reference materials. We hope that MTOB will help measure LLM capabilities along a new dimension, and that the methods developed to solve it could help expand access to language technology for underserved communities by leveraging qualitatively different kinds of data than traditional machine translation.

POSTER-191: VFLAIR: A Research Library and Benchmark for Vertical Federated Learning

Primary Area: datasets and benchmarks

Keywords: VFL Attacks and Defenses Benchmark Platform

Scores: [ 8 3 8 ]

Vertical Federated Learning (VFL) has emerged as a collaborative training paradigm that allows participants with different features of the same group of users to accomplish cooperative training without exposing their raw data or model parameters. VFL has gained significant attention for its research potential and real-world applications in recent years, but still faces substantial challenges, such as in defending various kinds of data inference and backdoor attacks. Moreover, most of existing VFL projects are industry-facing and not easily used for keeping track of the current research progress. To address this need, we present an extensible and lightweight VFL framework VFLAIR (available at https://anonymous.4open.science/r/VFLAIR-1ABD), which supports VFL training with a variety of models, datasets and protocols, along with standardized modules for comprehensive evaluations of attacks and defense strategies. We also benchmark 11 attacks and 8 defenses performance under different communication and model partition settings and draw concrete insights and recommendations on the choice of defense strategies for different practical VFL deployment scenarios.

POSTER-192: Measuring Vision-Language STEM Skills of Neural Models

Primary Area: datasets and benchmarks

Keywords: Benchmark STEM Multimodal Vision-language models Language models

Scores: [ 6 6 6 ]

We introduce a new challenge to test the STEM skills of neural models. Unlike existing datasets, our dataset requires the understanding of multimodal vision-language information. Our dataset features one of the largest and most comprehensive datasets for the challenge. It includes 448 skills and 1,073,146 questions spanning all STEM (science, technology, engineering, math) subjects. Compared to existing datasets that often focus on examining expert-level ability, our dataset includes fundamental skills and questions designed based on the K-12 curriculum. We also add state-of-the-art foundation models such as CLIP and ChatGPT to our dataset. Results show that the recent model advances only help master a very limited number of lower grade-level skills (2.5% in the third grade) in our dataset. In fact, these models are still well below (averaging 54.7%) the performance of elementary students, not to mention near expert-level performance. To understand and increase the performance on our dataset, we teach the models on a training split of our dataset. Even though we observe improved performance, the model performance remains relatively low compared to average elementary students. To solve STEM problems, we will need novel algorithmic innovations from the community. The code and dataset are available at https://anonymous.4open.science/r/STEM-Dataset-ICLR-2024 and will be made publicly available.

SPOTLIGHT-40: SocioDojo: Building Lifelong Analytical Agents with Real-world Text and Time Series

Primary Area: datasets and benchmarks

Keywords: Large Langauge Models Agent Prompt Tunning Time series forcasting

Scores: [ 8 8 5 ]

We introduce SocioDojo, an open-ended lifelong learning environment for developing ready-to-deploy autonomous agents capable of performing human-like analysis and decision-making on societal topics such as economics, finance, politics, and culture. It consists of (1) information sources from news, social media, reports, etc., (2) a knowledge base built from books, journals, and encyclopedias, plus a toolbox of Internet and knowledge graph search interfaces, (3) 30K high-quality time series in finance, economy, society, and polls, which support a novel task called "hyperportfolio", that can reliably and scalably evaluate societal analysis and decision-making power of agents, inspired by portfolio optimization with time series as assets to "invest". We also propose a novel Analyst-Assistant-Actuator architecture for the hyperportfolio task, and a Hypothesis & Proof prompting for producing in-depth analyses on input news, articles, etc. to assist decision-making. We perform experiments and ablation studies to explore the factors that impact performance. The results show that our proposed method achieves improvements of 32.4% and 30.4% compared to the state-of-the-art method in the two experimental settings.

POSTER-193: A Newborn Embodied Turing Test for Comparing Object Segmentation Across Animals and Machines

Primary Area: datasets and benchmarks

Keywords: newborn controlled rearing object recognition object segmentation reinforcement learning benchmark Turing test development

Scores: [ 6 8 8 3 ]

Newborn brains rapidly learn to solve challenging object recognition tasks, including segmenting objects from backgrounds and recognizing objects across novel backgrounds and viewpoints. Conversely, modern machine-learning (ML) algorithms are "data hungry," requiring more training data than brains to reach similar performance levels. How do we close this learning gap between brains and machines? Here we introduce a new benchmark—a Newborn Embodied Turing Test (NETT) for object segmentation—in which newborn animals and machines are raised in the same environments and tested with the same tasks, permitting direct comparison of their learning abilities. First, we raised newborn chicks in controlled environments containing a single object rotating on a single background, then tested their ability to recognize that object across new backgrounds and viewpoints. Second, we performed “digital twin” experiments in which we reared and tested artificial chicks in virtual environments that mimicked the rearing and testing conditions of the biological chicks. We inserted a variety of ML “brains” into the artificial chicks and measured whether those algorithms learned common object recognition behavior as biological chicks. All biological chicks solved this one-shot object segmentation task, successfully learning background-invariant object representations that generalized across new backgrounds and viewpoints. In contrast, none of the artificial chicks solved this object segmentation task, instead learning background-dependent representations that failed to generalize across new backgrounds and viewpoints. This digital twin design exposes core limitations in current ML algorithms in achieving brain-like object perception. Our NETT is publicly available for comparing ML algorithms with newborn chicks. Ultimately, we anticipate that NETT benchmarks will allow researchers to build embodied AI systems that learn as efficiently and robustly as newborn brains.

POSTER-194: Zero and Few-shot Semantic Parsing with Ambiguous Inputs

Primary Area: datasets and benchmarks

Keywords: semantic parsing text-to-code ambiguity NLP calibration

Scores: [ 8 6 6 8 ]

Despite the frequent challenges posed by ambiguity when representing meaning via natural language, it is often ignored or deliberately removed in tasks mapping language to formally-designed representations, which generally assume a one-to-one mapping between linguistic and formal representations. We attempt to address this shortcoming by introducing AmP, a framework, dataset, and challenge for translating ambiguous natural language to formal representations like logic and code. We define templates and generate data for five well-documented linguistic ambiguities.Using AmP, we investigate how several few-shot text-to-code systems handle ambiguity, introducing three new metrics.We find that large pre-trained models perform poorly at capturing the distribution of possible meanings without deliberate instruction.However, models are able to capture the distribution well when ambiguity is attested in their inputs. These results motivate a call for including ambiguity explicitly in datasets and promote considering the distribution of possible outputs when evaluating systems. We release our data and code.

POSTER-195: CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis

Primary Area: datasets and benchmarks

Keywords: doamin generalization model accuracy prediction testbeds

Scores: [ 6 6 6 8 ]

Analyzing model performance in various unseen environments is a critical research problem in the machine learning community. To study this problem, it is important to construct a testbed with out-of-distribution test sets that have broad coverage of environmental discrepancies. However, existing testbeds typically either have a small number of domains or are synthesized by image corruptions, hindering algorithm design that demonstrates real-world effectiveness. In this paper, we introduce CIFAR-10-Warehouse, consisting of 180 datasets collected by prompting image search engines and diffusion models in various ways. Generally sized between 300 and 8,000 images, the datasets contain natural images, cartoons, certain colors, or objects that do not naturally appear. With CIFAR-10-W, we aim to enhance the evaluation and deepen the understanding of two generalization tasks: domain generalization and model accuracy prediction in various out-of-distribution environments. We conduct extensive benchmarking and comparison experiments and show that CIFAR-10-W offers new and interesting insights inherent to these tasks. We also discuss other fields that would benefit from CIFAR-10-W.

POSTER-196: RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

Primary Area: datasets and benchmarks

Keywords: large language model code completion benchmark

Scores: [ 5 8 6 ]

Large Language Models (LLMs) have greatly advanced code auto-completion systems, with a potential for substantial productivity enhancements for developers. However, current benchmarks mainly focus on single-file tasks, leaving an assessment gap for more complex, real-world, multi-file programming scenarios. To fill this gap, we introduce RepoBench, a new benchmark specifically designed for evaluating repository-level code auto-completion systems. RepoBench consists of three interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline). Each task respectively measures the system's ability to retrieve the most relevant code snippets from other files as cross-file context, predict the next line of code with cross-file and in-file context, and handle complex tasks that require a combination of both retrieval and next-line prediction. RepoBench aims to facilitate a more complete comparison of performance and encouraging continuous improvement in auto-completion systems.

POSTER-197: Yet Another ICU Benchmark: A Flexible Multi-Center Framework for Clinical ML

Primary Area: datasets and benchmarks

Keywords: ICU Intensive Care Unit EHR ML Time Series Patient Monitoring Clinical ML Benchmark Multi-Center MIMIC eICU HiRID AmsterdamUMCdb

Scores: [ 6 8 5 6 6 ]

Medical applications of machine learning (ML) have experienced a surge in popularity in recent years. Given the abundance of available data from electronic health records, the intensive care unit (ICU) is a natural habitat for ML. Models have been proposed to address numerous ICU prediction tasks like the early detection of complications. While authors frequently report state-of-the-art performance, it is challenging to verify claims of superiority. Datasets and code are not always published, and cohort definitions, preprocessing pipelines, and training setups are difficult to reproduce. This work introduces Yet Another ICU Benchmark (YAIB), a modular framework that allows researchers to define reproducible and comparable clinical ML experiments; we offer an end-to-end solution from cohort definition to model evaluation. The framework natively supports most open-access ICU datasets (MIMIC III/IV, eICU, HiRID, AUMCdb) and is easily adaptable to future ICU datasets. Combined with a transparent preprocessing pipeline and extensible training code for multiple ML and deep learning models, YAIB enables unified model development, transfer, and evaluation. Our benchmark comes with five predefined established prediction tasks (mortality, acute kidney injury, sepsis, kidney function, and length of stay) developed in collaboration with clinicians. Adding further tasks is straightforward by design. Using YAIB, we demonstrate that the choice of dataset, cohort definition, and preprocessing have a major impact on the prediction performance — often more so than model class — indicating an urgent need for YAIB as a holistic benchmarking tool. We provide our work to the clinical ML community to accelerate method development and enable real-world clinical implementations.

POSTER-198: WebArena: A Realistic Web Environment for Building Autonomous Agents

Primary Area: datasets and benchmarks

Keywords: language-guided agents; web automation; benchmark

Scores: [ 5 6 8 ]

With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and designed to emulate tasks that humans routinely perform on the internet. We experiment with several baseline agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%. These results highlight the need for further development of robust agents, that current state-of-the-art large language models are far from perfect performance in these real-life tasks, and that \ours can be used to measure such progress.\footnote{Code, data, environment reproduction instructions, video demonstrations are available in the supplementary.}

POSTER-199: MIntRec2.0: A Large-scale Benchmark Dataset for Multimodal Intent Recognition and Out-of-scope Detection in Conversations

Primary Area: datasets and benchmarks

Keywords: intent recognition multimodal dataset out-of-scope detection multi-turn conversations benchmark framework benchmark evaluation

Scores: [ 6 8 6 6 ]

POSTER-200: CircuitNet 2.0: An Advanced Dataset for Promoting Machine Learning Innovations in Realistic Chip Design Environment

Primary Area: datasets and benchmarks

Keywords: Chip Design Machine Learning Dataset

Scores: [ 6 5 6 6 ]

Integrated circuits or chips are key to enable computing in modern industry. Designing a chip relies on human experts to produce chip data through professional electronic design automation (EDA) software and complicated procedures. Nowadays, prompted by the wide variety of machine learning (ML) datasets, we have witnessed great advancement of ML algorithms in computer vision, natural language processing, and other fields. However, in chip design, high human workload and data sensitivity cause the lack of public datasets, which hinders the progress of ML development for EDA. To this end, we introduce an advanced large-scale dataset, CircuitNet 2.0, which targets promoting ML innovations in a realistic chip design environment. In order to approach the realistic chip design space, we collect more than 10,000 samples with a variety of chip designs (e.g., CPU, GPU, and AI Chip). All the designs are conducted through complete commercial design flows in a widely-used technology node, 14nm FinFET. We collect comprehensive data, including routability, timing, and power, from the design flow to support versatile ML tasks in EDA. Besides, we also introduce some realistic ML tasks with CircuitNet 2.0 to verify the potential for boosting innovations.

POSTER-201: HAZARD Challenge: Embodied Decision Making in Dynamically Changing Environments

Primary Area: datasets and benchmarks

Keywords: embodied AI decision making

Scores: [ 8 8 6 5 ]

Recent advances in high-fidelity virtual environments serve as one of the major driving forces for building intelligent embodied agents to perceive, reason and interact with the physical world. Typically, these environments remain unchanged unless agents interact with them. However, in real-world scenarios, agents might also face dynamically changing environments characterized by unexpected events and need to rapidly take action accordingly. To remedy this gap, we propose a new simulated embodied benchmark, called HAZARD, specifically designed to assess the decision-making abilities of embodied agents in dynamic situations. HAZARD consists of three unexpected disaster scenarios, including fire, flood, and wind, and specifically supports the utilization of large language models (LLMs) to assist common sense reasoning and decision-making. This benchmark enables us to evaluate autonomous agents' decision-making capabilities across various pipelines, including reinforcement learning (RL), rule-based, and search-based methods in dynamically changing environments. As a first step toward addressing this challenge using large language models, we further develop an LLM-based agent and perform an in-depth analysis of its promise and challenge of solving these challenging tasks.

SPOTLIGHT-41: SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents

Primary Area: datasets and benchmarks

Keywords: Social Interaction Agent Social intelligence Large Language Models Evaluation Theory of Mind

Scores: [ 8 6 6 ]

Humans are social beings; we pursue social goals in our daily interactions, which is a crucial aspect of social intelligence. Yet, AI systems' abilities in this realm remain elusive. We present SOTOPIA, an open-ended environment to simulate complex social interactions between artificial agents and evaluate their social intelligence. In our environment, agents role-play and interact under a wide variety of scenarios; they coordinate, collaborate, exchange, and compete with each other to achieve complex social goals. We simulate the role-play interaction between LLM-based agents and humans within this task space and evaluate their performance with a holistic evaluation framework called SOTOPIA-Eval. With SOTOPIA, we find significant differences between these models in terms of their social intelligence, and we identify a subset of SOTOPIA scenarios, SOTOPIA-hard, that is generally challenging for all models. We find that on this subset, GPT-4 achieves a significantly lower goal completion rate than humans and struggles to exhibit social commonsense reasoning and strategic communication skills. These findings demonstrate SOTOPIA's promise as a general platform for research on evaluating and improving social intelligence in artificial agents.

POSTER-202: To the Cutoff... and Beyond? A Longitudinal Perspective on LLM Data Contamination

Primary Area: datasets and benchmarks

Keywords: contamination memorization llm codeforces project euler datasets benchmarks training cutoff

Scores: [ 8 8 6 5 ]

POSTER-203: ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models

Primary Area: datasets and benchmarks

Keywords: vision-language benchmark spatio-temporal grounding video-language models benchmark evaluation zero-shot

Scores: [ 6 6 6 6 ]

With the ever-increasing popularity of pretrained Video-Language Models (VidLMs), there is a pressing need to develop robust evaluation methodologies that delve deeper into their visio-linguistic capabilities. To address this challenge, we present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of these models on a firm footing. Task-based evaluations, while valuable, fail to capture the complexities and specific temporal aspects of moving images that VidLMs need to process. Through carefully curated counterfactuals, ViLMA offers a controlled evaluation suite that sheds light on the true potential of these models, as well as their performance gaps compared to human-level understanding. ViLMA also includes proficiency tests, which assess basic capabilities deemed essential to solving the main counterfactual tests. We show that current VidLMs’ grounding abilities are no better than those of vision-language models which use static images. This is especially striking once the performance on proficiency tests is factored in. Our benchmark serves as a catalyst for future research on VidLMs, helping to highlight areas that still need to be explored.

POSTER-204: COLLIE: Systematic Construction of Constrained Text Generation Tasks

Primary Area: datasets and benchmarks

Keywords: constrained text generation large language models compositional benchmark

Scores: [ 6 8 6 10 ]

Text generation under constraints have seen increasing interests in natural language processing, especially with the rapidly improving capabilities of large language models. However, existing benchmarks for constrained generation usually focus on fixed constraint types (e.g. generate a sentence containing certain words) that have proved to be easy for state-of-the-art models like GPT-4. We present COLLIE, a grammar-based framework that allows the specification of rich, compositional constraints with diverse generation levels (word, sentence, paragraph, passage) and modeling challenges (e.g. language understanding, logical reasoning, counting, semantic planning). We also develop tools for automatic extraction of task instances given a constraint structure and a raw text corpus. Using COLLIE, we compile the COLLIE-v1 dataset with 1,132 instances comprising 13 constraint structures. We perform systematic experiments across five state-of-the-art instruction-tuned language models and analyze their performances to reveal shortcomings. COLLIE is designed to be extensible and lightweight, and we hope the community finds it useful to develop more complex constraints and evaluations in the future.

POSTER-205: Large Language Models as Automated Aligners for benchmarking Vision-Language Models

Primary Area: datasets and benchmarks

Keywords: LLMs VLMs Benchmark

Scores: [ 5 5 8 ]

With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to measure task-specific performance, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence. In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient aligners, measuring the alignment between VLMs and human intelligence and value through automatic data curation and assessment. Specifically, for data curation, Auto-Bench utilizes LLMs (e.g., GPT-4) to automatically generate a vast set of question-answer-reasoning triplets via prompting on visual symbolic representations (e.g., captions, object locations, instance relationships, and etc. The curated data closely matches human intent, owing to the extensive world knowledge embedded in LLMs. Through this pipeline, a total of 28.5K human-verified and 3,504K unfiltered question-answer-reasoning triplets have been curated, covering 4 primary abilities and 16 sub-abilities. We subsequently engage LLMs like GPT-3.5 to serve as judges, implementing the quantitative and qualitative automated assessments to facilitate a comprehensive evaluation of VLMs. Our validation results reveal that LLMs are proficient in both evaluation data curation and model assessment, achieving an average agreement rate of 85%. We envision Auto-Bench as a flexible, scalable, and comprehensive benchmark for evaluating the evolving sophisticated VLMs.

POSTER-206: ETGraph: A Pioneering Dataset Bridging Ethereum and Twitter

Primary Area: datasets and benchmarks

Keywords: Ethereum Twitter GNN

Scores: [ 8 5 6 ]

POSTER-207: MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback

Primary Area: datasets and benchmarks

Keywords: large language model multi-turn interaction learning from feedback reinforcement learning from human feedback instruction tuning

Scores: [ 5 8 8 6 ]

To solve complex tasks, large language models (LLMs) often require multiple rounds of interactions with the user, sometimes assisted by external tools.However, current evaluation protocols often emphasize benchmark performance with single-turn exchanges, neglecting the nuanced interactions among the user, LLMs, and external tools, while also underestimating the importance of natural language feedback from users. These oversights contribute to discrepancies between research benchmark evaluations and real-world use cases.We introduce MINT, a benchmark that evaluates LLMs' ability to solve tasks with multi-turn interactions by (1) using tools and (2) leveraging natural language feedback.To ensure reproducibility, we provide an evaluation framework where LLMs can access tools by executing Python code and receive users' natural language feedback simulated by GPT-4.We repurpose a diverse set of established evaluation datasets focusing on reasoning, coding, and decision-making and carefully curate them into a compact subset for efficient evaluation.Our analysis of 20 open- and closed-source LLMs offers intriguing findings.(a) LLMs generally benefit from tools and language feedback, with performance gains (absolute, same below) of 1--8% for each turn of tool use and 2--17% with natural language feedback.(b) Better single-turn performance does not guarantee better multi-turn performance.(c) Surprisingly, on the LLMs evaluated, supervised instruction-finetuning (SIFT) and reinforcement learning from human feedback (RLHF) generally hurt multi-turn capabilities.We expect MINT can help measure progress and incentivize research in improving LLMs' capabilities in multi-turn interactions, especially for open-source communities where multi-turn human evaluation can be less accessible compared to commercial LLMs with a larger user base.

POSTER-208: OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text

Primary Area: datasets and benchmarks

Keywords: web-scale dataset natural language processing large language model reasoning AI for math

Scores: [ 5 5 8 ]

There is growing evidence that pretraining on high quality, carefully thought-out tokens such as code or mathematics plays an important role in improving the reasoning abilities of large language models. For example, Minerva, a PaLM model finetuned on billions of tokens of mathematical documents from arXiv and the web, reported dramatically improved performance on problems that require quantitative reasoning. However, because all known open source web datasets employ preprocessing that does not faithfully preserve mathematical notation, the benefits of large scale training on quantitive web documents are unavailable to the research community. We introduce OpenWebMath, an open dataset inspired by these works containing 14.7B tokens of mathematical webpages from Common Crawl. We describe in detail our method for extracting text and LaTeX content and removing boilerplate from HTML documents, as well as our methods for quality filtering and deduplication. Additionally, we run small-scale experiments by training 1.4B language models on OpenWebMath, showing that models trained on 14.7B tokens of our dataset surpass the performance of models trained on over 20x the amount of general language data. We hope that our dataset, open-sourced and released on the Hugging Face Hub, will help spur advances in the reasoning abilities of large language models.

POSTER-209: ContextRef: Evaluating Referenceless Metrics for Image Description Generation

Primary Area: datasets and benchmarks

Keywords: image description evaluation referenceless metrics multimodal evaluation

Scores: [ 6 6 6 ]

Referenceless metrics (e.g., CLIPScore) use pretrained vision--language models to assess image descriptions directly without costly ground-truth reference texts. Such methods can facilitate rapid progress, but only if they truly align with human preference judgments. In this paper, we introduce ContextRef, a benchmark for assessing referenceless metrics for such alignment. ContextRef has two components: human ratings along a variety of established quality dimensions, and ten diverse robustness checks designed to uncover fundamental weaknesses. A crucial aspect of ContextRef is that images and descriptions are presented in context, reflecting prior work showing that context is important for description quality. Using ContextRef, we assess a variety of pretrained models, scoring functions, and techniques for incorporating context. None of the methods is successful with ContextRef, but we show that careful fine-tuning yields substantial improvements. ContextRef remains a challenging benchmark though, in large part due to the challenge of context dependence.

SPOTLIGHT-42: Learning Performance-Improving Code Edits

Primary Area: datasets and benchmarks

Keywords: Large Language Models Retrieval Augmented Generation Program Synthesis Program Optimization Fine-Tuning Goal-Conditioning Data Augmentation

Scores: [ 8 5 8 8 ]

With the waning of Moore's law, optimizing program performance has become a major focus of software research. However, high-level optimizations such as API and algorithm changes remain elusive due to the difficulty of understanding the semantics of code.Simultaneously, pretrained large language models (LLMs) have demonstrated strong capabilities at solving a wide range of programming tasks.To that end, we introduce a framework for adapting LLMs to high-level program optimization.First, we curate a dataset of performance-improving edits made by human programmers of over 77,000 competitive C++ programming submission pairs, accompanied by extensive unit tests.A major challenge is the significant variability of measuring performance on commodity hardware, which can lead to spurious "improvements".To isolate and reliably evaluate the impact of program optimizations, we design an environment based on the gem5 full system simulator, the de facto simulator used in academia and industry.Next, we propose a broad range of adaptation strategies for code optimization; for prompting, these include retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play.A combination of these techniques achieves an average speedup of 5.65 times on CodeLlama-13B and 6.86 times on GPT-3.5, surpassing the best human performance (4.06 times).We find our proposed performance-conditioned generation is particularly effective at improving performance as well as increasing the fraction of optimized programs.

POSTER-210: Dissecting sample hardness: Fine-grained analysis of Hardness Characterization Methods

Primary Area: datasets and benchmarks

Keywords: data-centric AI hardness characterization data-centric ML DMLR benchmarking

Scores: [ 8 6 8 1 8 ]

Characterizing samples that are difficult to learn from is crucial to developing highly performant ML models. This has led to numerous Hardness Characterization Methods (HCMs) that aim to identify ''hard'' samples. However, there is a lack of consensus regarding the definition and evaluation of ''hardness''. Unfortunately, current HCMs have only been evaluated on specific types of hardness and often only qualitatively or with respect to downstream performance, overlooking the fundamental quantitative identification task. We address this gap by presenting a fine-grained taxonomy of various types of hardness. Additionally, we propose the Hardness Characterization Analysis Toolkit (H-CAT), which supports comprehensive and quantitative benchmarking of HCMs across the hardness taxonomy and can easily be extended to new HCMs, hardness types, and datasets. We use H-CAT to evaluate 13 different HCMs across 8 hardness types. This comprehensive evaluation encompassing over 14K setups uncovers various strengths and weaknesses of different HCMs, leading to practical tips to guide HCM selection and future development. Our findings highlight the need for more comprehensive HCM evaluation, while we hope our hardness taxonomy and toolkit will advance the principled evaluation and uptake of data-centric AI methods.

POSTER-211: CausalTime: Realistically Generated Time-series for Benchmarking of Causal Discovery

Primary Area: datasets and benchmarks

Keywords: Time series Causal discovery Neural networks Benchmark Dataset

Scores: [ 6 8 8 5 ]

Time-series causal discovery (TSCD) is a fundamental problem of machine learning. However, existing synthetic datasets cannot properly evaluate or predict the algorithms' performance on real data. This study introduces the CausalTime pipeline to generate time-series that highly resemble the real data and with ground truth causal graphs for quantitative performance evaluation. The pipeline starts from real observations in a specific scenario and produces a matching benchmark dataset. Firstly, we harness deep neural networks along with normalizing flow to accurately capture realistic dynamics. Secondly, we extract hypothesized causal graphs by performing importance analysis on the neural network or leveraging prior knowledge. Thirdly, we derive the ground truth causal graphs by splitting the causal model into causal term, residual term, and noise term. Lastly, using the fitted network and the derived causal graph, we generate corresponding versatile time-series proper for algorithm assessment. In the experiments, we validate the fidelity of the generated data through qualitative and quantitative experiments, followed by a benchmarking of existing TSCD algorithms using these generated datasets. CausalTime offers a feasible solution to evaluating TSCD algorithms in real applications and can be generalized to a wide range of fields. For easy use of the proposed approach, we also provide a user-friendly website, hosted on www.causaltime.cc.

POSTER-212: InstructDET: Diversifying Referring Object Detection with Generalized Instructions

Primary Area: datasets and benchmarks

Keywords: Referring Object Detection; Visual Grounding; Foundation Models

Scores: [ 5 6 6 6 ]

POSTER-213: Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World

Primary Area: datasets and benchmarks

Keywords: Few-shot learning Visual reasoning Open world learning

Scores: [ 8 8 5 6 ]

We introduce Bongard-OpenWorld, a new benchmark for evaluating real-world few-shot reasoning for machine vision. It originates from the classical Bongard Problems (BPs): Given two sets of images (positive and negative), the model needs to identify the set that query images belong to by inducing the visual concept, which is exclusively depicted by images from the positive set. Our benchmark inherits the few-shot concept induction of the original BPs while adding the two novel layers of challenge: 1) open-world free-form concepts, as the visual concepts in Bongard-OpenWorld are unique compositions of terms from an open vocabulary, ranging from object categories to abstract visual attributes and commonsense factual knowledge; 2) real-world images, as opposed to the synthetic diagrams used by many counterparts. In our exploration, Bongard-OpenWorld already imposes a significant challenge to current few-shot reasoning algorithms. We further investigate to which extent the recently introduced Large Language Models (LLMs) and Vision-Language Models (VLMs) can solve our task, by directly probing VLMs, and combining VLMs and LLMs in an interactive reasoning scheme. We even designed a neuro-symbolic reasoning approach that reconciles LLMs & VLMs with logical reasoning to emulate the human problem-solving process for Bongard problems. However, none of these approaches manage to close the human-machine gap, as the best learner achieves 64% accuracy while human participants easily reach 91%. We hope Bongard-OpenWorld can help us better understand the limitations of current visual intelligence and facilitate future research on visual agents with stronger few-shot visual reasoning capabilities. All implementation details and reproduction code, including Bongard-OpenWorld dataset, are available in an anonymous github repository https://github.com/Bongard-OpenWorld.

POSTER-214: VertiBench: Advancing Feature Distribution Diversity in Vertical Federated Learning Benchmarks

Primary Area: datasets and benchmarks

Keywords: Vertical federated learning benchmark feature correlation feature importance

Scores: [ 6 8 6 6 ]

Vertical Federated Learning (VFL) is a crucial paradigm for training machine learning models on feature-partitioned, distributed data. However, due to privacy restrictions, few public real-world VFL datasets exist for algorithm evaluation, and these represent a limited array of feature distributions. Existing benchmarks often resort to synthetic datasets, derived from arbitrary feature splits from a global set, which only capture a subset of feature distributions, leading to inadequate algorithm performance assessment. This paper addresses these shortcomings by introducing two key factors affecting VFL performance - feature importance and feature correlation - and proposing associated evaluation metrics and dataset splitting methods. Our comprehensive evaluation of cutting-edge VFL algorithms provides valuable insights for future research in the field.

SPOTLIGHT-43: DyVal: Graph-informed Dynamic Evaluation of Large Language Models

Primary Area: datasets and benchmarks

Keywords: Large Language Models Evaluation Data Contamination

Scores: [ 8 6 6 6 ]

Large language models (LLMs) have achieved remarkable performance in various evaluation benchmarks. However, concerns about their performance are raised on potential data contamination in their considerable volume of training corpus. Moreover, the static nature and fixed complexity of current benchmarks may inadequately gauge the advancing capabilities of LLMs. In this paper, we introduce DyVal, a novel, general, and flexible evaluation protocol for dynamic evaluation of LLMs. Based on our proposed dynamic evaluation framework, we build graph-informed DyVal by leveraging the structural advantage of directed acyclic graphs to dynamically generate evaluation samples with controllable complexities. DyVal generates challenging evaluation sets on reasoning tasks including mathematics, logical reasoning, and algorithm problems. We evaluate various LLMs ranging from Flan-T5-large to ChatGPT and GPT4. Experiments demonstrate that LLMs perform worse in DyVal-generated evaluation samples with different complexities, emphasizing the significance of dynamic evaluation. We also analyze the failure cases and results of different prompting methods. Moreover, DyVal-generated samples are not only evaluation sets, but also helpful data for fine-tuning to improve the performance of LLMs on existing benchmarks. We hope that DyVal can shed light on the future evaluation research of LLMs.

POSTER-215: GAIA: a benchmark for General AI Assistants

Primary Area: datasets and benchmarks

Keywords: Large Language Models Benchmark

Scores: [ 8 8 8 3 ]

We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. Our questions allow simple, fast, and factual verification. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board \href{https://huggingface.co/xxx}{hereby accessible}.

POSTER-216: Robust NAS benchmark under adversarial training: assessment, theory, and beyond

Primary Area: datasets and benchmarks

Keywords: neural architecture search robustness benchmark generalization theory

Scores: [ 6 6 6 ]

Recent developments in neural architecture search (NAS) emphasize the significance of considering robust architectures against malicious data. However, there is a notable absence of benchmark evaluations and theoretical guarantees for searching these robust architectures, especially when adversarial training is considered. In this work, we aim to address these two challenges, making twofold contributions. First, we release a comprehensive data set that encompasses both clean accuracy and robust accuracy for a vast array of adversarially trained networks from the NAS-Bench-201 search space on image datasets. Then, leveraging the neural tangent kernel (NTK) tool from deep learning theory, we establish a generalization theory for searching architecture in terms of clean accuracy and robust accuracy under multi-objective adversarial training. We firmly believe that our benchmark and theoretical insights will significantly benefit the NAS community through reliable reproducibility, efficient assessment, and theoretical foundation, particularly in the pursuit of robust architectures.

POSTER-217: GlucoBench: Curated List of Continuous Glucose Monitoring Datasets with Prediction Benchmarks

Primary Area: datasets and benchmarks

Keywords: diabetes management continuous glucose monitors (CGM) glucose trajectory prediction artificial pancreas systems public datasets standardized tasks benchmark models glycemic control

Scores: [ 5 6 8 ]

The rising rates of diabetes necessitate innovative methods for its management. Continuous glucose monitors (CGM) are small medical devices that measure blood glucose levels at regular intervals providing insights into daily patterns of glucose variation. Forecasting of glucose trajectories based on CGM data holds the potential to substantially improve diabetes management, by both refining artificial pancreas systems and enabling individuals to make adjustments based on predictions to maintain optimal glycemic range. Despite numerous methods proposed for CGM-based glucose trajectory prediction, these methods are typically evaluated on small, private datasets, impeding reproducibility, further research, and practical adoption. The absence of standardized prediction tasks and systematic comparisons between methods has led to uncoordinated research efforts, obstructing the identification of optimal tools for tackling specific challenges. As a result, only a limited number of prediction methods have been implemented in clinical practice.

POSTER-218: KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval

Primary Area: datasets and benchmarks

Keywords: language model evaluation benchmark constraint satisfaction information retrieval retrieval-augmented architecture

Scores: [ 8 6 6 5 ]

POSTER-219: ImageNet-OOD: Deciphering Modern Out-of-Distribution Detection Algorithms

Primary Area: datasets and benchmarks

Keywords: Out-of-distribution Detection Failure Detection Object Discovery Novelty Detection Robustness

Scores: [ 8 6 6 6 ]

The task of out-of-distribution (OOD) detection is notoriously ill-defined. Earlier works focused on new-class detection, aiming to identify label-altering data distribution shifts, also known as "semantic shift." However, recent works argue for a focus on failure detection, expanding the OOD evaluation framework to account for label-preserving data distribution shifts, also known as "covariate shift.” Intriguingly, under this new framework, complex OOD detectors that were previously considered state-of-the-art now perform similarly to, or even worse than the simple maximum softmax probability baseline. This raises the question: what are the latest OOD detectors actually detecting? Deciphering the behavior of OOD detection algorithms requires evaluation datasets that decouples semantic shift and covariate shift. To aid our investigations, we present ImageNet-OOD, a clean semantic shift dataset that minimizes the interference of covariate shift. Through comprehensive experiments, we show that OOD detectors are more sensitive to covariate shift than to semantic shift, and the benefits of recent OOD detection algorithms on semantic shift detection is minimal. Our dataset and analyses provide important insights for guiding the design of future OOD detectors.

ORAL-12: SWE-bench: Can Language Models Resolve Real-world Github Issues?

Primary Area: datasets and benchmarks

Keywords: Language models Natural language processing Software engineering

Scores: [ 5 6 8 6 ]

POSTER-220: GnnX-Bench: Unravelling the Utility of Perturbation-based GNN Explainers through In-depth Benchmarking

Primary Area: datasets and benchmarks

Keywords: graph neural network explanations factual explanations counterfactual explanations

Scores: [ 6 6 8 3 ]

Numerous explainability methods have been proposed to shed light on the inner workings of GNNs. Despite the inclusion of empirical evaluations in all the proposed algorithms, the interrogative aspects of these evaluations lack diversity. As a result, various facets of explainability pertaining to GNNs, such as a comparative analysis of counterfactual reasoners, their stability to variational factors such as different GNN architectures, noise, stochasticity in non-convex loss surfaces, feasibility amidst domain constraints, and so forth, have yet to be formally investigated. Motivated by this need, we present a benchmarking study on perturbation-based explainability methods for GNNs, aiming to systematically evaluate and compare a wide range of explainability techniques. Among the key findings of our study, we identify the Pareto-optimal methods that exhibit superior efficacy and stability in the presence of noise. Nonetheless, our study reveals thatall algorithms are affected by stability issues when faced with noisy data. Furthermore, we have established that the current generation of counterfactual explainers often fails to provide feasible recourses due to violations of topological constraints encoded by domain-specific considerations. Overall, this benchmarking study empowers stakeholders in the field of GNNs with a comprehensive understanding of the state-of-the-art explainability methods, potential research problems for further enhancement, and the implications of their application in real-world scenarios.

SPOTLIGHT-44: CivRealm: A Learning and Reasoning Odyssey for Decision-Making Agents

Primary Area: datasets and benchmarks

Keywords: Interactive Environments Benchmark Reinforcement Learning Language Agent Multi-agent

Scores: [ 6 8 8 ]

The generalization of decision-making agents encompasses two fundamental elements: learning from past experiences and reasoning in novel contexts. However, the predominant emphasis in most interactive environments is on learning, often at the expense of complexity in reasoning. In this paper, we introduce CivRealm, an environment inspired by the Civilization game. Civilization's profound alignment with human history and society necessitates extensive learning, while its ever-changing situations demand strong reasoning to generalize. Particularly, CivRealm sets up an imperfect-information general-sum game with a changing number of players; it presents a plethora of complex features, challenging the agent to deal with open-ended stochastic environments that require diplomacy and negotiation skills. Within CivRealm, we provide interfaces for two typical agent types: tensor-based agents that focus on learning, and language-based agents that emphasize reasoning. To catalyze further research, we present initial results for both paradigms. The canonical RL-based agents exhibit reasonable performance in mini-games, whereas both RL- and LLM-based agents struggle to make substantial progress in the full game. Overall, CivRealm stands as a unique challenge to explore the frontiers of both learning and reasoning for decision-making agents.

POSTER-221: FFB: A Fair Fairness Benchmark for In-Processing Group Fairness Methods

Primary Area: datasets and benchmarks

Keywords: fairness benchmark

Scores: [ 8 5 8 6 ]

POSTER-222: TiC-CLIP: Continual Training of CLIP Models

Primary Area: datasets and benchmarks

Keywords: CLIP Continual learning distribution shift benchmark

Scores: [ 6 5 8 6 ]

Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TIC-DataComp, TIC-YFCC, and TIC-RedCaps with over 12.7B timestamped image-text pairs spanning 9 years (2014--2022). We first use our benchmarks to curate various dynamic evaluations to measure temporal robustness of existing models. We show OpenAI’s CLIP (trained on data up to 2020) loses 8% zero-shot accuracy on our curated retrieval task from 2021--2022 compared with more recently trained models in OpenCLIP repository. We then study how to efficiently train models on time-continuous data. We demonstrate that a simple rehearsal-based approach that continues training from the last checkpoint and replays old data reduces compute by 2.5× when compared to the standard practice of retraining from scratch.

POSTER-223: OWL: A Large Language Model for IT Operations

Primary Area: datasets and benchmarks

Keywords: Large language model IT operations

Scores: [ 5 5 6 6 ]

POSTER-224: SmartPlay : A Benchmark for LLMs as Intelligent Agents

Primary Area: datasets and benchmarks

Keywords: Large Language Models

Scores: [ 8 5 6 8 ]

Recent large language models (LLMs) have demonstrated great potential toward intelligent agents and next-gen automation, but there currently lacks a systematic benchmark for evaluating LLMs' abilities as agents. We introduce SmartPlay: both a challenging benchmark and a methodology for evaluating LLMs as agents. SmartPlay consists of 6 different games, including Rock-Paper-Scissors, Tower of Hanoi, Minecraft. Each game features a unique setting, providing up to 20 evaluation settings and infinite environment variations. Each game in SmartPlay uniquely challenges a subset of 9 important capabilities of an intelligent LLM agent, including reasoning with object dependencies, planning ahead, spatial reasoning, learning from history, and understanding randomness. The distinction between the set of capabilities each game test allows us to analyze each capability separately.SmartPlay serves not only as a rigorous testing ground for evaluating the overall performance of LLM agents but also as a road-map for identifying gaps in current methodologies. We release our benchmark at github.com/LLMsmartplay/SmartPlay

SPOTLIGHT-45: What's In My Big Data?

Primary Area: datasets and benchmarks

Keywords: nlp dataset analaysis data-statistics data-quality PII

Scores: [ 10 5 8 5 ]

POSTER-225: MetaTool Benchmark: Deciding Whether to Use Tools and Which to Use

Primary Area: datasets and benchmarks

Keywords: Tool usage large language model benchmark dataset

Scores: [ 6 6 8 5 ]

POSTER-226: EasyTPP: Towards Open Benchmarking Temporal Point Processes

Primary Area: datasets and benchmarks

Keywords: Event sequence Temporal point process open benchmarking

Scores: [ 8 3 3 10 ]

POSTER-227: ImagenHub: Standardizing the evaluation of conditional image generation models

Primary Area: datasets and benchmarks

Keywords: Image Generation Image Editing Evaluation Benchmark Diffusion Model

Scores: [ 6 8 8 5 ]

Recently, a myriad of conditional image generation and editing models have been developed to serve different downstream tasks, including text-to-image generation, text-guided image editing, subject-driven image generation, control-guided image generation, etc. However, we observe huge inconsistencies in experimental conditions: datasets, inference, and evaluation metrics -- render fair comparisons difficult. This paper proposes ImagenHub, which is a one-stop library to standardize the inference and evaluation of all the conditional image generation models. Firstly, we define seven prominent tasks and curate high-quality evaluation datasets for them. Secondly, we built a unified inference pipeline to ensure fair comparison. Thirdly, we design two human evaluation scores, i.e. Semantic Consistency and Perceptual Quality, along with comprehensive guidelines to evaluate generated images. We train expert raters to evaluate the model outputs based on the proposed metrics. Our human evaluation achieves a high inter-worker agreement of Krippendorff’s alpha on 76% models with a value higher than 0.4. We comprehensively evaluated a total of around 30 models and observed three key takeaways: (1) the existing models’ performance is generally unsatisfying except for Text-guided Image Generation and Subject-driven Image Generation, with 74% models achieving an overall score lower than 0.5. (2) we examined the claims from published papers and found 83% of them hold with a few exceptions. (3) None of the existing automatic metrics has a Spearman's correlation higher than 0.2 except subject-driven image generation. Moving forward, we will continue our efforts to evaluate newly published models and update our leaderboard to keep track of the progress in conditional image generation.

POSTER-228: Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks

Primary Area: datasets and benchmarks

Keywords: conformer ensembles geometric learning

Scores: [ 8 5 6 ]

Molecular Representation Learning (MRL) has proven impactful in numerous biochemical applications such as drug discovery and enzyme design. While Graph Neural Networks (GNNs) are effective at learning molecular representations from a 2D molecular graph or a single 3D structure, existing works often overlook the flexible nature of molecules, which continuously interconvert across conformations via chemical bond rotations and minor vibrational perturbations. To better account for molecular flexibility, some recent works formulate MRL as an ensemble learning problem, focusing on explicitly learning from a set of conformer structures. However, most of these studies have limited datasets, tasks, and models. In this work, we introduce the first MoleculAR Conformer Ensemble Learning (MARCEL) benchmark to thoroughly evaluate the potential of learning on conformer ensembles and suggest promising research directions. MARCEL includes four datasets covering diverse molecule- and reaction-level properties of chemically diverse molecules including organocatalysts and transition-metal catalysts, extending beyond the scope of common GNN benchmarks that are confined to drug-like molecules. In addition, we conduct a comprehensive empirical study, which benchmarks representative 1D, 2D, and 3D molecular representation learning models, along with two strategies that explicitly incorporate conformer ensembles into 3D MRL models. Our findings reveal that direct learning from an accessible conformer space can improve performance on a variety of tasks and models.

SPOTLIGHT-46: InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Primary Area: datasets and benchmarks

Keywords: video-language dataset video understanding video generation multimodal understanding

Scores: [ 6 8 6 8 ]

ORAL-13: MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Primary Area: datasets and benchmarks

Keywords: large language models large multimodal models mathematical reasoning vision-language reasoning foundation models and their evaluations

Scores: [ 8 8 8 5 ]

POSTER-229: Data Filtering Networks

Primary Area: datasets and benchmarks

Keywords: datasets dataset filtering dataset curation large scale machine learning

Scores: [ 6 5 8 8 ]

Large training sets have become a cornerstone of machine learning and are the foundation for recent advances in language modeling and multimodal learning. While data curation for pre-training is often still ad-hoc, one common paradigm is to first collect a massive pool of data from the Web and then filter this candidate pool down to an actual training set via various heuristics. In this work, we study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset. Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks: for instance, a model that performs well on ImageNet can yield worse training sets than a model with low ImageNet accuracy that is trained on a small amount of high-quality data. Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets. Specifically, our best performing dataset DFN-5B enables us to train state-of-the-art models for their compute budgets: among other improvements on a variety of tasks, a ViT-H trained on our dataset achieves 83.0% zero-shot transfer accuracy on ImageNet, out-performing larger models trained on other datasets such as LAION-2B, DataComp-1B, or OpenAI’s WIT. In order to facilitate further research in dataset design, we also release a new 2 billion example dataset DFN-2B and show that high performance data filtering networks can be trained from scratch using only publicly available data.

POSTER-230: Repeated Random Sampling for Minimizing the Time-to-Accuracy of Learning

Primary Area: datasets and benchmarks

Keywords: data pruning dataset distillation random sampling corset selection data-efficient learning

Scores: [ 6 3 3 8 6 ]

Methods for carefully selecting or generating a small set of training data to learn from, i.e., data pruning, coreset selection, and dataset distillation, have been shown to be effective in reducing the ever-increasing cost of training neural networks. Behind this success are rigorously designed, yet expensive, strategies for identifying the most informative training examples out of large datasets. In this work, we revisit these methods to understand if the additional computational costs associated with such strategies are justified from the perspective of time-to-accuracy, which has become a critical efficiency measure of deep neural network training over large datasets. Surprisingly, we find that many of the recently proposed methods underperform what we call Repeated Sampling of Random Subsets (RSRS or RS2), a powerful yet overlooked extension of the standard random baseline that learns from repeatedly sampled data throughout training instead of a fixed random subset. We test RS2 against thirty-two state-of-the-art data pruning and distillation methods across four datasets including ImageNet. Our results demonstrate that RS2 significantly reduces time-to-accuracy, particularly in practical regimes where accuracy, but not runtime, is similar to that of training on full dataset. For example, when training ResNet-18 on ImageNet, with 10% of the dataset each epoch RS2 reaches an accuracy of 66% versus 69% when training with the full dataset. The best competing method achieves only 55% while training 1.6$\times$ slower than RS2. Beyond the above meta-study, we discuss the theoretical properties of RS2 such as its convergence rate and generalization error. Our primary goal is to highlight that future works that aim to minimize total training cost by using subset selection, need to consider 1) the total computation cost (including preparing the subset) and 2) should aim to outperform a simple extension of random sampling (i.e., RS2).

POSTER-231: Skill-Mix: a Flexible and Expandable Family of Evaluations for AI Models

Primary Area: datasets and benchmarks

Keywords: Large language model skill evaluation LLM benchmark emergence

Scores: [ 5 8 6 ]

POSTER-232: Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Primary Area: datasets and benchmarks

Keywords: instruction tuning multimodal large language model hallucination datasets

Scores: [ 8 8 8 6 ]

Despite the promising progress in multi-modal tasks, current large multi-modal models (LMMs) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset comprises 400k visualinstructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at three semantic levels: (i) Nonexistent Object Manipulation, (ii) Existent Object Manipulation and (iii) Knowledge Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts. GAVIE does not require human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate existing LMMs exhibit significant hallucinations when presented with our negative instructions, particularly Existent Object and Knowledge Manipulation instructions. Moreover, we successfully mitigate hallucination by finetuning MiniGPT4 and mPLUG-Owl on LRV-Instruction while improving performance on several publicdatasets compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model. Code and data will be released upon publication.

POSTER-233: A Benchmark Study on Calibration

Primary Area: datasets and benchmarks

Keywords: Calibration

Scores: [ 6 3 6 ]

Deep neural networks are increasingly utilized in various machine learning tasks. However, as these models grow in complexity, they often face calibration issues, despite enhanced prediction accuracy. Many studies have endeavored to improve calibration performance through the use of specific loss functions, data preprocessing and training frameworks. Yet, investigations into calibration properties have been somewhat overlooked. Our study leverages the Neural Architecture Search (NAS) search space, offering an exhaustive model architecture space for thorough calibration properties exploration. We specifically create a model calibration dataset. This dataset evaluates 90 bin-based and 12 additional calibration measurements across 117,702 unique neural networks within the widely employed NATS-Bench search space. Our analysis aims to answer several longstanding questions in the field, using our proposed dataset: (i) Can model calibration be generalized across different datasets? (ii) Can robustness be used as a calibration measurement? (iii) How reliable are calibration metrics? (iv) Does a post-hoc calibration method affect all models uniformly? (v) How does calibration interact with accuracy? (vi) What is the impact of bin size on calibration measurement? (vii) Which architectural designs are beneficial for calibration? Additionally, our study bridges an existing gap by exploring calibration within NAS. By providing this dataset, we enable further research into NAS calibration. As far as we are aware, our research represents the first large-scale investigation into calibration properties and the premier study of calibration issues within NAS.

SPOTLIGHT-47: Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy

Primary Area: datasets and benchmarks

Keywords: Open-ended VQA benchmark Vision-Language VL Vision-Text VLM Vision-Language models Image classification Visual question answering Text-generating VLM

Scores: [ 8 8 5 ]

The evaluation of text-generative vision-language models is a challenging yet crucial endeavor. By addressing the limitations of existing Visual Question Answering (VQA) benchmarks and proposing innovative evaluation methodologies, our research seeks to advance our understanding of these models’ capabilities. We propose a novel VQA benchmark based on well-known visual classification datasets which allows a granular evaluation of text-generative vision-language models and their comparison with discriminative vision-language models. To improve assessment of coarse answers on fine-grained classification tasks, we suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category. Finally, we compare traditional NLP and LLM-based metrics for the problem of evaluating model predictions given ground-truth answers. We perform a human evaluation study upon which we base our decision on the final metric. We apply our benchmark to a suite of vision-language models and show a detailed comparison of their abilities on object, action, and attribute classification. Our contributions aim to lay the foundation for more precise and meaningful assessments, facilitating targeted progress in the exciting field of vision-language modeling.

POSTER-234: MetaCoCo: A New Few-Shot Classification Benchmark with Spurious Correlation

Primary Area: datasets and benchmarks

Keywords: Benchmark; Few-shot classification; Spurious-correlation shifts

Scores: [ 8 5 8 ]

Out-of-distribution (OOD) problems in few-shot classification (FSC) occur when novel classes sampled from testing distributions differ from base classes drawn from training distributions, which considerably degrades the performance of deep learning models deployed in real-world applications. Recent studies suggest that the OOD problems in FSC mainly including: (a) cross-domain few-shot classification (CD-FSC) and (b) spurious-correlation few-shot classification (SC-FSC). Specifically, CD-FSC occurs when a classifier learns transferring knowledge from base classes drawn from \underline{seen} training distributions but recognizes novel classes sampled from unseen testing distributions. In contrast, SC-FSC arises when a classifier relies on non-causal features (or contexts) that happen to be correlated with the labels (or concepts) in base classes but such relationships no longer hold during the model deployment. Despite CD-FSC has been extensively studied, SC-FSC remains understudied due to lack of the corresponding evaluation benchmarks. To this end, we present Meta Concept Context (MetaCoCo), a benchmark with spurious-correlation shifts collected from real-world scenarios. Moreover, to quantify the extent of spurious-correlation shifts of the presented MetaCoCo, we further propose a metric by using CLIP as a pre-trained vision-language model. Extensive experiments on the proposed benchmark are performed to evaluate the state-of-the-art methods in FSC, cross-domain shifts, and self-supervised learning. The experimental results show that the performance of the existing methods degrades significantly in the presence of spurious-correlation shifts. We open-source all codes of our benchmark and hope that the proposed MetaCoCo can facilitate future research on spurious-correlation shifts problems in FSC.

POSTER-235: SpaCE: The Spatial Confounding Environment

Primary Area: datasets and benchmarks

Keywords: causal inference datasets benchmarks spatial confounding public health

Scores: [ 8 6 6 ]

Spatial confounding poses a significant challenge in scientific studies involving spatial data, where unobserved spatial variables can influence both treatment and outcome, possibly leading to spurious associations. To address this problem, we introduce SpaCE: The Spatial Confounding Environment, the first toolkit to provide realistic benchmark datasets and tools for systematically evaluating causal inference methods designed to alleviate spatial confounding. Each dataset includes training data, true counterfactuals, a spatial graph with coordinates, and smoothness and confounding scores characterizing the effect of a missing spatial confounder. It also includes realistic semi-synthetic outcomes and counterfactuals, generated using state-of-the-art machine learning ensembles, following best practices for causal inference benchmarks. The datasets cover real treatment and covariates from diverse domains, including climate, health and social sciences. SpaCE facilitates an automated end-to-end pipeline, simplifying data loading, experimental setup, and evaluating machine learning and causal inference models. The SpaCE project provides several dozens of datasets of diverse sizes and spatial complexity. It is publicly available as a Python package, encouraging community feedback and contributions.

SPOTLIGHT-48: FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

Primary Area: datasets and benchmarks

Keywords: large language models language model evaluation natural language processing

Scores: [ 8 8 6 ]

Evaluation of Large Language Models (LLMs) is challenging because instruction-following necessitates alignment with human values and the required set of skills varies depending on the instruction. However, previous studies have mainly focused on coarse-grained evaluation (i.e. overall preference-based evaluation), which limits interpretability since it does not consider the nature of user instructions that require instance-wise skill composition. In this paper, we introduce FLASK (Fine-grained Language Model Evaluation based on Alignment Skill Sets), a fine-grained evaluation protocol for both human-based and model-based evaluation which decomposes coarse-level scoring to a skill set-level scoring for each instruction. We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance and increasing the reliability of the evaluation. Using FLASK, we compare multiple open-source and proprietary LLMs and observe a high correlation between model-based and human-based evaluations.

POSTER-236: Jumanji: a Diverse Suite of Scalable Reinforcement Learning Environments in JAX

Primary Area: datasets and benchmarks

Keywords: reinforcement learning jax combinatorial research

Scores: [ 5 8 6 6 ]

Open-source reinforcement learning (RL) environments have played a crucial role in driving progress in the development of AI algorithms.In modern RL research, there is a need for simulated environments that are performant, scalable, and modular to enable their utilization in a wider range of potential real-world applications.Therefore, we present Jumanji, a suite of diverse RL environments specifically designed to be fast, flexible, and scalable.Jumanji provides a suite of environments focusing on combinatorial problems frequently encountered in industry, as well as challenging general decision-making tasks.By leveraging the efficiency of JAX and hardware accelerators like GPUs and TPUs, Jumanji enables rapid iteration of research ideas and large-scale experimentation, ultimately empowering more capable agents.Unlike existing RL environment suites, Jumanji is highly customizable, allowing users to tailor the initial state distribution and problem complexity to their needs.Furthermore, we provide actor-critic baselines for each environment, accompanied by preliminary findings on scaling and generalization scenarios.Jumanji aims to set a new standard for speed, adaptability, and scalability of RL environments.

SPOTLIGHT-49: (InThe)WildChat: 570K ChatGPT Interaction Logs In The Wild

Primary Area: datasets and benchmarks

Keywords: new corpus user study dialogues conversations chatGPT instruction fine-tuning toxicity safe guarding AI safety

Scores: [ 5 6 6 8 ]

SPOTLIGHT-50: RealChat-1M: A Large-Scale Real-World LLM Conversation Dataset

Primary Area: datasets and benchmarks

Keywords: large language models dataset conversation safety benchmark

Scores: [ 6 8 8 8 ]

Studying how people interact with large language models (LLMs) in real-world scenarios is increasingly important due to their widespread use in various applications.In this paper, we introduce RealChat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art LLMs.This dataset is collected from 210K unique IP addresses in the wild on our chat demo website.We offer an overview of the dataset's content, including its curation process, basic statistics, and topic distribution, highlighting its diversity, originality, and scale.We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions.We believe that this dataset will serve as a valuable resource for understanding and advancing LLM capabilities.The dataset will be publicly available.

POSTER-237: Compressing LLMs: The Truth is Rarely Pure and Never Simple

Primary Area: datasets and benchmarks

Keywords: Compression Large Language Models Pruning Quantization

Scores: [ 6 8 5 8 ]

POSTER-238: KoLA: Carefully Benchmarking World Knowledge of Large Language Models

Primary Area: datasets and benchmarks

Keywords: Large Language Model World Knowledge Evolving Benchmark

Scores: [ 5 6 8 8 ]

ORAL-14: How Well Do Supervised Models Transfer to 3D Image Segmentation?

Primary Area: datasets and benchmarks

Keywords: Transfer Learning Medical Image Analysis Organ Segmentation

Scores: [ 6 8 6 6 8 ]

The pre-training and fine-tuning paradigm has become prominent in transfer learning. For example, if the model is pre-trained on ImageNet and then fine-tuned to PASCAL, it can significantly outperform that trained directly on PASCAL. While ImageNet pre-training has shown enormous success, it is formed in 2D and the learned features are for classification tasks. Therefore, when transferring to more diverse tasks, like 3D image segmentation, its performance is inevitably compromised due to the deviation from the original ImageNet context. A significant challenge lies in the lack of large, annotated 3D datasets rivaling the scale of ImageNet for model pre-training. To overcome this challenge, we make two contributions. Firstly, we construct ImageNetCT-9K that comprises 9,262 three-dimensional computed tomography (CT) volumes with high-quality, per-voxel annotations. Secondly, we develop a suite of models that is supervised pre-trained on our ImageNetCT-9K. Our preliminary analyses indicate that the model trained only with 20 CT volumes, 640 masks, and 40 GPU hours has a transfer learning ability similar to the model trained with 5,050 CT volumes and 1,152 GPU hours. More importantly, the transfer learning ability of supervised models can further scale up with larger annotated datasets (i.e., SPT), achieving significantly better performance than all existing 3D models, irrespective of their pre-training methodologies or sources. We hope this study can facilitate collective efforts in constructing larger 3D vision datasets and more releases of supervised pre-trained models. Our code is attached as supplementary and will be publicly available.

POSTER-239: LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents

Primary Area: datasets and benchmarks

Keywords: task planning language models benchmarking embodied agents home robots

Scores: [ 6 6 6 6 ]

Large language models (LLMs) have recently received considerable attention as alternative solutions for task planning. However, comparing the performance of language-oriented task planners becomes difficult, and there exists a dearth of detailed exploration regarding the effects of various factors such as pre-trained model selection and prompt construction. To address this, we propose a benchmark system for automatically quantifying performance of task planning for home-service embodied agents. Task planners are tested on two pairs of datasets and simulators: 1) ALFRED and AI2-THOR, 2) an extension of Watch-And-Help and VirtualHome. Using the proposed benchmark system, we perform extensive experiments with LLMs and prompts, and explore several enhancements of the baseline planner. We expect that the proposed benchmark tool would accelerate the development of language-oriented task planners.

SPOTLIGHT-51: Do Large Language Models Know about Facts?

Primary Area: datasets and benchmarks

Keywords: Large Language Models Resource and Evaluation Interpretability NLP Application

Scores: [ 8 6 5 8 ]

Large language models (LLMs) have recently driven striking performance improvements across a range of natural language processing tasks. The factual knowledge acquired during pretraining and instruction tuning can be useful in various downstream tasks, such as question answering, and language generation. Unlike conventional Knowledge Bases (KBs) that explicitly store factual knowledge, LLMs implicitly store facts in their parameters. Content generated by the LLMs can often exhibit inaccuracies or deviations from the truth, due to facts that can be incorrectly induced or become obsolete over time. To this end, we aim to comprehensively evaluate the extent and scope of factual knowledge within LLMs by designing the benchmark Pinocchio. Pinocchio contains 20K diverse factual questions that span different sources, timelines, domains, regions, and languages. Furthermore, we investigate whether LLMs are able to compose multiple facts, update factual knowledge temporally, reason over multiple pieces of facts, identify subtle factual differences, and resist adversarial examples. Extensive experiments on different sizes and types of LLMs show that existing LLMs still lack factual knowledge and suffer from various spurious correlations. We believe this is a critical bottleneck for realizing trustworthy artificial intelligence. The dataset Pinocchio and our codes will be publicly available.

SPOTLIGHT-52: MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data

Primary Area: datasets and benchmarks

Keywords: theorem proving math word problem mathematical reasoning benchmark

Scores: [ 8 6 8 ]

Recent large language models (LLMs) have witnessed significant advancement in various tasks, including mathematical reasoning and theorem proving. As these two tasks require strict and formal multi-step inference, they are appealing domains for exploring the reasoning ability of LLMs but still formidable challenges. Previous studies such as Chain-of-Thought (CoT) have revealed the effectiveness of intermediate steps guidance. However, such step-wise annotation requires heavy labor, leading to insufficient training steps for current benchmarks. To fill this gap, this work introduces MUSTARD, a data generation framework that masters uniform synthesis of theorem and proof data of high quality and diversity. MUSTARD synthesizes data in three stages: (1) It samples a few mathematical concept seeds as the problem category. (2) Then it prompts a generative language model with the sampled concepts to obtain both the problems and their step-wise formal solutions. (3) Lastly, the framework utilizes a proof assistant (e.g., Lean Prover) to filter the valid proofs. With the proposed MUSTARD, we present a theorem-and-proof benchmark MUSTARDSAUCE with 7,335 valid data points. Each data point contains an informal statement, an informal proof, and a translated formal proof that passes the prover validation. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data. We further apply the MUSTARDSAUCE for fine-tuning smaller language models. The fine-tuned Llama 2-7B achieves improvements by 20.9% on zero-shot inference on GSM8K and achieves 8.7 of pass@1 on mathlib, which again demonstrates the data quality and their effectiveness on mathematical tasks.

POSTER-240: CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

Primary Area: datasets and benchmarks

Keywords: audio audio-language compositional reasoning

Scores: [ 6 6 8 6 ]

A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved performance in many downstream applications, including zero-shot audio classification, audio retrieval, etc. However, the ability of these models to effectively perform compositional reasoning remains largely unexplored and necessitates additional research. In this paper, we propose \textbf{CompA}, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs. Our proposed CompA-order evaluates how well an ALM understands the order or occurrence of acoustic events in audio, and CompA-attribute evaluates attribute binding of acoustic events. An instance from either benchmark consists of two audio-caption pairs, where both audios have the same acoustic events but with different compositions. An ALM is evaluated on how well it matches the right audio to the right caption. Using this benchmark, we first show that current ALMs perform only marginally better than random chance, thereby struggling with compositional reasoning. Next, we propose CompA-CLAP, where we fine-tune CLAP using a novel learning method to improve its compositional reasoning abilities. To train CompA-CLAP, we first propose improvements to contrastive training with composition-aware hard negatives, allowing for more focused training. Next, we propose a novel modular contrastive loss that helps the model learn fine-grained compositional understanding and overcomes the acute scarcity of openly available compositional audios. CompA-CLAP significantly improves over all our baseline models on the CompA benchmark, indicating its superior compositional reasoning capabilities.

SPOTLIGHT-53: EQA-MX: Embodied Question Answering using Multimodal Expression

Primary Area: datasets and benchmarks

Keywords: multimodal representation learning visual-language models embodied question answering

Scores: [ 8 8 8 8 ]

Humans predominantly use verbal utterances and nonverbal gestures (e.g., eye gaze and pointing gestures) in their natural interactions. For instance, pointing gestures and verbal information is often required to comprehend questions such as "what object is that?" Thus, this question-answering (QA) task involves complex reasoning of multimodal expressions (verbal utterances and nonverbal gestures). However, prior works have explored QA tasks in non-embodied settings, where questions solely contain verbal utterances from a single verbal and visual perspective. In this paper, we have introduced 8 novel embodied question answering (EQA) tasks to develop learning models to comprehend embodied questions with multimodal expressions. We have developed a novel large-scale dataset, EQA-MX, with over 8 million diverse embodied QA data samples involving multimodal expressions from multiple visual and verbal perspectives. To learn salient multimodal representations from discrete verbal embeddings and continuous wrapping of multiview visual representations, we propose a vector-quantization (VQ) based multimodal representation learning model, VQ-Fusion, for the EQA tasks. Our extensive experimental results suggest that VQ-Fusion can improve the performance of existing state-of-the-art visual-language models up to 13% across EQA tasks.

ORAL-15: BooookScore: A systematic exploration of book-length summarization in the era of LLMs

Primary Area: datasets and benchmarks

Keywords: summarization evaluation long context prompting LLM

Scores: [ 8 10 8 8 ]

Summarizing book-length documents ($>$100K tokens) that exceed the context window size of large language models (LLMs) requires first breaking the input document into smaller chunks and then prompting an LLM to merge, update, and compress chunk-level summaries. Despite the complexity and importance of this task, it has yet to be meaningfully studied due to the challenges of evaluation: existing book-length summarization datasets (e.g., BookSum) are in the pretraining data of most public LLMs, and existing evaluation methods struggle to capture errors made by modern LLM summarizers. In this paper, we present the first study of the coherence of LLM-based book-length summarizers implemented via two prompting workflows: (1) hierarchically merging chunk-level summaries, and (2) incrementally updating a running summary. We obtain 1193 fine-grained human annotations on GPT-4 generated summaries of 100 recently-published books and identify eight common types of coherence errors made by LLMs. Because human evaluation is expensive and time-consuming, we develop an automatic metric, BooookScore, that measures the proportion of sentences in a summary that do not contain any of the identified error types. BooookScore has high agreement with human annotations and allows us to systematically evaluate the impact of many other critical parameters (e.g., chunk size, base LLM) while saving $15K and 500 hours in human evaluation costs. We find that closed-source LLMs such as GPT-4 and Claude 2 produce summaries with higher BooookScore than the oft-repetitive ones generated by LLaMA 2. Incremental updating yields lower BooookScore but higher level of detail than hierarchical merging, a trade-off sometimes preferred by human annotators. We release code and annotations after blind review to spur more principled research on book-length summarization.

POSTER-241: Towards Diverse Behaviors: A Benchmark for Imitation Learning with Human Demonstrations

Primary Area: datasets and benchmarks

Keywords: Imitation Learning Benchmark Datasets Diverse Behaviors

Scores: [ 8 6 8 ]

Imitation learning with human data has demonstrated remarkable success in teaching robots in a wide range of skills. However, the inherent diversity in human behavior leads to the emergence of multi-modal data distributions, thereby presenting a formidable challenge for existing imitation learning algorithms. Quantifying a model's capacity to capture and replicate this diversity effectively is still an open problem. In this work, we introduce simulation benchmark environments and the corresponding Datasets with Diverse human Demonstrations for Imitation Learning (D3IL), designed explicitly to evaluate a model's ability to learn multi-modal behavior. Our environments are designed to involve multiple sub-tasks that need to be solved, consider manipulation of multiple objects which increases the diversity of the behavior and can only be solved by policies that rely on closed loop sensory feedback. Other available datasets are missing at least one of these challenging properties.To address the challenge of diversity quantification, we introduce tractable metrics that provide valuable insights into a model's ability to acquire and reproduce diverse behaviors. These metrics offer a practical means to assess the robustness and versatility of imitation learning algorithms. Furthermore, we conduct a thorough evaluation of state-of-the-art methods on the proposed task suite. This evaluation serves as a benchmark for assessing their capability to learn diverse behaviors. Our findings shed light on the effectiveness of these methods in tackling the intricate problem of capturing and generalizing multi-modal human behaviors, offering a valuable reference for the design of future imitation learning algorithms.

POSTER-242: Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models

Primary Area: datasets and benchmarks

Keywords: Label errors dataset cleaning AI safety toxicity harmless language models

Scores: [ 6 6 6 5 ]

POSTER-243: PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

Primary Area: datasets and benchmarks

Keywords: LLM evaluation

Scores: [ 5 8 8 ]

Instruction tuning large language models (LLMs) remains a challenging task, owing to the complexity of hyperparameter selection and the difficulty involved in evaluating the tuned models. To determine the optimal hyperparameters, an automatic, robust, and reliable evaluation benchmark is essential. However, establishing such a benchmark is not a trivial task due to the challenges associated with evaluation accuracy and privacy protection. In response to these challenges, we introduce a judge large language model, named PandaLM, which is trained to distinguish the superior model given several LLMs. PandaLM's focus extends beyond just the objective correctness of responses, which is the main focus of traditional evaluation datasets. It addresses vital subjective factors such as relative conciseness, clarity, adherence to instructions, comprehensiveness, and formality. To ensure the reliability of PandaLM, we collect a diverse human-annotated test dataset, where all contexts are generated by humans and labels are aligned with human preferences. Our findings reveal that PandaLM-7B offers a performance comparable to both GPT-3.5 and GPT-4. Impressively, PandaLM-70B surpasses their performance. PandaLM enables the evaluation of LLM to be fairer but with less cost, evidenced by significant improvements achieved by models tuned through PandaLM compared to their counterparts trained with default Alpaca's hyperparameters. In addition, PandaLM does not depend on API-based evaluations, thus avoiding potential data leakage.

POSTER-244: Habitat 3.0: A Co-Habitat for Humans, Avatars, and Robots

Primary Area: datasets and benchmarks

Keywords: Embodied AI Simulation

Scores: [ 8 6 6 ]

We present Habitat 3.0: a simulation platform for studying collaborative human-robot tasks in home environments. Habitat 3.0 offers contributions across three dimensions: (1) Accurate humanoid simulation: addressing challenges in modeling complex deformable bodies and diversity in appearance and motion, all while ensuring high simulation speed. (2) Human-in-the-loop infrastructure: enabling real human interaction with simulated robots via mouse/keyboard or a VR interface, facilitating evaluation of robot policies with human input. (3) Collaborative tasks: studying two collaborative tasks, Social Navigation and Social Rearrangement. Social Navigation investigates a robot's ability to locate and follow humanoid avatars in unseen environments, whereas Social Rearrangement addresses collaboration between a humanoid and robot while rearranging a scene. These contributions allow us to study end-to-end learned and heuristic baselines for human-robot collaboration in-depth, as well as evaluate them with humans in the loop. Our experiments demonstrate that learned robot policies lead to efficient task completion when collaborating with unseen humanoid agents and human partners that might exhibit behaviors that the robot has not seen before. Additionally, we observe emergent behaviors during collaborative task execution, such as the robot yielding space when obstructing a humanoid agent, thereby allowing the effective completion of the task by the humanoid agent. Furthermore, our experiments using the human-in-the-loop tool demonstrate that our automated evaluation with humanoids can provide an indication of the relative ordering of different policies when evaluated with real human collaborators. Habitat 3.0 unlocks interesting new features in simulators for Embodied AI, and we hope it paves the way for a new frontier of embodied human-AI interaction capabilities.

POSTER-245: Evaluating Language Models Through Negotiations

Primary Area: datasets and benchmarks

Keywords: language model evaluation dynamic evaluation alignment cooperative AI agency

Scores: [ 5 3 8 ]

Commercial interests are racing to exploit language models' remarkable capability to display agent-like behavior. Indeed, a future where personal LM-based agents are widely adopted to perform complicated tasks involving planning and negotiating appears increasingly plausible. Current, predominantly static evaluation methods are ill-suited to evaluate such dynamic, multi-step applications. In this work, we therefore propose jointly evaluating LM performance and alignment through the lens of negotiation games. We argue that this common real-world task provides scalable, difficult-to-hack performance metrics while offering non-trivial insights into model decision-making. Crucially, negotiation games allow us to study both competitive and cooperative performance, modulate complexity, and side-step accidental evaluation data leakage. Using our evaluation setup, we report results for publicly accessible LMs from all major providers on a variety of negotiation games. Noteworthy takeaways include: (i) open-source models are currently unable to complete this task, (ii) cooperative bargaining games prove challenging, and (iii) the most powerful models do not always 'win'. Evaluation through negotiations complements existing evaluation efforts by providing a novel evaluation paradigm to study evolving language model agency. We release an open-source library to accelerate research in this critical direction and lower the technical boundaries for researchers outside of the machine learning field to contribute.

POSTER-246: MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction Following

Primary Area: datasets and benchmarks

Keywords: Instruction Tuning Large Language Models Automatic Data Generation

Scores: [ 6 6 5 8 ]

In the realm of large language models (LLMs), enhancing instruction-following capability often involves curating expansive training data. This is achieved through two primary schemes: i) Scaling-Inputs: Amplifying (input, output) pairs per task instruction, aiming for better instruction adherence. ii) Scaling Input-Free Tasks: Enlarging tasks, each composed of an (instruction, output) pair (without requiring a separate input anymore). However, LLMs under Scaling-Inputs tend to be overly sensitive to inputs, leading to misinterpretation or non-compliance with instructions. Conversely, Scaling Input-Free Tasks demands a substantial number of tasks but is less effective in instruction following when dealing with instances in Scaling-Inputs. This work introduces MUFFIN, a new scheme of instruction-following dataset curation. Specifically, we automatically Scale Tasks per Input by diversifying these tasks with various input facets. Experimental results across four zero-shot benchmarks, spanning both Scaling-Inputs and Scaling Input-Free Tasks schemes, reveal that LLMs, at various scales, trained on MUFFIN generally demonstrate superior instruction-following capabilities compared to those trained on the two aforementioned schemes.

POSTER-247: I-PHYRE: Interactive Physical Reasoning

Primary Area: datasets and benchmarks

Keywords: Intuitive physics physical reasoning

Scores: [ 6 6 8 6 ]

Current evaluation protocols predominantly assess physical reasoning in stationary scenes, creating a gap in evaluating agents' abilities to interact with dynamic events. While contemporary methods allow agents to modify initial scene configurations and observe consequences, they lack the capability to interact with events in real time. To address this, we introduce I-PHYRE, a framework that challenges agents to simultaneously exhibit intuitive physical reasoning, multi-step planning, and in-situ intervention. Here, intuitive physical reasoning refers to a quick, approximate understanding of physics to address complex problems; multi-step denotes the need for extensive sequence planning in I-PHYRE, considering each intervention can significantly alter subsequent choices; and in-situ implies the necessity for timely object manipulation within a scene, where minor timing deviations can result in task failure. We formulate four game splits to scrutinize agents' learning and generalization of essential principles of interactive physical reasoning, fostering learning through interaction with representative scenarios. Our exploration involves three planning strategies and examines several supervised and reinforcement agents' zero-shot generalization proficiency on I-PHYRE. The outcomes highlight a notable gap between existing learning algorithms and human performance, emphasizing the imperative for more research in enhancing agents with interactive physical reasoning capabilities. The environment and baselines will be made publicly available.

SPOTLIGHT-54: Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision

Primary Area: datasets and benchmarks

Keywords: Benchmark Vision-Language Large Language Models Low-level Vision Image Quality Assessment

Scores: [ 8 8 6 ]

The rapid evolution of Multi-modality Large Language Models (MLLMs) has catalyzed a shift in computer vision from specialized models to general-purpose foundation models. Nevertheless, there is still an inadequacy in assessing the abilities of MLLMs on low-level visual perception and understanding. To address this gap, we present Q-Bench, a holistic benchmark crafted to systematically evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment. a) To evaluate the low-level perception ability, we construct the LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped with a human-asked question focusing on its low-level attributes. We then measure the correctness of MLLMs on answering these questions. b) To examine the description ability of MLLMs on low-level information, we propose the LLDescribe dataset consisting of long expert-labelled golden low-level text descriptions on 499 images, and a GPT-involved comparison pipeline between outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we further measure their visual quality assessment ability to align with human opinion scores. Specifically, we design a softmax-based strategy that enables MLLMs to predict quantifiable quality scores, and evaluate them on various existing image quality assessment (IQA) datasets. Our evaluation across the three abilities confirms that MLLMs possess preliminary low-level visual skills. However, these skills are still unstable and relatively imprecise, indicating the need for specific enhancements on MLLMs towards these abilities. We hope that our benchmark can encourage the research community to delve deeper to discover and enhance these untapped potentials of MLLMs.

POSTER-248: Learning Mean Field Games on Sparse Graphs: A Hybrid Graphex Approach

Primary Area: general machine learning (i.e., none of the above)

Keywords: Mean Field Games Equilibrium Learning Networks Large Graphs Multi Agent Systems

Scores: [ 6 8 6 ]

Learning the behavior of large agent populations is an important task for numerous research areas. Although the field of multi-agent reinforcement learning (MARL) has made significant progress towards solving these systems, solutions for many agents often remain computationally infeasible and lack theoretical guarantees. Mean Field Games (MFGs) address both of these issues and can be extended to Graphon MFGs (GMFGs) to include network structures between agents. Despite their merits, the real world applicability of GMFGs is limited by the fact that graphons only capture dense graphs. Since most empirically observed networks show some degree of sparsity, such as power law graphs, the GMFG framework is insufficient for capturing these network topologies. Thus, we introduce the novel concept of Graphex MFGs (GXMFGs) which builds on the graph theoretical concept of graphexes. Graphexes are the limiting objects to sparse graph sequences that also have other desirable features such as the small world property. Learning equilibria in these games is challenging due to the rich and sparse structure of the underlying graphs. To tackle these challenges, we design a new learning algorithm tailored to the GXMFG setup. This hybrid graphex learning approach leverages that the system mainly consists of a highly connected core and a sparse periphery. After defining the system and providing a theoretical analysis, we state our learning approach and demonstrate its learning capabilities on both synthetic graphs and real-world networks. This comparison shows that our GXMFG learning algorithm successfully extends MFGs to a highly relevant class of hard, realistic learning problems that are not accurately addressed by current MARL and MFG methods.

POSTER-249: In defense of parameter sharing for model-compression

Primary Area: general machine learning (i.e., none of the above)

Keywords: parameter sharing model compression pruning

Scores: [ 3 5 8 6 ]

POSTER-250: Safe Collaborative Filtering

Primary Area: general machine learning (i.e., none of the above)

Keywords: recommender systems collaborative filtering scalability

Scores: [ 8 8 6 ]

Excellent tail performance is crucial for modern machine learning tasks, such as algorithmic fairness, class imbalance, and risk-sensitive decision making, as it ensures the effective handling of challenging samples within a dataset. Tail performance is also a vital determinant of success for personalized recommender systems to reduce the risk of losing users with low satisfaction. This study introduces a "safe" collaborative filtering method that prioritizes recommendation quality for less-satisfied users rather than focusing on the average performance. Our approach minimizes the conditional value at risk (CVaR), which represents the average risk over the tails of users' loss. To overcome computational challenges for web-scale recommender systems, we develop a robust yet practical algorithm that extends the most scalable method, implicit alternating least squares (iALS). Empirical evaluation on real-world datasets demonstrates the excellent tail performance of our approach while maintaining competitive computational efficiency.

POSTER-251: Sparse MoE with Language Guided Routing for Multilingual Machine Translation

Primary Area: general machine learning (i.e., none of the above)

Keywords: Sparse Mixture-of-Experts Multilingual Machine Translation Language Guided Routing

Scores: [ 8 8 6 8 ]

Sparse Mixture-of-Experts (SMoE) has gained increasing popularity as a promising framework for scaling up multilingual machine translation (MMT) models with negligible extra computational overheads. However, current SMoE solutions neglect the intrinsic structures of the MMT problem: (a) Linguistics Hierarchy. Languages are naturally grouped according to their lingual properties like genetic families, phonological characteristics, etc; (b) Language Complexity. The learning difficulties are varied for diverse languages due to their grammar complexity, available resources, etc. Therefore, routing a fixed number of experts (e.g., 1 or 2 experts in usual) only at the word level leads to inferior performance. To fill in the missing puzzle, we propose \texttt{Lingual-SMoE} by equipping the SMoE with adaptive and linguistic-guided routing policies. Specifically, it (1) extracts language representations to incorporate linguistic knowledge and uses them to allocate experts into different groups; (2) determines the number of activated experts for each target language in an adaptive and automatic manner, according to their translation difficulties, which aims to mitigate the potential over-/under-fitting issues of learning simple/challenges translations. Sufficient experimental studies on MMT benchmarks with {16, 50, 100} language pairs and various network architectures, consistently validate the superior performance of our proposals. For instance, Lingual-SMoE outperforms its dense counterpart by over 5% BLEU scores on OPUS-100 dataset. Codes are included in the supplement.

SPOTLIGHT-55: Faithful and Efficient Explanations for Neural Networks via Neural Tangent Kernel Surrogate Models

Primary Area: general machine learning (i.e., none of the above)

Keywords: Explainability Surrogate Models Neural Tangent Kernel Deep Learning Attribution

Scores: [ 8 8 8 5 8 ]

POSTER-252: LQ-LoRA: Low-rank plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

Primary Area: general machine learning (i.e., none of the above)

Keywords: Low-rank plus Quantized Matrix Decomposition Efficient Language Model Finetuning

Scores: [ 5 6 8 8 ]

We propose a simple approach for memory-efficient adaptation of pretrained language models. Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component. During finetuning, the quantized component remains fixed and only the low-rank component is updated. We present an integer linear programming formulation of the quantization component, which enables dynamic configuration of quantization parameters (e.g., bit-width, block size) for each matrix given an overall target memory budget. We further explore a data-aware version of the algorithm which uses an approximation of the Fisher information matrix to weight the reconstruction objective during matrix decomposition. Across standard NLP benchmarks, our low-rank plus quantized matrix decomposition approach (LQ-LoRA) is found to perform well against strong baselines.

POSTER-253: Federated Recommendation with Additive Personalization

Primary Area: general machine learning (i.e., none of the above)

Keywords: Federated Learning Federated Recommendation System

Scores: [ 6 8 8 ]

Building recommendation systems via federated learning (FL) is a new emerging challenge for next-generation Internet service. Existing FL models share item embedding across clients while keeping the user embedding private and local on the client side. However, identical item embedding cannot capture users' individual differences in perceiving the same item and may lead to poor personalization. Moreover, dense item embedding in FL results in expensive communication costs and latency. To address these challenges, we propose \underline{\textbf{Fed}}erated \underline{\textbf{R}}ecommendation with \underline{\textbf{A}}dditive \underline{\textbf{P}}ersonalization (\textbf{FedRAP}), which learns a global view of items via FL and a personalized view locally on each user. FedRAP encourages a sparse global view to save FL's communication cost and enforces the two views to be complementary via two regularizers. We propose an effective curriculum to learn the local and global views progressively with increasing regularization weights. To produce recommendations for a user, FedRAP adds the two views together to obtain a personalized item embedding. FedRAP achieves the best performance in FL setting on multiple benchmarks. It outperforms recent federated recommendation methods and several ablation study baselines. \looseness-1

POSTER-254: Statistical Rejection Sampling Improves Preference Optimization

Primary Area: general machine learning (i.e., none of the above)

Keywords: large language model preference optimization language model alignment

Scores: [ 6 6 5 ]

Improving the alignment of language models with human preferences remains an active research challenge. Previous approaches have primarily utilized online Reinforcement Learning from Human Feedback (RLHF). Recently, offline methods such as Sequence Likelihood Calibration (SLiC) and Direct Preference Optimization (DPO) have emerged as attractive alternatives, offering improvements in stability and scalability while maintaining competitive performance. SLiC refines its loss function using sequence pairs sampled from a supervised fine-tuned (SFT) policy, while DPO directly optimizes language models based on preference data, foregoing the need for a separate reward model. However, the maximum likelihood estimator (MLE) of the target optimal policy requires labeled preference pairs sampled from that policy. The absence of a reward model in DPO constrains its ability to sample preference pairs from the optimal policy. Meanwhile, SLiC can only sample preference pairs from the SFT policy. To address these limitations, we introduce a novel approach called Statistical Rejection Sampling Optimization (RSO) designed to source preference data from the target optimal policy using rejection sampling, enabling a more accurate estimation of the optimal policy. We also propose a unified framework that enhances the loss functions used in both SLiC and DPO from a preference modeling standpoint. Through extensive experiments across diverse tasks, we demonstrate that RSO consistently outperforms both SLiC and DPO as evaluated by both Large Language Models (LLMs) and human raters.

POSTER-255: On the Stability of Expressive Positional Encodings for Graph Neural Networks

Primary Area: general machine learning (i.e., none of the above)

Keywords: graph neural networks positional encoding stability

Scores: [ 6 5 6 5 8 ]

Designing effective positional encodings for graphs is key to building powerful graph transformers and enhancing message-passing graph neural networks. Although widespread, using Laplacian eigenvectors as positional encodings faces two fundamental challenges: (1) Non-uniqueness: there are many different eigendecompositions of the same Laplacian, and (2) Instability: small perturbations to the Laplacian could result in completely different eigenspaces, leading to unpredictable changes in positional encoding. Despite many attempts to address non-uniqueness, most methods overlook stability, leading to poor generalization on unseen graph structures. We identify the cause of instability to be the use of "hard partition'' of eigenspaces. Hence, we introduce Stable and Expressive Positional Encodings (SPE), an architecture for processing eigenvectors that uses eigenvalues to softly partition'' eigenspaces. SPE is the first architecture that is (1) provably stable, and (2) universally expressive for basis invariant functions whilst respecting all symmetries of eigenvectors. Besides guaranteed stability, we prove that SPE is at least as expressive as existing methods, and highly capable of counting graph structures. Finally, we evaluate the effectiveness of our method on molecular property prediction, and out-of-distribution generalization tasks, finding improved generalization compared to existing positional encoding methods.

POSTER-256: LUM-ViT: Learnable Under-sampling Mask Vision Transformer for Bandwidth Limited Optical Signal Acquisition

Primary Area: general machine learning (i.e., none of the above)

Keywords: hyperspectral imaging optical modulation real-time detection vision transformer pre-acquisition modulation learnable mask weight binarization

Scores: [ 6 6 6 ]

Bandwidth constraints during signal acquisition frequently impede real-time detection applications. Hyperspectral data is a notable example, whose vast volume compromises real-time hyperspectral detection. To tackle this hurdle, we introduce a novel approach leveraging pre-acquisition modulation to reduce the acquisition volume. This modulation process is governed by a deep learning model, utilizing prior information. Central to our approach is LUM-ViT, a Vision Transformer variant. Uniquely, LUM-ViT incorporates a learnable under-sampling mask tailored for pre-acquisition modulation. To further optimize for optical calculations, we propose a kernel-level weight binarization technique and a three-stage fine-tuning strategy. Our evaluations reveal that, by sampling a mere 10% of the original image pixels, LUM-ViT maintains the accuracy loss within 1.8% on the ImageNet classification task. The method sustains near-original accuracy when implemented on real-world optical hardware, demonstrating its practicality.

POSTER-257: MVSFormer++: Revealing the Devil in the Transformer's Details for Multi-View Stereo

Primary Area: general machine learning (i.e., none of the above)

Keywords: Multi-View Stereo Transformer Depth Estimation

Scores: [ 6 5 5 8 6 ]

SPOTLIGHT-56: Learning No-Regret Sparse Generalized Linear Models with Varying Observation(s)

Primary Area: general machine learning (i.e., none of the above)

Keywords: Generalized Linear Models Learning with Varying Data Differential Equations

Scores: [ 8 6 8 ]

Generalized Linear Models (GLMs) encompass a wide array of regression and classification models, where prediction is a function of a linear combination of the input variables. Often in real-world scenarios, a number of observations would be added into or removed from the existing training dataset, necessitating the development of learning systems that can efficiently train optimal models with varying observations in an online (sequential) manner instead of retraining from scratch. Despite the significance of data-varying scenarios, most existing approaches to sparse GLMs concentrate on offline batch updates, leaving online solutions largely underexplored. In this work, we present the first algorithm without compromising accuracy for GLMs regularized by sparsity-enforcing penalties trained on varying observations. Our methodology is capable of handling the addition and deletion of observations simultaneously, while adaptively updating data-dependent regularization parameters to ensure the best statistical performance. Specifically, we recast sparse GLMs as a bilevel optimization objective upon varying observations and characterize it as an explicit gradient flow in the underlying space for the inner and outer subproblems we are optimizing over, respectively. We further derive a set of rules to ensure a proper transition at regions of non-smoothness, and establish the guarantees of theoretical consistency and finite convergence. Encouraging results are exhibited on real-world benchmarks.

POSTER-258: CoRe-GD: A Hierarchical Framework for Scalable Graph Visualization with GNNs

Primary Area: general machine learning (i.e., none of the above)

Keywords: Graph Visualization Optimization Scalability Graph Neural Networks

Scores: [ 6 6 6 5 ]

Graph Visualization, also known as Graph Drawing, aims to find geometric embeddings of graphs that optimize certain criteria. Stress is a widely used metric; stress is minimized when every pair of nodes is positioned at their shortest path distance. However, stress optimization presents computational challenges due to its inherent complexity and is usually solved using heuristics in practice. We introduce a scalable Graph Neural Network (GNN) based Graph Drawing framework with sub-quadratic runtime that can learn to optimize stress. Inspired by classical stress optimization techniques and force-directed layout algorithms, we create a coarsening hierarchy for the input graph. Beginning at the coarsest level, we iteratively refine and un-coarsen the layout, until we generate an embedding for the original graph. To enhance information propagation within the network, we propose a novel positional rewiring technique based on intermediate node positions. Our empirical evaluation demonstrates that the framework achieves state-of-the-art performance while remaining scalable.

POSTER-259: SAS: Structured Activation Sparsification

Primary Area: general machine learning (i.e., none of the above)

Keywords: Structured Sparse Pruning Projection

Scores: [ 6 6 6 ]

Wide networks usually yield better accuracy than their narrower counterpart at the expense of the massive mult cost.To break this tradeoff, we advocate a novel concept of Structured Activation Sparsification, dubbed SAS, which boosts accuracy without increasing computation by utilizing the projected sparsity in activation maps with a specific structure. Concretely, the projected sparse activation is allowed to have N nonzero value among M consecutive activations.Owing to the local structure in sparsity, the wide matmul between a dense weight and the sparse activation is executed as an equivalent narrow matmul between a dense weight and dense activation, which is compatible with NVIDIA's SparseTensorCore developed for the N:M structured sparse weight.In extensive experiments, we demonstrate that increasing sparsity monotonically improves accuracy (up to 7% on CIFAR10) without increasing the mult count.Furthermore, we show that structured sparsification of activation scales better than that of weight given the same computational budget.

POSTER-260: Skip-Attention: Improving Vision Transformers by Paying Less Attention

Primary Area: general machine learning (i.e., none of the above)

Keywords: vision transformers efficiency redundacy in attention maps improved throughput

Scores: [ 6 8 6 ]

This work aims to improve the efficiency of vision transformers (ViTs). While ViTs use computationally expensive self-attention operations in every layer, we identify that these operations are highly correlated across layers -- a key redundancy that causes unnecessary computations. Based on this observation, we propose SkipAT a method to reuse self-attention computation from preceding layers to approximate attention at one or more subsequent layers. To ensure that reusing self-attention blocks across layers does not degrade the performance, we introduce a simple parametric function, which outperforms the baseline transformer's performance while running computationally faster. We show that SkipAT is agnostic to transformer architecture and is effective in image classification, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS. We achieve improved throughput at the same-or-higher accuracy levels in all these tasks.

POSTER-261: The Expressive Leaky Memory Neuron: an Efficient and Expressive Phenomenological Neuron Model Can Solve Long-Horizon Tasks.

Primary Area: general machine learning (i.e., none of the above)

Keywords: Computational Neuroscience Phenomenological Neuron Modeling Cortical Neurons Recurrent Neural Networks Machine Learning Biologically Inspired Modelling

Scores: [ 5 8 6 8 ]

Biological cortical neurons are remarkably sophisticated computational devices,temporally integrating their vast synaptic input over an intricate dendritic tree,subject to complex, nonlinearly interacting internal biological processes. A recentstudy proposed to characterize this complexity by fitting accurate surrogate modelsto replicate the input-output relationship of a detailed biophysical cortical pyramidalneuron model and discovered it needed temporal convolutional networks (TCN)with millions of parameters. Requiring these many parameters, however, couldbe the result of a misalignment between the inductive biases of the TCN andcortical neuron’s computations. In light of this, and with the aim to explorethe computational implications of leaky memory units and nonlinear dendriticprocessing, we introduce the Expressive Leaky Memory (ELM) neuron model, abiologically inspired phenomenological model of a cortical neuron. Remarkably, byexploiting a few such slowly decaying memory-like hidden states and two-layerednonlinear integration of synaptic input, our ELM neuron can accurately matchthe aforementioned input-output relationship with under ten-thousand trainableparameters. To further assess the computational ramifications of our neuron design,we evaluate on various tasks with demanding temporal structures, including theLong Range Arena (LRA) datasets, as well as a novel neuromorphic dataset basedon the Spiking Heidelberg Digits dataset (SHD-Adding). Leveraging a largernumber of memory units with sufficiently long timescales, and correspondinglysophisticated synaptic integration, the ELM neuron proves to be competitive onboth datasets, reliably outperforming the classic Transformer or Chrono-LSTMarchitectures on latter, even solving the Pathfinder-X task with over 70% accuracy(16k context length). These findings indicate the importance of inductive biasesfor efficient surrogate neuron models and the potential for biologically motivatedmodels to enhance performance in challenging machine learning tasks.

POSTER-262: Bayesian Coreset Optimization for Personalized Federated Learning

Primary Area: general machine learning (i.e., none of the above)

Keywords: federated learning personalized federated learning bayesian coreset submodularity variational inference coresets optimization

Scores: [ 8 5 6 ]

In a distributed machine learning setting like Federated Learning where there are multiple clients involved which update their individual weights to a single central server, often training on the entire individual client's dataset for each client becomes cumbersome. To address this issue we propose CORESET-PFEDBAYES: a personalized coreset weighted federated learning setup where the training updates for each individual clients are forwarded to the central server based on only individual client coreset based representative data points instead of the entire client data. Through theoretical analysis we present how the average generalization error is minimax optimal up to logarithm bounds O(nk2β2β+dlog2δ(nk)), where nk denotes the coreset size and how the approximation error on the data likelihood differs from a vanilla Federated Learning setup as a function G(w) of the coreset weights w. Our experiments on different benchmark datasets based on a variety of recent personalized federated learning architectures show significant gains (+4.87% on MNIST, +8.61% on FashionMNIST, +9.71% on CIFAR in terms of model accuracy across ) as compared to random sampling on the training data followed by federated learning, thereby indicating how intelligently selecting such training samples can help in performance. Additionally, through experiments on medical datasets our proposed method showcases some gains (e.g. +9.74% under COVID-19 dataset) as compared to other submodular optimization based approaches used for subset selection on client's data.

POSTER-263: DEEP NEURAL NETWORK INITIALIZATION WITH SPARSITY INDUCING ACTIVATIONS

Primary Area: general machine learning (i.e., none of the above)

Keywords: Deep neural network random initialisation sparsity gaussian process

Scores: [ 6 6 8 6 ]

Inducing and leveraging sparse activations during training and inference is a promising avenue for improving the computational efficiency of deep networks, which is increasingly important as network sizes continue to grow and their application becomes more widespread. Here we use the large width Gaussian process limit to analyze the behaviour, at random initialization, of nonlinear activations that induce sparsity in the hidden outputs. A previously unreported form of training instability is proven for arguably two of the most natural candidates for hidden layer sparsification; those being a shifted ReLU (ϕ(x)=max(0,xτ) for τ0) and soft thresholding (ϕ(x)=0 for |x|τ and xsign(x)τ for |x|>τ). We show that this instability is overcome by clipping the nonlinear activation magnitude, at a level prescribed by the shape of the associated Gaussian process variance map. Numerical experiments verify the theory and show that the proposed magnitude clipped sparsifying activations can be trained with training and test fractional sparsity as high as 85% while retaining close to full accuracy.

POSTER-264: Adversarial Training on Purification (AToP): Advancing Both Robustness and Generalization

Primary Area: general machine learning (i.e., none of the above)

Keywords: adversarial attacks adversarial training adversarial purification

Scores: [ 6 5 6 ]

The deep neural networks are known to be vulnerable to well-designed adversarial attacks. The most successful defense technique based on adversarial training (AT) can achieve optimal robustness against particular attacks but cannot generalize well to unseen attacks. Another effective defense technique based on adversarial purification (AP) can enhance generalization but cannot achieve optimal robustness. Meanwhile, both methods share one common limitation on the degraded standard accuracy. To mitigate these issues, we propose a novel framework called Adversarial Training on Purification (AToP), which comprises two components: perturbation destruction by random transforms (RT) and purifier model fine-tuned (FT) by adversarial loss. RT is essential to avoid overlearning to known attacks resulting in the robustness generalization to unseen attacks and FT is essential for the improvement of robustness. To evaluate our method in an efficient and scalable way, we conduct extensive experiments on CIFAR-10, CIFAR-100, and ImageNette to demonstrate that our method achieves state-of-the-art results and exhibits generalization ability against unseen attacks.

POSTER-265: D2 Pruning: Message Passing for Balancing Diversity & Difficulty in Data Pruning

Primary Area: general machine learning (i.e., none of the above)

Keywords: coreset selection data pruning data graph message passing data distillation

Scores: [ 6 6 6 5 5 ]

In recent years, data quality has emerged as an important factor for training massive models. Analytical theories suggest that higher-quality data can lead to lower test errors in models trained on a fixed data budget. Moreover, a model can be trained on a lower compute budget without compromising performance if a dataset can be stripped of its redundancies. Coreset selection (or data pruning) seeks to select a subset of the training data so as to maximize the performance of models trained on this subset, also referred to as coreset. There are two dominant approaches: (1) geometry-based data selection for maximizing data diversity in the coreset, and (2) functions that assign difficulty scores to samples based on training dynamics. Optimizing for data diversity leads to a coreset that is biased towards easier samples, whereas, selection by difficulty ranking omits easy samples that are necessary for the training of deep learning models. This demonstrates that data diversity and importance scores are two complementary factors that need to be jointly considered during coreset selection. In this work, we represent a dataset as an undirected graph and propose a novel pruning algorithm, D2 Pruning, that uses message passing over this dataset graph for coreset selection. D2 Pruning updates the difficulty scores of each example by incorporating the difficulty of its neighboring examples in the dataset graph. Then, these updated difficulty scores direct a graph-based sampling method to select a coreset that encapsulates both diverse and difficult regions of the dataset space. We evaluate supervised and self-supervised versions of our method on various vision and NLP datasets. Results show that D2 Pruning improves coreset selection over previous state-of-the-art methods at low-to-medium pruning rates. Additionally, we find that using D2 Pruning for filtering large multimodal datasets leads to increased diversity in the dataset and improved generalization of pretrained models. Our work shows that D2 Pruning is a versatile framework for understanding and processing datasets.

POSTER-266: Let Models Speak Ciphers: Multiagent Debate through Embeddings

Primary Area: general machine learning (i.e., none of the above)

Keywords: multiagent debate large language models inter-model communication embedding representation

Scores: [ 5 6 6 5 ]

Discussion and debate among Large Language Models (LLMs) have gained considerable attention due to their potential to enhance the reasoning ability of LLMs. Although natural language is an obvious choice for communication due to LLM's language understanding capability, the token sampling step needed when generating natural language poses a potential risk of information loss, as it uses only one token to represent the model's belief across the entire vocabulary. In this paper, we introduce a communication regime named CIPHER (Communicative Inter-Model Protocol Through Embedding Representation) to address this issue. Specifically, we remove the token sampling step from LLMs and let them communicate their beliefs across the vocabulary through the expectation of the raw transformer output embeddings. Remarkably, by deviating from natural language, CIPHER offers an advantage of encoding a broader spectrum of information without any modification to the model weights, outperforming the state-of-the-art LLM debate methods using natural language by 1-3.5% across five reasoning tasks and multiple open-source LLMs of varying sizes. This showcases the superiority and robustness of embeddings as an alternative language" for communication among LLMs.

SPOTLIGHT-57: Towards Meta-Pruning via Optimal Transport

Primary Area: general machine learning (i.e., none of the above)

Keywords: Pruning Fusion

Scores: [ 8 8 6 8 6 ]

Pruning is one of the mainstream methods to compress over-parameterized neural networks, resulting in significant practical benefits. Recently, another line of work has explored the direction of fusion, i.e. merging, independently trained neural networks. Here, we seek to marry the two approaches in a bid to combine their advantages into a single approach, which we term `Intra-Fusion'. Specifically, we implicitly utilize the pruning criteria to result in more informed fusion. Agnostic to the choice of a specific neuron-importance metric, Intra-Fusion can typically prune an additional considerable amount of the parameters while retaining the same accuracy as the standard pruning approach. Additionally, we explore how fusion can be added to the pruning process to significantly decrease the training time while maintaining competitive performance. We benchmark our results for various networks on commonly used datasets such as CIFAR10, CIFAR100, and ImageNet. More broadly, we hope that the proposed approach invigorates exploration into a fresh alternative to the predominant compression approaches.

POSTER-267: Transformers can optimally learn regression mixture models

Primary Area: general machine learning (i.e., none of the above)

Keywords: transformers mixture models linear regression

Scores: [ 6 8 6 6 8 ]

Mixture models arise in many regression problems, but most methods have seen limited adoption partly due to these algorithms' highly-tailored and model-specific nature. On the other hand, transformers are flexible, neural sequence models that present the intriguing possibility of providing general-purpose prediction methods, even in this mixture setting. In this work, we investigate the hypothesis that transformers can learn an optimal predictor for mixtures of regressions. We construct a generative process for a mixture of linear regressions for which the decision-theoretic optimal procedure is given by data-driven exponential weights on a finite set of parameters. We observe that transformers achieve low mean-squared error on data generated via this process. By probing the transformer's output at inference time, we also show that transformers typically make predictions that are close to the optimal predictor. Our experiments also demonstrate that transformers can learn mixtures of regressions in a sample-efficient fashion and are somewhat robust to distribution shifts. We complement our experimental observations by proving constructively that the decision-theoretic optimal procedure is indeed implementable by a transformer.

SPOTLIGHT-58: Selective Mixup Fine-Tuning for Optimizing Non-Decomposable Metrics

Primary Area: general machine learning (i.e., none of the above)

Keywords: Non-Decomposable Objectives Long-Tail Learning Semi-Supervised Learning

Scores: [ 6 8 8 6 ]

The rise in internet usage has led to the generation of massive amounts of data, resulting in the adoption of various supervised and semi-supervised machine learning algorithms, which can effectively utilize the colossal amount of data to train models. However, before deploying these models in the real world, these must be strictly evaluated on performance measures like worst-case recall and satisfy constraints such as fairness. We find that current state-of-the-art empirical techniques offer sub-optimal performance on these practical, non-decomposable performance objectives. On the other hand, the theoretical techniques necessitate training a new model from scratch for each performance objective. To bridge the gap, we propose SelMix, a selective mixup-based inexpensive fine-tuning technique for pre-trained models, to optimize for the desired objective. The core idea of our framework is to determine a sampling distribution to perform a mixup of features between samples from particular classes such that it optimizes the given objective. We comprehensively evaluate our technique against the existing empirical and theoretically principled methods on standard benchmark datasets for imbalanced classification. We find that proposed SelMix fine-tuning significantly improves the performance for various practical non-decomposable objectives across benchmarks.

POSTER-268: Federated Wasserstein Distance

Primary Area: general machine learning (i.e., none of the above)

Keywords: Wasserstein distance Federated Learning ; Triangle inequality

Scores: [ 8 6 6 6 ]

POSTER-269: TabR: Tabular Deep Learning Meets Nearest Neighbors in 2023

Primary Area: general machine learning (i.e., none of the above)

Keywords: tabular tabular data architecture deep learning neural networks

Scores: [ 3 6 6 8 ]

Deep learning (DL) models for tabular data problems (e.g. classification, regression) are currently receiving increasingly more attention from researchers.However, despite the recent efforts, the non-DL algorithms based on gradient-boosted decision trees (GBDT) remain a strong go-to solution for these problems.One of the research directions aimed at improving the position of tabular DL involves designing so-called retrieval-augmented models.For a target object, such models retrieve other objects (e.g. the nearest neighbors) from the available training data and use their features and labels to make a better prediction.In this work, we present TabR -- essentially, a feed-forward network with a novel k-Nearest-Neighbors-like component in the middle. On a set of public benchmarks with datasets up to several million objects, TabR marks a big step forward for tabular DL: it demonstrates the best average performance among tabular DL models, becomes the new state-of-the-art on several datasets, and even outperforms GBDT models on the recently proposed "GBDT-friendly" benchmark (see Figure 1).Among the novel findings and technical details powering TabR, the main ones lie in the attention-like mechanism that is responsible for retrieving the nearest neighbors and extracting valuable signal from them.In addition to the much higher performance, TabR is simple and significantly more efficient compared to prior retrieval-based tabular DL models.

POSTER-270: Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching

Primary Area: general machine learning (i.e., none of the above)

Keywords: Dataset Distillation

Scores: [ 8 6 8 6 ]

The ultimate goal of Dataset Distillation is to synthesize a small synthetic dataset such that a model trained on this synthetic set will perform equally well as a model trained on the full, real dataset. Until now, no method of Dataset Distillation has reached this completely lossless goal, in part due to the fact that previous methods only remain effective when the total number of synthetic samples is extremely small. Since only so much information can be contained in such a small number of samples, it seems that to achieve truly loss dataset distillation, we must develop a distillation method that remains effective as the size of the synthetic dataset grows. In this work, we present such an algorithm and elucidate why existing methods fail to generate larger, high-quality synthetic sets. Current state-of-the-art methods rely on trajectory-matching, or optimizing the synthetic data to induce similar long-term training dynamics as the real data. We empirically find that the training stage of the trajectories we choose to match (i.e., early or late) greatly affects the effectiveness of the distilled dataset. Specifically, early trajectories (where the teacher network learns easy patterns) work well for a low-cardinality synthetic set since there are fewer examples wherein to distribute the necessary information. Conversely, late trajectories (where the teacher network learns hard patterns) provide better signals for larger synthetic sets since there are now enough samples to represent the necessary complex patterns. Based on our findings, we propose to align the difficulty of the generated patterns with the size of the synthetic dataset. In doing so, we successfully scale trajectory matching-based methods to larger synthetic datasets, achieving lossless dataset distillation for the very first time. Code and distilled datasets will be released.

POSTER-271: Masked Structural Growth for 2x Faster Language Model Pre-training

Primary Area: general machine learning (i.e., none of the above)

Keywords: Progressive Learning Large Language Models Model Growth

Scores: [ 8 6 6 ]

POSTER-272: Robust Training of Federated Models with Extremely Label Deficiency

Primary Area: general machine learning (i.e., none of the above)

Keywords: Federated Learning Label Deficiency

Scores: [ 6 6 6 6 ]

Federated semi-supervised learning (FSSL) has emerged as a powerful paradigm for collaboratively training machine learning models using distributed data with label deficiency. Advanced FSSL methods predominantly focus on training a single model on each client. However, this approach could lead to a discrepancy between the objective functions of labeled and unlabeled data, resulting in gradient conflicts. To alleviate gradient conflict, we propose a novel twin-model paradigm, called Twinsight, designed to enhance mutual guidance by providing insights from different perspectives of labeled and unlabeled data. In particular, Twinsight concurrently trains a supervised model with a supervised objective function while training an unsupervised model using an unsupervised objective function. To enhance the synergy between these two models, Twinsight introduces a neighborhood-preserving constraint, which encourages the preservation of the neighborhood relationship among data features extracted by both models. Our comprehensive experiments on four benchmark datasets provide substantial evidence that Twinsight can significantly outperform state-of-the-art methods across various experimental settings, demonstrating the efficacy of the proposed Twinsight.

POSTER-273: From Graphs to Hypergraphs: Hypergraph Projection and its Remediation

Primary Area: general machine learning (i.e., none of the above)

Keywords: hypergraphs hypergraph projection learning-based reconstruction

Scores: [ 6 8 8 8 ]

POSTER-274: Large Language Models as Tool Makers

Primary Area: general machine learning (i.e., none of the above)

Keywords: large language models tool making tool using serving efficiency

Scores: [ 5 5 6 3 ]

Recent research has highlighted the potential of large language models (LLMs)to improve their problem-solving capabilities with the aid of suitable externaltools. In our work, we further advance this concept by introducing a closed-loop framework, referred to as LLMs A s Tool Makers (LATM), where LLMscreate their own reusable tools for problem-solving. Our approach consists of twophases: 1) tool making: an LLM acts as the tool maker that crafts tools for a setof tasks, where a tool is implemented as a Python utility function. 2) tool using:another LLM acts as the tool user, which applies the tool built by the tool makerfor problem-solving. The tool user can be either the same or a different LLMfrom the tool maker. On the problem-solving server side, tool-making enablescontinual tool generation and caching as new requests emerge. This frameworkenables subsequent requests to access cached tools via their corresponding APIs,enhancing the efficiency of task resolution. Beyond enabling LLMs to create theirown tools, our framework also uncovers intriguing opportunities to optimize theserving cost of LLMs: Recognizing that tool-making requires more sophisticatedcapabilities, we assign this task to a powerful, albeit resource-intensive, model.Conversely, the simpler tool-using phase is delegated to a lightweight model. Thisstrategic division of labor allows the once-off cost of tool-making to be spreadover multiple instances of tool-using, significantly reducing average costs whilemaintaining strong performance. Furthermore, our method offers a functionalcache through the caching and reuse of tools, which stores the functionality ofa class of requests instead of the natural language responses from LLMs, thusextending the applicability of the conventional cache mechanism. We evaluateour approach across various complex reasoning tasks, including Big-Bench tasks.With GPT-4 as the tool maker and GPT-3.5 as the tool user, LATM demonstratesperformance equivalent to using GPT-4 for both roles, but with a significantlyreduced inference cost.

Primary Area: general machine learning (i.e., none of the above)

Keywords: Neural Architecture Search Training-free NAS Bayesian Optimization Precision@K Partial Monitoring

Scores: [ 5 3 8 ]

Neural architecture search (NAS) has become a key component of AutoML and a standard tool to automate the design of deep neural networks. Recently, training-free NAS as an emerging paradigm has successfully reduced the search costs of standard training-based NAS by estimating the true architecture performance with only training-free metrics. Nevertheless, the estimation ability of these metrics typically varies across different tasks, making it challenging to achieve robust and consistently good search performance on diverse tasks with only a single training-free metric. Meanwhile, the estimation gap between training-free metrics and the true architecture performances limits training-free NAS to achieve superior performance. To address these challenges, we propose the robustifying and boosting training-free NAS (RoBoT) algorithm which (a) employs the optimized combination of existing training-free metrics explored from Bayesian optimization to develop a robust and consistently better-performing metric on diverse tasks, and (b) applies greedy search, i.e., the exploitation, on the newly developed metric to bridge the aforementioned gap and consequently to boost the search performance of standard training-free NAS further. Remarkably, the expected performance of our RoBoT can be theoretically guaranteed, which improves over the existing training-free NAS under mild conditions with additional interesting insights. Our extensive experiments on various NAS benchmark tasks yield substantial empirical evidence to support our theoretical results.

POSTER-276: Benchmarking and Improving Generator-Validator Consistency of Language Models

Primary Area: general machine learning (i.e., none of the above)

Keywords: consistency language models self-critique

Scores: [ 6 8 6 ]

As of September 2023, ChatGPT correctly answers “what is 7+8” with 15, but when asked “7+8=15, True or False” it responds with “False”. This inconsistency between generating and validating an answer is prevalent in language models (LMs) and erodes trust. In this paper, we propose a framework for measuring the consistency between generation and validation (which we call generator-validator consistency, or GV-consistency), finding that even GPT-4, a state-of-the-art LM, is GV-consistent only 76% of the time. To improve the consistency of LMs, we propose to finetune on the filtered generator and validator responses that are GV-consistent, and call this approach consistency fine-tuning. We find that this approach improves GV-consistency of Alpaca-30B from 60% to 93%, and the improvement extrapolates to unseen tasks and domains (e.g., GV-consistency for positive style transfers extrapolates to unseen styles like humor). In addition to improving consistency, consistency fine-tuning improves both generator quality and validator accuracy without using any labeled data. Evaluated across 6 tasks, including math questions, knowledge-intensive QA, and instruction following, our method improves the generator quality by 16% and the validator accuracy by 6.3% across all tasks.

POSTER-277: Differentiable Learning of Generalized Structured Matrices for Efficient Deep Neural Networks

Primary Area: general machine learning (i.e., none of the above)

Keywords: Structured Matrix Block Low Rank Low Rank Efficient Neural Network Transformer Fourier Dirichlet Kernel FFT Boxcar Pruning Compression

Scores: [ 6 5 6 ]

ORAL-16: ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models

Primary Area: general machine learning (i.e., none of the above)

Keywords: Large Language Models Sparsity Activation Function ReLU Activation Function

Scores: [ 6 8 8 ]

POSTER-278: AffineQuant: Affine Transformation Quantization for Large Language Models

Primary Area: general machine learning (i.e., none of the above)

Keywords: post-training quantization large language model Affine Transformation

Scores: [ 8 5 8 8 ]

The significant resource requirements associated with Large-scale Language Models (LLMs) have generated considerable interest in the development of techniques aimed at compressing and accelerating neural networks. Among these techniques, Post-Training Quantization (PTQ) has emerged as a subject of considerable interest due to its noteworthy compression efficiency and cost-effectiveness in the context of training.Existing PTQ methods for LLMs limit the optimization scope to scaling transformations between pre- and post-quantization weights. This constraint results in significant errors after quantization, particularly in low-bit configurations. In this paper, we advocate for the direct optimization using equivalent Affine transformations in PTQ (AffineQuant). This approach extends the optimization scope and thus significantly minimizing quantization errors. Additionally, by employing the corresponding inverse matrix, we can ensure equivalence between the pre- and post-quantization outputs of PTQ, thereby maintaining its efficiency and generalization capabilities. To ensure the invertibility of the transformation during optimization, we further introduce a gradual mask optimization method. This method initially focuses on optimizing the diagonal elements and gradually extends to the other elements. Such an approach aligns with the Levy-Desplanques theorem, theoretically ensuring invertibility of the transformation. As a result, significant performance improvements are evident across different LLMs on diverse datasets. Notably, these improvements are most pronounced when using very low-bit quantization, enabling the deployment of large models on edge devices. To illustrate, we attain a C4 perplexity of 14.89 ({ 10.00$\downarrow$} vs 24.89 in OmniQuant) on the LLaMA-$7$B model of W$2$A$16$ quantization. AffineQuant significantly outperforms OmniQuant on smaller models, achieving a perplexity of 42.29 ({ 33.14$\downarrow$} vs 75.43 in OmniQuant) when using 2-bit 128-group quantization for OPT-$125$M, which setting a new state-of-the-art benchmark for PTQ in LLMs. Codes are available in the supplementary materials.

POSTER-279: LMUFormer: Low Complexity Yet Powerful Spiking Model With Legendre Memory Units

Primary Area: general machine learning (i.e., none of the above)

Keywords: Legendre Memory Unit Spiking Neural Network Recurrent Neural Network

Scores: [ 6 6 8 3 ]

Transformer models have demonstrated high accuracy in numerous applications but have high complexity and lack sequential processing capability making them ill-suited for many streaming applications at the edge where devices are heavily resource-constrained. Thus motivated, many researchers have proposed reformulating the transformer models as RNN modules which modify the self-attention computation with explicit states. However, these approaches often incur significant performance degradation.The ultimate goal is to develop a model that has the following properties: parallel training, streaming and low-cost inference, and state-of-the-art (SOTA) performance. In this paper, we propose a new direction to achieve this goal. We show how architectural modifications to a fully-sequential recurrent model can help push its performance toward Transformer models while retaining its sequential processing capability. Specifically, inspired by the recent success of Legendre Memory Units (LMU) in sequence learning tasks, we propose LMUFormer, which augments the LMU with convolutional patch embedding and convolutional channel mixer. Moreover, we present a spiking version of this architecture, which introduces the benefit of states within the patch embedding and channel mixer modules while simultaneously reducing the computing complexity. We evaluated our architectures on multiple sequence datasets. Of particular note is our performance on the Speech Commands V2 dataset (35 classes). In comparison to SOTA transformer-based models within the ANN domain, our LMUFormer demonstrates comparable performance while necessitating a remarkable 70× reduction in parameters and a substantial 140× decrement in FLOPs. Furthermore, when benchmarked against extant low-complexity SNN variants, our model establishes a new SOTA with an accuracy of 96.12%. Additionally, owing to our model's proficiency in real-time data processing, we are able to achieve a 32.03% reduction in sequence length, all while incurring an inconsequential decline in performance.

POSTER-280: Escape Sky-high Cost: Early-stopping Self-Consistency for Multi-step Reasoning

Primary Area: general machine learning (i.e., none of the above)

Keywords: Self-consistency Chain-of-Thoughts Multi-Step Reasoning Large Language Models

Scores: [ 8 5 5 5 ]

POSTER-281: FedCompass: Efficient Cross-Silo Federated Learning on Heterogeneous Client Devices Using a Computing Power-Aware Scheduler

Primary Area: general machine learning (i.e., none of the above)

Keywords: Federated Learning Device Heterogeneity Cross-silo Federated Learning

Scores: [ 8 6 6 5 ]

Cross-silo federated learning offers a promising solution to collaboratively train robust and generalized AI models without compromising the privacy of local datasets, e.g., healthcare, financial, as well as scientific projects that lack a centralized data facility. Nonetheless, because of the disparity of computing resources among different clients (i.e., device heterogeneity), synchronous federated learning algorithms suffer from degraded efficiency when waiting for straggler clients. Similarly, asynchronous federated learning algorithms experience degradation in the convergence rate and final model accuracy on non-identically and independently distributed (non-IID) heterogeneous datasets due to stale local models and client drift. To address these limitations in cross-silo federated learning with heterogeneous clients and data, we propose FedCompass, an innovative semi-asynchronous federated learning algorithm with a computing power-aware scheduler on the server side, which adaptively assigns varying amounts of training tasks to different clients using the knowledge of the computing power of individual clients. FedCompass ensures that multiple locally trained models from clients are received almost simultaneously as a group for aggregation, effectively reducing the staleness of local models. At the same time, the overall training process remains asynchronous, eliminating prolonged waiting periods from straggler clients. Using diverse non-IID heterogeneous distributed datasets, we demonstrate that FedCompass achieves faster convergence and higher accuracy than other asynchronous algorithms while remaining more efficient than synchronous algorithms when performing federated learning on heterogeneous clients.

POSTER-282: DRSM: De-Randomized Smoothing on Malware Classifier Providing Certified Robustness

Primary Area: general machine learning (i.e., none of the above)

Keywords: machine learning adversarial malware smoothing robustness

Scores: [ 8 6 6 6 5 ]

Machine Learning (ML) models have been utilized for malware detection for over two decades. Consequently, this ignited an ongoing arms race between malware authors and antivirus systems, compelling researchers to propose defenses for malware-detection models against evasion attacks. However, most if not all existing defenses against evasion attacks suffer from sizable performance degradation and/or can defend against only specific attacks, which makes them less practical in real-world settings. In this work, we develop a certified defense, DRSM (De-Randomized Smoothed MalConv), by redesigning the de-randomized smoothing technique for the domain of malware detection. Specifically, we propose a window ablation scheme to provably limit the impact of adversarial bytes while maximally preserving local structures of the executables. After showing how DRSM is theoretically robust against attacks with contiguous adversarial bytes, we verify its performance and certified robustness experimentally, where we observe only marginal accuracy drops as the cost of robustness. To our knowledge, we are the first to offer certified robustness in the realm of static detection of malware executables. More surprisingly, through evaluating DRSM against 9 empirical attacks of different types, we observe that the proposed defense is empirically robust to some extent against a diverse set of attacks, some of which even fall out of the scope of its original threat model. In addition, we collected 15.5K recent benign raw executables from diverse sources, which will be made public as a dataset called PACE (Publicly Accessible Collection(s) of Executables) to alleviate the scarcity of publicly available benign datasets for studying malware detection and provide future research with more representative data of the time.

POSTER-283: Faster Approximation of Probabilistic and Distributional Values via Least Squares

Primary Area: general machine learning (i.e., none of the above)

Keywords: data valuation probabilistic values approximation distributional Shapley

Scores: [ 8 6 8 8 ]

The family of probabilistic values, axiomatically-grounded and proposed in cooperative game theory, has recently received much attention in data valuation. However, it is often computationally expensive to compute exactly (exponential w.r.t. N, the number of data being valuated). Existing generic estimators cost O(N2ϵ2logNδ) utility evaluations to achieve an (ϵ, δ)-approximation under the 2-norm, while faster estimators have been developed recently for special cases (e.g., empirically for the Shapley value and theoretically for the Banzhaf value). In this work, based on a connection between probabilistic values and least square regressions, we propose two generic estimators for the whole family of probabilistic values that both cost O(Nϵ2logNδ) utility evaluations, largely extending the scope of this currently best complexity bound. Moreover, we show that each distributional value, proposed by Ghorbani et al. (2020) to alleviate the inconsistency of probabilistic values when using distinct databases, can also be cast as optimizing a similar least square regression. This observation makes it the first-time theoretically-grounded to train value estimators such that the distributional value of each unseen data point can be evaluated in a single forward pass. Our experiments verify the faster convergence of our proposed estimators, and demonstrate the effectiveness at learning distributional values.

POSTER-284: RobustTSF: Towards Theory and Design of Robust Time Series Forecasting with Anomalies

Primary Area: general machine learning (i.e., none of the above)

Keywords: Robust time series forecasting; learning with noisy labels

Scores: [ 6 5 5 6 ]

Time series forecasting is an important and forefront task whose techniques have been applied to electricity forecasting, trajectory prediction, labor planning, etc. However, most of time series forecasting techniques assume that the training data is clean without anomalies. This assumption is unrealistic since the collected time series data can be contaminated in practice. The forecasting model will be inferior if it is directly trained by time series with anomalies. Thus it is essential to develop methods to automatically learn a robust forecasting model from the contaminated data. In this paper, we first statistically define three types of anomalies, then theoretically and experimentally analyze the loss robustness and sample robustness when these anomalies exist. Based on our analyses, we propose a simple and efficient algorithm to learn a robust forecasting model. Extensive experiments show that our method is highly robust and outperforms all existing approaches.

POSTER-285: The Optimal Constant Solution: Predictable Extrapolation in Deep Neural Networks

Primary Area: general machine learning (i.e., none of the above)

Keywords: extrapolation OOD generalization deep neural networks decision-making

Scores: [ 8 6 6 8 ]

Conventional wisdom suggests that neural network predictions tend to be unpredictable and overconfident when faced with out-of-distribution (OOD) inputs. Our work reassesses this assumption for neural networks with high-dimensional inputs. Rather than extrapolating in arbitrary ways, we observe that neural network predictions often tend towards a constant value as input data becomes increasingly OOD. Moreover, we find that this value often closely approximates the optimal constant solution (OCS), i.e., the prediction that minimizes the average loss over the training data without observing the input. We present results showing this phenomenon across 8 datasets with different distributional shifts (including CIFAR10-C and ImageNet-R, S), different loss functions (cross entropy, MSE, and Gaussian NLL), and different architectures (CNNs and transformers). Furthermore, we present an explanation for this behavior, which we first validate empirically and then study theoretically in a simplified setting involving deep homogeneous networks with ReLU activations. Finally, we show how one can leverage our insights in practice to enable risk-sensitive decision-making in the presence of OOD inputs.

SPOTLIGHT-59: Efficient ConvBN Blocks for Transfer Learning and Beyond

Primary Area: general machine learning (i.e., none of the above)

Keywords: transfer learning batch normalization efficient training

Scores: [ 8 6 8 8 ]

Convolution-BatchNorm (ConvBN) blocks are integral components in various computer vision tasks and other domains. A ConvBN block can operate in three modes: Train, Eval, and Deploy. While the Train mode is indispensable for training models from scratch, the Eval mode is suitable for transfer learning and beyond, and the Deploy mode is designed for the deployment of models. This paper focuses on the trade-off between stability and efficiency in ConvBN blocks: Deploy mode is efficient but suffers from training instability; Eval mode is widely used in transfer learning but lacks efficiency. To solve the dilemma, we theoretically reveal the reason behind the diminished training stability observed in the Deploy mode. Subsequently, we propose a novel Tune mode to bridge the gap between Eval mode and Deploy mode. The proposed Tune mode is as stable as Eval mode for transfer learning, and its computational efficiency closely matches that of the Deploy mode. Through extensive experiments in object detection, classification, and adversarial example generation across 5 datasets and 12 model architectures, we demonstrate that the proposed Tune mode retains the performance while significantly reducing GPU memory footprint and training time, thereby contributing efficient ConvBN blocks for transfer learning and beyond. Our method has been integrated into both PyTorch (general machine learning framework) and MMCV/MMEngine (computer vision framework). Practitioners just need one line of code to enjoy our efficient ConvBN blocks thanks to PyTorch's builtin machine learning compilers.

POSTER-286: Bridging the Gap between Binary Neural Networks and Spiking Neural Networks for Efficient Computer Vision

Primary Area: general machine learning (i.e., none of the above)

Keywords: BNN Hoyer regularizer gradient descent FLOPs object detection

Scores: [ 5 6 6 ]

Binary Neural networks (BNN) have emerged as an attractive computing paradigm for a wide range of low-power vision tasks. However, state-of-the-art (SOTA) BNNs do not yield any sparsity, and induce a significant number of non-binary operations. On the other hand, activation sparsity can be provided by spiking neural networks (SNN), that too have gained significant traction in recent times. Thanks to this sparsity, SNNs when implemented on neuromorphic hardware, have the potential to be significantly more power-efficient compared. However, SNNs incur multiple time steps to achieve close to SOTA accuracy. Ironically, this increases latency and energy---costs that SNNs were proposed to reduce---and presents itself as a major hurdle in realizing SNNs’ theoretical gains in practice. This raises an intriguing question: Can we obtain SNN-like sparsity and BNN-like accuracy and enjoy the energy-efficiency benefits of both? To answer this question, in this paper, we present a training framework for sparse binary activation neural networks (BANN) using a novel variant of the Hoyer regularizer. We estimate the threshold of each BANN layer as the Hoyer extremum of a clipped version of its activation map, where the clipping value is trained using gradient descent with our Hoyer regularizer. This approach shifts the activation values away from the threshold, thereby mitigating the effect of noise that can otherwisee degrade the BANN accuracy. Our approach outperforms existing BNNs, SNNs, and adder neural networks (that also avoid energy-expensive multiplication operations similar to BNNs and SNNs) in terms of the accuracy-FLOPs trade-off for complex image recognition tasks. Downstream experiments on object detection further demonstrate the efficacy of our approach. Lastly, we demonstrate the portability of our approach to SNNs with multiple time steps.

SPOTLIGHT-60: A Poincaré Inequality and Consistency Results for Signal Sampling on Large Graphs

Primary Area: general machine learning (i.e., none of the above)

Keywords: large-scale graphs signal sampling graphons

Scores: [ 8 8 8 6 ]

POSTER-287: FedDA: Faster Adaptive Gradient Methods for Federated Constrained Optimization

Primary Area: general machine learning (i.e., none of the above)

Keywords: Federated Learning Adaptive Gradient Methods

Scores: [ 6 6 6 ]

Federated learning (FL) is an emerging learning paradigm in which a set of distributed clients learns a task under the coordination of a central server. The FedAvg algorithm is one of the most widely used methods to solve FL problems. In FedAvg, the learning rate is a constant rather than changing adaptively. Adaptive gradient methods have demonstrated superior performance over the constant learning rate schedules in non-distributed settings, and they have recently been adapted to FL. However, the majority of these methods are designed for unconstrained settings. Meanwhile, many crucial FL applications, like disease diagnosis and biomarker identification, often rely on constrained formulations such as Lasso and group Lasso. It remains an open question as to whether adaptive gradient methods can be effectively applied to FL problems with constrains. In this work, we introduce \textbf{FedDA}, a novel adaptive gradient framework for FL. This framework utilizes a restarted dual averaging technique and is compatible with a range of gradient estimation methods and adaptive learning rate schedules. Specifically, an instantiation of our framework \textbf{FedDA-MVR} achieves gradient complexity O~(K1ϵ1.5) and communication complexity O~(K0.25ϵ1.25) for finding a stationary point ϵ in the constrained setting. We conduct experiments over both constrained and unconstrained tasks to confirm the effectiveness of our approach.

POSTER-288: Towards Universal Multi-Modal Personalization: A Language Model Empowered Generative Paradigm

Primary Area: general machine learning (i.e., none of the above)

Keywords: multimodal personalization user modeling generative modeling instruction tuning large language model

Scores: [ 8 6 5 6 ]

Developing a universal model that can effectively harness heterogeneous resources and respond to a wide range of personalized needs has been a longstanding community aspiration. Our daily choices, especially in domains like fashion and retail, are substantially shaped by multi-modal data, such as pictures and textual descriptions. These modalities not only offer intuitive guidance but also cater to personalized user preferences. However, the predominant personalization approaches mainly focus on ID or text-based recommendation problem, failing to comprehend the information spanning various tasks or modalities. In this paper, our goal is to establish a Unified paradigm for Multi-modal Personalization systems (UniMP), which effectively leverages multi-modal data while eliminating the complexities associated with task- and modality-specific customization. We argue that the advancements in foundational generative modeling have provided the flexibility and effectiveness necessary to achieve the objective. In light of this, we develop a generic and extensible personalization generative framework, that can handle a wide range of personalized needs including item recommendation, product search, preference prediction, explanation generation, and further user-guided image generation. Our methodology enhances the capabilities of foundational language models for personalized tasks by seamlessly ingesting interleaved cross-modal user history information, ensuring a more precise and customized experience for users. To train and evaluate the proposed multi-modal personalized tasks, we also introduce a novel and comprehensive benchmark covered a variety of user requirements. Our experiments on the real-world benchmark showcase the model's potential, outperforming competitive methods specialized for each task.

POSTER-289: Neuron-Enhanced AutoEncoder Matrix Completion: Theory and Practice

Primary Area: general machine learning (i.e., none of the above)

Keywords: autoencoder matrix completion collaborative filtering

Scores: [ 8 6 8 8 ]

POSTER-290: Language Modeling Is Compression

Primary Area: general machine learning (i.e., none of the above)

Keywords: lossless compression arithmetic coding language models scaling laws in-context learning

Scores: [ 6 6 6 6 ]

It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.

POSTER-291: What Makes a Good Prune? Optimal Unstructured Pruning for Maximal Cosine Similarity

Primary Area: general machine learning (i.e., none of the above)

Keywords: Pruning Deep Learning Neural Networks Interpretability Loss landscapes Optimization Kurtosis

Scores: [ 6 1 8 5 ]

Pruning is an effective method to reduce the size of deep neural network models, maintain accuracy, and, in some cases, improve the network's overall performance. However, the mechanisms underpinning pruning remain unclear. Why can different methods prune by different percentages yet achieve similar performance? Why can we not prune at the start of training? Why are some models more amenable to being pruned than others? Given a model, what is the maximum amount it can be pruned before significantly affecting the performance? This paper explores and answers these questions from the global unstructured magnitude pruning perspective with one epoch of fine-tuning. We develop the idea that cosine similarity is an effective proxy measure for functional similarity between the parent and the pruned network. We prove that the L1 pruning method is optimal when pruning by cosine similarity. We show that the higher the kurtosis of a model's parameter distribution, the more it can be pruned while maintaining performance. Finally, we present a simple method to determine the optimal amount by which a network can be L1-pruned based on its parameter distribution.

POSTER-292: How Does Wild Data Provably Help OOD Detection?

Primary Area: general machine learning (i.e., none of the above)

Keywords: out-of-distribution detection learnability

Scores: [ 6 6 8 6 ]

Using unlabeled data to regularize the machine learning models has demonstrated promise for improving safety and reliability in detecting out-of-distribution (OOD) data. Harnessing the power of unlabeled in-the-wild data is non-trivial due to the heterogeneity of both in-distribution (ID) and OOD data. This lack of a clean set of OOD samples poses significant challenges in learning an optimal OOD classifier. Currently, there is a lack of research on formally understanding how unlabeled data helps OOD detection. This paper bridges the gap by introducing a new learning framework SAL (Separate And Learn) that offers both strong theoretical guarantees and empirical effectiveness. The framework separates candidate outliers from the unlabeled data and then trains an OOD classifier using the candidate outliers and the labeled ID data. Theoretically, we provide rigorous error bounds from the lens of separability and learnability, formally justifying the two components in our algorithm. Our theory shows that SAL can separate the candidate outliers with small error rates, which leads to a generalization guarantee for the learned OOD classifier. Empirically, SAL achieves state-of-the-art performance on common benchmarks, reinforcing our theoretical insights.

POSTER-293: Jointly-Learned Exit and Inference for a Dynamic Neural Network

Primary Area: general machine learning (i.e., none of the above)

Keywords: early-exit dynamic networks efficient inference

Scores: [ 8 8 8 6 ]

Large pretrained models, coupled with fine-tuning, are slowly becoming established as the dominant architecture in machine learning. Even though these models offer impressive performance, their practical application is often limited by the prohibitive amount of resources required for every inference. Early-exiting dynamic neural networks (EDNN) circumvent this issue by allowing a model to make some of its predictions from intermediate layers (i.e., early-exit). Training an EDNN architecture is challenging as it consists of two intertwined components: the gating mechanism (GM) that controls early-exiting decisions and the intermediate inference modules (IMs) that perform inference from intermediate representations. As a result, most existing approaches rely on thresholding confidence metrics for the gating mechanism and strive to improve the underlying backbone network and the inference modules. Although successful, this approach has two fundamental shortcomings: 1) the GMs and the IMs are decoupled during training, leading to a train-test mismatch; and 2) the thresholding gating mechanism introduces a positive bias into the predictive probabilities, making it difficult to readily extract uncertainty information. We propose a novel architecture that connects these two modules. This leads to significant performance improvements on classification datasets and enables better uncertainty characterization capabilities.

POSTER-294: Rethinking Branching on Exact Combinatorial Optimization Solver: The First Deep Symbolic Discovery Framework

Primary Area: general machine learning (i.e., none of the above)

Keywords: Combinatorial Optimization Branch and Bound Deep Symbolic Optimization Learn to Optimize Machine Learning for Combinatorial Optimization

Scores: [ 6 8 6 ]

Machine learning (ML) have been shown to successfully accelerate solving NP-hard combinatorial optimization (CO) problems under the branch and bound framework. However, the high training and inference cost and limited interpretability of ML approaches severely limit their wide application to modern exact CO solvers. In contrast, human-designed policies---though widely integrated in modern CO solvers due to their compactness and reliability---can not capture data-driven patterns for higher performance. To combine the advantages of the two paradigms, we propose the first symbolic discovery framework---namely, deep symbolic discovery for exact combinatorial optimization solver (Symb4CO)---to learn high-performance symbolic policies on the branching task. Specifically, we show the potential existence of small symbolic policies empirically, employ a large neural network to search in the high-dimensional discrete space, and compile the learned symbolic policies directly for fast deployment. Experiments show that the Symb4CO learned purely CPU-based policies consistently achieve comparable performance to previous GPU-based state-of-the-art approaches. Furthermore, the appealing features of Symb4CO include its high training (ten training instances) and inference (one CPU core) efficiency and good interpretability (one-line expressions), making it simple and reliable for deployment. The results show encouraging potential for the wide deployment of ML to modern CO solvers.

POSTER-295: IRAD: Implicit Representation-driven Image Resampling against Adversarial Attacks

Primary Area: general machine learning (i.e., none of the above)

Keywords: adversarial defense resampling implicit representation

Scores: [ 5 6 6 8 ]

We introduce a novel approach to counter adversarial attacks, namely, image resampling. Image resampling transforms a discrete image into a new one, simulating the process of scene recapturing or rerendering as specified by a geometrical transformation. The underlying rationale behind our idea is that image resampling can alleviate the influence of adversarial perturbations while preserving essential semantic information, thereby conferring an inherent advantage in defending against adversarial attacks. To validate this concept, we present a comprehensive study on leveraging image resampling to defend against adversarial attacks. We have developed basic resampling methods that employ interpolation strategies and coordinate shifting magnitudes. Our analysis reveals that these basic methods can partially mitigate adversarial attacks. However, they come with apparent limitations: the accuracy of clean images noticeably decreases, while the improvement in accuracy on adversarial examples is not substantial.We propose implicit representation-driven image resampling (IRAD) to overcome these limitations. First, we construct an implicit continuous representation that enables us to represent any input image within a continuous coordinate space. Second, we introduce SampleNet, which automatically generates pixel-wise shifts for resampling in response to different inputs. Furthermore, we can extend our approach to the state-of-the-art diffusion-based method, accelerating it with fewer time steps while preserving its defense capability. Extensive experiments demonstrate that our method significantly enhances the adversarial robustness of diverse deep models against various attacks while maintaining high accuracy on clean images.

POSTER-296: Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM

Primary Area: general machine learning (i.e., none of the above)

Keywords: Speech Continuation Spoken Question Answering

Scores: [ 8 8 6 5 ]

We present a novel approach to adapting pre-trained large language models (LLMs) to perform question answering (QA) and speech continuation. By endowing the LLM with a pre-trained speech encoder, our model becomes able to take speech inputs and generate speech outputs. The entire system is trained end-to-end and operates directly on spectrograms, simplifying our architecture. Key to our approach is a training objective that jointly supervises speech recognition, text continuation, and speech synthesis using only paired speech-text pairs, enabling a `cross-modal' chain-of-thought within a single decoding pass. Our method surpasses existing spoken language models in speaker preservation and semantic coherence. Furthermore, the proposed model improves upon direct initialization in retaining the knowledge of the original LLM as demonstrated through spoken QA datasets.

POSTER-297: Ultra-sparse network advantage in deep learning via Cannistraci-Hebb brain-inspired training with hyperbolic meta-deep community-layered epitopology

Primary Area: general machine learning (i.e., none of the above)

Keywords: Sparse Training Network Science Epitopological Learning Network Automata Network Percolation Hyperbolic Network

Scores: [ 8 6 8 ]

SPOTLIGHT-61: On Bias-Variance Alignment in Deep Models

Primary Area: general machine learning (i.e., none of the above)

Keywords: bias-variance decomposition ensemble deep learning

Scores: [ 8 6 6 8 ]

Classical wisdom in machine learning holds that the generalization error can be decomposed into bias and variance, and these two terms exhibit a \emph{trade-off}. However, in this paper, we show that for an ensemble of deep learning based classification models, bias and variance are \emph{aligned} at a sample level, where squared bias is approximately \emph{equal} to variance for correctly classified sample points. We present empirical evidence confirming this phenomenon in a variety of deep learning models and datasets. Moreover, we study this phenomenon from two theoretical perspectives: calibration and neural collapse. We first show theoretically that under the assumption that the models are well calibrated, we can observe the bias-variance alignment. Second, starting from the picture provided by the neural collapse theory, we show an approximate correlation between bias and variance.

POSTER-298: PolyVoice: Language Models for Speech to Speech Translation

Primary Area: general machine learning (i.e., none of the above)

Keywords: Speech-to-Speecn Translatiom Audio Language Model Voice Clone

Scores: [ 8 5 3 8 ]

With the huge success of GPT models in natural language processing, there is a growing interest in applying language modeling approaches to speech tasks.Currently, the dominant architecture in speech-to-speech translation (S2ST) remains the encoder-decoder paradigm, creating a need to investigate the impact of language modeling approaches in this area. In this study, we introduce PolyVoice, a language model-based framework designed for S2ST systems. Our framework comprises three decoder-only language models: a translation language model, a duration language model, and a speech synthesis language model. These language models employ different types of prompts to extract learned information effectively. By utilizing unsupervised semantic units, our framework can transfer semantic information across these models, making it applicable even to unwritten languages. We evaluate our system on Chinese English and English Spanish language pairs. Experimental results demonstrate that PolyVoice outperforms the state-of-the-art encoder-decoder model, producing voice-cloned speech with high translation and audio quality.Speech samples are available at \url{https://polyvoice.github.io}.

POSTER-299: FedWon: Triumphing Multi-domain Federated Learning Without Normalization

Primary Area: general machine learning (i.e., none of the above)

Keywords: federated learning multi-domain federated learning

Scores: [ 5 6 5 8 ]

Federated learning (FL) enhances data privacy with collaborative in-situ training on decentralized clients. Nevertheless, FL encounters challenges due to non-independent and identically distributed (non-i.i.d) data, leading to potential performance degradation and hindered convergence. While prior studies predominantly addressed the issue of skewed label distribution, our research addresses a crucial yet frequently overlooked problem known as multi-domain FL. In this scenario, clients' data originate from diverse domains with distinct feature distributions, instead of label distributions. To address the multi-domain problem in FL, we propose a novel method called Federated Learning Without Normalizations (FedWon). FedWon draws inspiration from the observation that batch normalization (BN) faces challenges in effectively modeling the statistics of multiple domains, while existing normalization techniques possess their own limitations. In order to address these issues, FedWon eliminates the normalization layers in FL and reparameterizes convolution layers with scaled weight standardization. Through extensive experimentation on five datasets and five models, our comprehensive experimental results demonstrate that FedWon surpasses both FedAvg and the current state-of-the-art method (FedBN) across all experimental setups, achieving notable accuracy improvements of more than 10% in certain domains. Furthermore, FedWon is versatile for both cross-silo and cross-device FL, exhibiting robust domain generalization capability, showcasing strong performance even with a batch size as small as 1, thereby catering to resource-constrained devices. Additionally, FedWon can also effectively tackle the challenge of skewed label distribution.

POSTER-300: Learning Polynomial Problems with SL(2,R)-Equivariance

Primary Area: general machine learning (i.e., none of the above)

Keywords: equivariance invariance polynomials non-compact special linear group data augmentation

Scores: [ 8 8 5 ]

Optimizing and certifying the positivity of polynomials are fundamental primitives across mathematics and engineering applications, from dynamical systems to operations research. However, solving these problems in practice requires large semidefinite programs, with poor scaling in dimension and degree. In this work, we demonstrate for the first time that neural networks can effectively solve such problems in a data-driven fashion, achieving tenfold speedups while retaining high accuracy. Moreover, we observe that these polynomial learning problems are equivariant to the non-compact group SL(2,R), which consists of area-preserving linear transformations. We therefore adapt our learning pipelines to accommodate this structure, including data augmentation, a new SL(2,R)-equivariant architecture, and an architecture equivariant with respect to its maximal compact subgroup, SO(2,R). Surprisingly, the most successful approaches in practice do not enforce equivariance to the entire group, which we prove arises from an unusual lack of architecture universality for SL(2,R) in particular. A consequence of this result, which is of independent interest, is that there exists an equivariant function for which there is no sequence of equivariant approximating polynomials. This is a rare example of a symmetric problem where data augmentation outperforms a fully equivariant architecture, and provides interesting lessons in both theory and practice for other problems with non-compact symmetries.

POSTER-301: Learning Large DAGs is Harder than you Think: Many Losses are Minimal for the Wrong DAG

Primary Area: general machine learning (i.e., none of the above)

Keywords: Directed Acyclic Graphs Structure Learning

Scores: [ 6 6 5 6 ]

Structure learning is a crucial task in science, especially in fields such as medicine and biology, where the wrong identification of (in)dependencies among random variables can have significant implications. The primary objective of structure learning is to learn a Directed Acyclic Graph (DAG) that represents the underlying probability distribution of the data. Many prominent DAG learners rely on least square losses or log-likelihood losses for optimization. It is well-known from regression models that least square losses are heavily influenced by the scale of the variables. Recently it has been demonstrated that the scale of data also affects performance of structure learning algorithms, though with a strong focus on linear 2-node systems and simulated data. Moving beyond these results, we provide conditions under which square-based losses are minimal for wrong DAGs in d-dimensional cases. Furthermore, we also show that scale can impair performance of structure learners if relations among variables are non-linear for both square based and log-likelihood based losses. We confirm our theoretical findings through extensive experiments on synthetic and real-world data.

POSTER-302: FedImpro: Measuring and Improving Client Update in Federated Learning

Primary Area: general machine learning (i.e., none of the above)

Keywords: Federated Learning Deep Learning

Scores: [ 6 8 6 8 ]

Federated Learning (FL) models typically suffer from client drift caused by heterogeneous data, where data distributions vary with clients. To this end, advanced works mainly focus on manipulating exist gradients to obtain more similar client models. In this paper, we propose a different view of client drift and correct it by producing better local models. First, we analyze the generalization contribution of local training and conclude that the generalization contribution of local training is bounded by the conditional Wasserstein distance between clients' data distributions. Then, we propose FedImpro, to constructs similar conditional distributions for local training. Specifically, FedImpro decouples the model into high-level and low-level parts and trains the high-level part on reconstructed feature distributions, causing promoted generalization contribution and alleviated gradient dissimilarity of FL. Experimental results demonstrate that FedImpro can help FL defend against data heterogeneity and improve model generalization

POSTER-303: Neural Active Learning Beyond Bandits

Primary Area: general machine learning (i.e., none of the above)

Keywords: Active Learning Neural Networks

Scores: [ 6 6 8 6 ]

We study both stream-based and pool-based active learning with neural network approximations. A recent line of works proposed bandit-based approaches that transformed active learning into a bandit problem, achieving both theoretical and empirical success. However, the performance and computational costs of these methods may be susceptible to the number of classes, denoted as K, due to this transformation. Therefore, this paper seeks to answer the question: "How can we mitigate the adverse impacts of K while retaining the advantages of principled exploration and provable performance guarantees in active learning?" To tackle this challenge, we propose two algorithms based on the newly designed exploitation and exploration neural networks for stream-based and pool-based active learning. Subsequently, we provide theoretical performance guarantees for both algorithms in a non-parametric setting, demonstrating a slower error-growth rate concerning K for the proposed approaches. We use extensive experiments to evaluate the proposed algorithms, which consistently outperform state-of-the-art baselines.

POSTER-304: The Alignment Problem from a Deep Learning Perspective: A Position Paper

Primary Area: general machine learning (i.e., none of the above)

Keywords: Alignment Safety AGI position paper

Scores: [ 5 3 8 5 ]

AI systems based on deep learning have reached or surpassed human performance in a range of narrow domains. In coming decades, artificial general intelligence (AGI) may surpass human capabilities at many critical tasks. In this position paper, we examine the technical difficulty of fine-tuning hypothetical AGI systems based on pretrained deep models to pursue goals that are aligned with human interests. We argue that, if trained like today's most capable models, AGI systems could learn to act deceptively to receive higher reward, learn internally-represented goals which generalize beyond their fine-tuning distributions, and pursue those goals using power-seeking strategies. We review emerging evidence for these properties. AGIs with these properties would be difficult to align and may appear aligned even when they are not.

POSTER-305: Nougat: Neural Optical Understanding for Academic Documents

Primary Area: general machine learning (i.e., none of the above)

Keywords: Visual Document Understanding Optical Character Recognition Mathematical Expression Recognition Information Extraction

Scores: [ 5 3 8 10 ]

Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human- readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.

POSTER-306: Magnitude Invariant Parametrizations Improve Hypernetwork Learning

Primary Area: general machine learning (i.e., none of the above)

Keywords: deep learning hypernetwork stability optimization

Scores: [ 5 8 6 6 ]

Hypernetworks, neural networks that predict the parameters of another neural network, are powerful models that have been successfully used in diverse applications from image generation to multi-task learning. Unfortunately, existing hypernetworks are often challenging to train. Training typically converges far more slowly than for non-hypernetwork models, and the rate of convergence can be very sensitive to hyperparameter choices. In this work, we identify a fundamental and previously unidentified problem that contributes to the challenge of training hypernetworks: a magnitude proportionality between the inputs and outputs of the hypernetwork. We demonstrate both analytically and empirically that this can lead to unstable optimization, thereby slowing down convergence, and sometimes even preventing any learning. We present a simple solution to this problem using a revised hypernetwork formulation that we call Magnitude Invariant Parametrizations (MIP). We demonstrate the proposed solution on several hypernetwork tasks, where it consistently stabilizes training and achieves faster convergence. Furthermore, we perform a comprehensive ablation study including choices of activation function, normalization strategies, input dimensionality, and hypernetwork architecture; and find that MIP improves training in all scenarios. We provide easy-to-use code that can turn existing networks into MIP-based hypernetworks.

POSTER-307: CoLiDE: Concomitant Linear DAG Estimation

Primary Area: general machine learning (i.e., none of the above)

Keywords: directed acyclic graph concomitant scale estimation causal discovery graph structure learning

Scores: [ 3 8 6 ]

We deal with the combinatorial problem of learning directed acyclic graph (DAG) structure from observational data adhering to a linear structural equation model (SEM). Leveraging advances in differentiable, nonconvex characterizations of acyclicity, recent efforts have advocated a continuous constrained optimization paradigm to efficiently explore the space of DAGs. Most existing methods employ lasso-type score functions to guide this search, which (i) require expensive penalty parameter retuning when the \emph{unknown} SEM noise variances change across problem instances; and (ii) implicitly rely on limiting homoscedasticity assumptions. In this work, we propose a new convex score function for sparsity-aware learning of linear DAGs, which incorporates concomitant estimation of scale and thus effectively decouples the sparsity parameter from noise levels. Regularization via a smooth, nonconvex acyclicity penalty term yields CoLiDE ($\textbf{Co}$ncomitant $\textbf{Li}$near $\textbf{D}$AG $\textbf{E}$stimation), a regression-based criterion amenable to efficient gradient computation and closed-form estimation of exogenous noise levels in heteroscedastic scenarios. Our algorithm outperforms state-of-the-art methods without incurring added complexity, especially when the DAGs are larger and the noise level profile is heterogeneous. We also find CoLiDE exhibits enhanced stability manifested via reduced standard deviations in several domain-specific metrics, underscoring the robustness of our novel linear DAG estimator.

ORAL-17: Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions

Primary Area: general machine learning (i.e., none of the above)

Keywords: In-context learning Transformers Large language models Boolean functions

Scores: [ 8 6 6 8 ]

In order to understand the in-context learning phenomenon, recent works have adopted a stylized experimental framework and demonstrated that Transformers can learn gradient-based learning algorithms for various classes of real-valued functions. However, the limitations of Transformers in implementing learning algorithms, and their ability to learn other forms of algorithms are not well understood. Additionally, the degree to which these capabilities are confined to attention-based models is unclear. Furthermore, it remains to be seen whether the insights derived from these stylized settings can be extrapolated to pretrained Large Language Models (LLMs). In this work, we take a step towards answering these questions by demonstrating the following: (a) On a test-bed with a variety of Boolean function classes, we find that Transformers can nearly match the optimal learning algorithm for 'simpler' tasks, while their performance deteriorates on more 'complex' tasks. Additionally, we find that certain attention-free models perform (almost) identically to Transformers on a range of tasks. (b) When provided a teaching sequence, i.e. a set of examples that uniquely identifies a function in a class, we show that Transformers learn more sample-efficiently. Interestingly, our results show that Transformers can learn to implement two distinct algorithms to solve a single task, and can adaptively select the more sample-efficient algorithm depending on the sequence of in-context examples. (c) Lastly, we show that extant LLMs, e.g. LLaMA-2, GPT-4, can compete with nearest-neighbor baselines on prediction tasks that are guaranteed to not be in their training set.

POSTER-308: G$^2$N$^2$ : Weisfeiler and Lehman go grammatical

Primary Area: general machine learning (i.e., none of the above)

Keywords: Machine learning GNN Graph Learning

Scores: [ 8 6 6 ]

This paper introduces a framework for formally establishing a connection between a portion of an algebraic language and a Graph Neural Network (GNN). The framework leverages Context-Free Grammars (CFG) to organize algebraic operations into generative rules that can be translated into a GNN layer model. As CFGs derived directly from a language tend to contain redundancies in their rules and variables, we present a grammar reduction scheme. By applying this strategy, we define a CFG that conforms to the third-order Weisfeiler-Lehman (3-WL) test using the matricial language MATLANG. From this 3-WL CFG, we derive a GNN model, named G$^2$N$^2$, which is provably 3-WL compliant. Through various experiments, we demonstrate the superior efficiency of G$^2$N$^2$ compared to other 3-WL GNNs across numerous downstream tasks. Specifically, one experiment highlights the benefits of grammar reduction within our framework.

SPOTLIGHT-62: Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy

Primary Area: general machine learning (i.e., none of the above)

Keywords: Sparse Mixture-of-Experts Efficiency Merging Compression

Scores: [ 8 8 3 ]

Sparsely activated Mixture-of-Experts (SMoE) has shown promise to scale up the learning capacity of neural networks, however, they have issues like: (a) High Memory Usage, due to duplication of the network layers into multiple copies as experts; and (b) Redundancy in Experts, as common learning-based routing policies suffer from representational collapse. Therefore, vanilla SMoE models are memory inefficient and non-scalable, especially for resource-constrained downstream scenarios. In this paper, we ask: Can we craft a compact SMoE model by consolidating expert information? What is the best recipe to merge multiple experts into fewer but more knowledgeable experts? Our pilot investigation reveals that conventional model merging methods fail to be effective in such expert merging for SMoE. The potential reasons are: (1) redundant information overshadows critical experts; (2) appropriate neuron permutation for each expert is missing to bring all of them in alignment. To address these challenges, we propose a novel merging algorithm for SMoE, i.e., M-SMoE, which leverages routing statistics to guide expert merging. Specifically, it starts with neuron permutation alignment for experts; then, dominant experts and their "group members" are formed based on routing policies; lastly, every expert group is merged into a single expert by utilizing each expert's activation frequency as their weight for merging, thus diminishing the impact of insignificant experts. Moreover, we draw an interesting observation that our proposed merging promotes a low dimensionality in the merged expert's weight space, naturally paving the way for additional compression. Hence, our final method, MC-SMoE (i.e., Merge, then Compress SMoE), further decomposes the merged experts into low-rank and structural sparse alternatives. Extensive experiments across 8 benchmarks validate the effectiveness of our proposals. For instance, our MC-SMoE achieves up to 80% memory and a 20% FLOPs reduction, with virtually no loss in performance. Our code is provided as supplementary material.

POSTER-309: ArchLock: Locking DNN Transferability at the Architecture Level with a Zero-Cost Binary Predictor

Primary Area: general machine learning (i.e., none of the above)

Keywords: Defense; DNN Transferability; Neural Architecture Search

Scores: [ 3 8 6 ]

Deep neural network (DNN) models, despite their impressive performance, are vulnerable to exploitation by attackers who attempt to adapt them to other tasks for their own benefit. Current defense strategies mainly address this vulnerability at the model parameter level, leaving the potential of architectural-level defense largely unexplored. This paper, for the first time, addresses the issue of model protection by reducing transferability at the architecture level. Specially, we present a novel neural architecture search (NAS)-enabled algorithm that employs zero-cost proxies and evolutionary search, to design model architectures with low transferability. Our method, namely ArchLock, aims to achieve high performance on the source task, while degrading the performance on target tasks, i.e., locking the transferability of a DNN model.To achieve efficient cross-task search without having access to the training data owned by the attackers, we utilize zero-cost proxies to speed up architecture evaluation and simulate potential target task embeddings to assist cross-task search with a binary performance predictor. Extensive experiments on NAS-Bench-201 and TransNAS-Bench-101 demonstrate that ArchLock reduces transferability by up to 30% and 50%, respectively, with negligible performance degradation on source tasks (<2%).

POSTER-310: DiffusionNAG: Predictor-guided Neural Architecture Generation with Diffusion Models

Primary Area: general machine learning (i.e., none of the above)

Keywords: Neural Architecture Search Diffusion Model Meta Learning

Scores: [ 6 6 6 5 ]

Existing NAS methods suffer from either an excessive amount of time for repetitive sampling and training of many task-irrelevant architectures. To tackle such limitations of existing NAS methods, we propose a paradigm shift from NAS to a novel conditional Neural Architecture Generation (NAG) framework based on diffusion models, dubbed DiffusionNAG. Specifically, we consider the neural architectures as directed graphs and propose a graph diffusion model for generating them. Moreover, with the guidance of parameterized predictors, DiffusionNAG can flexibly generate task-optimal architectures with the desired properties for diverse tasks, by sampling from a region that is more likely to satisfy the properties. This conditional NAG scheme is significantly more efficient than previous NAS schemes which sample the architectures and filter them using the property predictors. We validate the effectiveness of DiffusionNAG through extensive experiments in two predictor-based NAS scenarios: Transferable NAS and Bayesian Optimization (BO)-based NAS. DiffusionNAG achieves superior performance with speedups of up to 20× when compared to the baselines on Transferable NAS benchmarks. Furthermore, when integrated into a BO-based algorithm, DiffusionNAG outperforms existing BO-based NAS approaches, particularly in the large MobileNetV3 search space on the ImageNet 1K dataset.

POSTER-311: Teaching Arithmetic to Small Transformers

Primary Area: general machine learning (i.e., none of the above)

Keywords: Machine Learning Transformers Large Language Models

Scores: [ 6 5 6 5 8 ]

Large language models like GPT-4 exhibit emergent capabilities across general-purpose tasks, such as basic arithmetic, when trained on extensive text data, even though these tasks are not explicitly encoded by the unsupervised, next-token prediction objective. This study investigates how even small transformers, trained from random initialization, can efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square root, using the next-token prediction objective. We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that includes intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and parameter scaling. Additionally, we discuss the challenges associated with length generalization. Our work highlights the importance of high-quality, instructive data that considers the particular characteristics of the next-word prediction loss for rapidly eliciting arithmetic capabilities.

POSTER-312: Learning to Reject with a Fixed Predictor: Application to Decontextualization

Primary Area: general machine learning (i.e., none of the above)

Keywords: Rejection abstention loss function consistency learning theory decontextualization natural language processing

Scores: [ 6 6 6 ]

We study the problem of classification with a reject option for a fixed predictor, crucial to natural language processing. We introduce a new problem formulation for this scenario, and an algorithm minimizing a new surrogate loss function. We provide a complete theoretical analysis of the surrogate loss function with a strong H-consistency guarantee. For evaluation, we choose the \textit{decontextualization} task, and provide a manually-labelled dataset of 2,000 examples. Our algorithm significantly outperforms the baselines considered, with a 25% improvement in coverage when halving the error rate, which is only 3% away from the theoretical limit.

POSTER-313: Spurious Feature Diversification Improves Out-of-distribution Generalization

Primary Area: general machine learning (i.e., none of the above)

Keywords: Out-of-distribution generalization

Scores: [ 8 8 5 ]

ORAL-18: Approximating Nash Equilibria in Normal-Form Games via Stochastic Optimization

Primary Area: general machine learning (i.e., none of the above)

Keywords: game theory stochastic optimization nash equilbrium normal-form game x-armed bandits

Scores: [ 8 8 8 ]

We propose the first, to our knowledge, loss function for approximate Nash equilibria of normal-form games that is amenable to unbiased Monte Carlo estimation. This construction allows us to deploy standard non-convex stochastic optimization techniques for approximating Nash equilibria, resulting in novel algorithms with provable guarantees. We complement our theoretical analysis with experiments demonstrating that stochastic gradient descent can outperform previous state-of-the-art approaches.

SPOTLIGHT-63: FITS: Modeling Time Series with 10k Parameters

Primary Area: general machine learning (i.e., none of the above)

Keywords: Time series analysis Time series forecasting Complex-valued neural network

Scores: [ 8 8 8 8 ]

In this paper, we introduce FITS, a lightweight yet powerful model for time series analysis. Unlike existing models that directly process raw time-domain data, FITS operates on the principle that time series can be manipulated through interpolation in the complex frequency domain, achieving performance comparable to state-of-the-art models for time series forecasting and anomaly detection tasks. Notably, FITS accomplishes this with a svelte profile of just about 10k parameters, making it ideally suited for edge devices and paving the way for a wide range of applications. The code is available for review at: \url{https://anonymous.4open.science/r/FITS}.

POSTER-314: Transferring Learning Trajectories of Neural Networks

Primary Area: general machine learning (i.e., none of the above)

Keywords: neural networks learning dynamics permutation symmetry loss landscape

Scores: [ 6 8 6 ]

Training deep neural networks (DNNs) is computationally expensive, which is problematic especially when performing duplicated or similar training runs in model ensemble or fine-tuning pre-trained models, for example. Once we have trained one DNN on some dataset, we have its learning trajectory (i.e., a sequence of intermediate parameters during training) which may potentially contain useful information for learning the dataset. However, there has been no attempt to utilize such information of a given learning trajectory for another training. In this paper, we formulate the problem of "transferring" a given learning trajectory from one initial parameter to another one (named learning transfer problem) and derive the first algorithm to approximately solve it by matching gradients successively along the trajectory via permutation symmetry. We empirically show that the transferred parameters achieve non-trivial accuracy before any direct training, and can be trained significantly faster than training from scratch.

ORAL-19: Phenomenal Yet Puzzling: Testing Inductive Reasoning Capabilities of Language Models with Hypothesis Refinement

Primary Area: general machine learning (i.e., none of the above)

Keywords: language model natural language processing inductive reasoning

Scores: [ 8 8 8 8 ]

The ability to derive the underlying principles from a handful of observations and then generalize to novel situations---known as inductive reasoning---is central to human intelligence. Prior work suggests that language models (LMs) often fall short on inductive reasoning, despite achieving impressive success on research benchmarks. In this work, we conduct a systematic study of the inductive reasoning capabilities of LMs through iterative hypothesis refinement, a technique that more closely mirrors the human inductive process than standard input-output prompting. Iterative hypothesis refinement employs a three-step process: proposing, selecting, and refining hypotheses in the form of textual rules. By examining the intermediate rules, we observe that LMs are phenomenal hypothesis proposers (i.e., generating candidate rules), and when coupled with a (task-specific) symbolic interpreter that is able to systematically filter the proposed set of rules, this hybrid approach achieves strong results across inductive reasoning benchmarks that require inducing causal relations, language-like instructions, and symbolic concepts. However, they also behave as puzzling inductive reasoners, showing notable performance gaps in rule induction (i.e., identifying plausible rules) and rule application (i.e., applying proposed rules to instances), suggesting that LMs are proposing hypotheses without being able to actually apply the rules. Through extensive empirical and human analyses, we further reveal several discrepancies between the inductive reasoning processes of LMs and humans, shedding light on both the potentials and limitations of using LMs in inductive reasoning tasks.

POSTER-315: Understanding In-Context Learning from Repetitions

Primary Area: general machine learning (i.e., none of the above)

Keywords: In-context learning Large language models Repetition

Scores: [ 6 5 5 6 ]

This paper explores the elusive mechanism underpinning in-context learning in Large Language Models (LLMs). Our work provides a novel perspective by examining in-context learning via the lens of surface repetitions. We quantitatively investigate the role of surface features in text generation, and empirically establish the existence of token co-occurrence reinforcement, a principle that strengthens the relationship between two tokens based on their contextual co-occurrences. By investigating the dual impacts of these features, our research illuminates the internal workings of in-context learning and expounds on the reasons for its failures. This paper provides an essential contribution to the understanding of in-context learning and its potential limitations, providing a fresh perspective on this exciting capability.

POSTER-316: Learning model uncertainty as variance-minimizing instance weights

Primary Area: general machine learning (i.e., none of the above)

Keywords: loss reweighting epistemic uncertainty bi-level optimization model calibration bayesian neural networks

Scores: [ 6 6 8 ]

Predictive uncertainty--a model’s self-awareness regarding its accuracy on an input--is key for both building robust models via training interventions and for test-time applications such as selective classification. We propose a novel instance-conditional reweighting approach that captures predictive uncertainty using an auxiliary network, and unifies these train- and test-time applications. The auxiliary network is trained using a meta-objective in a bilevel optimization framework. A key contribution of our proposal is the meta-objective of minimizing dropout variance, an approximation of Bayesian predictive uncertainty, We show in controlled experiments that we effectively capture diverse specific notions of uncertainty through this meta-objective, while previous approaches only capture certain aspects. These results translate to significant gains in real-world settings–selective classification, label noise, domain adaptation, calibration–and across datasets–Imagenet, Cifar100, diabetic retinopathy, Camelyon, WILDs, Imagenet-C,-A,-R, Clothing-1.6M, etc. For Diabetic Retinopathy, we see upto 3.4%/3.3% accuracy & AUC gains over SOTA in selective classification. We also improve upon large-scale pretrained models such as PLEX.

POSTER-317: Supervised Knowledge Makes Large Language Models Better In-context Learners

Primary Area: general machine learning (i.e., none of the above)

Keywords: In-context Learning; Out-of-distribution Generalization; Reliability

Scores: [ 5 5 5 ]

Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering. The recent progress in large-scale generative models has further expanded their use in real-world language applications. However, the critical challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored. While previous in-context learning research has focused on enhancing models to adhere to users' specific instructions and quality expectations, and to avoid undesired outputs, little to no work has explored the use of task-specific fine-tuned Language Models (SLMs) to improve LLMs' in-context learning during the inference stage. Our primary contribution is the establishment of a simple yet effective framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks. Using our proposed plug-in method, enhanced versions of Llama 2 and ChatGPT surpass their original versions regarding generalizability and factuality. We offer a comprehensive suite of resources, including 16 curated datasets, prompts, model checkpoints, and LLM outputs across 9 distinct tasks. Our empirical analysis sheds light on the advantages of incorporating discriminative models into LLMs and highlights the potential of our methodology in fostering more reliable LLMs.

ORAL-20: The mechanistic basis of data dependence and abrupt learning in an in-context classification task

Primary Area: general machine learning (i.e., none of the above)

Keywords: in-context learning mechanistic interpretability language models induction heads

Scores: [ 8 8 10 10 ]

Transformer models exhibit in-context learning: the ability to accurately predict the response to a novel query based on illustrative examples in the input sequence, which contrasts with traditional in-weights learning of query-output relationships. What aspects of the training data distribution and architecture favor in-context vs in-weights learning? Recent work has shown that specific distributional properties inherent in language, such as burstiness, large dictionaries and skewed rank-frequency distributions, control the trade-off or simultaneous appearance of these two forms of learning. We first show that these results are recapitulated in a minimal attention-only network trained on a simplified dataset. In-context learning (ICL) is driven by the abrupt emergence of an induction head, which subsequently competes with in-weights learning. By identifying progress measures that precede in-context learning and targeted experiments, we construct a two-parameter model of an induction head which emulates the full data distributional dependencies displayed by the attention-based network. A phenomenological model of induction head formation traces its abrupt emergence to the sequential learning of three nested logits enabled by an intrinsic curriculum. We propose that the sharp transitions in attention-based networks arise due to a specific chain of multi-layer operations necessary to achieve ICL, which is implemented by nested nonlinearities sequentially learned during training.

POSTER-318: Interpretable Sparse System Identification: Beyond Recent Deep Learning Techniques on Time-Series Prediction

Primary Area: general machine learning (i.e., none of the above)

Keywords: time series sparse system identification long term prediction

Scores: [ 8 8 6 ]

POSTER-319: Understanding prompt engineering may not require rethinking generalization

Primary Area: general machine learning (i.e., none of the above)

Keywords: generalization prompt engineering PAC-Bayes foundation models

Scores: [ 6 8 8 8 ]

Zero-shot learning in prompted vision-language models, the practice of crafting prompts to build classifiers without an explicit training process, has achieved impressive performance in many settings. This success presents a seemingly surprising observation: these methods suffer relatively little from overfitting, i.e., when a prompt is manually engineered to achieve low error on a given training set (thus rendering the method no longer actually zero-shot), the approach still performs well on held-out test data. In this paper, we show that we can explain such performance well via recourse to classical PAC-Bayes bounds. Specifically, we show that the discrete nature of prompts, combined with a PAC-Bayes prior given by a language model, results in generalization bounds that are remarkably tight by the standards of the literature: for instance, the generalization bound of an ImageNet classifier is often within a few percentage points of the true test error. We demonstrate empirically that this holds for existing handcrafted prompts and prompts generated through simple greedy search. Furthermore, the resulting bound is well-suited for model selection: the models with the best bound typically also have the best test performance. This work thus provides a possible justification for the widespread practice of "prompt engineering," even if it seems that such methods could potentially overfit the training data.

POSTER-320: INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection

Primary Area: general machine learning (i.e., none of the above)

Keywords: Large Language Models Hallucination Detection LogDet of Covariance Matrix Eigenvalues Feature Clipping

Scores: [ 8 6 6 6 ]

POSTER-321: Effective Data Augmentation With Diffusion Models

Primary Area: general machine learning (i.e., none of the above)

Keywords: Deep Learning

Scores: [ 6 8 6 8 ]

Data augmentation is one of the most prevalent tools in deep learning, underpinning many recent advances, including those from classification, generative models, and representation learning. The standard approach to data augmentation combines simple transformations like rotations and flips to generate new images from existing ones. However, these new images lack diversity along key semantic axes present in the data. Current augmentations cannot alter the high-level semantic attributes, such as animal species present in a scene, to enhance the diversity of data. We address the lack of diversity in data augmentation with image-to-image transformations parameterized by pre-trained text-to-image diffusion models. Our method edits images to change their semantics using an off-the-shelf diffusion model, and generalizes to novel visual concepts from a few labelled examples. We evaluate our approach on few-shot image classification tasks, and on a real-world weed recognition task, and observe an improvement in accuracy in tested domains.

POSTER-322: A Stochastic Centering Framework for Improving Calibration in Graph Neural Networks

Primary Area: general machine learning (i.e., none of the above)

Keywords: Graph Neural Networks Uncertainity Quantification Calibration

Scores: [ 6 5 6 ]

While graph neural networks (GNNs) are widely used for node and graph representation learning tasks, the reliability of GNN uncertainty estimates under distribution shifts remains relatively under-explored. Indeed, while \textit{post-hoc} calibration strategies can be used to improve in-distribution calibration, they need not also improve calibration under distribution shift. However, techniques which produce GNNs with better \textit{intrinsic} uncertainty estimates are particularly valuable, as whey can always be combined with post-hoc strategies later. Therefore, in this work, we propose G-$\Delta$UQ, a novel training framework designed to improve intrinsic GNN uncertainty estimates. Our framework adapts the principle of stochastic data centering to graph data through novel graph anchoring strategies, and is able to support partially stochastic GNNs. While, the prevalent wisdom is that fully stochastic networks are necessary to obtain reliable estimates, we find that the structure induced by our anchoring strategies renders this unnecessary and allows us to support \gduq~ on pretrained models. Indeed, through extensive evaluation under covariate, concept and graph size shifts, we show that G-$\Delta$UQ leads to better calibrated GNNs for node and graph classification. Further, it also improves performance on other uncertainty-based tasks like out-of-distribution detection and generalization gap estimation. Overall, our work provides insights into uncertainty estimation for GNNs, and demonstrates the utility of G-$\Delta$UQ in obtaining reliable estimates.

POSTER-323: Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning

Primary Area: general machine learning (i.e., none of the above)

Keywords: large language models knowledge graphs reasoning

Scores: [ 8 6 8 8 ]

Large language models (LLMs) have demonstrated impressive reasoning abilities in complex tasks. However, they lack up-to-date knowledge and experience hallucinations during reasoning, which can lead to incorrect reasoning processes and diminish their performance and trustworthiness. Knowledge graphs (KGs), which capture vast amounts of facts in a structured format, offer a reliable source of knowledge for reasoning. Nevertheless, existing KG-based LLM reasoning methods only treat KGs as factual knowledge bases and overlook the importance of their structural information for reasoning. In this paper, we propose a novel method called reasoning on graphs (RoG) that synergizes LLMs with KGs to enable faithful and interpretable reasoning. Specifically, we present a planning-retrieval-reasoning framework, where RoG first generates relation paths grounded by KGs as faithful plans. These plans are then used to retrieve valid reasoning paths from the KGs for LLMs to conduct faithful reasoning. Furthermore, RoG not only distills knowledge from KGs to improve the reasoning ability of LLMs through training but also allows seamless integration with any arbitrary LLMs during inference. Extensive experiments on two benchmark KGQA datasets demonstrate that RoG achieves state-of-the-art performance on KG reasoning tasks and generates faithful and interpretable reasoning results.

POSTER-324: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers

Primary Area: general machine learning (i.e., none of the above)

Keywords: large language model evolutionary algorithm prompt engineering natural language processing

Scores: [ 6 6 8 6 ]

Large Language Models (LLMs) excel in various tasks, but they rely on carefully crafted prompts that often demand substantial human effort. To automate this process, in this paper, we propose a novel framework for discrete prompt optimization, called EvoPrompt, which borrows the idea of evolutionary algorithms (EAs) as they exhibit good performance and fast convergence. To enable EAs to work on discrete prompts, which are natural language expressions that need to be coherent and human-readable, we connect LLMs with EAs. This approach allows us to simultaneously leverage the powerful language processing capabilities of LLMs and the efficient optimization performance of EAs. Specifically, abstaining from any gradients or parameters, EvoPrompt starts from a population of prompts and iteratively generates new prompts with LLMs based on the evolutionary operators, improving the population based on the development set. We optimize prompts for both closed- and open-source LLMs including GPT-3.5 and Alpaca, on 31 datasets covering language understanding, generation tasks, as well as BIG-Bench Hard (BBH) tasks. EvoPrompt significantly outperforms human-engineered prompts and existing methods for automatic prompt generation (e.g., up to 25% on BBH). Furthermore, EvoPrompt demonstrates that connecting LLMs with EAs creates synergies, which could inspire further research on the combination of LLMs and conventional algorithms.

POSTER-325: Sparse Weight Averaging with Multiple Particles for Iterative Magnitude Pruning

Primary Area: general machine learning (i.e., none of the above)

Keywords: Unstructured pruning Sparse neural networks Lottery ticket hypothesis Weight averaging Bayesian neural networks

Scores: [ 6 6 6 8 6 ]

Given the ever-increasing size of modern neural networks, the significance of sparse architectures has surged due to their accelerated inference speeds and minimal memory demands. When it comes to global pruning techniques, Iterative Magnitude Pruning (IMP) still stands as a state-of-the-art algorithm despite its simple nature, particularly in extremely sparse regimes. In light of the recent finding that the two successive matching IMP solutions are linearly connected without a loss barrier, we propose Sparse Weight Averaging with Multiple Particles (SWAMP), a straightforward modification of IMP that achieves performance comparable to an ensemble of two IMP solutions. For every iteration, we concurrently train multiple sparse models, referred to as particles, using different batch orders yet the same matching ticket, and then weight average such models to produce a single mask. We demonstrate that our method consistently outperforms existing baselines across different sparsities through extensive experiments on various neural network structures and data.

POSTER-326: Balancing Act: Sparse Models with Constrained Disparate Impact

Primary Area: general machine learning (i.e., none of the above)

Keywords: deep learning sparsity disparate impact constrained optimization pruning fairness

Scores: [ 6 8 6 5 ]

POSTER-327: Smooth ECE: Principled Reliability Diagrams via Kernel Smoothing

Primary Area: general machine learning (i.e., none of the above)

Keywords: calibration reliability ECE theory

Scores: [ 6 6 6 8 ]

Calibration measures and reliability diagrams are two fundamental tools for measuring and interpreting the calibration of probabilistic predictors. Calibration measures quantify the degree of miscalibration, and reliability diagrams visualize the structure of this miscalibration. However, the most common constructions of reliability diagrams and calibration measures --- binning and ECE --- both suffer from well-known flaws (e.g. discontinuity). We show that a simple modification fixes both constructions: first smooth the observations using an RBF kernel, then compute the Expected Calibration Error (ECE) of this smoothed function. We prove that with a careful choice of bandwidth, this method yields a calibration measure that is well-behaved in the sense of (Blasiok, Gopalan, Hu, and Nakkiran 2023) --- a consistent calibration measure. We call this measure the SmoothECE. Moreover, the reliability diagram obtained from this smoothed function visually encodes the SmoothECE, just as binned reliability diagrams encode the BinnedECE. We also provide a Python package with simple, hyperparameter-free methods for measuring and plotting calibration (not linked to preserve anonymity).

POSTER-328: Consistent algorithms for multi-label classification with macro-at-k metrics

Primary Area: general machine learning (i.e., none of the above)

Keywords: multi-label classification complex performance metrics macro-at-k extreme classification

Scores: [ 8 8 6 8 ]

We consider the optimization of complex performance metrics in multi-label classification under the population utility framework. We mainly focus on metrics linearly decomposable into a sum of binary classification utilities applied separately to each label with an additional requirement of exactly k labels predicted for each instance. These macro-at-k'' metrics possess desired properties for extreme classification problems with long tail labels. Unfortunately, the at-k constraint couples the otherwise independent binary classification tasks, leading to a much more challenging optimization problem than standard macro-averages. We provide a statistical framework to study this problem, prove the existence and the form of the optimal classifier, and propose a statistically consistent and practical learning algorithm based on the Frank-Wolfe method. Interestingly, our main results concern even more general metrics being non-linear functions of label-wise confusion matrices. Empirical results provide evidence for the competitive performance of the proposed approach.

POSTER-329: GRANDE: Gradient-Based Decision Tree Ensembles

Primary Area: general machine learning (i.e., none of the above)

Keywords: Tabular Data; Architectures; Ensembles; Decision Trees; Gradient Descent

Scores: [ 6 8 6 6 ]

Despite the success of deep learning for text and image data, tree-based ensemble models are still state-of-the-art for machine learning with heterogeneous tabular data. However, there is a significant need for tabular-specific gradient-based methods due to their high flexibility. In this paper, we propose GRANDE, $\text{GRA}die\text{N}$t-Based $\text{D}$ecision Tree $\text{E}$nsembles, a novel approach for learning hard, axis-aligned decision tree ensembles using end-to-end gradient descent. GRANDE is based on a dense representation of tree ensembles, which affords to use backpropagation with a straight-through operator to jointly optimize all model parameters. Our method combines axis-aligned splits, which is a useful inductive bias for tabular data, with the flexibility of gradient-based optimization. Furthermore, we introduce an advanced instance-wise weighting that facilitates learning representations for both, simple and complex relations, within a single model. We conducted an extensive evaluation on a predefined benchmark with 19 classification datasets and demonstrate that our method outperforms existing gradient-boosting and deep learning frameworks on most datasets.

SPOTLIGHT-64: Quasi-Monte Carlo for 3D Sliced Wasserstein

Primary Area: general machine learning (i.e., none of the above)

Keywords: Sliced Wasserstein Monte Carlo Methods Point-Cloud Quasi-Monte Carlo Optimal Transport

Scores: [ 8 8 6 8 ]

Monte Carlo (MC) integration has been employed as the standard approximation method for the Sliced Wasserstein (SW) distance, whose analytical expression involves an intractable expectation. However, MC integration is not optimal in terms of absolute approximation error. To provide a better class of empirical SW, we propose quasi-sliced Wasserstein (QSW) approximations that rely on Quasi-Monte Carlo (QMC) methods. For a comprehensive investigation of QMC for SW, we focus on the 3D setting, specifically computing the SW between probability measures in three dimensions. In greater detail, we empirically evaluate various methods to construct QMC point sets on the 3D unit-hypersphere, including the Gaussian-based and equal area mappings, generalized spiral points, and optimizing discrepancy energies. Furthermore, to obtain an unbiased estimator for stochastic optimization, we extend QSW to Randomized Quasi-Sliced Wasserstein (RQSW) by introducing randomness in the discussed point sets. Theoretically, we prove the asymptotic convergence of QSW and the unbiasedness of RQSW. Finally, we conduct experiments on various 3D tasks, such as point-cloud comparison, point-cloud interpolation, image style transfer, and training deep point-cloud autoencoders, to demonstrate the favorable performance of the proposed QSW and RQSW variants.

POSTER-330: Fine-Tuning Language Models for Factuality

Primary Area: general machine learning (i.e., none of the above)

Keywords: factuality hallucination language model dpo

Scores: [ 6 6 5 6 ]

The fluency and creativity of large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines. However, language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations', which can harmfully perpetuate myths and misconceptions. Further, manual fact-checking of model responses is a time-consuming process, making human factuality labels expensive to acquire. In this work, we leverage two key recent innovations in NLP to fine-tune language models to be more factual without human labeling, targeting more open-ended generation settings than past work. First, several recent works have proposed methods for scoring the factuality of open-ended text derived from consistency with an external knowledge base or simply a large model's confidence scores. Second, the Direct Preference Optimization algorithm enables straightforward fine-tuning of language models on objectives other than supervised imitation, using a preference ranking over possible model responses. We show that learning from preference rankings generated by either automated criterion significantly improves the factuality of Llama-2 on held-out topics (percent of generated claims that are correct) compared with existing RLHF procedures or decoding strategies targeted at factuality, showing over 50% and 20-30% error reduction for biographies and medical questions respectively.

POSTER-331: Tree Cross Attention

Primary Area: general machine learning (i.e., none of the above)

Keywords: Attention Retrieval Tree

Scores: [ 5 5 8 8 ]

POSTER-332: Understanding Reconstruction Attacks with the Neural Tangent Kernel and Dataset Distillation

Primary Area: general machine learning (i.e., none of the above)

Keywords: Dataset Distillation Reconstruction Attacks Neural Tangent Kernel

Scores: [ 6 8 6 ]

Modern deep learning requires large volumes of data, which could contain sensitive or private information that cannot be leaked. Recent work has shown for homogeneous neural networks a large portion of this training data could be reconstructed with only access to the trained network parameters. While the attack was shown to work empirically, there exists little formal understanding of its effective regime which datapoints are susceptible to reconstruction. In this work, we first build a stronger version of the dataset reconstruction attack and show how it can provably recover the \emph{entire training set} in the infinite width regime. We then empirically study the characteristics of this attack on two-layer networks and reveal that its success heavily depends on deviations from the frozen infinite-width Neural Tangent Kernel limit. Next, we study the nature of easily-reconstructed images. We show that both theoretically and empirically, reconstructed images tend to outliers'' in the dataset, and that these reconstruction attacks can be used for \textit{dataset distillation}, that is, we can retrain on reconstructed images and obtain high predictive accuracy.

POSTER-333: Provable Out-of-Distribution Generalization in Hypersphere

Primary Area: general machine learning (i.e., none of the above)

Keywords: OOD Generalization Hypersphere

Scores: [ 3 6 6 ]

Out-of-distribution (OOD) generalization is critical for machine learning models deployed in the real world. However, achieving this can be fundamentally challenging, as it requires the ability to learn invariant features across different domains or environments. In this paper, we propose a novel framework for OOD generalization, which provably learns domain-invariant representations in a hyperspherical space. In particular, our hyperspherical learning algorithm is guided by intra-class variation and inter-class separation principles---ensuring that features from the same class (across different training domains) are closely aligned with their class prototypes, while different class prototypes are maximally separated. We also provide theoretical justifications on how our prototypical learning objective improves the OOD generalization bound. Through extensive experiments on challenging OOD benchmarks, we demonstrate that our approach outperforms competitive baselines and achieves superior performance.

POSTER-334: FedLoGe: Joint Local and Generic Federated Learning under Long-tailed Data

Primary Area: general machine learning (i.e., none of the above)

Keywords: Personalized federated learning Data heterogeneity Long-tailed Learning

Scores: [ 5 6 8 5 ]

POSTER-335: Exploring the Promise and Limits of Real-Time Recurrent Learning

Primary Area: general machine learning (i.e., none of the above)

Keywords: recurrent neural networks real-time recurrent learning online recurrent learning reinforcement learning actor-critic policy gradients

Scores: [ 6 6 6 8 ]

Real-time recurrent learning (RTRL) for sequence-processing recurrent neural networks (RNNs) offers certain conceptual advantages over backpropagation through time (BPTT). RTRL requires neither caching past activations nor truncating context, and enables online learning. However, RTRL's time and space complexity make it impractical. To overcome this problem, most recent work on RTRL focuses on approximation theories, while experiments are often limited to diagnostic settings. Here we explore the practical promise of RTRL in more realistic settings. We study actor-critic methods that combine RTRL and policy gradients, and test them in several subsets of DMLab-30, ProcGen, and Atari-2600 environments. On DMLab memory tasks, our system trained on fewer than 1.2B environmental frames is competitive with or outperforms well-known IMPALA and R2D2 baselines trained on 10B frames. To scale to such challenging tasks, we focus on certain well-known neural architectures with element-wise recurrence, allowing for tractable RTRL without approximation. Importantly, we also discuss rarely addressed limitations of RTRL in real-world applications, such as its complexity in the multi-layer case.

SPOTLIGHT-65: ODEFormer: Symbolic Regression of Dynamical Systems with Transformers

Primary Area: general machine learning (i.e., none of the above)

Keywords: symbolic regression dynamical systems differential equations transformer

Scores: [ 8 3 8 ]

We introduce ODEFormer, the first transformer able to infer multidimensional ordinary differential equation (ODE) systems in symbolic form from the observation of a single solution trajectory. We perform extensive evaluations on two datasets: (i) the existing ‘Strogatz’ dataset featuring two-dimensional systems; (ii) ODEBench, a collection of one- to four-dimensional systems that we carefully curated from the literature to provide a more holistic benchmark. ODEFormer consistently outperforms existing methods while displaying substantially improved robustness to noisy and irregularly sampled observations, as well as faster inference. We release our code, model and benchmark dataset publicly.

POSTER-336: TEST: Text Prototype Aligned Embedding to Activate LLM's Ability for Time Series

Primary Area: general machine learning (i.e., none of the above)

Keywords: Time Series Embedding LLM

Scores: [ 6 8 5 5 ]

This work summarizes two ways to accomplish Time-Series (TS) tasks in today's Large Language Model (LLM) context: LLM-for-TS (model-centric) designs and trains a fundamental large model, or fine-tunes a pre-trained LLM for TS data; TS-for-LLM (data-centric) converts TS into a model-friendly representation to enable the pre-trained LLM to handle TS data. Given the lack of data, limited resources, semantic context requirements, and so on, this work focuses on TS-for-LLM, where we aim to activate LLM's ability for TS data by designing a TS embedding method suitable for LLM. The proposed method is named TEST. It first tokenizes TS, builds an encoder to embed TS via instance-wise, feature-wise, and text-prototype-aligned contrast, where the TS embedding space is aligned to LLM’s embedding layer space, then creates soft prompts to make LLM more open to that embeddings, and finally implements TS tasks using the frozen LLM. We also demonstrate the feasibility of TS-for-LLM through theory and experiments. Experiments are carried out on TS classification, forecasting, and representation tasks using eight frozen LLMs with various structures and sizes. The results show that the pre-trained LLM with TEST strategy can achieve better or comparable performance than today's SOTA TS models, and offers benefits for few-shot and generalization. By treating LLM as the pattern machine, TEST can endow LLM's ability to process TS data without compromising language ability. We hope that this study will serve as a foundation for future work to support TS+LLM progress.

SPOTLIGHT-66: Spatially-Aware Transformers for Embodied Agents

Primary Area: general machine learning (i.e., none of the above)

Keywords: Episodic Memory Spatial Inference Prediction Generation Reinforcement Learning

Scores: [ 6 8 6 8 ]

Episodic memory plays a crucial role in various cognitive processes, such as the ability to mentally recall past events. While cognitive science emphasizes the significance of spatial context in the formation and retrieval of episodic memory, the current primary approach to implementing episodic memory in AI systems is through transformers that store temporally ordered experiences, which overlooks the spatial dimension. As a result, it is unclear how the underlying structure could be extended to incorporate the spatial axis beyond temporal order alone and thereby what benefits can be obtained. To address this, this paper explores the use of Spatially-Aware Transformer models that incorporate spatial information. These models enable the creation of place-centric episodic memory that considers both temporal and spatial dimensions. Adopting this approach, we demonstrate that memory utilization efficiency can be improved, leading to enhanced accuracy in various place-centric downstream tasks. Additionally, we propose the Adaptive Memory Allocator, a memory management method based on reinforcement learning that aims to optimize efficiency of memory utilization. Our experiments demonstrate the advantages of our proposed model in various environments and across multiple downstream tasks, including prediction, generation, reasoning, and reinforcement learning. The source code for our models and experiments will be available at \href{https://github.com/spatially_aware_transformer}{https://github.com/spatially_aware_transformer}.

POSTER-337: Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models

Primary Area: general machine learning (i.e., none of the above)

Keywords: Post-Training Pruning Combinatorial Optimization Large Language Models Inference Acceleration

Scores: [ 8 6 6 6 ]

With the rapid growth of large language models (LLMs), there is increasing demand for memory and computation for LLMs. Recent efforts on post-training pruning of LLMs aim to reduce the model size and computation, yet the performance is still sub-optimal. In this paper, we present a plug-and-play solution for post-training pruning of LLMs.The proposed solution has two innovative components: 1) Relative Importance and Activations (RIA), a new pruning metric that jointly considers the weight and activations efficiently on LLMs; and 2) Channel Permutation, a new approach to maximally preserve important weights under N:M sparsity.The proposed two components can be readily combined to further enhance the N:M structuredly pruned LLMs.Our empirical experiments show that RIA alone can already surpass all existing post-training pruning methods on prevalent LLMs, e.g., LLaMA ranging from 7B to 65B. Furthermore, N:M structured pruning with channel permutation can even outperform the original LLaMA2 70B on zero-shot tasks, together with practical speed-up on specific hardware.

POSTER-338: GeoLLM: Extracting Geospatial Knowledge from Large Language Models

Primary Area: general machine learning (i.e., none of the above)

Keywords: geospatial knowledge extraction large language models pre-trained language models natural language processing

Scores: [ 6 5 8 6 3 ]

The application of machine learning (ML) in a range of geospatial tasks is increasingly common but often relies on globally available covariates such as satellite imagery that can either be expensive or lack predictive power.Here we explore the question of whether the vast amounts of knowledge found in Internet language corpora, now compressed within large language models (LLMs), can be leveraged for geospatial prediction tasks. We first demonstrate that LLMs embed remarkable spatial information about locations, but naively querying LLMs using geographic coordinates alone is ineffective in predicting key indicators like population density. We then present GeoLLM, a novel method that can effectively extract geospatial knowledge from LLMs with auxiliary map data from OpenStreetMap.We demonstrate the utility of our approach across multiple tasks of central interest to the international community, including the measurement of population density and economic livelihoods.Across these tasks, our method demonstrates a 70% improvement in performance (measured using Pearson's r2) relative to baselines that use nearest neighbors or use information directly from the prompt, and performance equal to or exceeding satellite-based benchmarks in the literature.With GeoLLM, we observe that GPT-3.5 outperforms Llama 2 and RoBERTa by 19% and 51% respectively, suggesting that the performance of our method scales well with the size of the model and its pretraining dataset.Our experiments reveal that LLMs are remarkably sample-efficient, rich in geospatial information, and robust across the globe.Crucially, GeoLLM shows promise in mitigating the limitations of existing geospatial covariates and complementing them well.

POSTER-339: Querying Easily Flip-flopped Samples for Deep Active Learning

Primary Area: general machine learning (i.e., none of the above)

Keywords: active learning uncertainty closeness disagree metric diversity

Scores: [ 6 6 8 10 ]

Active learning is a machine learning paradigm that aims to improve the performance of a model by strategically selecting and querying unlabeled data. One effective selection strategy is to base it on the model's predictive uncertainty, which can be interpreted as a measure of how informative a sample is. The sample's distance to the decision boundary is a natural measure of predictive uncertainty, but it is often intractable to compute, especially for complex decision boundaries formed in multiclass classification tasks.To address this issue, this paper proposes the least disagree metric (LDM), defined as the smallest probability of disagreement of the predicted label, and an estimator for LDM proven to be asymptotically consistent under mild assumptions. The estimator is computationally efficient and can be easily implemented for deep learning models using parameter perturbation. The LDM-based active learning is performed by querying unlabeled data with the smallest LDM. Experimental results show that our LDM-based active learning algorithm obtains state-of-the-art overall performance on all considered datasets and deep architectures.

POSTER-340: Sliced Wasserstein Estimation with Control Variates

Primary Area: general machine learning (i.e., none of the above)

Keywords: Sliced Wasserstein Monte Carlo Methods Point-Cloud Generative Models Optimal Transport

Scores: [ 6 6 6 8 ]

The sliced Wasserstein (SW) distances between two probability measures are defined as the expectation of the Wasserstein distance between two one-dimensional projections of the two measures. The randomness comes from a projecting direction that is used to project the two input measures to one dimension. Due to the intractability of the expectation, Monte Carlo integration is performed to estimate the value of the SW distance. Despite having various variants, there has been no prior work that improves the Monte Carlo estimation scheme for the SW distance in terms of controlling its variance. To bridge the literature on variance reduction and the literature on the SW distance, we propose computationally efficient control variates to reduce the variance of the empirical estimation of the SW distance. The key idea is to first find Gaussian approximations of projected one-dimensional measures, then we utilize the closed-form of the Wasserstein-2 distance between two Gaussian distributions to design the control variates. In particular, we propose using a lower bound and an upper bound of the Wasserstein-2 distance between two fitted Gaussians as two computationally efficient control variates. We empirically show that the proposed control variate estimators can help to reduce the variance considerably when comparing measures over images and point-clouds. Finally, we demonstrate the favorable performance of the proposed control variate estimators in gradient flows to interpolate between two point-clouds and in deep generative modeling on standard image datasets, such as CIFAR10 and CelebA.

POSTER-341: IDEAL: Influence-Driven Selective Annotations Empower In-Context Learners in Large Language Models

Primary Area: general machine learning (i.e., none of the above)

Keywords: in-context learning large language models active learning

Scores: [ 8 6 3 5 8 ]

In-context learning is a promising paradigm that utilizes in-context examples as prompts for the predictions of large language models. These prompts are crucial for achieving strong performance. However, since the prompts need to be sampled from a large volume of annotated examples, finding the right prompt may result in high annotation costs. To address this challenge, this paper introduces an influence-driven selective annotation method that aims to minimize annotation costs while improving the quality of in-context examples. The essence of our method is to select a pivotal subset from a large-scale unlabeled data pool to annotate for the subsequent sampling of prompts. Specifically, a directed graph is first constructed to represent unlabeled data. Afterward, the influence of candidate unlabeled subsets is quantified with a diffusion process. A simple yet effective greedy algorithm for unlabeled data selection is lastly introduced. It iteratively selects the data if it provides a maximum marginal gain with respect to quantified influence. Compared with previous efforts on selective annotations, our influence-driven method works in an end-to-end manner, avoids an intractable explicit balance between data diversity and representativeness, and enjoys theoretical support. Experiments confirm the superiority of the proposed method on various benchmarks, achieving better performance under lower time consumption during subset selection.

POSTER-342: Efficient local linearity regularization to overcome catastrophic overfitting

Primary Area: general machine learning (i.e., none of the above)

Keywords: Fast Adversarial Training Catastrophic Overfitting Local Linearity

Scores: [ 5 6 6 ]

Catastrophic overfitting (CO) in single-step adversarial training (AT) results in abrupt drops in the adversarial test accuracy (even down to 0%). For models trained with multi-step AT, it has been observed that the loss function behaves locally linearly with respect to the input, this is however lost in single-step AT. To address CO in single-step AT, several methods have been proposed to enforce local linearity of the loss via regularization. However, these regularization terms considerably slow down training due to Double Backpropagation. Instead, in this work, we introduce a regularization term, called ELLE, to mitigate CO effectively and efficiently in classical AT evaluations, as well as some more difficult regimes, e.g., large adversarial perturbations and long training schedules. Our regularization term can be theoretically linked to curvature of the loss function and is computationally cheaper than previous methods by avoiding Double Backpropagation. Our thorough experimental validation demonstrates that our work does not suffer from CO, even in challenging settings where previous works suffer from it. We also notice that adapting our regularization parameter during training (ELLE-A) greatly improves the performance, specially in large ϵ setups.

POSTER-343: Kill Two Birds with One Stone: Rethinking Data Augmentation for Deep Long-tailed Learning

Primary Area: general machine learning (i.e., none of the above)

Keywords: Long-tailed learning data augmentation imbalance fairness

Scores: [ 8 6 5 8 ]

Real-world tasks are universally associated with training samples that exhibit a long-tailed class distribution, and traditional deep learning models are not suitable for fitting this distribution, thus resulting in a biased trained model. To surmount this dilemma, massive deep long-tailed learning studies have been proposed to achieve inter-class fairness models by designing sophisticated sampling strategies or improving existing model structures and loss functions. Habitually, these studies tend to apply data augmentation strategies to improve the generalization performance of their models. However, this augmentation strategy applied to balanced distributions may not be the best option for long-tailed distributions. For a profound understanding of data augmentation, we first theoretically analyze the gains of traditional augmentation strategies in long-tailed learning, and observe that augmentation methods cause the long-tailed distribution to be imbalanced again, resulting in an intertwined imbalance: inherent data-wise imbalance and extrinsic augmentation-wise imbalance, i.e., two 'birds' co-exist in long-tailed learning. Motivated by this observation, we propose an adaptive Dynamic Optional Data Augmentation (DODA) to address this intertwined imbalance, i.e., one 'stone' simultaneously 'kills' two 'birds', which allows each class to choose appropriate augmentation methods by maintaining a corresponding augmentation probability distribution for each class during training. Extensive experiments across mainstream long-tailed recognition benchmarks (e.g., CIFAR-100-LT, ImageNet-LT, and iNaturalist 2018) prove the effectiveness and flexibility of the DODA in overcoming the intertwined imbalance. The code is available inhttps://anonymous.4open.science/r/Code-for-DODA-FE12.

SPOTLIGHT-67: COMPARATOR: Reference-free machine translation evaluation by inter-system comparison

Primary Area: general machine learning (i.e., none of the above)

Keywords: Machine Translation Evaluation

Scores: [ 6 8 8 5 ]

Traditionally, Machine Translation (MT) Evaluation has been treated as a regression problem—producing an absolute translation-quality score. However, this approach has two limitations: i) the scores lack interpretability and human annotators struggle with giving consistent scores; ii) most scoring methods are based on(reference, translation) pairs, limiting their applicability in real-world scenarioswhere references are absent. In practice, we often care about whether a new MTsystem is better or worse than some competitors. In addition, reference-free MTevaluation is increasingly practical and necessary. However, these two practicalconsiderations have yet to be jointly explored. In this work, we formulate thereference-free MT evaluation into a pairwise ranking problem. Given the sourcesentence and a pair of translations, our system predicts which translation is better. In addition to proposing this new formulation, we further show that this newparadigm can demonstrate superior performance by merely using indirect supervision from natural language inference and weak supervision from our syntheticdata. In the context of reference-free evaluation, our system, trained without anyhuman annotations, achieves state-of-the-art results on the WMT Shared Metrics Task benchmarks DARR20, MQM20, and MQM21. On a more challengingbenchmark ACES, which contains fine-grained evaluation criteria such as addition, omission, and mistranslation errors our system marks state-of-the-art againstreference-free as well as reference-based baselines.

POSTER-344: Exploring the Relationship Between Model Architecture and In-Context Learning Ability

Primary Area: general machine learning (i.e., none of the above)

Keywords: in-context learning neural architectures

Scores: [ 5 6 5 6 ]

What is the relationship between model architecture and the ability to perform in-context learning? In this empirical study, we take the first steps towards answering this question. In particular, we evaluate fifteen model architectures across a suite of synthetic in-context learning tasks. The selected architectures represent a broad range of paradigms, including recurrent and convolution-based neural networks, transformers, and state-space models. We discover that all considered architectures can perform in-context learning under certain conditions. However, contemporary architectures are found to be the best performing, especially as task complexity grows. Additionally, our follow-up experiments delve into various factors that influence in-context learning. We observe varied sensitivities among architectures with respect to hyperparameter settings. Our study of training dynamics reveals that certain architectures exhibit a smooth, progressive learning trajectory, while others demonstrate periods of stagnation followed by abrupt mastery of the task. Finally, and somewhat surprisingly, we find that several state-space model variants are more robust in-context learners than transformers; since state-space models have constant-sized memory footprints at inference time, this result opens the future possibility of scaling up in-context learning to vastly larger numbers of in-context examples.

POSTER-345: Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

Primary Area: general machine learning (i.e., none of the above)

Keywords: Data Selection Large Language Model Optimal Transport Data-centric AI Data Efficiency

Scores: [ 8 5 3 6 ]

This work focuses on leveraging and selecting from vast, unlabeled, open data to \emph{pre-fine-tune} a pre-trained language model. The goal is to minimize the need for costly domain-specific data for subsequent fine-tuning while achieving desired performance levels. While many data selection algorithms have been designed for small-scale applications, rendering them unsuitable for our context, some emerging methods do cater to language data scales. However, they often prioritize data that aligns with the target distribution. While this strategy may be effective when training a model from scratch, it can yield limited results when the model has already been pre-trained on a different distribution. Differing from prior work, our key idea is to select data that nudges the pre-training distribution closer to the target distribution. We show the optimality of this approach for fine-tuning tasks under certain conditions. We demonstrate the efficacy of our methodology across a diverse array of tasks, showing that it consistently surpasses other selection methods. Moreover, our proposed method is significantly faster than existing techniques, scaling to millions of samples within a single GPU hour. Our code is open-sourced \footnote{Code repository: \url{https://anonymous.4open.science/r/DV4LLM-D761/}}. While fine-tuning offers significant potential for enhancing performance across diverse tasks, its associated costs often limit its widespread adoption; with this work, we hope to lay the groundwork for cost-effective fine-tuning, making its benefits more accessible.

POSTER-346: Adaptive Sharpness-Aware Pruning for Robust Sparse Networks

Primary Area: general machine learning (i.e., none of the above)

Keywords: sparsity pruning natural robustness flatness sharpness generalization deep learning out of distribution

Scores: [ 6 6 6 6 ]

Robustness and compactness are two essential attributes of deep learning models that are deployed in the real world. The goals of robustness and compactness may seem to be at odds, since robustness requires generalization across domains, while the process of compression exploits specificity in one domain. We introduce \textit{Adaptive Sharpness-Aware Pruning (AdaSAP)}, which unifies these goals through the lens of network sharpness. The AdaSAP method produces sparse networks that are robust to input variations which are \textit{unseen at training time}. We achieve this by strategically incorporating weight perturbations in order to optimize the loss landscape. This allows the model to be both primed for pruning and regularized for improved robustness. AdaSAP improves the robust accuracy of pruned models on image classification by up to +6% on ImageNet C and +4% on ImageNet V2, and on object detection by +4% on a corrupted Pascal VOC dataset, over a wide range of compression ratios, pruning criteria, and network architectures, outperforming recent pruning art by large margins.

POSTER-347: SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

Primary Area: general machine learning (i.e., none of the above)

Keywords: quantization sparsity large language models

Scores: [ 6 8 6 6 ]

Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities. By compressing such LLMs via quantization to 3-4 bits per parameter, they can fit into memory-limited devices such as laptops and mobile phones, enabling personalized use. Quantizing models to 3-4 bits per parameter can lead to moderate to high accuracy losses, especially for smaller models (1-10B parameters), which are suitable for edge deployment. To address this accuracy issue, we introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique that enables for the first time \emph{near-lossless} compression of LLMs across model scales while reaching similar compression levels to previous methods. SpQR works by identifying and isolating \emph{outlier weights}, which cause particularly large quantization errors, and storing them in higher precision while compressing all other weights to 3-4 bits, and achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs. This makes it possible to run a 33B parameter LLM on a single 24 GB consumer GPU without performance degradation at 15% speedup, thus making powerful LLMs available to consumers without any downsides. SpQR comes with efficient algorithms for both encoding weights into its format, as well as decoding them efficiently at runtime. Specifically, we provide an efficient GPU inference algorithm for SpQR, which yields faster inference than 16-bit baselines at similar accuracy while enabling memory compression gains of more than 4x.

POSTER-348: A Simple and Effective Pruning Approach for Large Language Models

Primary Area: general machine learning (i.e., none of the above)

Keywords: network pruning sparsity large language models network architectures outlier features

Scores: [ 6 6 5 8 ]

As their size increases, Large Languages Models (LLMs) are natural candidates for network pruning methods: approaches that drop a subset of network weights while striving to preserve performance. Existing methods, however, require either retraining, which is rarely affordable for billion-scale LLMs, or solving a weight reconstruction problem reliant on second-order information, which may also becomputationally expensive. In this paper, we introduce a novel, straightforward yet effective pruning method, termed Wanda (Pruning by Weights and activations), designed to induce sparsity in pretrained LLMs. Motivated by the recent observation of emergent large magnitude features in LLMs, our approach prunes weights with the smallest magnitudes multiplied by the corresponding input activations, ona per-output basis. Notably, Wanda requires no retraining or weight update, and the pruned LLM can be used as is. We conduct a thorough evaluation of our method on LLaMA and LLaMA-2 across various language benchmarks. Wanda significantly outperforms the established baseline of magnitude pruning and performs competitively against recent methods involving intensive weight updates.

POSTER-349: Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control

Primary Area: general machine learning (i.e., none of the above)

Keywords: web navigation large language models prompting

Scores: [ 8 6 8 ]

Building agents with large language models (LLMs) for computer control is a burgeoning research area, where the agent receives the states, e.g., texts or webpages, of the computer and generate actions to complete the complex tasks. Previous methods leverage in-context learning (ICL), using a few exemplars to prompt LLMs for computer control. However, their performance is hindered by several issues: i) The limited context length of LLMs and complex states restrict the number of exemplars, for a single webpage could occupy all the context. ii) The exemplars in current methods, such as high-level plans and multi-choice questions, cannot represent the full trajectories of the controlling process, leading to suboptimal performance in many-step tasks, e.g., more than 20 steps. iii) Existing methods require task-specific exemplars and neglect the reuse of experiences from similar tasks, leading to the poor generalizability to new tasks. To address these challenges, we introduce Synapse, an agent which incorporates trajectory-as-exemplar prompting with the associated memory for solving computer control tasks. Our contributions are three-fold. First, we propose the state abstraction method, which abstracts the task-relevant information from the raw state, thereby allowing more exemplars as the contexts of LLMs. Second, we propose the trajectory-as-exemplar (TaE) prompting method to prompt the LLM with the full trajectories of the task, including the abstracted states and user actions, significantly improving performance in many-step decision-making. Third, we introduce the associated TaE memory to store the trajectories and retrieve relevant trajectories as exemplars from the memory via similarity search, thus significantly improving the generalizability to novel tasks. We conduct the experiments on MiniWoB++ and the more realistic benchmark Mind2Web. In MiniWoB++, Synapse achieves 99.2% (10% relative improvement) average success rate across 64 tasks with demonstrations from merely 48 tasks. Notably, Synapse is the first prompting method to solve book-flight in MiniWoB++. Synapse also exhibits a 53% relative improvement in average step success rate over previous state-of-the-art prompting scheme in Mind2Web.

POSTER-350: From Matching to Mixing: A Graph Interpolation Approach for SAT Instance Generation

Primary Area: general machine learning (i.e., none of the above)

Keywords: Combinatorial Optimization Boolean Satisfiability Problem Graph Generation Graph Matching

Scores: [ 5 5 8 ]

The Boolean satisfiability problem (SAT) stands as a canonical NP-complete combinatorial optimization (CO) problem, with wide impact on both theoretical and industrial scenarios. In particular, the scarcity of real-world SAT instances and their usefulness for tuning SAT solvers underscore the necessity for effective and efficient ways of hard instance generation, whereas existing methods either struggle to maintain plausible hardness or suffer from limited applicability. Different from the typical construction-based methods, this paper introduces an adaptive and efficient graph interpolation approach that in place modifies the raw structure of graph-represented SAT instance by replacing it with a counterpart from another instance. Specifically, our method involves a two-stage matching and mixing pipeline. The matching aims to find a correspondence map of literal nodes from two instance graphs via learned features from a matching network; while the mixing stage involves iteratively exchanging clause pairs with the highest correspondence scores until a specified replacement ratio is achieved. We further show that under our matching-mixing framework, moderate randomness can avoid hardness degradation of SAT instances by introducing Gumbel noise. Experimental results show the superiority of the proposed method with both resemblance in structure and hardness, as well as general applicability in an efficient way. Source code will be released.

POSTER-351: ACRF: Compressing Explicit Neural Radiance Fields via Attribute Compression

Primary Area: general machine learning (i.e., none of the above)

Keywords: NeRF; NeRF Compression; 3D Data Compression

Scores: [ 6 8 8 6 ]

In this work, we study the problem of explicit NeRF compression. Through analyzing recent explicit NeRF models, we reformulate the task of explicit NeRF compression as 3D data compression. We further introduce our NeRF compression framework, Attributed Compression of Radiance Field (ACRF), which focuses on the compression of the explicit neural 3D representation. The neural 3D structure is pruned and converted to points with features, which are further encoded using importance-guided feature encoding. Furthermore, we employ an importance-prioritized entropy model to estimate the probability distribution of transform coefficients, which are then entropy coded with an arithmetic coder using the predicted distribution. Within this framework, we present two models, ACRF and ACRF-F, to strike a balance between compression performance and encoding time budget. Our experiments, which include both synthetic and real-world datasets such as Synthetic-NeRF and Tanks&Temples, demonstrate the superior performance of our proposed algorithm.

POSTER-352: RTFS-Net: Recurrent time-frequency modelling for efficient audio-visual speech separation

Primary Area: general machine learning (i.e., none of the above)

Keywords: Audio-Visual Multi-Modal Time-Frequency-Domain Speech-Separation Model-Compression

Scores: [ 8 6 8 8 ]

Audio-visual speech separation methods aim to integrate different modalities to generate high-quality separated speech, thereby enhancing the performance of downstream tasks such as speech recognition. Most existing state-of-the-art (SOTA) models operate in the time domain. However, their overly simplistic approach to modeling acoustic features often necessitates larger and more computationally intensive models in order to achieve SOTA performance. In this paper, we present a novel time-frequency domain audio-visual speech separation method: Recurrent Time-Frequency Separation Network (RTFS-Net), which applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform. We model and capture the time and frequency dimensions of the audio independently using a multi-layered RNN along each dimension. Furthermore, we introduce a unique attention-based fusion technique for the efficient integration of audio and visual information, and a new mask separation approach that takes advantage of the intrinsic spectral nature of the acoustic features for a clearer separation. RTFS-Net outperforms the previous SOTA method using only 10% of the parameters and 18% of the MACs. This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.

POSTER-353: ALAM: Averaged Low-Precision Activation for Memory-Efficient Training of Transformer Models

Primary Area: general machine learning (i.e., none of the above)

Keywords: Memory efficient training Activation-compressed training Average Quantization NLP Transformer

Scores: [ 6 6 6 6 ]

One of the key challenges in deep neural network training is the substantial amount of GPU memory required to store activations obtained in the forward pass. Various Activation-Compressed Training (ACT) schemes have been proposed to mitigate this issue; however, it is challenging to adopt those approaches in recent transformer-based large language models (LLMs), which experience significant performance drops when the activations are deeply compressed during training. In this paper, we introduce ALAM, a novel ACT framework that utilizes average quantization and a lightweight sensitivity calculation scheme, enabling large memory saving in LLMs while maintaining training performance. We first demonstrate that compressing activations into their group average values minimizes the gradient variance. Employing this property, we propose Average Quantization which provides high-quality deeply compressed activations with an effective precision of less than 1 bit and improved flexibility of precision allocation. In addition, we present a cost-effective yet accurate sensitivity calculation algorithm that solely relies on the L2 norm of parameter gradients, substantially reducing memory overhead due to sensitivity calculation. In experiments, the ALAM framework significantly reduces activation memory without compromising accuracy, achieving up to a 12.5$\times$ compression rate in LLMs.

POSTER-354: Less or More From Teacher: Exploiting Trilateral Geometry For Knowledge Distillation

Primary Area: general machine learning (i.e., none of the above)

Keywords: Knowledge Distillation Bilevel Optimization Re-weighting

Scores: [ 8 8 5 5 ]

Knowledge distillation aims to train a compact student network using soft supervision from a larger teacher network and hard supervision from ground truths. However, determining an optimal knowledge fusion ratio that balances these supervisory signals remains challenging. Prior methods generally resort to a constant or heuristic-based fusion ratio, which often falls short of a proper balance. In this study, we introduce a novel adaptive method for learning a sample-wise knowledge fusion ratio, exploiting both the correctness of teacher and student, as well as how well the student mimics the teacher on each sample. Our method naturally leads to the \textit{intra-sample} trilateral geometric relations among the student prediction (S), teacher prediction (T), and ground truth (G). To counterbalance the impact of outliers, we further extend to the \textit{inter-sample} relations, incorporating the teacher's global average prediction (T¯) for samples within the same class. A simple neural network then learns the implicit mapping from the intra- and inter-sample relations to an adaptive, sample-wise knowledge fusion ratio in a bilevel-optimization manner. Our approach provides a simple, practical, and adaptable solution for knowledge distillation that can be employed across various architectures and model sizes. Extensive experiments demonstrate consistent improvements over other loss re-weighting methods on image classification, attack detection, and click-through rate prediction.

POSTER-355: CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding

Primary Area: general machine learning (i.e., none of the above)

Keywords: 3D referring 3D visual grounding localization 3D

Scores: [ 6 6 6 6 ]

3D visual grounding is the ability to localize objects in 3D scenes conditioned onan input utterance. Most existing methods devote the referring head to localize thereferred object directly. However, this approach will fail in complex scenarios andnot illustrate how and why the network reaches the final decision. In this paper,we address this question “Can we design an interpretable 3D visual groundingframework that has the potential to mimic the human perception system?”. To thisend, we formulate the 3D visual grounding problem as a sequence-to-sequence(Seq2Seq) task by first predicting a chain of anchors and then utilizing them to pre-dict the final target. Following the chain of thoughts approach enables us to decom-pose the referring task into interpretable intermediate steps, which in turn, booststhe performance and makes our framework extremely data-efficient. Interpretabil-ity not only improves the overall performance but also helps us identify failurecases. Moreover, our proposed framework can be easily integrated into any existingarchitecture. We validate our approach through comprehensive experiments on theNr3D and Sr3D benchmarks and show consistent performance gains compared toexisting methods without requiring any manually annotated data. Furthermore, ourproposed framework, dubbed CoT3DRef, is significantly data-efficient, whereaswhen trained only on 10% of the data, we match the SOTA performance that trainedon the entire data. The code is available at https://cot3dref.github.io/.

POSTER-356: Biased Temporal Convolution Graph Network for Time Series Forecasting with Missing Values.

Primary Area: general machine learning (i.e., none of the above)

Keywords: Spatial and temporal graph neural network bias missing values time series forecasting

Scores: [ 6 6 6 ]

Multivariate time series forecasting plays an important role in various applications ranging from meteorology study, traffic management to economics planning. In the past decades, many efforts have been made toward accurate and reliable forecasting by exploring both temporal dynamics and spatial correlation. Especially, the development of Transformer-based methods has significantly enhanced long-term forecasting accuracy in very recent years. The existing forecasting methods often assume intact input data, however, in practice the time series data is often partially observed due to device malfunction or costly data acquisition, which can seriously impede the performance of the existing approaches. A naive employment of imputation methods unavoidably involves error accumulation and leads to suboptimal solutions. Motivated by this, we propose a Biased Temporal Convolution Graph Network that jointly captures the temporal dependencies and spatial structure. In particular, we inject bias into the two carefully developed modules---the Multi-Scale Instance PartialTCN and Biased GCN---to account for missing patterns. The experimental results show that our proposed model is able to achieve up to 11% improvements over the existing methods on five real-world benchmark datasets. Code is available at this repository: https://anonymous.4open.science/r/BiaTCGNet-1F80/.

POSTER-357: InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists

Primary Area: general machine learning (i.e., none of the above)

Keywords: Vision generalists Text-to-image models Multi-task learning

Scores: [ 3 8 6 5 ]

Recent advances in generative diffusion models have enabled text-controlled synthesis of realistic and diverse images with impressive quality. Despite these remarkable advances, the application of text-to-image generative models in computer vision for standard visual recognition tasks remains limited. The current de facto approach for these tasks is to design model architectures and loss functions that are tailored to the task at hand. In this paper, we develop a unified language interface for computer vision tasks that abstracts away task specific design choices and enables task execution by following natural language instructions. Our approach involves casting multiple computer vision tasks as text-to-image generation problems. Here, the text represents an instruction describing the task, and the resulting image is a visually-encoded task output. To train our model, we pool commonly-used computer vision datasets covering a range of tasks, including segmentation, object detection, depth estimation, and classification. We then use a large language model to paraphrase prompt templates that convey the specific tasks to be conducted on each image, and through this process, we create a multi-modal and multi-task training dataset comprising input and output images along with annotated instructions. Following the InstructPix2Pix architecture, we apply instruction-tuning to a text-to-image diffusion model using our constructed dataset, steering its functionality from a generative model to an instruction-guided multi-task vision learner. Experiments demonstrate that our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models. Moreover, it exhibits compelling generalization capabilities to unseen data, categories, and user instructions.

POSTER-358: The Expressive Power of Transformers with Chain of Thought

Primary Area: general machine learning (i.e., none of the above)

Keywords: The Expressive Power of Transformers with Chain of Thought

Scores: [ 6 8 8 8 ]

Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers' reasoning can be improved by allowing them to use a "chain of thought" or "scratchpad", i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this, we ask: Does such intermediate generation fundamentally extend the computational power of a decoder-only transformer? We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. For instance, we find that transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps adds a clear new ability (under standard complexity conjectures): recognizing all regular languages. Our results also imply that linear steps keep transformer decoders within context-sensitive languages, and polynomial steps make them recognize exactly the class of polynomial-time solvable problems---the first exact characterization of a type of transformers in terms of standard complexity classes. Together, our results provide a nuanced framework for understanding how the length of a transformer’s chain of thought or scratchpad impacts its reasoning power.

POSTER-359: Fake It Till Make It: Federated Learning with Consensus-Oriented Generation

Primary Area: general machine learning (i.e., none of the above)

Keywords: Federated learning data heterogeneity

Scores: [ 6 6 6 6 ]

In federated learning (FL), data heterogeneity is one key bottleneck that causes model divergence and limits performance. Addressing this, existing methods often regard data heterogeneity as an inherent property and propose to mitigate its adverse effects by correcting models. In this paper, we seek to break this inherent property by generating data to complement the original dataset to fundamentally mitigate heterogeneity level. As a novel attempt from the perspective of data, we propose federated learning with consensus-oriented generation (FedCOG). FedCOG consists of two key components at the client side: complementary data generation, which generates data extracted from the shared global model to complement the original dataset, and knowledge-distillation-based model training, which distills knowledge from global model to local model based on the generated data to mitigate over-fitting the original heterogeneous dataset.FedCOG has two critical advantages: 1) it can be a plug-and-play module to further improve the performance of most existing FL methods, and 2) it is naturally compatible with standard FL protocols such as Secure Aggregation since it makes no modification in communication process.Extensive experiments on classical and real-world FL datasets show that FedCOG consistently outperforms state-of-the-art methods and has the plug-and-play property.

POSTER-360: Improving LoRA in Privacy-preserving Federated Learning

Primary Area: general machine learning (i.e., none of the above)

Keywords: Federated Learning Parameter-efficient Fine-tuning Differential Privacy Large Language Model

Scores: [ 6 5 5 6 ]

POSTER-361: Hebbian Learning based Orthogonal Projection for Continual Learning of Spiking Neural Networks

Primary Area: general machine learning (i.e., none of the above)

Keywords: Hebbian learning neuromorphic computing continual learning spiking neural networks

Scores: [ 8 6 6 6 ]

Neuromorphic computing with spiking neural networks is promising for energy-efficient artificial intelligence (AI) applications. However, different from humans who continually learn different tasks in a lifetime, neural network models suffer from catastrophic forgetting. How could neuronal operations solve this problem is an important question for AI and neuroscience. Many previous studies draw inspiration from observed neuroscience phenomena and propose episodic replay or synaptic metaplasticity, but they are not guaranteed to explicitly preserve knowledge for neuron populations. Other works focus on machine learning methods with more mathematical grounding, e.g., orthogonal projection on high dimensional spaces, but there is no neural correspondence for neuromorphic computing. In this work, we develop a new method with neuronal operations based on lateral connections and Hebbian learning, which can protect knowledge by projecting activity traces of neurons into an orthogonal subspace so that synaptic weight update will not interfere with old tasks. We show that Hebbian and anti-Hebbian learning on recurrent lateral connections can effectively extract the principal subspace of neural activities and enable orthogonal projection. This provides new insights into how neural circuits and Hebbian learning can help continual learning, and also how the concept of orthogonal projection can be realized in neuronal systems. Our method is also flexible to utilize arbitrary training methods based on presynaptic activities/traces. Experiments show that our method consistently solves forgetting for spiking neural networks with nearly zero forgetting under various supervised training methods with different error propagation approaches, and outperforms previous approaches under various settings. Our method can pave a solid path for building continual neuromorphic computing systems.

POSTER-362: L2P-MIP: Learning to Presolve for Mixed Integer Programming

Primary Area: general machine learning (i.e., none of the above)

Keywords: Combinatorial Optimization; Mixed Integer Programming; Presolving

Scores: [ 8 6 6 6 ]

Modern solvers for solving mixed integer programming (MIP) often rely on the branch-and-bound (B&B) algorithm which could be of high time complexity, and presolving techniques are well designed to simplify the instance as pre-processing before B&B. However, such presolvers in existing literature or open-source solvers are mostly set by default agnostic to specific input instances, and few studies have been reported on tailoring presolving settings. In this paper, we aim to dive into this open question and show that the MIP solver can be indeed largely improved when switching the default instance-agnostic presolving into instance-specific presolving. Specifically, we propose a combination of supervised learning and classic heuristics to achieve efficient presolving adjusting, avoiding tedious reinforcement learning. Notably, our approach is orthogonal from many recent efforts in incorporating learning modules into the B&B framework after the presolving stage, and to our best knowledge, this is the first work for introducing learning to presolve in MIP solvers. Experiments on multiple real-world datasets show that well-trained neural networks can infer proper presolving for arbitrary incoming MIP instances in less than 0.5s, which is neglectable compared with the solving time often hours or days. We plan to open-source our code as a benchmark for this new task, and currently, the code is available at an anonymous repository.

POSTER-363: Unraveling the Key Components of OOD Generalization via Diversification

Primary Area: general machine learning (i.e., none of the above)

Keywords: Algorithm Design Diversity OOD Generalization Spurious Correlation Understanding Neural Networks

Scores: [ 5 6 6 6 ]

Real-world datasets may contain multiple features that explain the training data equally well, i.e., learning any of them would lead to correct predictions on the training data. However, many of them can be spurious, i.e., lose their predictive power under a distribution shift and fail to generalize to out-of-distribution (OOD) data.Recently developed diversification'' methods approach this problem by finding multiple diverse hypotheses that rely on different features.This paper aims to study this class of methods and identify the key components contributing to their OOD generalization abilities.We show that (1) diversification methods are highly sensitive to the distribution of the unlabeled data used for diversification and can underperform significantly when away from a method-specific sweet spot.(2) Diversification alone is insufficient for OOD generalization. The choice of the used learning algorithm, e.g., the model's architecture and pretraining, is crucial, and using the second-best choice leads to an up to 20% absolute drop in accuracy.(3) The optimal choice of learning algorithm depends on the unlabeled data, and vice versa.(4) Finally, we show that the above pitfalls cannot be alleviated by increasing the number of diverse hypotheses, allegedly the major feature of diversification methods.These findings provide a clearer understanding of the critical design factors influencing the OOD generalization of diversification methods. They can guide practitioners in how to use the existing methods best and guide researchers in developing new, better ones.

POSTER-364: Transformer Fusion with Optimal Transport

Primary Area: general machine learning (i.e., none of the above)

Keywords: Fusion Transformers

Scores: [ 8 6 6 6 ]

Fusion is a technique for merging multiple independently-trained neural networks in order to combine their capabilities. Past attempts have been restricted to the case of fully-connected, convolutional, and residual networks. In this paper, we present a systematic approach for fusing two or more transformer-based networks exploiting Optimal Transport to (soft-)align the various architectural components. We flesh out an abstraction for layer alignment, that can generalize to arbitrary architectures -- in principle -- and we apply this to the key ingredients of Transformers such as multi-head self-attention, layer-normalization, and residual connections, and we discuss how to handle them via various ablation studies. Furthermore, our method allows the fusion of models of different sizes (heterogeneous fusion), providing a new and efficient way for compression of Transformers. The proposed approach is evaluated on both image classification tasks via Vision Transformer and natural language modeling tasks using BERT. Our approach consistently outperforms vanilla fusion, and, after a surprisingly short finetuning, also outperforms the individual converged parent models. In our analysis, we uncover intriguing insights about the significant role of soft alignment in the case of Transformers. Our results showcase the potential of fusing multiple Transformers, thus compounding their expertise, in the budding paradigm of model fusion and recombination.

POSTER-365: Robust Network Pruning With Sparse Entropic Wasserstein Regression

Primary Area: general machine learning (i.e., none of the above)

Keywords: Network Pruning Model Compression Optimal Transport Wasserstein Distance Deep Learning

Scores: [ 6 6 6 8 6 ]

This study unveils a cutting-edge technique for neural network pruning that judiciously addresses noisy gradients during the computation of the empirical Fisher Information Matrix (FIM). We introduce an entropic Wasserstein regression (EWR) formulation, capitalizing on the geometric attributes of the optimal transport (OT) problem. This is analytically showcased to excel in noise mitigation by adopting neighborhood interpolation across data points. The unique strength of the Wasserstein distance is its intrinsic ability to strike a balance between noise reduction and covariance information preservation. Extensive experiments performed on various networks show comparable performance of the proposed method with state-of-the-art (SoTA) network pruning algorithms. Our proposed method outperforms the SoTA when the network size or the target sparsity is large, the gain is even larger with the existence of noisy gradients, possibly from noisy data, analog memory, or adversarial attacks. Notably, our proposed method achieves a gain of 6% improvement in accuracy and 8% improvement in testing loss for MobileNetV1 with less than one-fourth of the network parameters remaining.

POSTER-366: Who to imitate: Imitating desired behavior from divserse multi-agent datasets

Primary Area: general machine learning (i.e., none of the above)

Keywords: multi-agent systems imitation learning credit assignment learning from human data

Scores: [ 6 8 8 8 5 ]

AI agents are commonly trained with large datasets of demonstrations of human behavior.However, not all behaviors are equally safe or desirable.Desired characteristics for an AI agent can be expressed by assigning desirability scores, which we assume are assigned to collective trajectories, but not to individual behaviors.For example, in a dataset of vehicle interactions, these scores might relate to the number of incidents that occurred. We first assess the effect of each individual agent's behavior on the collective desirability score, e.g., assessing how likely an agent is to cause incidents.This allows us to afterward only imitate agents with desired behavior, e.g., only imitating agents that are unlikely to cause incidents. To enable this, we propose the concept of an agent's \textit{Exchange Value}, which quantifies an individual agent's contribution to the collective desirability score. This is expressed as the expected change in desirability score when substituting the agent for a randomly selected agent.We propose additional methods for estimating Exchange Values from real-world datasets, enabling us to learn aligned imitation policies that outperform relevant baselines.

POSTER-367: Teaching Large Language Models to Self-Debug

Primary Area: general machine learning (i.e., none of the above)

Keywords: large language model self-debugging

Scores: [ 6 6 6 6 ]

Large language models (LLMs) have achieved impressive performance on code generation. However, for complex programming tasks, generating the correct solution in one go becomes challenging, thus some prior works have designed program repair approaches to improve code generation performance. In this work, we propose self-debugging, which teaches a large language model to debug its predicted program. In particular, we demonstrate that self-debugging can teach the large language model to perform rubber duck debugging; i.e., without any human feedback on the code correctness or error messages, the model is able to identify its mistakes by leveraging code execution and explaining the generated code in natural language. Self-debugging achieves the state-of-the-art performance on several code generation benchmarks, including the Spider dataset for text-to-SQL generation, TransCoder for C++-to-Python translation, and MBPP for text-to-Python generation. On the Spider benchmark where there are no unit tests to verify the correctness of predictions, self-debugging with code explanation consistently improves the baseline by 2-3%, and improves the prediction accuracy on problems of the hardest level by 9%. On TransCoder and MBPP where unit tests are available, self-debugging improves the baseline accuracy by up to 12%. Meanwhile, by leveraging feedback messages and reusing failed predictions, self-debugging notably improves sample efficiency, and can match or outperform baseline models that generate more than 10$\times$ candidate programs.

POSTER-368: FedHyper: A Universal and Robust Learning Rate Scheduler for Federated Learning with Hypergradient Descent

Primary Area: general machine learning (i.e., none of the above)

Keywords: Federate Learning Learning rate Hyperparameter Hypergradient

Scores: [ 5 8 8 8 ]

The theoretical landscape of federated learning (FL) undergoes rapid evolution, but its practical application encounters a series of intricate challenges, and hyperparameter optimization is one of these critical challenges. Amongst the diverse adjustments in hyperparameters, the adaptation of the learning rate emerges as a crucial component, holding the promise of significantly enhancing the efficacy of FL systems. In response to this critical need, this paper presents FedHyper, a novel hypergradient-based learning rate adaptation algorithm specifically designed for FL. FedHyper serves as a universal learning rate scheduler that can adapt both global and local rates as the training progresses. In addition, FedHyper not only showcases unparalleled robustness to a spectrum of initial learning rate configurations but also significantly alleviates the necessity for laborious empirical learning rate adjustments. We provide a comprehensive theoretical analysis of FedHyper’s convergence rate and conduct extensive experiments on vision and language benchmark datasets. The results demonstrate that FEDHYPER consistently converges 1.1-3× faster than FedAvg and the competing baselines while achieving superior final accuracy. Moreover, FEDHYPER catalyzes a remarkable surge in accuracy, augmenting it by up to 15% compared to FedAvg under suboptimal initial learning rate settings.

POSTER-369: Constraint-Free Structure Learning with Smooth Acyclic Orientations

Primary Area: general machine learning (i.e., none of the above)

Keywords: Structure Learning Continuous Acyclic Optimization Directed Acyclic Graphs DAG

Scores: [ 6 5 6 6 5 ]

The structure learning problem consists of fitting data generated by a Directed Acyclic Graph (DAG) to correctly reconstruct its arcs. In this context, differentiable approaches constrain or regularize an optimization problem with a continuous relaxation of the acyclicity property. The computational cost of evaluating graph acyclicity is cubic on the number of nodes and significantly affects scalability. In this paper, we introduce COSMO, a constraint-free continuous optimization scheme for acyclic structure learning. At the core of our method lies a novel differentiable approximation of an orientation matrix parameterized by a single priority vector. Differently from previous works, our parameterization fits a smooth orientation matrix and the resulting acyclic adjacency matrix without evaluating acyclicity at any step. Despite this absence, we prove that COSMO always converges to an acyclic solution. In addition to being asymptotically faster, our empirical analysis highlights how COSMO performance on graph reconstruction compares favorably with competing structure learning methods.

POSTER-370: \textit{M}\textit{3}: Towards Robust Multi-Modal Reasoning via Model Selection

Primary Area: general machine learning (i.e., none of the above)

Keywords: robust model selection multi-modal learning multi-step reasoning with LLM

Scores: [ 6 5 6 5 ]

The reasoning capabilities of LLM (Large Language Model) are widely acknowledged in recent research, inspiring studies on tool learning and autonomous agents. LLM serves as the brain'' of agent, orchestrating multiple tools for collaborative multi-step task solving. Unlike methods invoking tools like calculators or weather APIs for straightforward tasks, multi-modal agents excel by integrating diverse AI models for complex challenges. However, current multi-modal agents neglect the significance of model selection: they primarily focus on the planning and execution phases, and will only invoke predefined task-specific models for each subtask, making the execution fragile. Meanwhile, other traditional model selection methods are either incompatible with or suboptimal for the multi-modal agent scenarios, due to ignorance of dependencies among subtasks arising by multi-step reasoning.To this end, we identify the key challenges therein and propose the \textit{M}\textit{3} framework as a plug-in with negligible runtime overhead at test-time. This framework improves model selection and bolsters the robustness of multi-modal agents in multi-step reasoning. In the absence of suitable benchmarks, we create MS-GQA, a new dataset specifically designed to investigate the model selection challenge in multi-modal agents. Our experiments reveal that our framework enables dynamic model selection, considering both user inputs and subtask dependencies, thereby robustifying the overall reasoning process.

POSTER-371: Conditional Information Bottleneck Approach for Time Series Imputation

Primary Area: general machine learning (i.e., none of the above)

Keywords: time series imputation; time series interpolation; information bottleneck

Scores: [ 8 6 6 6 ]

Time series imputation presents a significant challenge because it requires capturing the underlying temporal dynamics from partially observed time series data. Among the recent successes of imputation methods based on generative models, the information bottleneck (IB) framework offers a well-suited theoretical foundation for multiple imputations, allowing us to account for the uncertainty associated with the imputed values. However, directly applying the IB framework to time series data without considering their temporal context can lead to a substantial loss of temporal dependencies, which, in turn, can degrade the overall imputation performance. To address such a challenge, we propose a novel conditional information bottleneck (CIB) approach for time series imputation, which aims to mitigate the potentially negative consequences of the regularization constraint by focusing on reducing the redundant information conditioned on the temporal context. We provide a theoretical analysis of its effect by adapting variational decomposition. We use the resulting insight and propose a novel deep learning method that can approximately achieve the proposed CIB objective for time series imputation as a combination of evidence lower bound and novel temporal kernel-enhanced contrastive optimization. Our experiments, conducted on multiple real-world datasets, consistently demonstrate that our method significantly improves imputation performance (including both interpolation and extrapolation), and also enhances classification performance based on the imputed values.

POSTER-372: Rethinking Channel Dependence for Multivariate Time Series Forecasting: Learning from Leading Indicators

Primary Area: general machine learning (i.e., none of the above)

Keywords: Multivariate time series forecasting channel dependence lead-lag relationships

Scores: [ 6 6 6 6 ]

Recently, channel-independent methods have achieved state-of-the-art performance in multivariate time series (MTS) forecasting. Despite reducing overfitting risks, these methods miss potential opportunities in utilizing channel dependence for accurate predictions. We argue that there exist locally stationary lead-lag relationships between variates, i.e., some lagged variates may follow the leading indicators within a short time period. Exploiting such channel dependence is beneficial since leading indicators offer advance information that can be used to reduce the forecasting difficulty of the lagged variates. In this paper, we propose a new method named LIFT that first efficiently estimates leading indicators and their leading steps at each time step and then judiciously allows the lagged variates to utilize the advance information from leading indicators. LIFT plays as a plugin that can be seamlessly collaborated with arbitrary time series forecasting methods. Extensive experiments on six real-world datasets demonstrate that LIFT improves the state-of-the-art methods by 5.6% in average forecasting performance.

POSTER-373: When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations

Primary Area: general machine learning (i.e., none of the above)

Keywords: prompt prefix LLM fine-tuning theory

Scores: [ 8 6 6 6 ]

Context-based fine-tuning methods like prompting, in-context learning, soft prompting (prompt tuning) and prefix-tuning have gained popularity as they often match the performance of full fine-tuning with a fraction of the parameters. Despite their empirical successes, there is little theoretical understanding of how these techniques influence the internal computation of the model and their expressiveness limitations. We show that despite the continuous embedding space being much more expressive than the discrete token space, soft-prompting and prefix-tuning are strictly less expressive than full fine-tuning. Concretely, context-based fine-tuning cannot change the relative attention pattern over the content and can only bias the outputs of an attention layer in a fixed direction. While this means that fine-tuning techniques such as prompting, in-context learning, soft prompting and prefix-tuning can successfully elicit or combine skills already present in the pretrained model, they cannot learn tasks requiring new attention patterns.

POSTER-374: Scaling Laws of RoPE-based Extrapolation

Primary Area: general machine learning (i.e., none of the above)

Keywords: Position Embeddin Length Extrapolation Large Language Model Natural Language Processing

Scores: [ 8 6 10 ]

The extrapolation capability of Large Language Models (LLMs) based on Rotary Position Embedding \cite{su2021roformer} is currently a topic of considerable interest. The mainstream approach to addressing extrapolation with LLMs involves modifying RoPE by replacing 10000, the rotary base of θn=100002n/d in the original RoPE, with a larger value and providing longer fine-tuning text. In this work, we first observe that fine-tuning a RoPE-based LLM with either a smaller or larger base in pre-training context length could significantly enhance its extrapolation performance. After that, we propose \textbf{\textit{Scaling Laws of RoPE-based Extrapolation}}, a unified framework from the periodic perspective, to describe the relationship between the extrapolation performance and base value as well as tuning context length. In this process, we also explain the origin of the RoPE-based extrapolation issue by \textbf{\textit{critical dimension for extrapolation}}. Besides these observations and analyses, we achieve extrapolation up to 1 million context length within only 16K training length on LLaMA2 7B and 13B \citep{touvron2023llama2}.

POSTER-375: Large Language Models Cannot Self-Correct Reasoning Yet

Primary Area: general machine learning (i.e., none of the above)

Keywords: Large Language Models Self-Correction Reasoning

Scores: [ 8 8 6 5 ]

Large Language Models (LLMs) have emerged as a groundbreaking technology with their unparalleled text generation capabilities across various applications. Nevertheless, concerns persist regarding the accuracy and appropriateness of their generated content. A contemporary methodology, self-correction, has been proposed as a remedy to these issues. Building upon this premise, this paper critically examines the role and efficacy of self-correction within LLMs, shedding light on its true potential and limitations. Central to our investigation is the notion of intrinsic self-correction, whereby an LLM attempts to correct its initial responses based solely on its inherent capabilities, without the crutch of external feedback. In the context of reasoning, our research indicates that LLMs struggle to self-correct their responses without external feedback, and at times, their performance might even degrade post self-correction. Drawing from these insights, we offer suggestions for future research and practical applications in this field.

POSTER-376: Vanishing Gradients in Reinforcement Finetuning of Language Models

Primary Area: general machine learning (i.e., none of the above)

Keywords: Vanishing Gradients Reinforcement Finetuning Supervised Finetuning Language Models

Scores: [ 6 6 5 8 ]

Pretrained language models are commonly aligned with human preferences and downstream tasks via finetuning.A prominent finetuning approach is reinforcement finetuning (RFT), which entails maximizing a (possibly learned) reward function using policy gradient algorithms. This work highlights an optimization obstacle in RFT: we prove that the expected gradient for an input vanishes when its reward standard deviation under the model is small, even if the expected reward is far from optimal. We then demonstrate that vanishing gradients due to small reward standard deviation are prevalent and detrimental in an RFT benchmark for lanaguage models. In particular, we show that in datasets where inputs with small reward standard deviation under the pretrained model are more prevalent, the reward that RFT achieves compared to \emph{supervised finetuning (SFT)} is worse. Controlled experiments and a theoretical analysis further establish that vanishing gradients in RFT can lead to extremely slow optimization. Lastly, we explore ways to overcome vanishing gradients in RFT. We find the common practice of an initial SFT phase to be the most promising candidate, which sheds light on its importance in an RFT pipeline. Moreover, our experiments reveal that a relatively small number of SFT optimization steps on as few as 1% of the input samples can suffice, indicating that the initial SFT phase need not be expensive in terms of compute and data labeling efforts.

ORAL-21: Improved Active Learning via Dependent Leverage Score Sampling

Primary Area: general machine learning (i.e., none of the above)

Keywords: leverage score sampling active learning polynomial regression differential equations pivotal sampling

Scores: [ 8 8 6 6 8 ]

We show how to obtain improved active learning methods in the agnostic (adversarial noise) setting by combining marginal leverage score sampling with non-independent sampling strategies that promote spatial coverage. In particular, we propose an easily implemented method based on the \emph{pivotal sampling algorithm}, which we test on problems motivated by learning-based methods for parametric PDEs and uncertainty quantification. In comparison to independent sampling, our method reduces the number of samples needed to reach a given target accuracy by up to 50%.We support our findings with two theoretical results. First, we show that any non-independent leverage score sampling method that obeys a weak \emph{one-sided independence condition} (which includes pivotal sampling) can actively learn d dimensional linear functions with O(dlogd) samples, matching independent sampling. This result extends recent work on matrix Chernoff bounds under independence, and may be of interest for analyzing other sampling strategies beyond pivotal sampling. Second, we show that, for the important case of polynomial regression, our pivotal method obtains an improved bound of O(d) samples.

POSTER-377: Adversarial Feature Map Pruning for Backdoor

Primary Area: general machine learning (i.e., none of the above)

Keywords: Backdoor Defense Data Poisoning

Scores: [ 6 6 6 ]

Deep neural networks have been widely used in many critical applications, such as autonomous vehicles and medical diagnosis. However, their security is threatened by backdoor attack, which is achieved by adding artificial patterns to specific training data. Existing defense strategies primarily focus on using reverse engineering to reproduce the backdoor trigger generated by attackers and subsequently repair the DNN model by adding the trigger into inputs and fine-tuning the model with ground-truth labels. However, once the trigger generated by the attackers is complex and invisible, the defender can not successfully reproduce the trigger. Consequently, the DNN model will not be repaired since the trigger is not effectively removed.In this work, we propose Adversarial Feature Map Pruning for Backdoor~(FMP) to mitigate backdoor from the DNN. Different from existing defense strategies, which focus on reproducing backdoor triggers, FMP tries to prune the backdoor feature maps, which are trained to extract backdoor information from the inputs. After pruning these backdoor feature maps, FMP will fine-tune the model with a secure subset of training data. Our experiments demonstrate that, compared to existing defense strategies, FMP can effectively reduce the Attack Success Rate (ASR) even against the most complex and invisible attack triggers~(e.g., FMP decreases the ASR to 2.86% in CIFAR10, 19.2%-65.41% lower than previous arts). Second, unlike conventional defense methods that tend to exhibit low Robust Accuracy (i.e., the model's accuracy on the poisoned data), FMP achieves higher RA, indicating its superiority in maintaining model performance while mitigating the effects of backdoor attacks~(e.g., FMP obtains 87.40% RA in CIFAR10). Third, compared to existing feature map pruning techniques, FMP can cover more backdoor feature maps~(e.g., FMP removes 83.33% of backdoor feature maps from the model in the CIFAR10 & BadNet scenario).

POSTER-378: Understanding Domain Generalization: A Noise Robustness Perspective

Primary Area: general machine learning (i.e., none of the above)

Keywords: out-of-distribution generalization distribution shifts spurious correlation noise robustness

Scores: [ 5 6 3 5 ]

Despite the rapid development of machine learning algorithms for domain generalization (DG), there is no clear empirical evidence that the existing DG algorithms outperform the classic empirical risk minimization (ERM) across standard benchmarks. To better understand this phenomenon, we investigate whether there are benefits of DG algorithms over ERM through the lens of label noise.Specifically, our finite-sample analysis reveals that label noise exacerbates the effect of spurious correlations for ERM, undermining generalization. Conversely, we illustrate that DG algorithms exhibit implicit label-noise robustness during finite-sample training even when spurious correlation is present.Such desirable property helps mitigate spurious correlations and improve generalization in synthetic experiments. However, additional comprehensive experiments on real-world benchmark datasets indicate that label-noise robustness does not necessarily translate to better performance compared to ERM. We conjecture that the failure mode of ERM arising from spurious correlations may be less prevalent in practice.

POSTER-379: Chain-of-Experts: When LLMs Meet Complex Operations Research Problems

Primary Area: general machine learning (i.e., none of the above)

Keywords: Large Language Model Operations Research

Scores: [ 6 8 8 6 ]

POSTER-380: Novel Quadratic Constraints for Extending LipSDP beyond Slope-Restricted Activations

Primary Area: general machine learning (i.e., none of the above)

Keywords: Neural Networks Semidefinite Programming Lipschitz Constant Estimation

Scores: [ 6 6 6 ]

Recently, semidefinite programming (SDP) techniques have shown great promise in providing accurate Lipschitz bounds for neural networks. Specifically, the LipSDP approach (Fazlyab et al., 2019) has received much attention and provides the least conservative Lipschitz upper bounds that can be computed with polynomial time guarantees. However, one main restriction of LipSDP is that its formulation requires the activation functions to be slope-restricted on [0,1], preventing its further use for more general activation functions such as GroupSort, MaxMin, and Householder. One can rewrite MaxMin activations for example as residual ReLU networks. However, a direct application of LipSDP to the resultant residual ReLU networks is conservative and even fails in recovering the well-known fact that the MaxMin activation is 1-Lipschitz. Our paper bridges this gap and extends LipSDP beyond slope-restricted activation functions. To this end, we provide novel quadratic constraints for GroupSort, MaxMin, and Householder activations via leveraging their underlying properties such as sum preservation. Our proposed analysis is general and provides a unified approach for estimating 2 and Lipschitz bounds for a rich class of neural network architectures, including non-residual and residual neural networks and implicit models, with GroupSort, MaxMin, and HouseHolder activations. Finally, we illustrate the utility of our approach with a variety of experiments and show that our proposed SDPs generate less conservative Lipschitz bounds in comparison to existing approaches.

POSTER-381: Principled Architecture-aware Scaling of Hyperparameters

Primary Area: general machine learning (i.e., none of the above)

Keywords: Hyperparameter Transfer Neural Network Architecture Neural Network Initialization Learning Rate

Scores: [ 6 6 6 8 ]

Training a high-quality deep neural network requires choosing suitable hyperparameters, which is a non-trivial and expensive process. Current works try to automatically optimize or design principles of hyperparameters, such that they can generalize to diverse unseen scenarios. However, most designs of principles or optimization methods are agnostic to the choice of network structures, and thus largely ignore the impact of neural architectures on hyperparameters. In this work, we precisely characterize the dependence of initializations and maximal learning rates on the network architecture, which includes the network depth, width, convolutional kernel size, and connectivity patterns. By pursuing every parameter to be maximally updated with the same mean squared change in pre-activations, we can generalize our initialization and learning rates across MLPs (multi-layer perception) and CNNs (convolutional neural network) with sophisticated graph topologies. We verify our principles with comprehensive experiments. More importantly, our strategy further sheds light on advancing current benchmarks for architecture design. A fair comparison of AutoML algorithms requires accurate network rankings. However, we demonstrate that network rankings can be easily changed by better training networks in benchmarks with our architecture-aware learning rates and initialization.

POSTER-382: Set Learning for Accurate and Calibrated Models

Primary Area: general machine learning (i.e., none of the above)

Keywords: set learning calibration overconfidence class imbalance long-tailed classification low data classification calibration safety

Scores: [ 8 6 3 8 ]

Model overconfidence and poor calibration are common in machine learning and difficult to account for when applying standard empirical risk minimization. In this work, we propose a novel method to alleviate these problems that we call odd-k-out learning (OKO), which minimizes the cross-entropy error for sets rather than for single examples. This naturally allows the model to capture correlations across data examples and achieves both better accuracy and calibration, especially in limited training data and class-imbalanced regimes. Perhaps surprisingly, OKO often yields better calibration even when training with hard labels and dropping any additional calibration parameter tuning, such as temperature scaling. We provide theoretical justification, establishing that OKO naturally yields better calibration, and provide extensive experimental analyses that corroborate our theoretical findings. We emphasize that OKO is a general framework that can be easily adapted to many settings and the trained model can be applied to single examples at inference time, without introducing significant run-time overhead or architecture changes.

ORAL-22: "What Data Benefits My Classifier?" Enhancing Model Performance and Interpretability through Influence-Based Data Selection

Primary Area: general machine learning (i.e., none of the above)

Keywords: Data Selection Interpretability Fairness Robustness

Scores: [ 6 8 6 6 6 ]

Classification models are ubiquitously deployed in society and necessitate high utility, fairness, and robustness performance. Current research efforts mainly focus on improving model architectures and learning algorithms on fixed datasets to achieve this goal. In contrast, in this paper, we address an orthogonal yet crucial problem: given a fixed convex learning model (or a convex surrogate for a non-convex model) and a function of interest, we assess what data benefits the model by interpreting the feature space, and then aim to improve performance as measured by this function. To this end, we propose the use of influence estimation models for interpreting the classifier's performance from the perspective of the data feature space. Additionally, we propose data selection approaches based on influence that enhance model utility, fairness, and robustness. Through extensive experiments on synthetic and real-world datasets, we validate and demonstrate the effectiveness of our approaches not only for conventional classification scenarios, but also under more challenging scenarios such as distribution shifts, fairness poisoning attacks, utility evasion attacks, online learning, and active learning.

POSTER-383: DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation

Primary Area: general machine learning (i.e., none of the above)

Keywords: Diffusion models speech generation text to speech autoregressive model

Scores: [ 5 6 6 6 ]

Diffusion models have recently been shown to be relevant for high-quality speech generation. Most work has been focused on generating spectrograms, and as such, they further require a subsequent model to convert the spectrogram to a waveform (i.e., a vocoder). This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform. The proposed model is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one. Hence, our model can effectively synthesize an unlimited speech duration while preserving high-fidelity synthesis and temporal coherence. We implemented the proposed model for unconditional and conditional speech generation, where the latter can be driven by an input sequence of phonemes, amplitudes, and pitch values. Working on the waveform directly has some empirical advantages. Specifically, it allows the creation of local acoustic behaviors, like vocal fry, which makes the overall waveform sounds more natural. Furthermore, the proposed diffusion model is stochastic and not deterministic; therefore, each inference generates a slightly different waveform variation, enabling abundance of valid realizations. Experiments show that the proposed model generates speech with superior quality compared with other state-of-the-art neural speech generation systems.

POSTER-384: CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets

Primary Area: general machine learning (i.e., none of the above)

Keywords: Tool Creation Tool Retrieval LLM Multimodality Reasoning

Scores: [ 8 6 6 ]

Large language models (LLMs) are often augmented with tools to solve complex tasks. By generating code snippets and executing them through task-specific Application Programming Interfaces (APIs), they can offload certain functions to dedicated external modules, such as image encoding and performing calculations. However, most existing approaches to augment LLMs with tools are constrainedby general-purpose APIs and lack the flexibility for tailoring them to specific tasks. In this work, we present CRAFT, a general tool creation and retrieval framework for LLMs. It creates toolsets specifically curated for the tasks and equips LLMs with a component that retrieves tools from these sets to enhance their capability to solve complex tasks. For each task, we collect specific code solutions by promptingGPT-4 to solve the training examples. Following a validation step ensuring the correctness, these solutions are abstracted into code snippets to enhance reusability, and deduplicated for higher quality. At inference time, the language model retrieves snippets from the toolsets and then executes them or generates the output conditioning on the retrieved snippets. Our method is designed to be flexible andoffers a plug-and-play approach to adapt off-the-shelf LLMs to unseen domains and modalities, without any finetuning. Experiments on vision-language, tabular processing, and mathematical reasoning tasks show that our approach achieves substantial improvements compared to strong baselines. In addition, our in-depth analysis reveals that: (1) consistent performance improvement can be achieved byscaling up the number of tools and the capability of the backbone models; (2) each component of our approach contributes to the performance gains; (3) the created tools are well-structured and reliable with low complexity and atomicity.

SPOTLIGHT-68: Adaptive Causal Balancing for Collaborative Filtering

Primary Area: general machine learning (i.e., none of the above)

Keywords: Recommender System Causal Inference Bias Debias Balancing

Scores: [ 8 6 6 ]

Collaborative filtering builds personalized models from the collected user feedback. However, the collected data is observational rather than experimental, leading to various biases in the data, which can significantly affect the learned model. To address this issue, many studies have focused on propensity-based methods to combat the selection bias by reweighting the sample loss, and demonstrate thatbalancing is important for debiasing both theoretically and empirically. However, there are two questions that still need to be addressed: which function class should be balanced and how to effectively balance that function class? In this paper, we first perform theoretical analysis to show the effect of balancing finite-dimensional function classes on the bias of IPS and DR methods, and based on this, we propose a universal kernel-based balancing method to balance functions on the reproducing kernel Hilbert space. In addition, we propose a novel adaptive causal balancing method during the alternating update between unbiased evaluation and training of the prediction model. Specifically, the prediction loss of the model is projected in the kernel-based covariate function space, and the projection coefficients are used to determine which functions should be prioritized for balancing to reduce the estimation bias. We conduct extensive experiments on three real-world datasets to demonstrate the effectiveness of the proposed approach.

ORAL-23: Multisize Dataset Condensation

Primary Area: general machine learning (i.e., none of the above)

Keywords: Dataset Condensation Dataset Distillation Image Classification

Scores: [ 8 8 8 6 ]

While dataset condensation effectively enhances training efficiency, its application in on-device scenarios brings unique challenges. 1) Due to the fluctuating computational resources of these devices, there's a demand for a flexible dataset size that diverges from a predefined size. 2) The limited computational power on devices often prevents additional condensation operations. These two challenges connect to the "subset degradation problem" in traditional dataset condensation: a subset from a larger condensed dataset is often unrepresentative compared to directly condensing the whole dataset to that smaller size. In this paper, we propose Multisize Dataset Condensation (MDC) by compressing N condensation processes into a single condensation process to obtain datasets with multiple sizes. Specifically, we introduce an "adaptive subset loss" on top of the basic condensation loss to mitigate the "subset degradation problem". Our MDC method offers several benefits: 1) No additional condensation process is required; 2) reduced storage requirement by reusing condensed images. Experiments validate our findings on networks including ConvNet, ResNet and DenseNet, and datasets including SVHN, CIFAR-10, CIFAR-100 and ImageNet. For example, we achieved 6.40% average accuracy gains on condensing CIFAR-10 to ten images per class.

POSTER-385: A Symmetry-Aware Exploration of Bayesian Neural Network Posteriors

Primary Area: general machine learning (i.e., none of the above)

Keywords: bayesian posterior bayesian neural networks dataset ensembles uncertainty quantification weight-space symmetries computer vision

Scores: [ 6 3 5 6 8 ]

The distribution of the weights of modern deep neural networks (DNNs) - crucial for uncertainty quantification and robustness - is an eminently complex object due to its extremely high dimensionality. This paper proposes one of the first large-scale explorations of the posterior distribution of deep Bayesian Neural Networks (BNNs), expanding our study to real-world vision tasks and architectures. Specifically, we investigate the optimal approach for approximating the posterior, analyze the connection between posterior quality and uncertainty quantification, delve into the impact of modes on the posterior, and explore methods for visualizing the posterior. Moreover, we uncover weight-space symmetries as a critical aspect for understanding the posterior. To this extent, we develop an in-depth assessment of the impact of both permutation and scaling symmetries that tend to obfuscate the Bayesian posterior. While the first type of transformation is known for duplicating modes, we explore the relationship between the latter and L2 regularization, challenging previous misconceptions. Finally, to help the community improve our understanding of the Bayesian posterior, we will release the first large-scale checkpoint dataset, including thousands of real-world models, along with our codes, after the anonymity period.

POSTER-386: An Emulator for Fine-tuning Large Language Models using Small Language Models

Primary Area: general machine learning (i.e., none of the above)

Keywords: pre-training fine-tuning decouple scale reward alignment

Scores: [ 8 6 6 6 ]

Widely used language models (LMs) are typically built by scaling up a two-stage training pipeline: a pre-training stage that uses a very large, diverse dataset of text and a fine-tuning (sometimes, 'alignment') stage using more targeted examples of specific behaviors and/or human preferences. While it has been hypothesized that knowledge and skills come from pre-training, and fine-tuning mostly filters this knowledge and skillset, this intuition has not been rigorously tested. In this paper, we test this hypothesis with a novel methodology for scaling these two stages independently, essentially asking, What would happen if we combined the knowledge learned by a large model during pre-training with the knowledge learned by a small model during fine-tuning (or vice versa)? Using an RL-based framework derived from recent developments in learning from human preferences, we introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales. Our experiments with EFT show that scaling up fine-tuning tends to improve helpfulness, while scaling up pre-training tends to improve factuality. Further, we show that EFT enables test-time adjustment of competing behavioral factors like helpfulness and harmlessness without additional training. Finally, we find that a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling small fine-tuned models with large pre-trained models, essentially 'emulating' the result of fine-tuning the large pre-trained model. Up-scaling consistently improves helpfulness and factuality of widely used pre-trained models like Llama, Llama-2, and Falcon, without additional hyperparameters or training.

POSTER-387: L2MAC: Large Language Model Automatic Computer for Unbounded Code Generation

Primary Area: general machine learning (i.e., none of the above)

Keywords: Code Generation Memory-augmented LLMs Large Language Models (LLMs) LLM coder agent

Scores: [ 8 8 6 6 8 ]

Transformer-based large language models (LLMs) are constrained by the fixed context window of the underlying transformer architecture, hindering their ability to produce long and logically consistent code. Memory-augmented LLMs are a promising solution, but current approaches cannot handle long code generation tasks since they (1) only focus on reading memory and reduce its evolution to the concatenation of new memories or (2) use very specialized memories that cannot adapt to other domains. This paper presents L2MAC, the first practical LLM-based stored-program automatic computer for long and consistent code generation. Its memory has two components: the instruction registry, which is populated with a prompt program to solve the user-given task, and a file store, which will contain the final and intermediate outputs. Each instruction is executed by a separate LLM instance, whose context is managed by a control unit capable of precise memory reading and writing to ensure effective interaction with the file store. These components enable L2MAC to generate virtually unbounded code structures, bypassing the constraints of the finite context window while producing code that fulfills complex user-specified requirements. We empirically show that L2MAC succeeds in generating large code bases for system design tasks where other coding methods fall short in implementing user requirements and provide insight into the reasons for this performance gap.

POSTER-388: HyperAttention: Long-context Attention in Near-Linear Time

Primary Area: general machine learning (i.e., none of the above)

Keywords: self-attention locality sensitive hashing long-range context

Scores: [ 5 8 6 ]

We present an approximate attention mechanism named HyperAttention to address the computational challenges posed by the growing complexity of long contexts used in Large Language Models (LLMs). Recent work suggests that in the worst-case scenario, the quadratic time is necessary unless the entries of the attention matrix are bounded or the matrix has low stable rank. We introduce two parameters which measure: (1) the max column norm in the normalized attention matrix, and (2) the ratio of row norms in the unnormalized attention matrix after detecting and removing large entries. We use these fine-grained parameters to capture the hardness of the problem. Despite previous lower bounds, we are able to achieve a linear time sampling algorithm even when the matrix has unbounded entries or a large stable rank, provided the above parameters are small.HyperAttention features a modular design that easily accommodates integration of other fast low-level implementations, particularly FlashAttention. Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods, giving significant speed improvements compared to state-of-the-art solutions like FlashAttention. This development presents substantial implications for enabling LLMs to handle significantly larger contexts.

POSTER-389: One-shot Active Learning Based on Lewis Weight Sampling for Multiple Deep Models

Primary Area: general machine learning (i.e., none of the above)

Keywords: active learning Lewis weight sampling machine learning deep learning

Scores: [ 5 5 6 5 ]

Active learning (AL) for multiple target models aims to reduce labeled data querying while effectively training multiple models concurrently. Existing AL algorithms often rely on iterative model training, which can be computationally expensive, particularly for deep models. In this paper, we propose a one-shot AL method to address this challenge, which performs all label queries without repeated model training. Specifically, we extract different representations of the same dataset using distinct network backbones, and actively learn linear prediction layer on each representation via an p-regression formulation. The regression problems are solved approximately by sampling and reweighting the unlabeled instances based on their maximum Lewis weights across the representations. An upper bound on the number of samples needed is provided with a rigorous analysis for p(0,+). Notably, in the case of p=2, our result substantially improves the bound of applying (Gajjar et al., 2023) to our setting. Experimental results on 8 benchmarks show that our one-shot approach achieves competitive performances with the state-of-the-art AL methods for multiple target models.

POSTER-390: Parallelizing non-linear sequential models over the sequence length

Primary Area: general machine learning (i.e., none of the above)

Keywords: parallel algorithm recurrent neural networks neural ordinary differential equations sequential models

Scores: [ 6 6 6 ]

Sequential models, such as Recurrent Neural Networks and Neural Ordinary Differential Equations, have long suffered from slow training due to their inherent sequential nature.For many years this bottleneck has persisted, as many thought sequential models could not be parallelized.We challenge this long-held belief with our parallel algorithm that accelerates GPU evaluation of sequential models by up to 3 orders of magnitude faster without compromising output accuracy.The algorithm does not need any special structure in the sequential models' architecture, making it applicable to a wide range of architectures.Using our method, training sequential models can be more than 10 times faster than the common sequential method without any meaningful difference in the training results.Leveraging this accelerated training, we discovered the efficacy of the Gated Recurrent Unit in a long time series classification problem with 17k time samples.By overcoming the training bottleneck, our work serves as the first step to unlock the potential of non-linear sequential models for long sequence problems.

SPOTLIGHT-69: Robustifying State-space Models for Long Sequences via Approximate Diagonalization

Primary Area: general machine learning (i.e., none of the above)

Keywords: state-space models sequence models Long-Range Arena recurrent neural networks

Scores: [ 6 6 8 6 ]

SPOTLIGHT-70: Improving Out-of-Domain Generalization with Domain Relations

Primary Area: general machine learning (i.e., none of the above)

Keywords: Out-of-Domain Generalization; Domain Relations; Distribution Shift

Scores: [ 6 6 6 8 8 6 ]

Distribution shift presents a significant challenge in machine learning, where models often underperform during the test stage when faced with a different distribution than the one they were trained on. In this paper, we focus on domain shifts, which occur when the model is applied to new domains that are different from the ones it was trained on, and propose a new approach called DG. Unlike previous approaches that aim to learn a single model that is domain invariant, DG leverages domain similarities based on domain metadata to learn domain-specific models. Concretely, DG learns a set of training-domain-specific functions during the training stage and reweights them based on domain relations during the test stage. These domain relations can be directly obtained and learned from domain metadata. Under mild assumptions, we theoretically prove that using domain relations to reweight training-domain-specific functions achieves stronger out-of-domain generalization compared to the conventional averaging approach. Empirically, we evaluate the effectiveness of DG using both toy and real-world datasets for tasks such as temperature regression, land use classification, and molecule-protein binding affinity prediction. Our results show that DG consistently outperforms state-of-the-art methods.

SPOTLIGHT-71: Finite-State Autoregressive Entropy Coding for Efficient Learned Lossless Compression

Primary Area: general machine learning (i.e., none of the above)

Keywords: Lossless Compression Autoregressive Model Acceleration Entropy Coding Autoencoder

Scores: [ 6 6 6 6 ]

Learned lossless data compression has garnered significant attention recently due to its superior compression ratios compared to traditional compressors. However, the computational efficiency of these models jeopardizes their practicality. This paper proposes a novel system for improving the compression ratio while maintaining computational efficiency for learned lossless data compression. Our approach incorporates two essential innovations. First, we propose the Finite-State AutoRegressive (FSAR) entropy coder, an efficient autoregressive Markov model based entropy coder that utilizes a lookup table to expedite autoregressive entropy coding. Next, we present a Straight-Through Hardmax Quantization (STHQ) scheme to enhance the optimization of discrete latent space. Our experiments show that the proposed lossless compression method could improve the compression ratio by up to 6% compared to the baseline, with negligible extra computational time. Our work provides valuable insights into enhancing the computational efficiency of learned lossless data compression, which can have practical applications in various fields. Code will be available publicly after the paper is accepted.

POSTER-391: PAC-FNO: Parallel-Structured All-Component Fourier Neural Operators for Recognizing Low-Quality Images

Primary Area: general machine learning (i.e., none of the above)

Keywords: Low-quality image Visual recognition Classification Low-resolution Input variations

Scores: [ 6 6 6 ]

A standard practice in developing image recognition models is to train a model on a specific image resolution and then deploy it. However, in real-world inference, models often encounter images different from the training sets in resolution and/or subject to natural variations such as weather changes, noise types and compression artifacts. While traditional solutions involve training multiple models for different resolutions or input variations, these methods are computationally expensive and thus do not scale in practice. To this end, we propose a novel neural network model, parallel-structured and all-component Fourier neural operator (PAC-FNO), that addresses the problem. Unlike conventional feed-forward neural networks, PAC-FNO operates in the frequency domain, allowing it to handle images of varying resolutions within a single model. We also propose a two-stage algorithm for training PAC-FNO with a minimal modification to the original, downstream model. Moreover, the proposed PAC-FNO is ready to work with existing image recognition models. Extensively evaluating methods with seven image recognition benchmarks, we show that the proposed PAC-FNO improves the performance of existing baseline models on images with various resolutions by up to 77.1% and various types of natural variations in the images at inference.

POSTER-392: Internal Cross-layer Gradients for Extending Homogeneity to Heterogeneity in Federated Learning

Primary Area: general machine learning (i.e., none of the above)

Keywords: Federated Learning heterogeneity convex optimization.

Scores: [ 8 5 8 8 ]

Federated learning (FL) inevitably confronts the challenge of system heterogeneity in practical scenarios. To enhance the capabilities of most model-homogeneous FL methods in handling system heterogeneity, we propose a training scheme that can extend their capabilities to cope with this challenge. In this paper, we commence our study with a detailed exploration of homogeneous and heterogeneous FL settings and discover three key observations: (1) a positive correlation between client performance and layer similarities, (2) higher similarities in the shallow layers in contrast to the deep layers, and (3) the smoother gradients distributions indicate the higher layer similarities. Building upon these observations, we propose InCo Aggregation that leverages internal cross-layer gradients, a mixture of gradients from shallow and deep layers within a server model, to augment the similarity in the deep layers without requiring additional communication between clients. Furthermore, our methods can be tailored to accommodate model-homogeneous FL methods such as FedAvg, FedProx, FedNova, Scaffold, and MOON, to expand their capabilities to handle the system heterogeneity. Copious experimental results validate the effectiveness of InCo Aggregation, spotlighting internal cross-layer gradients as a promising avenue to enhance the performance in heterogeneous FL.

POSTER-393: MG-TSD: Multi-Granularity Time Series Diffusion Models with Guided Learning Process

Primary Area: general machine learning (i.e., none of the above)

Keywords: denoising diffusion models multi-granularity time series forecasting

Scores: [ 6 6 6 ]

Recently, diffusion probabilistic models have attracted attention in generative time series forecasting due to their remarkable capacity to generate high-fidelity samples. However, the effective utilization of their strong modeling ability in the probabilistic time series forecasting task remains an open question, partially due to the challenge of instability arising from their stochastic nature. To address this challenge, we introduce a novel Multi-Granularity Time Series Diffusion (MG-TSD) model, which achieves state-of-the-art predictive performance by leveraging the inherent granularity levels within the data as given targets at intermediate diffusion steps to guide the learning process of diffusion models. The way to construct the targets is motivated by the observation that forward process of the diffusion model, which sequentially corrupts the data distribution to a standard normal distribution, intuitively aligns with the process of smoothing fine-grained data into a coarse-grained representation, both of which result in a gradual loss of fine distribution features. In the study, we derive a novel multi-granularity guidance diffusion loss function and propose a concise implementation method to effectively utilize coarse-grained data across various granularity levels.More importantly, our approach does not rely on additional external data, making it versatile and applicable across various domains. Extensive experiments conducted on real-world datasets demonstrate that our MG-TSD model outperforms existing time series prediction methods.

POSTER-394: Test-Time Training on Nearest Neighbors for Large Language Models

Primary Area: general machine learning (i.e., none of the above)

Keywords: Test-time training

Scores: [ 5 5 5 6 ]

Many recent efforts aim to augment language models with relevant information retrieved from a database at test time. We avoid the need for prompt engineering by directly fine-tuning the model on data retrieved at test time using its standard training setup. For this purpose, we build a large-scale distributed nearest neighbor index based on text embeddings of the Pile dataset. Given a query to a language model, our system retrieves the neighbors of the query and fine-tunes the model on the text data corresponding to those neighbors. Surprisingly, retrieving and training on as few as 20 neighbors, each for only one gradient iteration, drastically improves performance across more than twenty language modeling tasks in the Pile benchmark. For example, test-time training significantly narrows the performance gap between a small GPT2 model and a GPTNeo model, more than ten times larger, that was specifically trained to convergence on the Pile. Sufficient index quality and size, however, are important. Our work establishes a valuable first baseline for implementing test-time training in the context of large language models, opening the door to numerous promising research avenues.

ORAL-24: InfoBatch: Lossless Training Speed Up by Unbiased Dynamic Data Pruning

Primary Area: general machine learning (i.e., none of the above)

Keywords: Dynamic Data Pruning; Training acceleration

Scores: [ 6 8 6 8 ]

Data pruning aims to obtain lossless performances with less overall cost. A common approach is to filter out samples that make less contribution to the training. This could lead to gradient expectation bias compared to the original data. To solve this problem, we propose InfoBatch, a novel framework aiming to achieve lossless training acceleration by unbiased dynamic data pruning. Specifically, InfoBatchrandomly prunes a portion of less informative samples based on the loss distribution and rescales the gradients of the remaining samples to approximate the original gradient. As a plug-and-play and architecture-agnostic framework, InfoBatch consistently obtains lossless training results on classification, semantic segmentation, vision pertaining, and instruction fine-tuning tasks. On CIFAR10/100, ImageNet-1K, and ADE20K, InfoBatch losslessly saves 40% overall cost. For pertaining MAE and diffusion model, InfoBatch can respectively save 24.8% and 27% cost. For LLaMA instruction fine-tuning, InfoBatch is also able to save 20% cost and is compatible with coreset selection methods. The code will be made public.

POSTER-395: Large Language Models as Optimizers

Primary Area: general machine learning (i.e., none of the above)

Keywords: large language model optimizer prompting

Scores: [ 8 8 6 5 ]

Optimization is ubiquitous. While derivative-based algorithms have been powerful tools for various problems, the absence of gradient imposes challenges on many real-world applications. In this work, we propose Optimization by PROmpting (OPRO), a simple and effective approach to leverage large language models (LLMs) as optimizers, where the optimization task is described in natural language. In each optimization step, the LLM generates new solutions from the prompt that contains previously generated solutions with their values, then the new solutions are evaluated and added to the prompt for the next optimization step. We first showcase OPRO on linear regression and traveling salesman problems, then move on to prompt optimization where the goal is to find instructions that maximize the task accuracy. With a variety of LLMs, we demonstrate that the best prompts optimized by OPRO outperform human-designed prompts by up to 8% on GSM8K, and by up to 50% on Big-Bench Hard tasks.

POSTER-396: Provably Robust Conformal Prediction with Improved Efficiency

Primary Area: general machine learning (i.e., none of the above)

Keywords: Conformal prediction randomized smoothing adversarial robustness

Scores: [ 6 8 8 6 ]

Conformal prediction is a powerful tool to generate uncertainty sets with guaranteed coverage using any predictive model, under the assumption that the training and test data are i.i.d.. Recently, it has been shown that adversarial examples are able to manipulate conformal methods to construct prediction sets with invalid coverage rates, as the i.i.d. assumption is violated. To address this issue, a recent work, Randomized Smoothed Conformal Prediction (RSCP), was first proposed to certify the robustness of conformal prediction methods to adversarial noise. However, RSCP has two major limitations: (i) its robustness guarantee is flawed when used in practice and (ii) it tends to produce large uncertainty sets. To address these limitations, we first propose a novel framework called RSCP+ to provide provable robustness guarantee in evaluation, which fixes the issues in the original RSCP method. Next, we propose two novel methods, Post-Training Transformation (PTT) and Robust Conformal Training (RCT), to effectively reduce prediction set size with little computation overhead. Experimental results in CIFAR10, CIFAR100, and ImageNet suggest the baseline method only yields trivial predictions including full label set, while our methods could boost the efficiency by up to 4.36×, 5.46×, and 16.9× respectively and provide practical robustness guarantee.

POSTER-397: What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning

Primary Area: general machine learning (i.e., none of the above)

Keywords: data selection instruction tuning large language models

Scores: [ 5 6 8 ]

Instruction tuning is a standard technique employed to align large language models to end tasks and user preferences after the initial pretraining phase. Recent research indicates the critical role of data engineering in instruction tuning -- when appropriately selected, only limited data is necessary to achieve superior performance. However, we still lack a principled understanding of what makes good instruction tuning data for alignment, and how we should select data automatically and effectively. In this work, we delve deeply into automatic data selection strategies for alignment. We start with controlled studies to measure data across three dimensions: complexity, quality, and diversity, along which we examine existing methods and introduce novel techniques for enhanced data measurement. Subsequently, we propose a simple strategy to select data samples based on the measurement. We present Deita (short for Data-Efficient Instruction Tuning for Alignment), a series of models fine-tuned from LLaMA models using data samples automatically selected with our proposed approach. When assessed through both automatic metrics and human evaluation, Deita performs better or on par with the state-of-the-art open-source alignment models such as Vicuna and WizardLM with only 6K training data samples -- 10x less than the data used in the baselines. We anticipate this work to provide clear guidelines and tools on automatic data selection, aiding researchers and practitioners in achieving data-efficient alignment.

POSTER-398: A 2-Dimensional State Space Layer for Spatial Inductive Bias

Primary Area: general machine learning (i.e., none of the above)

Keywords: SSM Dimensional State Spaces Spatial Representation

Scores: [ 6 6 6 6 ]

A central objective in computer vision is to design models with appropriate 2-D inductive bias. Desiderata for 2-D inductive bias include two-dimensional position awareness, dynamic spatial locality, and translation and permutation invariance. To address these goals, we leverage an expressive variation of the multidimensional State Space Model (SSM). Our approach introduces efficient parameterization,accelerated computation, and a suitable normalization scheme. Empirically, we observe that incorporating our layer at the beginning of each transformer block of Vision Transformers (ViT) significantly enhances performance for multiple ViT backbones and across datasets. The new layer is effective even with a negligible amount of additional parameters and inference time. Ablation studies and visualizations demonstrate that the layer has a strong 2-D inductive bias. For example, vision transformers equipped with our layer exhibit effective performance evenwithout positional encoding. Our code is attached as supplementary.

Primary Area: general machine learning (i.e., none of the above)

Keywords: Large Language Model Tool Use Tree Search A* Search

Scores: [ 8 6 8 8 ]

Large language models (LLMs) have demonstrated powerful decision-making and planning capabilities in solving complicated real-world problems. LLM-based autonomous agents can interact with diverse tools (e.g., functional APIs) and generate solution plans that execute a series of API function calls in a step-by-step manner. The multitude of candidate API function calls significantly expands the action space, amplifying the critical need for efficient action space navigation. However, existing methods either struggle with unidirectional exploration in expansive action spaces, trapped into a locally optimal solution, or suffer from exhaustively traversing all potential actions, causing inefficient navigation. To address these issues, we propose ToolChain*, an efficient tree search-based planning algorithm for LLM-based agents. It formulates the entire action space as a decision tree, where each node represents a possible API function call involved in a solution plan. By incorporating the A$^$ search algorithm with task-specific cost function design, it efficiently prunes high-cost branches that may involve incorrect actions, identifying the most low-cost valid path as the solution. Extensive experiments on multiple tool-use and reasoning tasks demonstrate that ToolChain efficiently balances exploration and exploitation within an expansive action space. It outperforms state-of-the-art baselines on planning and reasoning tasks by 3.1% and 3.5% on average while requiring 7.35x and 2.31x less time, respectively.

POSTER-400: BatchPrompt: Accomplish more with less

Primary Area: general machine learning (i.e., none of the above)

Keywords: large language models token-resource utilization prompt

Scores: [ 6 8 5 6 ]

POSTER-401: Large Language Models as Analogical Reasoners

Primary Area: general machine learning (i.e., none of the above)

Keywords: large language model prompting analogical reasoning

Scores: [ 8 5 5 5 ]

Chain-of-thought (CoT) prompting for language models demonstrates impressive performance across reasoning tasks, but typically needs labeled exemplars of the reasoning process. In this work, we introduce a new prompting approach, Analogical Prompting, designed to automatically guide the reasoning process of large language models. Inspired by analogical reasoning, a cognitive process in which humans draw from relevant past experiences to tackle new problems, our approach prompts language models to self-generate relevant exemplars or knowledge in the context, before proceeding to solve the given problem. This method presents several advantages: it obviates the need for labeling or retrieving exemplars, offering generality and convenience; it can also tailor the generated exemplars and knowledge to each problem, offering adaptability. Experimental results show that our approach outperforms 0-shot CoT and few-shot CoT in a variety of reasoning tasks, including math problem solving in GSM8K and MATH, code generation in Codeforces, and other reasoning tasks in BIG-Bench, with an average accuracy gain of +4%.

POSTER-402: Out-of-Distribution Detection by Leveraging Between-Layer Transformation Smoothness

Primary Area: general machine learning (i.e., none of the above)

Keywords: out-of-distribution detection deep neural networks transformers representation analysis uncertainty quantification text classification

Scores: [ 6 6 5 5 ]

Effective OOD detection is crucial for reliable machine learning models, yet most current methods are limited in practical use due to requirements like access to training data or intervention in training. We present a novel method for detecting OOD data in deep neural networks based on transformation smoothness between intermediate layers of a network (BLOOD), which is applicable to pre-trained models without access to training data. BLOOD utilizes the tendency of between-layer representation transformations of in-distribution (ID) data to be smoother than the corresponding transformations of OOD data, a property that we also demonstrate empirically for Transformer networks. We evaluate BLOOD on several text classification tasks with Transformer networks and demonstrate that it outperforms methods with comparable resource requirements. Our analysis also suggests that when learning simpler tasks, OOD data transformations maintain their original sharpness, whereas sharpness increases with more complex tasks.

POSTER-403: Backdoor Federated Learning by Poisoning Backdoor-Critical Layers

Primary Area: general machine learning (i.e., none of the above)

Keywords: Federated Learning Backdoor Attack

Scores: [ 6 6 6 6 ]

Federated learning (FL) has been widely deployed to enable machine learning training on sensitive data across distributed devices. However, the decentralized learning paradigm and heterogeneity of FL further extend the attack surface for backdoor attacks. Existing FL attack and defense methodologies typically focus on the whole model. None of them recognizes the existence of backdoor-critical (BC) layers-a small subset of layers that dominate the model vulnerabilities. Attacking the BC layers achieves equivalent effects as attacking the whole model but at a far smaller chance of being detected by state-of-the-art (SOTA) defenses. This paper proposes a general in-situ approach that identifies and verifies BC layers from the perspective of attackers. Based on the identified BC layers, we carefully craft a new backdoor attack methodology that adaptively seeks a fundamental balance between attacking effects and stealthiness under various defense strategies. Extensive experiments show that our BC layer-aware backdoor attacks can successfully backdoor FL under seven SOTA defenses with only 10% malicious clients and outperform the latest backdoor attack methods.

POSTER-404: MuSc : Zero-Shot Anomaly Classification and Segmentation by Mutual Scoring of the Unlabeled Images

Primary Area: general machine learning (i.e., none of the above)

Keywords: zero-shot anomaly detection; Industrial Informatics;

Scores: [ 6 5 6 3 6 ]

This paper studies zero-shot anomaly classification (AC) and segmentation (AS) in industrial vision. We reveal that the abundant normal and abnormal cues implicit in unlabeled test images can be exploited for anomaly determination, which is ignored by prior methods. Our key observation is that for the industrial product images, the normal image patches could find a relatively large number of similar patches in other unlabeled images, while the abnormal ones only have a few similar patches. We leverage such a discriminative characteristic to design a novel zero-shot AC/AS method by Mutual Scoring (MuSc) of the unlabeled images, which does not need any training or prompts. Specifically, we perform \hzm{Local Neighborhood Aggregation with Multiple Degrees (LNAMD)} to obtain the patch features that are capable of representing anomalies in varying sizes. Then we propose the Mutual Scoring Mechanism (MSM) to leverage the unlabeled test images to assign the anomaly score to each other. Furthermore, we present an optimization approach named Re-scoring with Constrained Image-level Neighborhood (RsCIN) for image-level anomaly classification to suppress the false positives caused by noises in normal images. The superior performance on the challenging MVTec AD and VisA datasets demonstrates the effectiveness of our approach. Compared with the state-of-the-art zero-shot approaches, MuSc achieves a 29.2% PRO absolute gain (from 64.6% to 93.8%) on MVTec AD, a 19.4% pixel-AP gain and a 14.7% pixel-AUROC gain on VisA. In addition, our zero-shot approach outperforms most of the few-shot approaches and is comparable to some one-class methods. The source code will be released publicly available.

POSTER-405: Hyper Evidential Deep Learning to Quantify Composite Classification Uncertainty

Primary Area: general machine learning (i.e., none of the above)

Keywords: Evidential Neural Network hyperdomain vagueness

Scores: [ 6 6 6 6 ]

Deep neural networks (DNNs) have been shown to perform well on exclusive, multi-class classification tasks. However, when different classes have similar visual features, it becomes challenging for human annotators to differentiate them. When an image is ambiguous, such as a blurry one where an annotator can't distinguish between a husky and a wolf, it may be labeled with both classes: {husky, wolf}. This scenario necessitates the use of composite set labels. In this paper, we propose a novel framework called Hyper-Evidential Neural Network (HENN) that explicitly models predictive uncertainty caused by composite set labels in training data in the context of the belief theory called Subjective Logic (SL).By placing a Grouped Dirichlet distribution on the class probabilities, we treat predictions of a neural network as parameters of hyper-subjective opinions and learn the network that collects both single and composite evidence leading to these hyper-opinions by a deterministic DNN from data.We introduce a new uncertainty type called vagueness originally designed for hyper-opinions in SL to quantify composite classification uncertainty for DNNs.Our experiments prove that HENN outperforms its state-of-the-art counterparts based on four image datasets.The code and datasets are available at: https://shorturl.at/dhoqx.

POSTER-406: DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models

Primary Area: general machine learning (i.e., none of the above)

Keywords: Influence function Data valuation

Scores: [ 6 6 6 6 ]

POSTER-407: On the Learnability of Watermarks for Language Models

Primary Area: general machine learning (i.e., none of the above)

Keywords: watermarking large language models distillation

Scores: [ 6 6 6 5 ]

Language model watermarking enables reliable detection of model-generated text, which has many applications in the responsible deployment of language models. Existing watermarking strategies operate by altering the decoder of an existing language model, and the ability for a language model to directly learn to generate the watermark would have significant implications for the real-world deployment of watermarks. First, learned watermarks could be used to build open models that naturally generate watermarked text, allowing for open models to benefit from watermarking. Second, if watermarking is used to determine the provenance of generated text, an adversary can damage the reputation of a victim model by spoofing its watermark and generating harmful watermarked text. To investigate the learnability of watermarks, we propose watermark distillation, which trains a student model to behave like a teacher model that uses decoding-based watermarking. We test our approach on three distinct decoding-based watermarking strategies, finding that models can learn to generate watermarked text with high detectability. We also find limitations to learnability, including the loss of watermarking capabilities under fine-tuning on normal text and high sample complexity when learning low-distortion watermarks.

SPOTLIGHT-72: What does automatic differentiation compute for neural networks?

Primary Area: general machine learning (i.e., none of the above)

Keywords: automatic differentiation correctness neural networks clarke subdifferential

Scores: [ 6 8 8 8 6 ]

SPOTLIGHT-73: Poisoned Forgery Face: Towards Backdoor Attacks on Face Forgery Detection

Primary Area: general machine learning (i.e., none of the above)

Keywords: Deepfake Detection Backdoor Attack

Scores: [ 6 8 8 ]

The proliferation of face forgery techniques has raised significant concerns within society, thereby motivating the development of face forgery detection methods. These methods aim to distinguish forged faces from genuine ones and have proven effective in practical applications. However, this paper introduces a novel and previously unrecognized threat in face forgery detection scenarios caused by backdoor attack. By embedding backdoors into models and incorporating specific trigger patterns into the input, attackers can deceive detectors into producing erroneous predictions for forged faces. To achieve this goal, this paper proposes \emph{Poisoned Forgery Face} framework, which enables clean-label backdoor attacks on face forgery detectors. Our approach involves constructing a scalable trigger generator and utilizing a novel convolving process to generate translation-sensitive trigger patterns. Moreover, we employ a relative embedding method based on landmark-based regions to enhance the stealthiness of the poisoned samples. Consequently, detectors trained on our poisoned samples are embedded with backdoors. Notably, our approach surpasses SoTA backdoor baselines with a significant improvement in attack success rate (+16.39% BD-AUC) and reduction in visibility (-12.65% L). Furthermore, our attack exhibits promising performance against backdoor defenses. We anticipate that this paper will draw greater attention to the potential threats posed by backdoor attacks in face forgery detection scenarios. \emph{Our codes can be found in the anonymous website\footnote{\url{https://anonymous.4open.science/r/iclr24_2182}}.}

POSTER-408: Neural Auto-designer for Enhanced Quantum Kernels

Primary Area: general machine learning (i.e., none of the above)

Keywords: Quantum machine learning kernel learning quantum kernels feature map quantum circuit design

Scores: [ 3 6 6 6 ]

Quantum kernels hold great promise for offering computational advantages over classical learners, with the effectiveness of these kernels closely tied to the design of the feature map. However, the challenge of designing effective quantum feature maps for real-world datasets, particularly in the absence of sufficient prior information, remains a significant obstacle. In this study, we present a data-driven approach that automates the design of problem-specific quantum feature maps. Our approach leverages feature-selection techniques to handle high-dimensional data on near-term quantum machines with limited qubits, and incorporates a deep neural predictor to efficiently evaluate the performance of various candidate quantum kernels. Through extensive numerical simulations on different datasets, we demonstrate the superiority of our proposal over prior methods, especially for the capability of eliminating the kernel concentration issue and identifying the feature map with prediction advantages. Our work not only unlocks the potential of quantum kernels for enhancing real-world tasks, but also highlights the substantial role of deep learning in advancing quantum machine learning.

POSTER-409: Amortized Network Intervention to Steer the Excitatory Point Processes

Primary Area: general machine learning (i.e., none of the above)

Keywords: Time series application Point Process Amortized Learning

Scores: [ 5 6 6 5 ]

We tackle the challenge of large-scale network intervention for guiding excitatory point processes, such as infectious disease spread or traffic congestion control. Our model-based reinforcement learning utilizes neural ODEs to capture how the networked excitatory point processes will evolve subject to the time-varying changes in network topology. Our approach incorporates Gradient-Descent based Model Predictive Control (GD-MPC), offering policy flexibility to accommodate prior knowledge and constraints. To address the intricacies of planning and overcome the high dimensionality inherent to such decision-making problems, we design an Amortize Network Interventions (ANI) framework, allowing for the pooling of optimal policies from history and other contexts, while ensuring a permutation equivalent property. This property enables efficient knowledge transfer and sharing across diverse contexts. Our approach has broad applications, from curbing infectious disease spread to reducing carbon emissions through traffic light optimization, and thus has the potential to address critical societal and environmental challenges.

SPOTLIGHT-74: OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

Primary Area: general machine learning (i.e., none of the above)

Keywords: Large Language Model Compression Differentiable Quantization

Scores: [ 6 8 6 6 6 ]

Large language models (LLMs) have revolutionized natural language processing tasks. However, their practical deployment is hindered by their immense memory and computation requirements. Although recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM, they hand-craft quantization parameters, which leads to low performance and fails to deal with extremely low-bit quantization. To tackle this issue, we introduce an Omnidirectionally calibrated Quantization (OmniQuant) technique for LLMs, which achieves good performance in diverse quantization settings while maintaining the computational efficiency of PTQ by efficiently optimizing various quantization parameters. OmniQuant comprises two innovative components including Learnable Weight Clipping (LWC) and Learnable Equivalent Transformation (LET). LWC modulates the extreme values of weights by optimizing the clipping threshold. Meanwhile, LET tackles activation outliers by shifting the challenge of quantization from activations to weights through a learnable equivalent transformation. Operating within a differentiable framework using block-wise error minimization, OmniQuant can optimize the quantization process efficiently for both weight-only and weight-activation quantization. For instance, the LLaMA-2 model family with the size of 7-70B can be processed with OmniQuant on a single A100-40G GPU within 1-16 hours using 128 samples. Extensive experiments validate OmniQuant's superior performance across diverse quantization configurations such as W4A4, W6A6, W4A16, W3A16, and W2A16. Additionally, OmniQuant demonstrates effectiveness in instruction-tuned models and delivers notable improvements in inference speed and memory reduction on real devices. Codes and models are available at https://github.com/anonymous998899/OmniQuant.

POSTER-410: Context is Environment

Primary Area: general machine learning (i.e., none of the above)

Keywords: Domain Generalization; In-Context Learning

Scores: [ 6 6 6 8 ]

Two lines of work are taking the central stage in AI research. On the one hand, the community is making increasing efforts to build models that discard spurious correlations and generalize better in novel test environments. Unfortunately, the hard lesson so far is that no proposal convincingly outperforms a simple empirical risk minimization baseline. On the other hand, large language models (LLMs) have erupted as algorithms able to learn in-context, generalizing on-the-fly to eclectic contextual circumstances that users enforce by means of prompting. In this paper, we argue that context is environment, and posit that in-context learning holds the key to better domain generalization. Via extensive theory and experiments, we show that paying attention to context$\unicode{x2013}\unicode{x2013}unlabeledexamplesastheyarrive\unicode{x2013}\unicode{x2013}$allows our proposed In-Context Risk Minimization (ICRM) algorithm to zoom-in on the test environment risk minimizer, leading to significant out-of-distribution performance improvements. From all of this, two messages are worth taking home. Researchers in domain generalization should consider environment as context, and harness the adaptive power of in-context learning. Researchers in LLMs should consider context as environment, to better structure data towards generalization.

POSTER-411: Learning Personalized Causally Invariant Representations for Heterogeneous Federated Clients

Primary Area: general machine learning (i.e., none of the above)

Keywords: Personalized Federated Learning Invariant Learning Shortcut Learning Causality Out-of-distribution Generalization

Scores: [ 3 6 6 ]

Personalized federated learning (PFL) has gained great success in tackling the scenarios where target datasets are heterogeneous across the local clients. However, the application of the existing PFL methods to real-world setting is hindered by the common assumption that the test data on each client is in-distribution (IND) with respect to its training data. Due to the bias of training dataset, the modern machine learning model prefers to rely on shortcut which can perform well on the training data but fail to generalize to the unseen test data that is out-of-distribution (OOD). This pervasive phenomenon is called shortcut learning and has attracted plentiful efforts in centralized situations. In PFL, the limited data diversity on federated clients makes mitigating shortcut and meanwhile preserving personalization knowledge rather difficult. In this paper, we analyse this challenging problem by formulating the structural causal models (SCMs) for heterogeneous federated clients. From the proposed SCMs, we derive two significant causal signatures which inspire a provable shortcut discovery and removal method under federated learning, namely FedSDR. Specifically, FedSDR is divided into two steps: 1) utilizing the available training data distributed among local clients to discover all the shortcut features in a collaborative manner. 2) developing the optimal personalized causally invariant predictor for each client by eliminating the discovered shortcut features. We provide theoretical analysis to prove that our method can draw complete shortcut features and produce the optimal personalized invariant predictor that can generalize to unseen OOD data on each client. The experimental results on diverse datasets validate the superiority of FedSDR over the state-of-the-art PFL methods on OOD generalization performance.

POSTER-412: Heterogeneous Personalized Federated Learning by Local-Global Updates Mixing via Convergence Rate

Primary Area: general machine learning (i.e., none of the above)

Keywords: Personalized Federated Learning Heterogeneous Data Feature Distribution Shift

Scores: [ 5 6 8 ]

Personalized federated learning (PFL) has emerged as a promising technique for addressing the challenge of data heterogeneity. While recent studies have made notable progress in mitigating heterogeneity associated with label distributions, the issue of effectively handling feature heterogeneity remains an open question. In this paper, we propose a personalization approach by Local-global updates Mixing (LG-Mix) via Neural Tangent Kernel (NTK)-based convergence. The core idea is to leverage the convergence rate induced by NTK to quantify the importance of local and global updates, and subsequently mix these updates based on their importance. Specifically, we find the trace of the NTK matrix can manifest the convergence rate, and propose an efficient and effective approximation to calculate the trace of a feature matrix instead of the NTK matrix. Such approximation significantly reduces the cost of computing NTK, and the feature matrix explicitly considers the heterogeneous features among samples. We have theoretically analyzed the convergence of our method in the over-parameterize regime, and experimentally evaluated our method on five datasets. These datasets present heterogeneous data features in natural and medical images. With comprehensive comparison to existing state-of-the-art approaches, our LG-Mix has consistently outperformed them across all datasets (largest accuracy improvement of 5.01%), demonstrating the outstanding efficacy of our method for model personalization.

POSTER-413: Data-independent Module-aware Pruning for Hierarchical Vision Transformers

Primary Area: general machine learning (i.e., none of the above)

Keywords: Filter Pruning; Model Compression; Vision Transformer

Scores: [ 8 6 6 ]

Hierarchical vision transformers (ViTs) have two advantages over conventional ViTs. First, hierarchical ViTs achieve linear computational complexity with respect to image size by local self-attention. Second, hierarchical ViTs create hierarchical feature maps by merging image patches in deeper layers for dense prediction. However, existing pruning methods ignore the unique properties of hierarchical ViTs and use the magnitude value as the weight importance. This approach leads to two main drawbacks. First, the "local" attention weights are compared at a "global" level, which may cause some "locally" important weights to be pruned due to their relatively small magnitude "globally". The second issue with magnitude pruning is that it fails to consider the distinct weight distributions of the network, which are essential for extracting coarse to fine-grained features at various hierarchical levels. To solve the aforementioned issues, we have developed a Data-independent Module-Aware Pruning method (DIMAP) to compress hierarchical ViTs. To ensure that "local" attention weights at different hierarchical levels are compared fairly in terms of their contribution, we treat them as a module and examine their contribution by analyzing their information distortion. Furthermore, we introduce a novel weight metric that is solely based on weights and does not require input images, thereby eliminating the dependence on the patch merging process. Our method validates its usefulness and strengths on Swin Transformers of different sizes on ImageNet-1k classification. Notably, the top-5 accuracy drop is only 0.07% when we remove 52.5% FLOPs and 52.7% parameters of Swin-B. When we reduce 33.2% FLOPs and 33.2% parameters of Swin-S, we can even achieve a 0.8% higher relative top-5 accuracy than the original model.

POSTER-414: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Primary Area: general machine learning (i.e., none of the above)

Keywords: Large Language Models Jailbreak Attack Adversarial Attack

Scores: [ 6 8 6 8 ]

The aligned Large Language Models (LLMs) are powerful language understanding and decision-making tools that are created through extensive alignment with human feedback. However, these large models remain susceptible to jailbreak attacks, where adversaries manipulate prompts to elicit malicious outputs that should not be given by aligned LLMs. Investigating jailbreak prompts can lead us to delve into the limitations of LLMs and further guide us to secure them. Unfortunately, existing jailbreak techniques suffer from either (1) scalability issues, where attacks heavily rely on manual crafting of prompts, or (2) stealthiness problems, as attacks depend on token-based algorithms to generate prompts that are often semantically meaningless, making them susceptible to detection through basic perplexity testing. In light of these challenges, we intend to answer this question: Can we develop an approach that can automatically generate stealthy jailbreak prompts? In this paper, we introduce AutoDAN, a novel jailbreak attack against aligned LLMs. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm. Extensive evaluations demonstrate that AutoDAN not only automates the process while preserving semantic meaningfulness, but also demonstrates superior attack strength in cross-model transferability, and cross-sample universality compared with the baseline. Moreover, we also compare AutoDAN with perplexity-based defense methods and show that AutoDAN can bypass them effectively.

POSTER-415: Designing Skill-Compatible AI: Methodologies and Frameworks in Chess

Primary Area: general machine learning (i.e., none of the above)

Keywords: Skill-AI compatibility Agent Systems Decision-making Chess Deep RL

Scores: [ 8 8 5 6 ]

POSTER-416: Continuous Invariance Learning

Primary Area: general machine learning (i.e., none of the above)

Keywords: Causality; Domain Generalization; Invariance Learning

Scores: [ 8 5 8 5 ]

Invariance learning methods aim to learn invariant features in the hope that they generalize under distributional shift. Although many tasks are naturally characterized by continuous domains, current invariance learning techniques generally assume categorically indexed domains. For example, auto-scaling in cloud computing often needs a CPU utilization prediction model that generalizes across different times (e.g., time of a day and date of a year), where `time' is a continuous domain index. In this paper, we start by theoretically showing that existing invariance learning methods can fail for continuous domain problems. Specifically, the naive solution of splitting continuous domains into discrete ones ignores the underlying relationship among domains, and therefore potentially leads to suboptimal performance. To address this challenge, we then propose Continuous Invariance Learning (CIL), which extracts invariant features across continuously indexed domains. CIL is a novel adversarial procedure which measures and controls the conditional independence between the labels and continuous domain indices given the extracted features. Our theoretical analysis demonstrates that CIL learns features that satisfy the invariant constraint with infinite samples. Empirical results on both synthetic and real-world datasets (including data collected from production systems) show that CIL consistently outperforms strong baselines among all the tasks.

POSTER-417: Learning to solve Class-Constrained Bin Packing Problems via Encoder-Decoder Model

Primary Area: general machine learning (i.e., none of the above)

Keywords: Combinatorial Optimization Class-Contrained Bin Packing Problems Graph Convolution Network Cluster Decode

Scores: [ 6 6 8 6 6 ]

Neural methods have shown significant merit in solving combinatorial optimization (CO) problems, including the Bin Packing Problem (BPP). However, most existing ML-based approaches focus on geometric BPP like 3DBPP, neglecting complex vector BPP. In this study, we introduce a vector BPP variant called Class-Constrained Bin Packing Problem (CCBPP), dealing with items of both classes and sizes, and the objective is to pack the items in the least amount of bins respecting the bin capacity and the number of different classes that it can hold. To enhance the efficiency and practicality of solving CCBPP, we propose a learning-based Encoder-Decoder Model. The Encoder employs a Graph Convolution Network (GCN) to generate a heat-map, representing probabilities of different items packing together. The Decoder decodes and fine-tunes the solution through Cluster Decode and Active Search methods, thereby producing high-quality solutions for CCBPP instances. Extensive experiments demonstrate that our proposed method consistently yields high-quality solutions for various kinds of CCBPP with a very small gap from the optimal. Moreover, our Encoder-Decoder Model also shows promising performance on one practical application of CCBPP, the \emph{Manufacturing Order Consolidation Problem} (OCP).

SPOTLIGHT-75: Dynamic Discounted Counterfactual Regret Minimization

Primary Area: general machine learning (i.e., none of the above)

Keywords: imperfect-information games regret minimization Nash equilibrium

Scores: [ 8 8 8 8 ]

Counterfactual regret minimization (CFR) is a family of iterative algorithms showing promising results in solving imperfect-information games. Recent novel CFR variants (e.g., CFR+, DCFR) have significantly improved the convergence rate of the vanilla CFR. The key to these CFR variants’ performance is weighting each iteration non-uniformly, i.e., discounting earlier iterations. However, these algorithms use a fixed, manually-specified scheme to weight each iteration, which enormously limits their potential. In this work, we propose Dynamic Discounted CFR (DDCFR), the first equilibrium-finding framework that discounts prior iterations using a dynamic, automatically-learned scheme. We formalize CFR’s iteration process as a carefully designed Markov decision process and transform the discounting scheme learning problem into a policy optimization problem within it. The learned discounting scheme dynamically weights each iteration on the fly using information available at runtime. Experimental results across multiple games demonstrate that DDCFR’s dynamic discounting scheme has a strong generalization ability and leads to faster convergence with improved performance.

POSTER-418: Principled Federated Domain Adaptation: Gradient Projection and Auto-Weighting

Primary Area: general machine learning (i.e., none of the above)

Keywords: federated domain adaptation federated learning domain adaptation machine learning

Scores: [ 6 6 6 5 ]

Federated Domain Adaptation (FDA) describes the federated learning (FL) setting where source clients and a server work collaboratively to improve the performance of a target client where limited data is available. The domain shift between the source and target domains, coupled with limited data of the target client, makes FDA a challenging problem, e.g., common techniques such as federated averaging and fine-tuning fail due to domain shift and data scarcity. To theoretically understand the problem, we introduce new metrics that characterize the FDA setting and a theoretical framework with novel theorems for analyzing the performance of server aggregation rules. Further, we propose a novel lightweight aggregation rule, Federated Gradient Projection (FedGP), which significantly improves the target performance with domain shift and data scarcity. Moreover, our theory suggests an auto-weighting scheme that finds the optimal combinations of the source and target gradients. This scheme improves both FedGP and a simpler heuristic aggregation rule. Extensive experiments verify the theoretical insights and illustrate the effectiveness of the proposed methods in practice.

POSTER-419: Compressed Context Memory for Online Language Model Interaction

Primary Area: general machine learning (i.e., none of the above)

Keywords: context compression efficient inference natural language processing transformer

Scores: [ 6 5 6 6 ]

This paper presents a novel context compression method for Transformer language models in online scenarios such as ChatGPT, where the context continually expands. As the context lengthens, the attention process requires more memory and computational resources, which in turn reduces the throughput of the language model. To this end, we propose a compressed context memory system that continually compresses the growing context into a compact memory space. The compression process simply involves integrating a lightweight adapter into the language model's forward pass during inference. Based on the compressed context memory, the language model can perform inference with reduced memory and attention operations. Through evaluations on multi-task learning, personalization, and conversation, we demonstrate that our approach achieves the performance level of a full context model with 5× smaller context memory space.

POSTER-420: Linear Log-Normal Attention with Unbiased Concentration

Primary Area: general machine learning (i.e., none of the above)

Keywords: Neural Networks Transformers Self-Attention Linear Attention Scalable Transformers Efficient Attention Attention with Linear Complexity Linearized Attention Self-Attention Analysis

Scores: [ 6 5 5 8 ]

POSTER-421: When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method

Primary Area: general machine learning (i.e., none of the above)

Keywords: LLM finetuning Scaling Laws Full-model finetuning Parameter efficient tuning Machine Translation Multilingual Summarization

Scores: [ 8 5 8 6 ]

While large language models (LLMs) often adopt finetuning to unlock their capabilities for downstream applications, our understanding on the inductive biases (especially the scaling properties) of different finetuning methods is still limited. To fill this gap, we conduct systematic experiments studying whether and how different scaling factors, including LLM model size, pretraining data size, new finetuning parameter size and finetuning data size, affect the finetuning performance. We consider two types of finetuning – full-model tuning (FMT) and parameter efficient tuning (PET, including prompt tuning and LoRA), and explore their scaling behaviors in the data-limited regime where the LLM model size substantially outweighs the finetuning data size. Based on two sets of pretrained bilingual LLMs from 1B to 16B and experiments on bilingual machine translation and multilingual summarization benchmarks, we find that 1) LLM finetuning follows a powerbased multiplicative joint scaling law between finetuning data size and each other scaling factor; 2) LLM finetuning benefits more from LLM model scaling than pretraining data scaling, and PET parameter scaling is generally ineffective; and 3) the optimal finetuning method is highly task- and finetuning data-dependent. We hope our findings could shed light on understanding, selecting and developing LLM finetuning methods.

POSTER-422: Be Aware of the Neighborhood Effect: Modeling Selection Bias under Interference for Recommendation

Primary Area: general machine learning (i.e., none of the above)

Keywords: Recommender system Selection Bias Neighborhood effect

Scores: [ 8 5 6 8 ]

The interaction between users and recommender systems is not only affected by selection bias but also the neighborhood effect, i.e., the interaction between a user and an item is affected by the interactions between other users and other items, or between the same user and other items, or between other users and the same item. Many previous studies have focused on addressing selection bias to achieve unbiased learning of the prediction model, but the lack of consideration of neighborhood effects can lead to biased estimates and suboptimal performance of the prediction model. In this paper, we formally formulate the neighborhood effect as an interference problem from the perspective of causal inference and introduce a treatment representation to capture the neighborhood effect. On this basis, we propose a novel ideal loss that can be used to deal with selection bias in the presence of neighborhood effects. In addition, we further develop two novel estimators for the ideal loss. We theoretically establish the connection between the proposed methods and previous methods ignoring the neighborhood effect and show that the proposed methods achieve unbiased learning when both selection bias and neighborhood effects are present, while the existing methods are biased. Extensive semi-synthetic and real-world experiments are conducted to demonstrate the effectiveness of the proposed methods.

POSTER-423: NfgTransformer: Equivariant Representation Learning for Normal-form Games

Primary Area: general machine learning (i.e., none of the above)

Keywords: Game Theory Deep Learning Representation Learning Nash Equilibrium

Scores: [ 8 5 8 3 ]

Normal-form games (NFGs) are the fundamental model of strategic interaction. We study their representation using neural networks, enabling deep learning systems to reason about game-theoretic concepts such as equilibrium, deviation gain, equity and social welfare. We describe the inherent equivariance of NFGs --- any permutation of strategies describes an equivalent game --- as well as the challenges this poses for representation learning. We then propose the NfgTransformer architecture that leverages this equivariance, significantly outperforming strong baselines in a range of game-theoretic applications. We show that the resulting model is interpretable and versatile, paving the way towards systems capable of game-theoretic reasoning when interacting with humans and with each other.

SPOTLIGHT-76: Stable Neural Stochastic Differential Equations in Analyzing Irregular Time Series Data

Primary Area: general machine learning (i.e., none of the above)

Keywords: Neural Ordinary Differential Equations Neural Stochastic Differential Equations irregular time series data time series classification

Scores: [ 8 6 6 ]

Irregular sampling intervals and missing values in real-world time series data present challenges for conventional methods that assume consistent intervals and complete data. Neural Ordinary Differential Equations (Neural ODEs) offer an alternative approach, utilizing neural networks combined with ODE solvers to learn continuous latent representations through parameterized vector fields. Neural Stochastic Differential Equations (Neural SDEs) extend Neural ODEs by incorporating a diffusion term, although this addition is not trivial, particularly when addressing irregular intervals and missing values. Consequently, careful design of drift and diffusion functions is crucial for maintaining stability and enhancing performance, while incautious choices can result in adverse properties such as the absence of strong solutions, stochastic destabilization, or unstable Euler discretizations, significantly affecting Neural SDEs' performance. In this study, we propose three stable classes of Neural SDEs—Langevin-type SDE, Linear Noise SDE, and Geometric SDE. Then, we rigorously demonstrate their robustness in maintaining excellent performance under distribution shift, while effectively preventing overfitting. To assess the effectiveness of our approach, we conduct extensive experiments on three benchmark datasets for interpolation and classification tasks, and analyze the robustness of our methods with 30 public datasets under different missing rates. Our results demonstrate the efficacy of the proposed method in handling real-world irregular time series data.

POSTER-424: LCOT: Linear Circular Optimal Transport

Primary Area: general machine learning (i.e., none of the above)

Keywords: Optimal Transport Circular Measure Probability Metrics

Scores: [ 6 6 6 ]

The optimal transport problem for measures supported on non-Euclidean spaces has recently gained ample interest in diverse applications involving representation learning. In this paper, we focus on circular probability measures, i.e., probability measures supported on the unit circle, and introduce a new computationally efficient metric for these measures, denoted as Linear Circular Optimal Transport (LCOT). The proposed metric comes with an explicit linear embedding that allows one to apply Machine Learning (ML) algorithms to the embedded measures and seamlessly modify the underlying metric for the ML algorithm to LCOT. We show that the proposed metric is rooted in the Circular Optimal Transport (COT) and can be considered the linearization of the COT metric with respect to a fixed reference measure. We provide a theoretical analysis of the proposed metric and derive the computational complexities for pairwise comparison of circular probability measures. Lastly, through a set of numerical experiments, we demonstrate the benefits of LCOT in learning representations from circular measures.

POSTER-425: Generative Learning for Solving Non-Convex Problem with Multi-Valued Input-Solution Mapping

Primary Area: general machine learning (i.e., none of the above)

Keywords: Non-convex optimization Multi-valued solution mapping Generative model Ordinary differential equation Supervised learning

Scores: [ 8 8 5 6 ]

General non-convex problems often present multiple optimal solutions for identical inputs, signifying a complex, multi-valued input-solution mapping. Conventional learning techniques, primarily tailored to learn single-valued mappings, struggle to train neural networks (NN) to accurately decipher multi-valued ones, leading to inferior solutions. We address this fundamental issue by developing a generative learning approach using a rectified flow (RectFlow) model built upon ordinary differential equations. In contrast to learning input-solution mapping, we learn the mapping from input to solution-distribution, exploiting the universal approximation capability of the RectFlow model. Upon receiving a new input, we employ the trained RectFlow model to sample high-quality solutions from the input-dependent distribution it has learned. Our approach outperforms conceivable GAN and Diffusion models in terms of training stability and run-time complexity. We provide a detailed characterization of the optimality loss and runtime complexity associated with our generative approach. Simulation results for solving non-convex problems show that our method achieves significantly better solution optimality than recent NN schemes, with comparable feasibility and speedup performance.

POSTER-426: Fusion is Not Enough: Single Modal Attack on Fusion Models for 3D Object Detection

Primary Area: general machine learning (i.e., none of the above)

Keywords: Adversarial Attacks 3D Object Detection Autonomous Driving

Scores: [ 8 6 6 5 ]

POSTER-427: Droplets of Good Representations: Grokking as a First Order Phase Transition in Two Layer Networks

Primary Area: general machine learning (i.e., none of the above)

Keywords: Grokking deep neural networks Gaussian Process phase transitions

Scores: [ 6 6 8 6 3 ]

A key property of deep neural networks (DNNs) is their ability to learn new features during training. This intriguing aspect of deep learning stands out most clearly in recently reported Grokking phenomena. While mainly reflected as a sudden increase in test accuracy, Grokking is also believed to be a beyond lazy-learning/Gaussian Process (GP) phenomenon involving feature learning. Here we apply a recent development in the theory of feature learning, the adaptive kernel approach, to two teacher-student models with cubic-polynomial and modular addition teachers. We provide analytical predictions on feature learning and Grokking properties of these models and demonstrate a mapping between Grokking and the theory of phase transitions. We show that after Grokking, the state of the DNN is analogous to the mixed phase following a first-order phase transition. In this mixed phase, the DNN generates useful internal representations of the teacher that are sharply distinct from those before the transition.

SPOTLIGHT-77: GIO: Gradient Information Optimization for Training Dataset Selection

Primary Area: general machine learning (i.e., none of the above)

Keywords: data selection data-centric AI information theory kl divergence gradient natural language processing computer vision

Scores: [ 8 8 6 6 ]

It is often advantageous to train models on a subset of the available train examples, because the examples are of variable quality or because one would like to train with fewer examples, without sacrificing performance. We present Gradient Information Optimization (GIO), a scalable, task-agnostic approach to this data selection problem that requires only a small set of (unlabeled) examples representing a target distribution. GIO begins from a natural, information-theoretic objective that is intractable in practice. Our contribution is in showing that it can be made highly scalable through a simple relaxation of the objective and a highly efficient implementation. In experiments with machine translation, spelling correction, and image recognition, we show that GIO delivers outstanding results with very small train sets. These findings are robust to different representation models and hyperparameters for GIO itself. GIO is task- and domain-agnostic and can be applied out-of-the-box to new datasets and domains.

SPOTLIGHT-78: Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints

Primary Area: general machine learning (i.e., none of the above)

Keywords: Language models Preference optimization

Scores: [ 5 8 8 8 ]

The increasing capabilities of large language models (LLMs) raise opportunities for artificial general intelligence but concurrently amplify safety concerns, such as potential misuse of AI systems, necessitating effective AI alignment. Reinforcement Learning from Human Feedback (RLHF) has emerged as a promising pathway towards AI alignment but brings forth challenges due to its complexity and dependence on a separate reward model. Direct Preference Optimization (DPO) has been proposed as an alternative; and it remains equivalent to RLHF under the reverse KL regularization constraint. This paper presents f-DPO, a generalized approach to DPO by incorporating diverse divergence constraints. We show that under certain f-divergences, including Jensen-Shannon divergence, forward KL divergences and α-divergences, the complex relationship between the reward and optimal policy can also be simplified by addressing the Karush–Kuhn–Tucker conditions. This eliminates the need for estimating the normalizing constant in the Bradley-Terry model and enables a tractable mapping between the reward function and the optimal policy. Our approach optimizes LLMs to align with human preferences in a more efficient and supervised manner under a broad set of divergence constraints. Empirically, adopting these divergences ensures a balance between alignment performance and generation diversity. Importantly, our f-DPO outperforms PPO-based methods in divergence efficiency, and divergence constraints directly influence expected calibration error (ECE).

POSTER-428: GOAt: Explaining Graph Neural Networks via Graph Output Attribution

Primary Area: general machine learning (i.e., none of the above)

Keywords: Graph Neural Networks Explainability Interpretability Local-level explanation Instance-level explanation

Scores: [ 6 8 6 5 ]

Understanding the decision-making process of Graph Neural Networks (GNNs) is crucial to their interpretability. Most existing methods for explaining GNNs typically rely on training auxiliary models, resulting in the explanations remain black-boxed. This paper introduces Graph Output Attribution (GOAt), a novel method to attribute graph outputs to input graph features, creating GNN explanations that are faithful, discriminative, as well as stable across similar samples. By expanding the GNN as a sum of scalar products involving node features, edge features and activation patterns, we propose an efficient analytical method to compute contribution of each node or edge feature to each scalar product and aggregate the contributions from all scalar products in the expansion form to derive the importance of each node and edge. Through extensive experiments on synthetic and real-world data, we show that our method not only outperforms various state-of-the-art GNN explainers in terms of the commonly used fidelity metric, but also exhibits stronger discriminability, and stability by a remarkable margin.

POSTER-429: Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs

Primary Area: general machine learning (i.e., none of the above)

Keywords: Large Language Models Network Sparsity

Scores: [ 6 6 6 ]

The ever-increasing large language models (LLMs), though opening a potential path for the upcoming artificial general intelligence, sadly drops a daunting obstacle on the way towards their on-device deployment. As one of the most well-established pre-LLMs approaches in reducing model complexity, network pruning appears to lag behind in the era of LLMs, due mostly to its costly fine-tuning (or re-training) necessity under the massive volumes of model parameter and training data. To close this industry-academia gap, we introduce Dynamic Sparse No Training (DSNT), a training-free fine-tuning approach that slightly updates sparse LLMs without the expensive backpropagation and any weight updates. Inspired by the Dynamic Sparse Training, DSNT minimizes the reconstruction error between the dense and sparse LLMs, in the fashion of performing iterative weight pruning-and-growing on top of sparse LLMs. To accomplish this purpose, DSNT particularly takes into account the anticipated reduction in reconstruction error for pruning and growing, as well as the variance w.r.t. different input data for growing each weight. This practice can be executed efficiently in linear time since its obviates the need of backpropagation for fine-tuning LLMs. Extensive experiments on LLaMA-V1/V2, Vicuna, and OPT across various benchmarks demonstrate the effectiveness of DSNT in enhancing the performance of sparse LLMs, especially at high sparsity levels. For instance, DSNT is able to outperform the state-of-the-art Wanda by 26.79 perplexity at 70% sparsity with LLaMA-7B. Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs. Codes will be released.

POSTER-430: Beyond IID weights: sparse and low-rank deep Neural Networks are also Gaussian Processes

Primary Area: general machine learning (i.e., none of the above)

Keywords: Deep Neural Networks Gaussian processes Neural Networks initialisation Edge of chaos Large width limit Mean-Field

Scores: [ 5 8 6 6 ]

The infinitely wide neural network has been proven a useful and manageable mathematical model that enables the understanding of many phenomena appearing in deep learning. One example is the convergence of random deep networks to Gaussian processes that enables a rigorous analysis of the way the choice of activation function and network weights impacts the training dynamics. In this paper, we extend the seminal proof of Matthews et al., 2018 to a larger class of initial weight distributions (which we call pseudo-iid), including the established cases of iid and orthogonal weights, as well as the emerging low-rank and structured sparse settings celebrated for their computational speed-up benefits. We show that fully-connected and convolutional networks initialised with pseudo-iid distributions are all effectively equivalent up to their variance. Using our results, one can identify the Edge of Chaos for a broader class of neural networks and tune them at criticality in order to enhance their training.

POSTER-431: Network Memory Footprint Compression Through Jointly Learnable Codebooks and Mappings

Primary Area: general machine learning (i.e., none of the above)

Keywords: quantization codebooks hashlists compression efficient inference deep learning

Scores: [ 6 8 5 ]

The massive interest in deep neural networks (DNNs) for both computer vision and natural language processing has been sparked by the growth in computational power. However, this led to an increase in the memory footprint, to a point where it can be challenging to simply load a model on commodity devices such as mobile phones. To address this limitation, quantization is a favored solution as it maps high precision tensors to a low precision, memory efficient format. In terms of memory footprint reduction, its most effective variants are based on codebooks. These methods, however, suffer from two limitations. First, they either define a single codebook for each tensor, or use a memory-expensive mapping to multiple codebooks. Second, gradient descent optimization of the mapping favors jumps toward extreme values, hence not defining a proximal search. In this work, we propose to address these two limitations. First, we initially group similarly distributed neurons and leverage the re-ordered structure to either apply different scale factors to the different groups, or map weights that fall in these groups to several codebooks, without any mapping overhead. Second, stemming from this initialization, we propose a joint learning of the codebook and weight mappings that bears similarities with recent gradient-based post-training quantization techniques. Third, drawing estimation from straight-through estimation techniques, we introduce a novel gradient update definition to enable a proximal search of the codebooks and their mappings. The proposed jointly learnable codebooks and mappings (JLCM) method allows a very efficient approximation of any DNN: as such, a Llama 7B can be compressed down to 2Go and loaded on 5-year-old smartphones.

POSTER-432: Neural Architecture Retrieval

Primary Area: general machine learning (i.e., none of the above)

Keywords: Information Retrieval Vector Database Neural Architecture Search

Scores: [ 8 8 6 ]

With the increasing number of new neural architecture designs and substantial existing neural architectures, it becomes difficult for the researchers to situate their contributions compared with existing neural architectures or establish the connections between their designs and other relevant ones. To discover similar neural architectures in an efficient and automatic manner, we define a new problem Neural Architecture Retrieval which retrieves a set of existing neural architectures which have similar designs to the query neural architecture. Existing graph pre-training strategies cannot address the computational graph in neural architectures due to the graph size and motifs. To fulfill this potential, we propose to divide the graph into motifs which are used to rebuild the macro graph to tackle these issues, and introduce multi-level contrastive learning to achieve accurate graph representation learning. Extensive evaluations on both human-designed and synthesized neural architectures demonstrate the superiority of our algorithm. Such a dataset which contains 12k real-world network architectures, as well as their embedding, is built for neural architecture retrieval.

POSTER-433: Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing

Primary Area: general machine learning (i.e., none of the above)

Keywords: Large language models Efficient ML Query Routing

Scores: [ 6 6 3 8 ]

Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size, while smaller models that can be deployed on lower cost (e.g., edge) devices, tend to lag behind in terms of response quality. Therefore in this work we propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality. Our approach uses a router that assigns queries to the small or large model based on the predicted query difficulty and the desired quality level. The desired quality level can be tuned dynamically at test time to seamlessly trade quality for cost as per the scenario requirements. In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.

POSTER-434: Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models

Primary Area: generative models

Keywords: Diffusion model; Content Suppression; Image Editing; Text Embeddings

Scores: [ 6 6 6 6 ]

The success of recent text-to-image diffusion models is largely due to their capacity to be guided by a complex text prompt, which enables users to precisely describe the desired content. However, these models struggle to effectively suppress the generation of undesired content, which is explicitly requested to be omitted from the generated image in the prompt. In this paper, we analyze how to manipulate the text embeddings and remove unwanted content from them. We introduce two contributions, which we refer to as soft-weighted regularization and inference-time text embedding optimization. The first regularizes the text embedding matrix and effectively suppresses the undesired content. The second method aims to further suppress the unwanted content generation of the prompt, and encourages the generation of desired content. We evaluate our method quantitatively and qualitatively on extensive experiments, validating its effectiveness. Furthermore, our method is generalizability to both the pixel-space diffusion models (i.e. DeepFloyd-IF) and the latent-space diffusion models (i.e. Stable Diffusion).

SPOTLIGHT-79: CADS: Unleashing the Diversity of Diffusion Models through Condition-Annealed Sampling

Primary Area: generative models

Keywords: diffusion models diversity generative models

Scores: [ 8 8 8 ]

While conditional diffusion models are known to have good coverage of the data distribution, they still face limitations in output diversity, particularly when sampled with a high classifier-free guidance scale for optimal image quality or when trained on small datasets. We attribute this problem to the role of the conditioning signal in inference and offer an improved sampling strategy for diffusion models that can increase generation diversity, especially at high guidance scales, with minimal loss of sample quality. Our sampling strategy anneals the conditioning signal by adding scheduled, monotonically decreasing Gaussian noise to the conditioning vector during inference to balance diversity and condition alignment. Our Condition-Annealed Diffusion Sampler (CADS) can be used with any pretrained model and sampling algorithm, and we show that it boosts the diversity of diffusion models in various conditional generation tasks. Further, using an existing pretrained diffusion model, CADS achieves a new state-of-the-art FID of 1.70 and 2.31 for class-conditional ImageNet generation at 256$\times$256 and 512$\times$512 respectively.

POSTER-435: MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

Primary Area: generative models

Keywords: mathematical reasoning large language models code generation

Scores: [ 6 6 5 6 ]

The recently released GPT-4 Code Interpreter has demonstrated remarkable proficiency in solving challenging math problems, primarily attributed to its ability to seamlessly reason with natural language, generate code, execute code, and continue reasoning based on the execution output. In this paper, we present a method to fine-tune open-source language models, enabling them to use code for modeling and deriving math equations and, consequently, enhancing their mathematical reasoning abilities. We propose a method of generating novel and high-quality datasets with math problems and their code-based solutions, referred to as MathCodeInstruct. Each solution interleaves natural language, code, and execution results. We also introduce a customized supervised fine-tuning and inference approach. This approach yields the MathCoder models, a family of models capable of generating code-based solutions for solving challenging math problems. Impressively, the MathCoder models achieve state-of-the-art scores among open-source LLMs on the MATH (45.2%) and GSM8K (83.9%) datasets, substantially outperforming other open-source alternatives. Notably, the MathCoder model not only surpasses ChatGPT-3.5 and PaLM-2 on GSM8K and MATH but also outperforms GPT-4 on the competition-level MATH dataset. The proposed dataset and models will be released upon acceptance.

POSTER-436: Revisit and Outstrip Entity Alignment: A Perspective of Generative Models

Primary Area: generative models

Keywords: entity alignment variational autoencoder generative models knowledge graphs

Scores: [ 8 6 6 ]

Recent embedding-based methods have achieved great successes in exploiting entity alignment from knowledge graph (KG) embeddings of multiple modalities. In this paper, we study embedding-based entity alignment (EEA) from a perspective of generative models. We show that EEA shares similarities with typical generative models and prove the effectiveness of the recently developed generative adversarial network (GAN)-based EEA methods theoretically. We then reveal that their incomplete objective limits the capacity on both entity alignment and entity synthesis (i.e., generating new entities). We mitigate this problem by introducing a generative EEA (GEEA) framework with the proposed mutual variational autoencoder (M-VAE) as the generative model. M-VAE enables entity conversion between KGs and generation of new entities from random noise vectors. We demonstrate the power of GEEA with theoretical analysis and empirical experiments on both entity alignment and entity synthesis tasks.

POSTER-437: Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion

Primary Area: generative models

Keywords: Diffusion Models; Score-based Models; Generative Models; Generative AI; Deep Generative Models; Distillation Models

Scores: [ 6 8 6 6 ]

Consistency Models (CM) (Song et al., 2023) accelerate score-based diffusion model sampling at the cost of sample quality but lack a natural way to trade-off quality for speed. To address this limitation, we propose Consistency Trajectory Model (CTM), a generalization encompassing CM and score-based models as special cases. CTM trains a single neural network that can -- in a single forward pass -- output scores (i.e., gradients of log-density) and enables unrestricted traversal between any initial and final time along the Probability Flow Ordinary Differential Equation (ODE) in a diffusion process. CTM enables the efficient combination of adversarial training and denoising score matching loss to enhance performance and achieves new state-of-the-art FIDs for single-step diffusion model sampling on CIFAR-10 (FID 1.73) and ImageNet at 64X64 resolution (FID 2.13). CTM also enables a new family of sampling schemes, both deterministic and stochastic, involving long jumps along the ODE solution trajectories. It consistently improves sample quality as computational budgets increase, avoiding the degradation seen in CM. Furthermore, CTM's access to the score accommodates all diffusion model inference techniques, including exact likelihood computation.

POSTER-438: Bidirectional Temporal Diffusion Model for Temporally Consistent Human Animation

Primary Area: generative models

Keywords: Human Motion Transfer Diffusion Model Human Animation Generative model.

Scores: [ 6 6 8 6 ]

We introduce a method to generate temporally coherent human animation from a single image, a video, or a random noise.This problem has been formulated as modeling of an auto-regressive generation, i.e., to regress past frames to decode future frames.However, such unidirectional generation is highly prone to motion drifting over time, generating unrealistic human animation with significant artifacts such as appearance distortion. We claim that bidirectional temporal modeling enforces temporal coherence on a generative network by largely suppressing the appearance ambiguity.To prove our claim, we design a novel human animation framework using a denoising diffusion model: a neural network learns to generate the image of a person by denoising temporal Gaussian noises whose intermediate results are cross-conditioned bidirectionally between consecutive frames. In the experiments, our method demonstrates strong performance compared to existing unidirectional approaches with realistic temporal coherence.

POSTER-439: AdjointDPM: Adjoint Sensitivity Method for Gradient Backpropagation of Diffusion Probabilistic Models

Primary Area: generative models

Keywords: Diffusion Probabilistic Model Neural ODE Adjoint Sensitivity Method

Scores: [ 6 6 6 ]

This paper considers a ubiquitous problem underlying several applications of DPMs, i.e., optimizing the parameters of DPMs when the objective is a differentiable metric defined on the generated contents. Since the sampling procedure of DPMs involves recursive calls to the denoising UNet, naive gradient backpropagation requires storing the intermediate states of all iterations, resulting in extremely high memory consumption. To overcome this issue, we propose a novel method AdjointDPM, which first generates new samples from diffusion models by solving the corresponding probability-flow ODEs. It then uses the adjoint sensitivity method to backpropagate the gradients of the loss to the models' parameters (including conditioning signals, network weights, and initial noises) by solving another augmented ODE. To reduce numerical errors in both the forward generation and gradient backpropagation processes, we further reparameterize the probability-flow ODE and augmented ODE as simple non-stiff ODEs using exponential integration. AdjointDPM can effectively compute the gradients of all types of parameters in DPMs, including the network weights, conditioning text prompts, and noisy states.Finally, we demonstrate the effectiveness of AdjointDPM on several interesting tasks: guided generation via modifying sampling trajectories, finetuning DPM weights for stylization, and converting visual effects into text embeddings.

POSTER-440: Is Self-Repair a Silver Bullet for Code Generation?

Primary Area: generative models

Keywords: program synthesis code generation large language models machine learning for code

Scores: [ 8 8 6 ]

POSTER-441: SALMON: Self-Alignment with Principle-Following Reward Models

Primary Area: generative models

Keywords: AI Alignment Large Language Models Principle-Driven Self-Align

Scores: [ 6 6 6 8 ]

Supervised Fine-Tuning (SFT) on human demonstrations combined with Reinforcement Learning from Human Feedback (RLHF) constitutes a powerful alignment paradigm for Large Language Model (LLM) AI-assistant agents. However, a significant limitation of this approach is its substantial dependency on high-quality human annotations, making its broader application to intricate tasks challenging due to difficulties in obtaining consistent response demonstrations and task-specific response preferences. To address this issue, we present a novel alignment paradigm in this paper, termed SALMON (Self-ALignMent with principle-fOllowiNg reward models). This paradigm offers the ability to align base language models with minimal human supervision, using only a select set of human-defined principles, yet achieves superior performance. Central to our approach is a principle-following reward model. Trained on synthetic preference data, this reward model can generate reward scores based on arbitrary human-defined principles. Therefore, during the RL training phase, by merely adjusting these principles, we gain full control over the preferences of the reward model, subsequently influencing the behavior of the RL-trained policy model, and eliminating the traditional reliance on exhaustive online human preference collection. Applying our method to the LLaMA-2-70b base language model, we developed an AI assistant named Dromedary-2. With only 6 exemplars for in-context learning and 31 human-defined principles, Dromedary-2 significantly surpasses the performance of several state-of-the-art AI systems, including LLaMA-2-Chat-70b, on various benchmark datasets. We have open-sourced the code and model weights to encourage further research into aligning LLM-based AI agents with enhanced supervision efficiency, improved controllability, and scalable oversight.

POSTER-442: Elucidating the Exposure Bias in Diffusion Models

Primary Area: generative models

Keywords: Generative models Diffusion models

Scores: [ 8 6 6 6 6 ]

Diffusion models have demonstrated impressive generative capabilities, but their exposure bias problem, described as the input mismatch between training and sampling, lacks in-depth exploration. In this paper, we systematically investigate the exposure bias problem in diffusion models by first analytically modelling the sampling distribution, based on which we then attribute the prediction error at each sampling step as the root cause of the exposure bias issue. Furthermore, we discuss potential solutions to this issue and propose an intuitive metric for it. Along with the elucidation of exposure bias, we propose a simple, yet effective, training-free method called Epsilon Scaling to alleviate the exposure bias. We show that Epsilon Scaling explicitly moves the sampling trajectory closer to the vector field learned in the training phase by scaling down the network output (Epsilon), mitigating the input mismatch between training and sampling. Experiments on various diffusion frameworks (ADM, DDPM/DDIM, EDM, LDM), unconditional and conditional settings, and deterministic vs. stochastic sampling verify the effectiveness of our method. Remarkably, our ADM-ES, as a SOTA stochastic sampler, obtains 2.17 FID on CIFAR-10 under 100-step unconditional generation.

POSTER-443: GeoDiffusion: Text-Prompted Geometric Control for Object Detection Data Generation

Primary Area: generative models

Keywords: diffusion model controllable generation object detection autonomous driving

Scores: [ 6 6 6 8 ]

Diffusion models have attracted significant attention due to the remarkable ability to create content and generate data for tasks like image classification. However, the usage of diffusion models to generate the high-quality object detection data remains an underexplored area, where not only image-level perceptual quality but also geometric conditions such as bounding boxes and camera views are essential. Previous studies have utilized either copy-paste synthesis or layout-to-image (L2I) generation with specifically designed modules to encode semantic layouts. In this paper, we propose GeoDiffusion, a simple framework that can flexibly translate various geometric conditions into text prompts and empower pre-trained text-to-image (T2I) diffusion models for high-quality detection data generation. Unlike previous L2I methods, our GeoDiffusion is able to encode not only the bounding boxes but also extra geometric conditions such as camera views in self-driving scenes. Extensive experiments demonstrate GeoDiffusion outperforms previous L2I methods while maintaining 4x training time faster. To the best of our knowledge, this is the first work to adopt diffusion models for layout-to-image generation with geometric conditions and demonstrate that L2I-generated images can be beneficial for improving the performance of object detectors.

POSTER-444: Are Bert Family Good Instruction Followers? A Study on Their Potential And Limitations

Primary Area: generative models

Keywords: Instruction tuning Large language models BERT family Natural language generation

Scores: [ 6 8 6 ]

Language modeling at scale has proven very effective and brought unprecedented success to natural language models. Many typical representatives, especially decoder-only models, e.g., BLOOM and LLaMA, and encoder-decoder models, e.g., Flan-T5 and AlexaTM, have exhibited incredible instruction-following capabilities while keeping strong task completion ability. These large language models can achieve superior performance in various tasks and even yield emergent capabilities, e.g., reasoning and universal generalization. Though the above two paradigms are mainstream and well explored, the potential of the BERT family, which are encoder-only based models and have ever been one of the most representative pre-trained models, also deserves attention, at least should be discussed. In this work, we adopt XML-R to explore the effectiveness of the BERT family for instruction following and zero-shot learning. We first design a simple yet effective strategy to utilize the encoder-only models for generation tasks and then conduct multi-task instruction tuning. Experimental results demonstrate that our fine-tuned model, Instruct-XMLR, outperforms Bloomz on all evaluation tasks and achieves comparable performance with mT0 on most tasks. Surprisingly, Instruct-XMLR also possesses strong task and language generalization abilities, indicating that Instruct-XMLR can also serve as a good instruction follower and zero-shot learner. Besides, Instruct-XMLR can accelerate decoding due to its non-autoregressive generation manner, achieving around 3 times speedup compared with current autoregressive large language models. Although we also witnessed several limitations through our experiments, such as the performance decline in long-generation tasks and the shortcoming of length prediction, Instruct-XMLR can still become a good member of the family of current large language models.

POSTER-445: Navigating Text-To-Image Customization: From LyCORIS Fine-Tuning to Model Evaluation

Primary Area: generative models

Keywords: Text-to-Image Model Customization Parameter-Efficient Fine-Tuning Model Evaluation

Scores: [ 6 6 6 8 ]

Text-to-image generative models have garnered immense attention for their ability to produce high-fidelity images from text prompts. Among these, Stable Diffusion distinguishes itself as a leading open-source model in this fast-growing field. However, the intricacies of fine-tuning these models pose multiple challenges from new methodology integration to systematic evaluation. Addressing these issues, this paper introduces LyCORIS (Lora beYond Conventional methods, Other Rank adaptation Implementations for Stable diffusion), an open-source library that offers a wide selection of fine-tuning methodologies for Stable Diffusion. Furthermore, we present a thorough framework for the systematic assessment of varied fine-tuning techniques. This framework employs a diverse suite of metrics and delves into multiple facets of fine-tuning, including hyperparameter adjustments and the evaluation with different prompt types across various concept categories. Through this comprehensive approach, our work provides essential insights into the nuanced effects of fine-tuning parameters, bridging the gap between state-of-the-art research and practical application.

POSTER-446: YaRN: Efficient Context Window Extension of Large Language Models

Primary Area: generative models

Keywords: transformers nlp fine-tuning context window attention rotary embedding

Scores: [ 6 8 6 6 ]

Rotary Position Embeddings (RoPE) have been shown to effectively encode positional information in transformer-based language models. However, these models fail to generalize past the sequence length they were trained on. We present YaRN (Yet another RoPE extensioN method), a compute-efficient method to extend the context window of such models, requiring 10x less tokens and 2.5x less training steps than previous methods. Using YaRN, we show that LLaMA models can effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow, while also surpassing previous the state-of-the-art at context window extension. In addition, we demonstrate that YaRN exhibits the capability to extrapolate beyond the limited context of a fine-tuning dataset. The models fine-tuned using YaRN has been made available and reproduced online up to 128k context length.

POSTER-447: Score-based generative models break the curse of dimensionality in learning a family of sub-Gaussian distributions

Primary Area: generative models

Keywords: score-based generative models Barron space curse of dimensionality

Scores: [ 8 6 8 5 ]

While score-based generative models (SGMs) have achieved remarkable successes in enormous image generation tasks, their mathematical foundations are still limited. In this paper, we analyze the approximation and generalization of SGMs in learning a family of sub-Gaussian probability distributions that are defined by exponential tilting of Gaussian distributions. We prove that if the tilting exponent is a Barron function, then the distribution generated by empirical score matching approximates the target distribution in the total variation distance with a dimension-independent rate. An essential ingredient of our proof is to derive a dimension-free deep neural network approximation rate for the true score function associated to the forward process, which is interesting in its own right.

ORAL-25: Batched Low-Rank Adaptation of Foundation Models

Primary Area: generative models

Keywords: LLM Adaptation Low-rank Code Generation

Scores: [ 8 8 8 8 ]

Low-Rank Adaptation (LoRA) has recently gained attention for fine-tuning foundation models by incorporating trainable low-rank matrices, thereby reducing the number of trainable parameters. While \lora/ offers numerous advantages, its applicability for real-time serving to a diverse and global user base is constrained by its incapability to handle multiple task-specific adapters efficiently. This imposes a performance bottleneck in scenarios requiring personalized, task-specific adaptations for each incoming request.To address this, we introduce FLORA (Fast LoRA), a framework in which each input example in a minibatch can be associated with its unique low-rank adaptation weights, allowing for efficient batching of heterogeneous requests. We empirically demonstrate that \flora/ retains the performance merits of \lora/, showcasing competitive results on the MultiPL-E code generation benchmark spanning over 8 languages and a multilingual speech recognition task across 6 languages.

POSTER-448: SuRe: Improving Open-domain Question Answering of LLMs via Summarized Retrieval

Primary Area: generative models

Keywords: question answering large language model retrieval

Scores: [ 6 8 6 6 6 ]

Large language models (LLMs) have made significant advancements in various natural language processing tasks but face challenges such as hallucinations and integration of up-to-date knowledge, which is particularly critical for question answering (QA). While incorporating new information with the retrieval of relevant passages is a promising way to improve QA with LLMs, the existing methods often require additional fine-tuning which becomes infeasible with recent LLMs. Retrieval augmentation via prompting has the potential to address this limitation, but this direction has been limitedly explored. To this end, we design a simple yet effective framework to enhance open-domain QA (ODQA) with LLMs, based on the summarized retrieval (SuRe). SuRe helps LLMs predict more grounded answers, which are well-supported by the summarization of retrieved passages that could be viewed as an explicit rationale extracted from the retrieved passages. Specifically, SuRe first constructs summaries of the retrieved passages for each of the multiple answer candidates. Then, SuRe confirms the most plausible answer from the candidate set by evaluating the validity and ranking of the generated summaries. Experimental results on diverse ODQA benchmarks demonstrate the superiority of SuRe, with improvements of up to 4.4% in exact match (EM) and 3.9% in F1 score over standard prompting approaches. SuRe also can be integrated with a broad range of retrieval methods and LLMs. Finally, the generated summaries from SuRe show additional advantages to measure the importance of retrieved passages and serve as more preferred rationales by models and humans.

POSTER-449: Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Primary Area: generative models

Keywords: audio synthesis vocoder GAN

Scores: [ 6 5 8 5 ]

Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that directly generates Fourier spectral coefficients. Vocos not only matches the state-of-the-art in audio quality, as demonstrated in our evaluations, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. The source code and model weights have been open-sourced.

POSTER-450: SliceGPT: Compress Large Language Models by Deleting Rows and Columns

Primary Area: generative models

Keywords: compression sparsification large language models

Scores: [ 6 6 6 5 5 ]

Large language models have become the cornerstone of natural language processing, but their use comes with substantial costs in terms of compute and memory resources. Sparsification provides a solution to alleviate these resource constraints, and recent works have shown that trained models can be sparsified post-hoc. Existing sparsification techniques face challenges as they need additional data structures and offer constrained speedup with current hardware. In this paper we present SliceGPT, a new post-training sparsification scheme which replaces each weight matrix with a smaller (dense) matrix, reducing the embedding dimension of the network. Through extensive experimentation, we show that SliceGPT can remove up to 25% of the model parameters (including embeddings) for OPT 66B and LLAMA-2 70B models with negligible loss in accuracy. Our sliced models run on fewer GPUs and run faster without any additional code optimization: on 24GB consumer GPUs we reduce the total compute for inference on LLAMA-2 70B to 64% of that of the dense model; on 40GB A100 GPUs we reduce it to 66%. We offer a new insight, computational invariance in transformer networks, which enables SliceGPT and we hope it will inspire and enable future avenues to reduce memory and computation demands for pre-trained models.

POSTER-451: Let's Verify Step by Step

Primary Area: generative models

Keywords: LLMs chain of thought verifiers human feedback reasoning

Scores: [ 8 3 5 6 ]

POSTER-452: RLCD: Reinforcement Learning from Contrastive Distillation for LM Alignment

Primary Area: generative models

Keywords: Language Model RLHF Alignment Instruction Tuning

Scores: [ 6 5 6 6 6 ]

We propose Reinforcement Learning from Contrastive Distillation (RLCD), a method for aligning language models to follow principles expressed in natural language (e.g., to be more harmless) without using human feedback. RLCD creates preference pairs from two contrasting model outputs, one using a positive prompt designed to encourage following the given principles, and one using a negative prompt designed to encourage violating them. Using two different prompts causes model outputs to be more differentiated on average, resulting in cleaner preference labels in the absence of human annotations. We then use the preference pairs to train a preference model, which is in turn used to improve a base unaligned language model via reinforcement learning. Empirically, RLCD outperforms RLAIF (Bai et al., 2022b) and context distillation (Huang et al., 2022) baselines across three diverse alignment tasks—harmlessness, helpfulness, and story outline generation—and when using both 7B and 30B model scales for simulating preference data

POSTER-453: AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ

Primary Area: generative models

Keywords: Vector Graphics Generation Code Generation Science Generation TikZ Text-to-Image

Scores: [ 6 6 8 6 ]

Generating bitmap graphics from text has gained considerable attention, yet for scientific figures, vector graphics are often preferred. Given that vector graphics are typically encoded using low-level graphics primitives, generating them directly is difficult. To address this, we propose the use of TikZ, a well-known abstract graphics language that can be compiled to vector graphics, as an intermediate representation of scientific figures. TikZ offers human-oriented, high-level commands, thereby facilitating conditional language modeling with any large language model. To this end, we introduce DaTikZ the first large-scale TikZ dataset, consisting of 120k TikZ drawings aligned with captions. We fine-tune LLaMA on DaTikZ, as well as our new model CLiMA, which augments LLaMA with multimodal CLIP embeddings. In both human and automatic evaluation, CLiMA and LLaMA outperform commercial GPT-4 and Claude 2 in terms of similarity to human-created figures, with CLiMA additionally improving text-image alignment. Our detailed analysis shows that all models generalize well and are not susceptible to memorization. GPT-4 and Claude 2, however, tend to generate more simplistic figures compared to both humans and our models. We make our framework, AutomaTikZ, along with model weights and datasets, publicly available.

POSTER-454: Enhancing High-Resolution 3D Generation through Pixel-wise Gradient Clipping

Primary Area: generative models

Keywords: 3D generation diffusion model

Scores: [ 6 6 6 ]

SPOTLIGHT-80: Unbiased Watermark for Large Language Models

Primary Area: generative models

Keywords: watermark bias

Scores: [ 6 8 6 8 5 ]

The recent advancements in large language models (LLMs) have sparked a growing apprehension regarding the potential misuse. One approach to mitigating this risk is to incorporate watermarking techniques into LLMs, allowing for the tracking and attribution of model outputs. This study examines a crucial aspect of watermarking: how significantly watermarks impact the quality of model-generated outputs. Previous studies have suggested a trade-off between watermark strength and output quality. However, our research demonstrates that it is possible to integrate watermarks without affecting the output probability distribution with appropriate implementation. We refer to this type of watermark as an unbiased watermark. This has significant implications for the use of LLMs, as it becomes impossible for users to discern whether a service provider has incorporated watermarks or not. Furthermore, the presence of watermarks does not compromise the performance of the model in downstream tasks, ensuring that the overall utility of the language model is preserved. Our findings contribute to the ongoing discussion around responsible AI development, suggesting that unbiased watermarks can serve as an effective means of tracking and attributing model outputs without sacrificing output quality.

POSTER-455: Unpaired Image-to-Image Translation via Neural Schrödinger Bridge

Primary Area: generative models

Keywords: Style Transfer Schrödinger Bridge Problem Adversarial Training

Scores: [ 6 6 6 8 ]

Diffusion models are a powerful class of generative models which simulate stochastic differential equations (SDEs) to generate data from noise. Although diffusion models have achieved remarkable progress in recent years, they have limitations in the unpaired image-to-image translation tasks due to the Gaussian prior assumption. Schrödinger Bridge (SB), which learns an SDE to translate between two arbitrary distributions, have risen as an attractive solution to this problem. However, none of SB models so far have been successful at unpaired translation between high-resolution images. In this work, we propose the Unpaired Neural Schrödinger Bridge (UNSB), which expresses SB problem as a sequence of adversarial learning problems. This allows us to incorporate advanced discriminators and regularization to learn a SB between unpaired data. We demonstrate that UNSB is scalable and successfully solves various unpaired image-to-image translation tasks.

SPOTLIGHT-81: ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

Primary Area: generative models

Keywords: text-to-image generation diffusion models high resolution generation

Scores: [ 6 6 8 6 ]

In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis. More results are available at the anonymous website: https://scalecrafter.github.io/ScaleCrafter/

POSTER-456: Diffusion Posterior Sampling for Linear Inverse Problem Solving: A Filtering Perspective

Primary Area: generative models

Keywords: Diffusion Models linear Inverse problem Bayesian posterior sampling Bayesian filtering importance sampling

Scores: [ 10 6 6 3 ]

POSTER-457: Compositional Preference Models for Aligning LMs

Primary Area: generative models

Keywords: language model alignment preference model Reinforcement Learning from Human Feedback (RLHF) overoptimization interpretability scalable oversight reward hacking

Scores: [ 6 6 3 8 ]

As language models (LMs) become more capable, it is increasingly important to align them with human preferences. However, the dominant paradigm for training Preference Models (PMs) for that purpose suffers from fundamental limitations, such as lack of transparency and scalability, along with susceptibility to overfitting the preference dataset.We propose Compositional Preference Models (CPMs), a novel PM framework that decomposes one global preference assessment into several interpretable features, obtains scalar scores for these features from a prompted LM, and aggregates these scores using a logistic regression classifier. CPMs allow to control which properties of the preference data are used to train the preference model and to build it based on features that are believed to underlie the human preference judgment.Our experiments show that CPMs not only improve generalization and are more robust to overoptimization than standard PMs, but also that best-of-n samples obtained using CPMs tend to be preferred over samples obtained using conventional PMs.Overall, our approach demonstrates the benefits of endowing PMs with priors about which features determine human preferences while relying on LM capabilities to extract those features in a scalable and robust way.

ORAL-26: Generative Modeling with Phase Stochastic Bridge

Primary Area: generative models

Keywords: Generative Modeling Stochastic Optimal Control Diffusion Model

Scores: [ 8 8 8 8 ]

Diffusion models (DMs) represent state-of-the-art generative models for continuous inputs. DMs work by constructing a Stochastic Differential Equation (SDE) in the input space (ie, position space), and using a neural network to reverse it. In this work, we introduce a novel generative modeling framework grounded in \textbf{phase space dynamics}, where a phase space is defined as {an augmented space encompassing both position and velocity.} Leveraging insights from Stochastic Optimal Control, we construct a path measure in the phase space that enables efficient sampling. {In contrast to DMs, our framework demonstrates the capability to generate realistic data points at an early stage of dynamics propagation.} This early prediction sets the stage for efficient data generation by leveraging additional velocity information along the trajectory. On standard image generation benchmarks, our model yields favorable performance over baselines in the regime of small Number of Function Evaluations (NFEs). Furthermore, our approach rivals the performance of diffusion models equipped with efficient sampling techniques, underscoring its potential as a new tool generative modeling.

POSTER-458: Matryoshka Diffusion Models

Primary Area: generative models

Keywords: diffusion models text-to-image generation nesting architecture progressive training

Scores: [ 5 6 8 6 ]

Diffusion models are the de-facto approach for generating high-quality images and videos, but learning high-dimensional models remains a formidable task due to computational and optimization challenges. Existing methods often resort to training cascaded models in pixel space, or using a downsampled latent space of a separately trained auto-encoder. In this paper, we introduce Matryoshka Diffusion (MDM), an end-to-end framework for high-resolution image and video synthesis. We propose a diffusion process that denoises inputs at multiple resolutions jointly and uses a NestedUNet architecture where features and parameters for small-scale inputs are nested within those of large scales. In addition, MDM enables a progressive training schedule from lower to higher resolutions, which leads to significant improvements in optimization for high-resolution generation. We demonstrate the effectiveness of our approach on various benchmarks, including class-conditioned image generation, high-resolution text-to-image, and text-to-video applications. Remarkably, we can train a single pixel-space model at resolutions of up to 1024x1024 pixels, demonstrating strong zero-shot generalization using the CC12M dataset, which contains only 12 million images.

SPOTLIGHT-82: Likelihood Training of Cascaded Diffusion Models via Hierarchical Volume-preserving Maps

Primary Area: generative models

Keywords: likelihood-based modeling diffusion modeling density estimation

Scores: [ 8 8 8 5 ]

Cascaded models are multi-scale generative models with a marked capacity for producing perceptually impressive samples at high resolutions. In this work, we show that they can also be excellent likelihood models, so long as we overcome a fundamental difficulty with probabilistic multi-scale models: the intractability of the likelihood function. Chiefly, in cascaded models each intermediary scale introduces extraneous variables that cannot be tractably marginalized out for likelihood evaluation. This issue vanishes by modeling the diffusion process on latent spaces induced by a class of transformations we call hierarchical volume-preserving maps, which decompose spatially structured data in a hierarchical fashion without introducing local distortions in the latent space. Not only do such reparameterizations allow the model likelihood to be directly expressed as a joint likelihood over the scales, they also pave the way for significant improvements to the state-of-the-art on a selection of benchmarks, including density estimation, lossless compression, and out-of-distribution detection. Investigating the theoretical basis of our empirical gains we uncover deep connections to score matching under the Earth Mover's Distance (EMD), which is a well-known surrogate for perceptual similarity.

SPOTLIGHT-83: Large Language Models Are Not Robust Multiple Choice Selectors

Primary Area: generative models

Keywords: large language model bias robustness multiple choice question

Scores: [ 5 8 8 8 ]

Multiple choice questions (MCQs) serve as a common yet important task format in the research of large language models (LLMs). This work shows that LLMs are vulnerable to option position changes in MCQs due to their inherent “selection bias”, namely, they prefer to select specific option IDs as answers (like “Option A”). Through extensive empirical analyses with 20 LLMs on three benchmarks, we pinpoint that this behavioral bias primarily stems from LLMs’ token bias, where the model a priori assigns more probabilistic mass to specific option ID tokens (e.g., A/B/C/D) when predicting answers from the option IDs. To mitigate selection bias, we propose a label-free, inference-time debiasing method, called PriDe, which separates the model’s prior bias for option IDs from the overall prediction distribution. PriDe first estimates the prior by permutating option contents on a small number of test samples, which is then applied to debias the subsequent samples. We demonstrate that PriDe achieves superior debiasing effectiveness and computational efficiency to strong baselines. Furthermore, the prior estimated by PriDe is interpretable and can generalize well across different domains, highlighting its practical potential in broader scenarios.

POSTER-459: Consistent4D: Consistent 360° Dynamic Object Generation from Monocular Video

Primary Area: generative models

Keywords: Dynamic object generation

Scores: [ 10 6 6 6 6 ]

In this paper, we present Consistent4D, a novel approach for generating 4D dynamic objects from uncalibrated monocular videos. Uniquely, we cast the 360-degree dynamic object reconstruction as a 4D generation problem, eliminating the need for tedious multi-view data collection and camera calibration. This is achieved by leveraging the object-level 3D-aware image diffusion model as the primary supervision signal for training dynamic Neural Radiance Fields (DyNeRF). Specifically, we propose a cascade DyNeRF to facilitate stable convergence and temporal continuity under the supervision signal which is discrete along the time axis. To achieve spatial and temporal consistency, we further introduce an interpolation-driven consistency loss. It is optimized by minimizing the L2 distance between rendered frames from DyNeRF and interpolated frames from a pre-trained video interpolation model. Extensive experiments show that our Consistent4D can perform competitively to prior art alternatives, opening up new possibilities for 4D dynamic object generation from monocular videos, whilst also demonstrating advantage for conventional text-to-3D generation tasks

POSTER-460: Branch-GAN: Improving Text Generation with (not so) Large Language Models

Primary Area: generative models

Keywords: GAN GANN Text Generation NLP NLU Deep Learning Machine Learning

Scores: [ 6 6 6 8 ]

The current advancements in open domain text generation have been spearheaded by Transformer-based large language models. Leveraging efficient parallelization and vast training datasets, these models achieve unparalleled text generation capabilities. Even so, current models are known to suffer from deficiencies such as repetitive texts, looping issues, and lack of robustness. While adversarial training through generative adversarial networks (GAN) is a proposed solution, earlier research in this direction has predominantly focused on older architectures, or narrow tasks. As a result, this approach is not yet compatible with modern language models for open-ended text generation, leading to diminished interest within the broader research community. We propose a computationally efficient GAN approach for sequential data that utilizes the parallelization capabilities of Transformer models. Our method revolves around generating multiple branching sequences from each training sample, while also incorporating the typical next-step prediction loss on the original data. In this way, we achieve a dense reward and loss signal for both the generator and the discriminator, resulting in a stable training dynamic. We apply our training method to pre-trained language models, using data from their original training set but less than 0.01% of the available data. A comprehensive human evaluation shows that our method significantly improves the quality of texts generated by the model while avoiding the previously reported sparsity problems of GAN approaches. Even our smaller models outperform larger original baseline models with more than 16 times the number of parameters. Finally, we corroborate previous claims that perplexity on held-out data is not a sufficient metric for measuring the quality of generated texts.

POSTER-461: MagicDrive: Street View Generation with Diverse 3D Geometry Control

Primary Area: generative models

Keywords: diffusion models street view generation 3D control

Scores: [ 6 8 5 5 ]

Recent advancements in diffusion models have significantly enhanced the data synthesis with 2D control. Yet, precise 3D control in street view generation, crucial for 3D perception tasks, remains elusive. Specifically, utilizing Bird's-Eye View (BEV) as the primary condition often leads to challenges in geometry control (e.g., height), affecting the representation of object shapes, occlusion patterns, and road surface elevations, all of which are essential to perception data synthesis, especially for 3D object detection tasks. In this paper, we introduce MagicDrive, a novel street view generation framework offering diverse 3D geometry controls, including camera poses, road maps, and 3D bounding boxes, together with textual descriptions, achieved through tailored encoding strategies. Besides, our design incorporates a cross-view attention module, ensuring consistency across multiple camera views. With MagicDrive, we achieve high-fidelity street-view synthesis that captures nuanced 3D geometry and various scene descriptions, enhancing tasks like BEV segmentation and 3D object detection. Project Website: https://magic-drive.github.io/

Primary Area: generative models

Keywords: language modeling; retrieval; legality of language modeling

Scores: [ 8 8 8 5 ]

POSTER-462: DistillSpec: Improving Speculative Decoding via Knowledge Distillation

Primary Area: generative models

Keywords: large language model knowledge distillation speculative decoding

Scores: [ 6 6 6 ]

Speculative decoding~(SD) accelerates large language model inference by employing a faster {\em draft} model for generating multiple tokens, which are then verified in parallel by the larger {\em target} model, resulting in the text generated according to the target model distribution. However, identifying a compact draft model that is well-aligned with the target model is challenging. To tackle this issue, we propose {\em DistillSpec} that uses knowledge distillation to better align the draft model with the target model, before applying SD. DistillSpec makes two key design choices, which we demonstrate via systematic study to be crucial to improve the draft and target alignment: utilizing \emph{on-policy} data generation from the draft model, and \emph{tailoring the divergence function} to the task and decoding strategy. Notably, DistillSpec yields impressive 1045% speedups over standard SD on a range of standard benchmarks, using both greedy and non-greedy sampling. Furthermore, we combine DistillSpec with lossy SD to achieve fine-grained control over the latency vs. task performance trade-off. Finally, in practical scenarios with models of varying sizes, first using distillation to boost the performance of the target model and then applying DistillSpec to train a well-aligned draft model can reduce decoding latency by 610× with minimal performance drop, compared to standard decoding without distillation.

SPOTLIGHT-85: Intriguing Properties of Generative Classifiers

Primary Area: generative models

Keywords: diffusion models zero-shot text-to-image generative models human visual perception psychophysics cognitive science neuroscience

Scores: [ 8 8 8 ]

What is the best paradigm to recognize objects---discriminative inference (fast but potentially prone to shortcut learning) or using a generative model (slow but potentially more robust)? We build on recent advances in generative modeling that turn text-to-image models into classifiers. This allows us to study their behavior and to compare them against discriminative models and human psychophysical data.We report four intriguing emergent properties of generative classifiers: they show a record-breaking human-like shape bias (99% for Imagen), near human-level out-of-distribution accuracy, state-of-the-art alignment with human classification errors, and they understand certain perceptual illusions. Our results indicate that while the current dominant paradigm for modeling human object recognition is discriminative inference, zero-shot generative models approximate human object recognition data surprisingly well.

POSTER-463: A Branching Decoder for Set Generation

Primary Area: generative models

Keywords: set generation keyphrase generation

Scores: [ 6 6 8 8 ]

Generating a set of text is a common challenge for many NLP applications, for example, automatically providing multiple keyphrases for a document to facilitate user reading. Existing generative models use a sequential decoder which generates a single sequence successively, and the set generation problem is converted to sequence generation via concatenating multiple texts into a long text sequence. However, the elements of a set are unordered, which makes this scheme suffer from biased or conflicting training signals. In this paper, we propose a novel branching decoder. It can generate a dynamic number of tokens at each time-step and branch multiple generation paths. In particular, paths are generated individually so that no order dependence is required. Moreover, multiple paths can be generated in parallel which greatly reduces inference time. Experiments on three keyphrase generation datasets demonstrate that our branching decoder is more effective and efficient than the existing sequential decoder.

SPOTLIGHT-86: Image Inpainting via Iteratively Decoupled Probabilistic Modeling

Primary Area: generative models

Keywords: Inpainting Decoupled Probabilistic Modeling Pixel Spread Model

Scores: [ 6 8 6 5 ]

Generative adversarial networks (GANs) have made great success in image inpainting yet still have difficulties tackling large missing regions. In contrast, iterative probabilistic algorithms, such as autoregressive and denoising diffusion models, have to be deployed with massive computing resources for decent effect. To achieve high-quality results with low computational cost, we present a novel pixel spread model (PSM) that iteratively employs decoupled probabilistic modeling, combining the optimization efficiency of GANs with the prediction tractability of probabilistic models. As a result, our model selectively spreads informative pixels throughout the image in a few iterations, largely enhancing the completion quality and efficiency. On multiple benchmarks, we achieve new state-of-the-art performance. Our code and models will be publicly available.

POSTER-464: Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models

Primary Area: generative models

Keywords: Diffusion Model; Pose-Guided Image Synthesis

Scores: [ 5 6 8 6 ]

Recent work has showcased the significant potential of diffusion models in pose-guided person image synthesis.However, owing to the inconsistency in pose between the source and target images, synthesizing an image with a distinct pose, relying exclusively on the source image and target pose information, remains a formidable challenge.This paper presents Progressive Conditional Diffusion Models (PCDMs) that incrementally bridge the gap between person images under the target and source poses through three stages.Specifically, in the first stage, we design a simple prior conditional diffusion model that predicts the global features of the target image by mining the global alignment relationship between pose coordinates and image appearance.Then, the second stage establishes a dense correspondence between the source and target images using the global features from the previous stage, and an inpainting conditional diffusion model is proposed to further align and enhance the contextual features, generating a coarse-grained person image.In the third stage, we propose a refining conditional diffusion model to utilize the coarsely generated image from the previous stage as a condition, achieving texture restoration and enhancing fine-detail consistency.The three-stage PCDMs work progressively to generate the final high-quality and high-fidelity synthesized image.Both qualitative and quantitative results demonstrate the consistency and photorealism of our proposed PCDMs under challenging scenarios.Our code and models will be publicly available.

POSTER-465: Demystifying Embedding Spaces using Large Language Models

Primary Area: generative models

Keywords: large language models interpretability methods representation recommender systems

Scores: [ 8 6 8 5 ]

Embeddings have become a pivotal means to represent complex, multi-faceted information about entities, concepts, and relationships in a condensed and useful format. Nevertheless, they often preclude direct interpretation. While downstream tasks make use of these compressed representations, meaningful interpretation usually requires visualization using dimensionality reduction or specialized machine learning interpretability methods. This paper addresses the challenge of making such embeddings more interpretable and broadly useful, by employing large language models (LLMs) to directly interact with embeddings -- transforming abstract vectors into understandable narratives. By injecting embeddings into LLMs, we enable querying and exploration of complex embedding data. We demonstrate our approach on a variety of diverse tasks, including: enhancing concept activation vectors (CAVs), communicating novel embedded entities, and decoding user preferences in recommender systems. Our work couples the immense information potential of embeddings with the interpretative power of LLMs.

POSTER-466: Learning Stackable and Skippable LEGO Bricks for Efficient, Reconfigurable, and Variable-Resolution Diffusion Modeling

Primary Area: generative models

Keywords: Efficient diffusion models short-span attention local-feature enrichment global-content orchestration multi-scale generation

Scores: [ 6 8 6 ]

Diffusion models excel at generating photo-realistic images but come with significant computational costs in both training and sampling. While various techniques address these computational challenges, a less-explored issue is designing an efficient and adaptable network backbone for iterative refinement. Current options like U-Net and Vision Transformer often rely on resource-intensive deep networks and lack the flexibility needed for generating images at variable resolutions or with a smaller network than used in training.This study introduces LEGO bricks, which seamlessly integrate Local-feature Enrichment and Global-content Orchestration. These bricks can be stacked to create a test-time reconfigurable diffusion backbone, allowing selective skipping of bricks to reduce sampling costs and generate higher-resolution images than the training data. LEGO bricks enrich local regions with an MLP and transform them using a Transformer block while maintaining a consistent full-resolution image across all bricks. Experimental results demonstrate that LEGO bricks enhance training efficiency, expedite convergence, and facilitate variable-resolution image generation while maintaining strong generative performance. Moreover, LEGO significantly reduces sampling time compared to other methods, establishing it as a valuable enhancement for diffusion models.

POSTER-467: Transformer-Modulated Diffusion Models for Probabilistic Multivariate Time Series Forecasting

Primary Area: generative models

Keywords: time series forecasting generative modeling deep learning diffusion model

Scores: [ 8 6 5 ]

Transformers have gained widespread usage in multivariate time series (MTS) forecasting, delivering impressive performance. Nonetheless, these existing transformer-based methods often neglect an essential aspect: the incorporation of uncertainty into the predicted series, which holds significant value in decision-making. In this paper, we introduce a Transformer-Modulated Diffusion Model (TMDM), uniting conditional diffusion generative process with transformers into a unified framework to enable precise distribution forecasting for MTS. TMDM harnesses the power of transformers to extract essential insights from historical time series data. This information is then utilized as prior knowledge, capturing covariate-dependence in both the forward and reverse processes within the diffusion model. Furthermore, we seamlessly integrate well-designed transformer-based forecasting methods into TMDM to enhance its overall performance. Additionally, we introduce two novel metrics for evaluating uncertainty estimation performance. Through extensive experiments on six datasets using four evaluation metrics, we establish the effectiveness of TMDM in probabilistic MTS forecasting.

SPOTLIGHT-87: Relay Diffusion: Unifying diffusion process across resolutions for image synthesis

Primary Area: generative models

Keywords: generative models diffusion model image synthesis

Scores: [ 6 6 8 8 ]

Diffusion models achieved great success in image synthesis, but still face challenges in high-resolution generation. Through the lens of discrete cosine transformation, we find the main reason is that the same noise level on a higher resolution results in a higher Signal-to-Noise Ratio in the frequency domain. In this work, we present Relay Diffusion Model (RDM), which transfers a low-resolution image or noise into an equivalent high-resolution one for diffusion model via blurring diffusion and block noise. Therefore, the diffusion process can continue seamlessly in any new resolution or model without restarting from pure noise or low-resolution conditioning. RDM achieves state-of-the-art FID on CelebA-HQ and sFID on ImageNet 256$\times$256, surpassing previous works such as ADM, LDM and DiT by a large margin. All the codes and checkpoints are open-sourced at https://anonymous.4open.science/r/RelayDiffusion-anonymous-1B3E.

POSTER-468: Seer: Language Instructed Video Prediction with Latent Diffusion Models

Primary Area: generative models

Keywords: video prediction multi-modality generation diffusion model language-vision model

Scores: [ 8 6 6 6 ]

Imagining the future trajectory is the key for robots to make sound planning and successfully reach their goals. Therefore, text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning.To tackle this task and empower robots with the ability to foresee the future, we propose a sample and computation-efficient model, named Seer, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis. We enhance the U-Net and language conditioning model by incorporating computation-efficient spatial-temporal attention. Furthermore, we introduce a novel Frame Sequential Text Decomposer module that dissects a sentence's global instruction into temporally aligned sub-instructions, ensuring precise integration into each frame of generation. Our framework allows us to effectively leverage the extensive prior knowledge embedded in pretrained T2I models across the frames. With the adaptable-designed architecture, Seer makes it possible to generate high-fidelity, coherent, and instruction-aligned video frames by fine-tuning a few layers on a small amount of data. The experimental results on Something Something V2 (SSv2), Bridgedata and EpicKitchens-100 datasets demonstrate our superior video prediction performance with around 480-GPU hours versus CogVideo with over 12,480-GPU hours: achieving the 31% FVD improvement compared to the current SOTA model on SSv2 and 83.7% average preference in the human evaluation.

POSTER-469: Protein-Ligand Interaction Prior for Binding-aware 3D Molecule Diffusion Models

Primary Area: generative models

Keywords: Diffusion Models Structure-Based Drug Design Molecule Generation

Scores: [ 6 5 8 6 ]

Generating 3D ligand molecules that bind to specific protein targets via diffusion models has shown great promise for structure-based drug design. The key idea is to disrupt molecules into noise through a fixed forward process and learn its reverse process to generate molecules from noise in a denoising way. However, existing diffusion models primarily focus on incorporating protein-ligand interaction information solely in the reverse process, and neglect the interactions in the forward process. The inconsistency between forward and reverse processes may impair the binding affinity of generated molecules towards target protein. In this paper, we propose a novel Interaction Prior-guided Diffusion model (IPDiff) for the protein-specific 3D molecular generation by introducing geometric protein-ligand interactions into both diffusion and sampling process. Specifically, we begin by pretraining a protein-ligand interaction prior network (IPNet) by utilizing the binding affinity signals as supervision. Subsequently, we leverage the pretrained prior network to (1) integrate interactions between the target protein and the molecular ligand into the forward process for adapting the molecule diffusion trajectories (prior-shifting), and (2) enhance the binding-aware molecule sampling process (prior-conditioning). Empirical studies on CrossDocked2020 dataset show IPDiff can generate molecules with more realistic 3D structures and state-of-the-art binding affinities towards the protein targets, with up to -6.42 Avg. Vina Score, while maintaining proper molecular properties.

POSTER-470: LayoutNUWA: Revealing the Hidden Layout Expertise of Large Language Models

Primary Area: generative models

Keywords: layout generation large language model code generation

Scores: [ 5 6 6 8 ]

Graphic layout generation, a growing research field, plays a significant role in user engagement and information perception. Existing methods primarily treat layout generation as a numerical optimization task, focusing on quantitative aspects while overlooking the semantic information of layout, such as the relationship between each layout element. In this paper, we propose LayoutNUWA, the first model that treats layout generation as a code generation task to enhance semantic information and harness the hidden layout expertise of large language models~(LLMs). Concretely, we develop a Code Instruct Tuning (CIT) approach comprising three interconnected modules: 1) the Code Initialization (CI) module quantifies the numerical conditions and initializes them as HTML code with strategically placed masks; 2) the Code Completion (CC) module employs the formatting knowledge of LLMs to fill in the masked portions within the HTML code; 3) the Code Rendering (CR) module transforms the completed code into the final layout output, ensuring a highly interpretable and transparent layout generation procedure that directly maps code to a visualized layout. We attain significant state-of-the-art performance (even over 50% improvements compared to previous works) on multiple datasets, showcasing the strong capabilities of LayoutNUWA.

POSTER-471: Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models

Primary Area: generative models

Keywords: Large Language Models; Prompt Engineering; Boosting Mechanism;

Scores: [ 8 5 6 6 5 ]

The reasoning performance of Large Language Models (LLMs) on a wide range of problems critically relies on chain-of-thought prompting, which involves providing a few chain of thought demonstrations as exemplars in prompts. Recent work, e.g., Tree of Thoughts, has pointed out the importance of exploration and self-evaluation in reasoning step selection for complex problem solving. In this paper, we present Boosting of Thoughts (BoT), an automated prompting framework for problem solving with LLMs by iteratively exploring and self-evaluating many trees of thoughts in order to acquire an ensemble of trial-and-error reasoning experiences, which will serve as a new form of prompting to solve the complex problem. Starting from a simple prompt without requiring examples, BoT iteratively explores and evaluates a large collection of reasoning steps, and more importantly, uses error analysis obtained from the LLM on them to explicitly revise prompting, which in turn enhances reasoning step generation, until a final answer is attained. Our experiments with GPT-4 and Llama2 across extensive complex mathematical problems demonstrate that BoT consistently achieves higher or comparable problem-solving rates than other advanced prompting approaches.

POSTER-472: Efficient Integrators for Diffusion Generative Models

Primary Area: generative models

Keywords: Diffusion models Deep Generative Models Efficient Sampling

Scores: [ 6 8 6 5 ]

Diffusion models suffer from slow sample generation at inference time. Therefore, developing a principled framework for fast deterministic/stochastic sampling for a broader class of diffusion models is a promising direction. We propose two complementary frameworks for accelerating sample generation in pre-trained models: Conjugate Integrators and Splitting Integrators. Conjugate integrators generalize DDIM, mapping the reverse diffusion dynamics to a more amenable space for sampling. In contrast, splitting-based integrators, commonly used in molecular dynamics, reduce the numerical simulation error by cleverly alternating between numerical updates involving the data and auxiliary variables. After extensively studying these methods empirically and theoretically, we present a hybrid method that leads to the best-reported performance for diffusion models in augmented spaces. Applied to Phase Space Langevin Diffusion [Pandey & Mandt, 2023] on CIFAR-10, our deterministic and stochastic samplers achieve FID scores of 2.11 and 2.36 in only 100 network function evaluations (NFE) as compared to 2.57 and 2.63 for the best-performing baselines, respectively. Our code and model checkpoints will be made publicly available.

POSTER-473: Large-Vocabulary 3D Diffusion Model with Transformer

Primary Area: generative models

Keywords: Large-vocabulary Objects; 3D Generation; Transformer; Diffusion model

Scores: [ 6 6 6 ]

Creating diverse and high-quality 3D assets with an automatic generative model is highly desirable. Despite extensive efforts on 3D generation, most existing works focus on the generation of a single category or a few categories. In this paper, we introduce a diffusion-based feed-forward framework for synthesizing massive categories of real-world 3D objects \textit{with a single generative model}. Notably, there are three major challenges for this large-vocabulary 3D generation: \textbf{a}) the need for expressive yet efficient 3D representation; \textbf{b}) large diversity in geometry and texture across categories; \textbf{c}) complexity in the appearances of real-world objects. To this end, we propose a novel triplane-based 3D-aware \textbf{Diff}usion model with \textbf{T}rans\textbf{F}ormer, \textbf{DiffTF}, for handling challenges via three aspects. \textbf{1}) Considering efficiency and robustness, we adopt a revised triplane representation and improve the fitting speed and accuracy. \textbf{2}) To handle the drastic variations in geometry and texture, we regard the features of all 3D objects as a combination of generalized 3D knowledge and specialized 3D features. To extract generalized 3D knowledge from diverse categories, we propose a novel 3D-aware transformer with shared cross-plane attention. It learns the cross-plane relations across different planes and aggregates the generalized 3D knowledge with specialized 3D features. \textbf{3}) In addition, we devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge in the encoded triplanes for handling categories with complex appearances. Extensive experiments on ShapeNet and OmniObject3D (over 200 diverse real-world categories) convincingly demonstrate that a single DiffTF model achieves state-of-the-art large-vocabulary 3D object generation performance with large diversity, rich semantics, and high quality. More results are available at https://difftf.github.io/

POSTER-474: Universal Guidance for Diffusion Models

Primary Area: generative models

Keywords: Generative Models Computer Vision Diffusion Models

Scores: [ 6 3 6 6 ]

Typical diffusion models are trained to accept a particular form of conditioning, most commonly text, and cannot be conditioned on other modalities without retraining. In this work, we propose a universal guidance algorithm that enables diffusion models to be controlled by arbitrary guidance modalities without the need to retrain any use-specific components. We show that our algorithm successfully generates quality images with guidance functions including segmentation, face recognition, object detection, style guidance and classifier signals.

ORAL-27: How I Warped Your Noise: a Temporally-Correlated Noise Prior for Diffusion Models

Primary Area: generative models

Keywords: diffusion models; temporal coherency; Gaussian noise field; continuous white noise; noise transport

Scores: [ 6 8 8 8 ]

Video editing and generation methods often rely on pre-trained image-based diffusion models. During the diffusion process, however, the reliance on rudimentary noise sampling techniques that do not preserve correlations present in subsequent frames of a video is detrimental to the quality of the results. This either produces high-frequency flickering, or texture-sticking artifacts that are not amenable to post-processing. With this in mind, we propose a novel method for preserving temporal correlations in a sequence of noise samples. This approach is materialized by a novel noise representation, dubbed -noise (integral noise), that reinterprets individual noise samples as a continuously integrated noise field: pixel values do not represent discrete values, but are rather the integral of an underlying infinite-resolution noise over the pixel area. Additionally, we propose a carefully tailored transport method that uses -noise to accurately advect noise samples over a sequence of frames, maximizing the correlation between different frames while also preserving the noise properties. Our results demonstrate that the proposed -noise can be used for a variety of tasks, such as video restoration, surrogate rendering, and conditional video generation.

POSTER-475: Probabilistic Adaptation of Black-Box Text-to-Video Models

Primary Area: generative models

Keywords: Controllable video generation probabilistic adaptation video diffusion black box adaptation

Scores: [ 8 6 8 3 ]

Large text-to-video models trained on internet-scale data have demonstrated exceptional capabilities in generating high-fidelity videos from arbitrary textual descriptions. However, similar to proprietary language models, large text-to-video models are often black boxes whose weight parameters are not publicly available, posing a significant challenge to adapting these models to specific domains such as robotics, animation, and personalized stylization. Inspired by how a large language model can be prompted to perform new tasks without access to the model weights, we investigate how to adapt a black-box pretrained text-to-video model to a variety of downstream domains without weight access to the pretrained model. In answering this question, we propose \emph{\methodname}, which leverages the score function of a large pretrained video diffusion model as a probabilistic prior to guide the generation of a task-specific small video model. Our experiments show that, by incorporating broad knowledge and fidelity of the pretrained model probabilistically, a small model with as few as 1.25% parameters of the pretrained model can generate high-quality yet domain-specific videos for a variety of downstream domains such as animation, egocentric modeling, and modeling of simulated and real-world robotics data. As large text-to-video models starting to become available as a service similar to large language models, we advocate for private institutions to expose scores of video diffusion models as outputs in addition to generated videos to allow flexible adaptation of large pretrained text-to-video models by the general public.

POSTER-476: SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning

Primary Area: generative models

Keywords: large language models LLMs

Scores: [ 8 5 6 5 ]

The recent progress in large language models (LLMs), especially the invention of chain-of-thought prompting, has made it possible to automatically answer questions by stepwise reasoning. However, when faced with more complicated problems that require non-linear thinking, even the strongest LLMs make mistakes. To address this, we explore whether LLMs are able to recognize errors in their own step-by-step reasoning, without resorting to external resources. To this end, we propose SelfCheck, a general-purpose zero-shot verification schema for recognizing such errors. We then use the results of these checks to improve question-answering performance by conducting weighted voting on multiple solutions to the question. We test SelfCheck on three datasets---GSM8K, MathQA, and MATH---and find that it successfully recognizes errors and, in turn, increases final answer accuracies.

POSTER-477: Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis

Primary Area: generative models

Keywords: Generative Models Diffusion Models Image Editing Image Composition

Scores: [ 6 6 8 ]

Addressing the limitations of text as a source of accurate layout representation in text-conditional diffusion models, many works incorporate additional signals to condition certain attributes within a generated image. Although successful, previous works do not account for the specific localization of said attributes extended into the three dimensional plane. In this context, we present a conditional diffusion model that integrates control over three-dimensional object placement with disentangled representations of global stylistic semantics from multiple exemplar images. Specifically, we first introduce depth disentanglement training to leverage the relative depth of objects as an estimator, allowing the model to identify the absolute positions of unseen objects through the use of synthetic image triplets. We also introduce soft guidance, a method for imposing global semantics onto targeted regions without the use of any additional localization cues. Our integrated framework, Compose and Conquer (CnC), unifies these techniques to localize multiple conditions in a disentangled manner. We demonstrate that our approach allows perception of objects at varying depths while offering a versatile framework for composing localized objects with different global semantics.

POSTER-478: The Truth Is In There: Improving Reasoning with Layer-Selective Rank Reduction

Primary Area: generative models

Keywords: Large Language models Transformer model understanding

Scores: [ 8 3 10 ]

Transformer-based Large Language Models (LLMs) have become a fixture in modern machine learning. Correspondingly, significant resources are allocated towards research that aims to further advance this technology, typically resulting in models of increasing size that are trained on increasing amounts of data. This work, however, demonstrates the surprising result that it is often possible to im-prove the performance of LLMs by simply removing higher-order components of their constituent weight matrices in the multi-layer perception (MLP) layers. This simple intervention, which we call LAyer-SElective Rank reduction (LASER), can be done on a model after training has completed, and requires no additional parameters or data. LASER can dramatically boost predictive performance—at times by 80% over the model’s original performance—on question-answering tasks and across various modalities for which Transformers are used.

POSTER-479: Quality-Diversity through AI Feedback

Primary Area: generative models

Keywords: quality diversity large language models derivative-free optimization AI feedback

Scores: [ 6 6 6 5 ]

In many text-generation problems, users may prefer not only a single response, but a diverse range of high-quality outputs from which to choose. Quality-diversity (QD) search algorithms aim at such outcomes, by continually improving and diversifying a population of candidates. However, the applicability of QD to qualitative domains, like creative writing, has been limited by the difficulty of algorithmically specifying measures of quality and diversity. Interestingly, recent developments in language models (LMs) have enabled guiding search through \emph{AI feedback}, wherein LMs are prompted in natural language to evaluate qualitative aspects of text. Leveraging this development, we introduce Quality-Diversity through AI Feedback (QDAIF), wherein an evolutionary algorithm applies LMs to both generate variation and evaluate the quality and diversity of candidate text. When assessed on creative writing domains, QDAIF covers more of a specified search space with high-quality samples than do non-QD controls. Further, human evaluation of QDAIF-generated creative texts validates reasonable agreement between AI and human evaluation. Our results thus highlight the potential of AI feedback to guide open-ended search for creative and original solutions, providing a recipe that seemingly generalizes to many domains and modalities. In this way, QDAIF is a step towards AI systems that can independently search, diversify, evaluate, and improve, which are among the core skills underlying human society's capacity for innovation.

ORAL-28: ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis

Primary Area: generative models

Keywords: Program Synthesis Programming By Example Generalization Compositional Generalization

Scores: [ 6 8 8 6 ]

POSTER-480: Transformer-VQ: Linear-Time Transformers via Vector Quantization

Primary Area: generative models

Keywords: Transformer Transformer Decoder Decoder-Only Transformer Natural Language Processing NLP Vector Quantization VQ K-Means Clustering Causal Attention Autoregressive Attention Efficient Attention Linear-Time Attention Autoregressive Modeling Generative Modeling Gated Attention Compressive Attention Kernelized Attention Kernelizable Attention Hierarchical Attention Segment-Level Recurrent Attention Long-Context Modeling Long-Range Modeling Long-Range Dependencies Long-Term Dependencies Cached Attention Shift-Equivariant Attention

Scores: [ 8 8 6 ]

We introduce Transformer-VQ, a decoder-only transformer computing softmax-based dense self-attention in linear time. Transformer-VQ's efficient attention is enabled by vector-quantized keys and a novel caching mechanism. In large-scale experiments, Transformer-VQ is shown highly competitive in quality, with strong results on Enwik8 (0.99 bpb), PG-19 (26.6 ppl), and ImageNet64 (3.16 bpb). Code: https://github.com/transformer-vq/transformer_vq

POSTER-481: MaGIC: Multi-modality Guided Image Completion

Primary Area: generative models

Keywords: Multi-modality Image Completion Diffusion Model

Scores: [ 8 6 6 ]

Vanilla image completion approaches exhibit sensitivity to large missing regions, attributed to the limited availability of reference information for plausible generation. To mitigate this, existing methods incorporate the extra cue as guidance for image completion. Despite improvements, these approaches are often restricted to employing a single modality (e.g., segmentation or sketch maps), which lacks scalability in leveraging multi-modality for more plausible completion.In this paper, we propose a novel, simple yet effective method for Multi-modal Guided Image Completion, dubbed MaGIC, which not only supports a wide range of single modality as the guidance (e.g., text, canny edge, sketch, segmentation, depth, and pose), but also adapts to arbitrarily customized combinations of these modalities (i.e., arbitrary multi-modality) for image completion.For building MaGIC, we first introduce a modality-specific conditional U-Net (MCU-Net) that injects single-modal signal into a U-Net denoiser for single-modal guided image completion. Then, we devise a consistent modality blending (CMB) method to leverage modality signals encoded in multiple learned MCU-Nets through gradient guidance in latent space. Our CMB is training-free, thereby avoids the cumbersome joint re-training of different modalities, which is the secret of MaGIC to achieve exceptional flexibility in accommodating new modalities for completion.Experiments show the superiority of MaGIC over state-of-the-art methods and its generalization to various completion tasks.

POSTER-482: Manifold Preserving Guided Diffusion

Primary Area: generative models

Keywords: generative model diffusion model controllable generation

Scores: [ 6 6 6 8 ]

Despite the recent advancements, conditional image generation still faces challenges of cost, generalizability, and the need for task-specific training. In this paper, we propose Manifold Preserving Guided Diffusion (MPGD), a training-free conditional generation framework that leverages pretrained diffusion models and off-the-shelf neural networks with minimal additional inference cost for a broad range of tasks. Specifically, we leverage the manifold hypothesis to refine the guided diffusion steps and introduce a shortcut algorithm in the process. We then propose two methods for on-manifold training-free guidance using pre-trained autoencoders and demonstrate that our shortcut inherently preserves the manifolds when applied to latent diffusion models. Our experiments show that MPGD is efficient and effective for solving a variety of conditional generation applications in low-compute settings, and can consistently offer up to 3.8× speed-ups with the same number of diffusion steps while maintaining high sample quality compared to the baselines.

POSTER-483: A Simple and Scalable Representation for Graph Generation

Primary Area: generative models

Keywords: Graph generative models graph neural networks graph representation

Scores: [ 5 6 8 6 ]

Recently, there has been a surge of interest in employing neural networks for graph generation, a fundamental statistical learning problem with critical applications like molecule design and community analysis. However, most approaches encounter significant limitations when generating large-scale graphs. This is due to their requirement to output the full adjacency matrices whose size grows quadratically with the number of nodes. In response to this challenge, we introduce a new, simple, and scalable graph representation named gap encoded edge list (GEEL) that has a small representation size that aligns with the number of edges. In addition, GEEL significantly reduces the vocabulary size by incorporating the gap encoding and bandwidth restriction schemes. GEEL can be autoregressively generated with the incorporation of node positional encoding, and we further extend GEEL to deal with attributed graphs by designing a new grammar. Our findings reveal that the adoption of this compact representation not only enhances scalability but also bolsters performance by simplifying the graph generation process. We conduct a comprehensive evaluation across ten non-attributed and two molecular graph generation tasks, demonstrating the effectiveness of GEEL.

POSTER-484: Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing

Primary Area: generative models

Keywords: Diffusion Model Text-to-Image Generation Text-to-Video Editing

Scores: [ 6 8 5 6 ]

Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing. Nevertheless, prevailing text-guided visual diffusion models primarily focus on incorporating text-visual relationships exclusively into the reverse process, often disregarding their relevance in the forward process. This inconsistency between forward and reverse processes maylimit the precise conveyance of textual semantics in visual synthesis results. To address this issue, we propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample into forward and reverse processes. We propagate this context to all timesteps in the two processes to adapt their trajectories, thereby facilitating cross-modal conditional modeling. We generalize our contextualized diffusion to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing. In each task, our ContextDiff achieves new state-of-the-art performance, significantly enhancing the semantic alignment between text condition and generated samples, as evidenced by quantitative and qualitative evaluations.

Primary Area: generative models

Keywords: language models decoding planning game theory

Scores: [ 6 8 6 10 ]

When applied to question answering and other text generation tasks, language models (LMs) may be queried generatively (by sampling answers from their output distribution) or discriminatively (by using them to score or rank a set of candidate answers). These procedures sometimes yield very different predictions. How do we reconcile mutually incompatible scoring procedures to obtain coherent LM predictions? We introduce a new, a training-free, game-theoretic procedure for language model decoding. Our approach casts language model decoding as a regularized imperfect-information sequential signaling game—which we term the concensus game—in which a generator seeks to communicate an abstract correctness parameter using natural language sentences to a discriminator. We develop computational procedures for finding approximate equilibria of this game, resulting in a decoding algorithm we call equilibrium-ranking. Applied to a large number of tasks (including reading comprehension, commonsense reasoning, mathematical problem-solving, and assistive dialog), equilibrium-ranking consistently improves performance over existing LM decoding procedures. These improvements are sometimes substantial—on multiple benchmarks, we observe that applying equilibrium-ranking to LLaMA-7B outperforms the much larger LLaMA-65B and PaLM-540B models.

SPOTLIGHT-89: OctoPack: Instruction Tuning Code Large Language Models

Primary Area: generative models

Keywords: large language models large code models instruction tuning

Scores: [ 8 8 6 ]

Finetuning large language models (LLMs) on instructions leads to vast performance improvements on natural language tasks. We apply instruction tuning using code, leveraging the natural structure of Git commits, which pair code changes with human instructions. We compile CommitPack: 4 terabytes of Git commits across 350 programming languages. We benchmark CommitPack against other natural and synthetic code instructions (xP3x, Self-Instruct, OASST) on the 16B parameter StarCoder model, and achieve state-of-the-art performance among models not trained on OpenAI outputs, on the HumanEval Python benchmark (46.2% pass@1). We further introduce HumanEvalPack, expanding the HumanEval benchmark to a total of 3 coding tasks (Code Repair, Code Explanation, Code Synthesis) across 6 languages (Python, JavaScript, Java, Go, C++, Rust). Our models, OctoCoder and OctoGeeX, achieve the best performance across HumanEvalPack among all permissive models, demonstrating CommitPack's benefits in generalizing to a wider set of languages and natural coding tasks. Code, models and data will be made freely available.

POSTER-485: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis

Primary Area: generative models

Keywords: Prompting Mechanisms Zero-Shot Text-to-Speech Multi-Sentence Prompts

Scores: [ 6 6 6 8 ]

Zero-shot text-to-speech (TTS) aims to synthesize voices with unseen speech prompts, which significantly reduces the data and computation requirements for voice cloning by skipping the fine-tuning process. However, the prompting mechanisms of zero-shot TTS still face challenges in the following aspects: 1) previous works of zero-shot TTS are typically trained with single-sentence prompts, which significantly restricts their performance when the data is relatively sufficient during the inference stage. 2) The prosodic information in prompts is highly coupled with timbre, making it untransferable to each other. This paper introduces Mega-TTS, a generic prompting mechanism for zero-shot TTS, to tackle the aforementioned challenges. Specifically, we design a powerful acoustic autoencoder that separately encodes the prosody and timbre information into the compressed latent space while providing high-quality reconstructions. Then, we propose a multi-reference timbre encoder and a prosody latent language model (P-LLM) to extract useful information from multi-sentence prompts. We further leverage the probabilities derived from multiple P-LLM outputs to produce transferable and controllable prosody. Experimental results demonstrate that Mega-TTS could not only synthesize identity-preserving speech with a short prompt of an unseen speaker from arbitrary sources but consistently outperform the fine-tuning method when the volume of data ranges from 10 seconds to 5 minutes. Furthermore, our method enables to transfer various speaking styles to the target timbre in a fine-grained and controlled manner. Audio samples can be found in https://boostprompt.github.io/boostprompt/.

POSTER-486: Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding

Primary Area: generative models

Keywords: large language model efficient inference data-centric optimization parallel decoding prompt engineering

Scores: [ 6 6 6 8 5 3 ]

This work aims at decreasing the end-to-end generation latency of large language models (LLMs). One of the major causes of the high generation latency is the sequential decoding approach adopted by almost all state-of-the-art LLMs. In this work, motivated by the thinking and writing process of humans, we propose Skeleton-of-Thought (SoT), which first guides LLMs to generate the skeleton of the answer, and then conducts parallel API calls or batched decoding to complete the contents of each skeleton point in parallel. Not only does SoT provide considerable speed-ups across 12 LLMs, but it can also potentially improve the answer quality on several question categories. SoT is an initial attempt at data-centric optimization for inference efficiency, and further underscores the potential of pushing LLMs to think more like a human for answer quality.

POSTER-487: Multi-Resolution Diffusion Models for Time Series Forecasting

Primary Area: generative models

Keywords: diffusion model time series multiscale

Scores: [ 6 6 8 6 ]

POSTER-488: Noise Map Guidance: Inversion with Spatial Context for Real Image Editing

Primary Area: generative models

Keywords: Generative Models Diffusion Models Image Editing

Scores: [ 6 8 6 ]

POSTER-489: Improving Code Style for Accurate Code Generation

Primary Area: generative models

Keywords: Code Generation Data Quality Synthetic Datasets

Scores: [ 5 8 8 ]

Natural language to code generation is an important application area of LLMs and has received wide attention from the community. The majority of relevant studies have exclusively concentrated on increasing the quantity and functional correctness of training sets while disregarding other stylistic elements of programs. More recently, data quality has garnered a lot of interest and multiple works have showcased its importance for improving performance. In this work, we investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system. We build a novel data-cleaning pipeline that uses these principles to transform existing programs by 1.) renaming variables, 2.) modularizing and decomposing complex code into smaller helper sub-functions, and 3.) inserting natural-language based planning annotations. We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B on our transformed programs improves the performance by up to \textbf{30%} compared to fine-tuning on the original dataset. Additionally, we demonstrate improved performance from using a smaller amount of higher-quality data, finding that a model fine-tuned on the entire original dataset is outperformed by a model trained on one-eighth of our cleaned dataset. Even in comparison to closed-source models, our models outperform the much larger AlphaCode models.

POSTER-490: LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts

Primary Area: generative models

Keywords: diffusion LLM

Scores: [ 6 5 6 5 ]

Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts describing complex scenes with multiple objects. While excelling in generating images from short, single-object descriptions, these models often struggle to faithfully capture all the nuanced details within longer and more elaborate textual inputs. In response, we present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts, including bounding box coordinates for foreground objects, detailed textual descriptions for individual objects, and a succinct background context. These components form the foundation of our layout-to-image generation model, which operates in two phases. The initial Global Scene Generation utilizes object layouts and background context to create an initial scene but often falls short in faithfully representing object characteristics as specified in the prompts. To address this limitation, we introduce an Iterative Refinement Scheme that iteratively evaluates and refines box-level content to align them with their textual descriptions, recomposing objects as needed to ensure consistency. Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models. This is further validated by a user study, underscoring the efficacy of our approach in generating coherent and detailed scenes from intricate textual inputs. Our iterative framework offers a promising solution for enhancing text-to-image generation models' fidelity with lengthy, multifaceted descriptions, opening new possibilities for accurate and diverse image synthesis from textual inputs.

POSTER-491: Emu: Generative Pretraining in Multimodality

Primary Area: generative models

Keywords: Generative pretraining Multimodal generalist Foundation models In-context multimodal learning

Scores: [ 6 6 6 6 ]

We present Emu, a multimodal foundation model that seamlessly generates images and text in multimodal context. This omnivore model can take in any single- modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the leverage of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, supporting in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.

POSTER-492: The Cost of Scaling Down Large Language Models: Reducing Model Size Affects Memory before In-context Learning

Primary Area: generative models

Keywords: large language model scaling in-context learning pruning

Scores: [ 6 6 6 6 ]

POSTER-493: Grounding Multimodal Large Language Models to the World

Primary Area: generative models

Keywords: large language model multimodal large language model grounding referring

Scores: [ 6 8 6 8 ]

POSTER-494: TokenFlow: Consistent Diffusion Features for Consistent Video Editing

Primary Area: generative models

Keywords: Computer Vision Applications Video Editing and Synthesis Text-to-Image Diffusion Models Text-driven Video Editing

Scores: [ 6 8 6 8 ]

The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method. We demonstrate state-of-the-art editing results on a variety of real-world videos.

POSTER-495: Navigating the Design Space of Equivariant Diffusion-Based Generative Models for De Novo 3D Molecule Generation

Primary Area: generative models

Keywords: Generative Modelling Molecule Design Denoising Diffusion Probabilistic Models Ablation Study Equivariant Graph Neural Network 3D Molecule Generation Diffusion Model

Scores: [ 3 6 8 6 ]

Deep generative diffusion models are a promising avenue for 3D de novo molecular design in material science and drug discovery. However, their utility is still constrained by suboptimal performance with large molecular structures and limited training data. Addressing this gap, we explore the design space of E(3) equivariant diffusion models, focusing on previously blank spots. Our extensive comparative analysis evaluates the interplay between continuous and discrete state spaces. Out of this investigation, we introduce the EQGAT-diff model, which consistently surpasses the performance of established models on the QM9 and GEOM-Drugs datasets by a large margin.Distinctively, EQGAT-diff takes continuous atomic positions while chemical elements and bond types are categorical and employ a time-dependent loss weighting that significantly increases training convergence and the quality of generated samples.To further strengthen the applicability of diffusion models to limited training data, we examine the transferability of EQGAT-diff trained on the large PubChem3D dataset with implicit hydrogens to target distributions with explicit hydrogens. Fine-tuning EQGAT-diff for a couple of iterations further pushes state-of-the-art performance across datasets.We envision that our findings will find applications in structure-based drug design, where the accuracy of generative models for small datasets of complex molecules is critical.

POSTER-496: JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling

Primary Area: generative models

Keywords: Generative AI Computer Vision Machine Learning

Scores: [ 5 6 8 3 ]

We introduce JointNet, a novel neural network architecture for modeling the joint distribution of images and an additional dense modality (e.g., depth maps). JointNet is extended from a pre-trained text-to-image diffusion model, where a copy of the original network is created for the new dense modality branch and is densely connected with the RGB branch. The RGB branch is locked during network fine-tuning, which enables efficient learning of the new modality distribution while maintaining the strong generalization ability of the large-scale pre-trained diffusion model.We demonstrate the effectiveness of JointNet by using the RGB-D diffusion as an example and through extensive experiments, showcasing its applicability in a variety of applications, including joint RGB-D generation, dense depth prediction, depth-conditioned image generation, and high-resolution 3D panorama generation.

POSTER-497: Towards image compression with perfect realism at ultra-low bitrates

Primary Area: generative models

Keywords: image compression diffusion models generative image models

Scores: [ 6 6 6 6 ]

Image codecs are typically optimized to trade-off bitrate vs. distortion metrics. At low bitrates, this leads to compression artefacts which are easily perceptible, even when training with perceptual distortion metrics or adversarial losses. To improve image quality, and to make it less dependent on the bitrate, we propose to leverage diffusion models as a decoder module, which rely on an iterative decoding process instead of feed-forward decoders trained using MSE or LPIPS distortions used in most neural codecs. In addition to conditioning the model on a vector-quantized image representation, we also condition on a global textual image description to provide additional context. We dub our model PerCo for ''perceptual compression'', and compare it to state-of-the-art codecs at bitrates from 0.1 down to 0.003 bits per pixel (bpp). The latter rate is an order of magnitude smaller than those considered in most prior work. At this bitrate a 512x768 Kodak image is encoded in just 148 bytes. Despite this ultra-low bitrate, our approach maintains the ability to reconstruct realistic images. We find that our model leads to reconstructions with state-of-the-art visual quality as measured by FID and KID, and that the visual quality is less dependent on the bitrate than previous methods. Image codecs are typically optimized to trade-off bitrate vs. distortion metrics. At low bitrates, this leads to compression artefacts which are easily perceptible, even when training with perceptual or adversarial losses. To improve image quality, and to make it less dependent on the bitrate, we propose to decode with iterative diffusion models, instead of feed-forward decoders trained using MSE or LPIPS distortions used in most neural codecs. In addition to conditioning the model on a vector-quantized image representation, we also condition on a global textual image description to provide additional context. We dub our model PerCo for ''perceptual compression'', and compare it to state-of-the-art codecs at rates from 0.1 down to 0.003 bits per pixel. The latter rate is an order of magnitude smaller than those considered in most prior work. At this bitrate a 512x768 Kodak image is encoded in less than 153 bytes. Despite this ultra-low bitrate, our approach maintains the ability to reconstruct realistic images. We find that our model leads to reconstructions with state-of-the-art visual quality as measured by FID and KID, and that the visual quality is less dependent on the bitrate than previous methods.

POSTER-498: On Accelerating Diffusion-Based Sampling Processes via Improved Integration Approximation

Primary Area: generative models

Keywords: diffusion EDM DDIM DPM-Solver optimal stepsize

Scores: [ 6 6 6 6 ]

POSTER-499: Text-to-3D with Classifier Score Distillation

Primary Area: generative models

Keywords: 3D modeling generative models

Scores: [ 5 6 8 8 ]

Text-to-3D generation has made remarkable progress recently, particularly with methods based on Score Distillation Sampling (SDS) that leverages pre-trained 2D diffusion models. While the usage of classifier-free guidance is well acknowledged to be crucial for successful optimization, it is considered an auxiliary trick rather than the most essential component. In this paper, we re-evaluate the role of classifier-free guidance in score distillation and discover a surprising finding: the guidance alone is enough for effective text-to-3D generation tasks. We name this method Classifier Score Distillation (CSD), which can be interpreted as using an implicit classification model for generation. This new perspective reveals new insights for understanding existing techniques. We validate the effectiveness of CSD across a variety of text-to-3D tasks including shape generation, texture synthesis, and shape editing, achieving results superior to those of state-of-the-art methods.

POSTER-500: DynaVol: Unsupervised Learning for Dynamic Scenes through Object-Centric Voxelization

Primary Area: generative models

Keywords: Unsupervised Learning Object-Centric Representation Learning Dynamic Scene Decomposition Inverse Rendering

Scores: [ 3 6 6 6 ]

Unsupervised learning of object-centric representations in dynamic visual scenes is challenging. Unlike most previous approaches that learn to decompose 2D images, we present DynaVol, a 3D scene generative model that unifies geometric structures and object-centric learning in a differentiable volume rendering framework. The key idea is to perform object-centric voxelization to capture the 3D nature of the scene, which infers the probability distribution over objects at individual spatial locations. These voxel features evolve over time through a canonical-space deformation function, forming the basis for global representation learning via slot attention. The voxel features and global features are complementary and are both leveraged by a compositional NeRF decoder for volume rendering. DynaVol remarkably outperforms existing approaches for unsupervised dynamic scene decomposition. Once trained, the explicitly meaningful voxel features enable additional capabilities that 2D scene decomposition methods cannot achieve: it is possible to freely edit the geometric shapes or manipulate the motion trajectories of the objects.

SPOTLIGHT-90: Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency

Primary Area: generative models

Keywords: vision-language generative model cycle consistency

Scores: [ 8 6 6 8 ]

Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities. However, automatically collecting such data (e.g. via large-scale web scraping) leads to low quality and poor image-text correlation, while human annotation is more accurate but requires significant manual effort and expense. We introduce ITIT ($\textbf{I}n\textbf{T}$egrating $\textbf{I}$mage $\textbf{T}$ext): an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data. ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework. During training, ITIT leverages a small set of paired image-text data to ensure its output matches the input reasonably well in both directions. Simultaneously, the model is also trained on much larger datasets containing only images or texts. This is achieved by enforcing cycle consistency between the original unpaired samples and the cycle-generated counterparts. For instance, it generates a caption for a given input image and then uses the caption to create an output image, and enforces similarity between the input and output images. Our experiments show that ITIT with unpaired datasets exhibits similar scaling behavior as using high-quality paired data. We demonstrate image generation and captioning performance on par with state-of-the-art text-to-image and image-to-text models with orders of magnitude fewer (only 3M) paired image-text data.

SPOTLIGHT-91: SE(3)-Stochastic Flow Matching for Protein Backbone Generation

Primary Area: generative models

Keywords: Proteins; Equivariance; Riemannian; Flow Matching; Generative models

Scores: [ 8 8 8 8 ]

The computational design of novel protein structures has the potential to impact numerous scientific disciplines greatly. Toward this goal, we introduce \foldflow, a series of novel generative models of increasing modeling power based on the flow-matching paradigm over 3D rigid motions---i.e. the group SE(3)---enabling accurate modeling of protein backbones. We first introduce FoldFlow-Base, a simulation-free approach to learning deterministic continuous-time dynamics and matching invariant target distributions on SE(3). We next accelerate training by incorporating Riemannian optimal transport to create FoldFlow-OT, leading to the construction of both more simple and stable flows. Finally, we design FoldFlow-SFM, coupling both Riemannian OT and simulation-free training to learn stochastic continuous-time dynamics over SE(3). Our family of FoldFlow, generative models offers several key advantages over previous approaches to the generative modeling of proteins: they are more stable and faster to train than diffusion-based approaches, and our models enjoy the ability to map any invariant source distribution to any invariant target distribution over SE(3). Empirically, we validate our FoldFlow, models on protein backbone generation of up to 300 amino acids leading to high-quality designable, diverse, and novel samples.

POSTER-501: Towards Aligned Layout Generation via Diffusion Model with Aesthetic Constraints

Primary Area: generative models

Keywords: Diffusion model Layout generation Constrained Optimization

Scores: [ 6 6 8 6 ]

Controllable layout generation refers to the process of creating a plausible visual arrangement of elements within a graphic design (e.g., document and web designs) with constraints representing design intentions. Although recent diffusion-based models have achieved state-of-the-art FID scores, they tend to exhibit more pronounced misalignment compared to earlier transformer-based models. In this work, we propose the LAyout Constraint diffusion modEl (LACE), a unified model to handle a broad range of layout generation tasks, such as arranging elements with specified attributes and refining or completing a coarse layout design. The model is based on continuous diffusion models. Compared with existing methods that use discrete diffusion models, continuous state-space design can enable the incorporation of continuous aesthetic constraint functions in training more naturally. For conditional generation, we propose injecting layout conditions in the form of masks or gradient guidance during inference. Empirical results show that LACE produces high-quality layouts and outperforms existing state-of-the-art baselines. We will release our source code and model checkpoints.

POSTER-502: Lifting Architectural Constraints of Injective Flows

Primary Area: generative models

Keywords: normalizing flows injective flows manifold learning maximum likelihood generative model auto encoder

Scores: [ 5 8 5 8 ]

Normalizing Flows explicitly maximize a full-dimensional likelihood on the training data. However, real data is typically only supported on a lower-dimensional manifold leading the model to expend significant compute on modeling noise. Injective Flows fix this by jointly learning a manifold and the distribution on it. So far, they have been limited by restrictive architectures and/or high computational cost. We lift both constraints by a new efficient estimator for the maximum likelihood loss, compatible with free-form bottleneck architectures. We further show that naively learning both the data manifold and the distribution on it can lead to divergent solutions, and use this insight to motivate a stable maximum likelihood training objective. We perform extensive experiments on toy, tabular and image data, demonstrating the competitive performance of the resulting model.

POSTER-503: Deep Generative Clustering with Multimodal Diffusion Variational Autoencoders

Primary Area: generative models

Keywords: Generative Clustering Multimodal VAEs Variational Autoencoder Multimodal Learning Generative Models

Scores: [ 8 6 6 ]

Multimodal VAEs have recently gained significant attention as generative models for weakly-supervised learning with multiple heterogeneous modalities. In parallel, VAE-based methods have been explored as probabilistic approaches for clustering tasks. At the intersection of these two research directions, we propose a novel multimodal VAE model in which the latent space is extended to learn data clusters, leveraging shared information across modalities. Our experiments show that our proposed model improves generative performance over existing multimodal VAEs, particularly for unconditional generation. Furthermore, we propose a post-hoc procedure to automatically select the number of true clusters thus mitigating critical limitations of previous clustering frameworks. Notably, our method favorably compares to alternative clustering approaches, in weakly-supervised settings. Finally, we integrate recent advancements in diffusion models into the proposed method to improve generative quality for real-world images.

POSTER-504: Language-Informed Visual Concept Learning

Primary Area: generative models

Keywords: Image generation; vision language models

Scores: [ 6 5 5 8 ]

Our understanding of the visual world is centered around various concept axes, characterizing different aspects of visual entities. While different concept axes can be easily specified by language, e.g. color, the exact visual nuances along each axis often exceed the limitations of linguistic articulations, e.g. a particular style of painting. In this work, our goal is to learn a language-informed visual concept representation, by simply distilling large pre-trained vision-language models. Specifically, we train a set of concept encoders to encode the information pertinent to a set of language-informed concept axes, with an objective of reproducing the input image through a pre-trained Text-to-Image (T2I) model. To encourage better disentanglement of different concept encoders, we anchor the concept embeddings to a set of text embeddings obtained from a pre-trained Visual Question Answering (VQA) model. At inference time, the model extracts concept embeddings along various axes from new test images, which can be remixed to generate images with novel compositions of visual concepts. With a lightweight test-time finetuning procedure, it can also generalize to novel concepts unseen at training.

POSTER-505: Analyzing and Improving OT-based Adversarial Networks

Primary Area: generative models

Keywords: Optimal Transport Generative Adversarial Networks

Scores: [ 6 6 6 6 5 ]

Optimal transport (OT) problem aims to find a transport plan that bridges two distributions while minimizing a given cost function. OT theory has been widely utilized in generative modeling. In the beginning, OT distance has been used as a measure for assessing the distance between data and generated distributions. Recently, OT transport map between data and prior distributions has been utilized as a generative model. These OT-based generative models share a similar adversarial training objective. In this paper, we begin by unifying these OT-based adversarial methods within a single framework. Then, we elucidate the role of each component in training dynamics through a comprehensive analysis of this unified framework. Moreover, we suggest a simple but novel method that improves the previously best-performing OT-based model. Intuitively, our approach conducts a gradual refinement of the generated distribution, progressively aligning it with the data distribution. Our approach achieves a FID score of 2.51 on CIFAR-10, outperforming unified OT-based adversarial approaches.

POSTER-506: Lightweight Language Model Calibration for Open-ended Question Answering with Varied Answer Lengths

Primary Area: generative models

Keywords: calibration hallucination large language model

Scores: [ 8 6 5 ]

A model is considered well-calibrated when its probability estimate aligns with the true likelihood of the output being correct. Calibrating large language models (LLMs) is crucial, as it plays a vital role in detecting and mitigating hallucinations, a common issue of LLMs, as well as building more trustworthy models. Yet popular neural model calibration techniques are not well-suited for LLMs due to their lack of flexibility in discerning answer correctness and their high computational costs. For instance, post-processing methods, e.g., temperature scaling, are often unable to reorder the candidate generations. Moreover, training-based methods require fine-tuning the entire model, which becomes impractical due to the increasing sizes of modern LLMs. In this paper, we present Litcab, a lightweight calibration mechanism consisting of a single linear layer that takes as input the sentence representation and predicts a bias term, which is then added to the LM output logits. Litcab results with better-calibrated models, by only adding and training <2% of the original model parameters. For evaluation, we construct CaT, a benchmark consisting of six open-ended question-answering (QA) tasks, covering responses ranging from short phrases to paragraphs. We test Litcab with Llama2-7B, where it improves calibration across all tasks. We further conduct a comprehensive evaluation with multiple popular open-sourced LLMs from GPT and LLaMA families, yielding the following key findings: (i) Larger models within the same family exhibit better calibration. (ii) GPT-family models show superior calibration compared to LLaMA, Llama2 and Vicuna models despite having much fewer parameters. (iii) Fine-tuning pretrained model (e.g., LLaMA) with samples of focused purpose (e.g., conversations) may lead to worse calibration, highlighting the importance of fine-tuning setups.

SPOTLIGHT-92: Solving Inverse Problems with Latent Diffusion Models via Hard Data Consistency

Primary Area: generative models

Keywords: Diffusion models inverse problems

Scores: [ 8 8 6 8 ]

Diffusion models have recently emerged as powerful generative priors for solving inverse problems. However, training diffusion models in the pixel space are both data-intensive and computationally demanding, which restricts their applicability as priors for high-dimensional real-world data such as medical images. Latent diffusion models, which operate in a much lower-dimensional space, offer a solution to these challenges. However, incorporating latent diffusion models to solve inverse problems remains a challenging problem due to the nonlinearity of the encoder and decoder. To address these issues, we propose ReSample, an algorithm that can solve general inverse problems with pre-trained latent diffusion models. Our algorithm incorporates data consistency by solving an optimization problem during the reverse sampling process, a concept that we term as hard data consistency. Upon solving this optimization problem, we propose a novel resampling scheme to map the measurement-consistent sample back onto the noisy data manifold and theoretically demonstrate its benefits. Lastly, we apply our algorithm to solve a wide range of linear and nonlinear inverse problems in both natural and medical images, demonstrating that our approach outperforms existing state-of-the-art approaches, including those based on pixel-space diffusion models.

POSTER-507: FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling

Primary Area: generative models

Keywords: diffusion video diffusion video generation tuning-free

Scores: [ 6 5 6 6 ]

With the availability of large-scale video datasets and the advances of diffusion models, text-driven video generation has achieved substantial progress. However, existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference. Furthermore, these models only support single-text conditions, whereas real-life scenarios often require multi-text conditions as the video content changes over time. To tackle these challenges, this study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts. 1) We first analyze the impact of initial noise in video diffusion models. Then building upon the observation of noise, we propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models while preserving content consistency. Specifically, instead of initializing noises for all frames, we reschedule a sequence of noises for long-range correlation and perform temporal attention over them by window-based function. 2) Additionally, we design a novel motion injection method to support the generation of videos conditioned on multiple text prompts. Extensive experiments validate the superiority of our paradigm in extending the generative capabilities of video diffusion models. It is noteworthy that compared with the previous best-performing method which brought about 255% extra time cost, our method incurs only negligible time cost of approximately 17%. Generated video samples are available at the anonymous website: https://free-noise.github.io.

POSTER-508: Generating Images in Context with Multimodal Large Language Models

Primary Area: generative models

Keywords: Diffusion Models Vision-Language Image Generation

Scores: [ 6 6 6 ]

Recent advancements in text-to-image (T2I) and vision-language-to-image (VL2I) generation have made significant strides. However, the generation from generalized vision-language inputs, especially involving multiple images, remains under-explored. This paper presents Kosmos-G, a model that leverages the advanced perception capabilities of Multimodal Large Language Models (MLLMs) to tackle the aforementioned challenge. Our approach aligns the output space of MLLM with CLIP using the textual modality as an anchor and performs compositional instruction tuning on curated data. Kosmos-G demonstrates a unique capability of zero-shot multi-entity subject-driven generation. Notably, the score distillation instruction tuning requires no modifications to the image decoder. This allows for a seamless substitution of CLIP and effortless integration with a myriad of U-Net techniques ranging from fine-grained controls to personalized image decoder variants. We posit Kosmos-G as an initial attempt towards the goal of image as a foreign language in image generation.''

ORAL-29: Multi-Source Diffusion Models for Simultaneous Music Generation and Separation

Primary Area: generative models

Keywords: source separation probabilistic diffusion models music generation

Scores: [ 8 8 8 8 ]

In this work, we define a diffusion-based generative model capable of both music generation and source separation by learning the score of the joint probability density of sources sharing a context. Alongside the classic total inference tasks (i.e., generating a mixture, separating the sources), we also introduce and experiment on the partial generation task of source imputation, where we generate a subset of the sources given the others (e.g., play a piano track that goes well with the drums). Additionally, we introduce a novel inference method for the separation task based on Dirac likelihood functions. We train our model on Slakh2100, a standard dataset for musical source separation, provide qualitative results in the generation settings, and showcase competitive quantitative results in the source separation setting. Our method is the first example of a single model that can handle both generation and separation tasks, thus representing a step toward general audio models.

POSTER-509: Neural Network-Based Score Estimation in Diffusion Models: Optimization and Generalization

Primary Area: generative models

Keywords: diffusion models; score estimation; neural networks; neural tangent kernels

Scores: [ 6 5 8 6 ]

Diffusion models have emerged as a powerful tool rivaling GANs in generating high-quality samples with improved fidelity, flexibility, and robustness. A key component of these models is to learn the score function through score matching. Despite empirical success on various tasks, it remains unclear whether gradient-based algorithms can learn the score function with a provable accuracy. As a first step toward answering this question, this paper establishes a mathematical framework for analyzing score estimation using neural networks trained by gradient descent. Our analysis covers both the optimization and the generalization aspects of the learning procedure. In particular, we propose a parametric form to formulate the denoising score-matching problem as a regression with noisy labels. Compared to the standard supervised learning setup, the score-matching problem introduces distinct challenges, including unbounded inputs, vector-valued outputs, and an additional time variable, preventing existing techniques from being applied directly. In this paper, we show that with a properly designed neural network architecture, the score function can be accurately approximated by a reproducing kernel Hilbert space induced by neural tangent kernels. Furthermore, by applying an early-stopping rule for gradient descent and leveraging certain coupling arguments between neural network training and kernel regression, we establish the first generalization error (sample complexity) bounds for learning the score function despite the presence of noise in the observations. Our analysis is grounded in a novel parametric form of the neural network and an innovative connection between score matching and regression analysis, facilitating the application of advanced statistical and optimization techniques.

POSTER-510: Language Model Beats Diffusion - Tokenizer is key to visual generation

Primary Area: generative models

Keywords: language model diffusion model video generation visual tokenization

Scores: [ 8 8 8 ]

While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce \modelname{}, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks.

POSTER-511: Generative Judge for Evaluating Alignment

Primary Area: generative models

Keywords: Generative Evaluation Alignment

Scores: [ 6 5 5 ]

The rapid development of Large Language Models (LLMs) has substantially expanded the range of tasks they can address. In the field of Natural Language Processing (NLP), researchers have shifted their focus from conventional NLP tasks (e.g., sequence tagging and parsing) towards tasks that revolve around aligning with human needs (e.g., brainstorming and email writing). This shift in task distribution imposes new requirements on evaluating these aligned models regarding generality (i.e., assessing performance across diverse scenarios), flexibility (i.e., examining under different protocols), and interpretability (i.e., scrutinizing models with explanations). In this paper, we propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios and accommodates diverse evaluation protocols (e.g., pairwise response comparison and single-response evaluation) with well-structured natural language critiques. To demonstrate the efficacy of our approach, we construct a new testbed covering 58 different scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models, by a large margin. We also provide detailed analysis and case studies to further reveal the potential of our method and make a variety of resources public at https://anonymous.4open.science/r/Auto-J-ICLR-ver-0107.

POSTER-512: OmniControl: Control Any Joint at Any Time for Human Motion Generation

Primary Area: generative models

Keywords: 3D human motion generation diffusion models conditional generation

Scores: [ 6 6 6 8 ]

We present a novel approach named OmniControl for incorporating flexible spatial control signals into a text-conditioned human motion generation model based on the diffusion process. Unlike previous methods that can only control the pelvis trajectory, OmniControl can incorporate flexible spatial control signals over different joints at different times with only one model. Specifically, we propose analytic spatial guidance that ensures the generated motion can tightly conform to the input control signals. At the same time, realism guidance is introduced to refine all the joints to generate more coherent motion. Both the spatial and realism guidance are essential and they are highly complementary for balancing control accuracy and motion realism. By combining them, OmniControl generates motions that are realistic, coherent, and consistent with the spatial constraints. Experiments on HumanML3D and KIT-ML datasets show that OmniControl not only achieves significant improvement over state-of-the-art methods on pelvis control but also shows promising results when incorporating the constraints over other joints. Our code and model weights will be publicly released.

ORAL-30: Würstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models

Primary Area: generative models

Keywords: Latent Diffusion Model Text-to-Image Neural Architectures Foundation Models

Scores: [ 8 8 8 8 ]

We introduce Würstchen, a novel architecture for text-to-image synthesis that combines competitive performance with unprecedented cost-effectiveness for large-scale text-to-image diffusion models.A key contribution of our work is to develop a latent diffusion technique in which we learn a detailed but extremely compact semantic image representation used to guide the diffusion process. This highly compressed representation of an image provides much more detailed guidance compared to latent representations of language and this significantly reduces the computational requirements to achieve state-of-the-art results. Our approach also improves the quality of text-conditioned image generation based on our user preference study.The training requirements of our approach consists of 24,602 A100-GPU hours - compared to Stable Diffusion 2.1's 200,000 GPU hours. Our approach also requires less training data to achieve these results. Furthermore, our compact latent representations allows us to perform inference over twice as fast, slashing the usual costs and carbon footprint of a state-of-the-art (SOTA) diffusion model significantly, without compromising the end performance. In a broader comparison against SOTA models our approach is substantially more efficient and compares favourably in terms of image quality.We believe that this work motivates more emphasis on the prioritization of both performance and computational accessibility.

POSTER-513: Neural Optimal Transport with General Cost Functionals

Primary Area: generative models

Keywords: Optimal Transport Neural Networks Generative Modelling Unpaired Learning

Scores: [ 6 6 6 6 ]

We introduce a novel neural network-based algorithm to compute optimal transport (OT) plans for general cost functionals. In contrast to common Euclidean costs, i.e., 1 or 2, such functionals provide more flexibility and allow using auxiliary information, such as class labels, to construct the required transport map. Existing methods for general costs are discrete and have limitations in practice, i.e. they do not provide an out-of-sample estimation. We address the challenge of designing a continuous OT approach for general costs that generalizes to new data points in high-dimensional spaces, such as images. Additionally, we provide the theoretical error analysis for our recovered transport plans. As an application, we construct a cost functional to map data distributions while preserving the class-wise structure.

POSTER-514: From Posterior Sampling to Meaningful Diversity in Image Restoration

Primary Area: generative models

Keywords: Image restoration Diverse image restoration Inverse problems Posterior sampling Meaningful diversity

Scores: [ 8 6 8 ]

Image restoration problems are typically ill-posed in the sense that each degraded image can be restored in infinitely many valid ways. To accommodate this, many works generate a diverse set of outputs by attempting to randomly sample from the posterior distribution of natural images given the degraded input. Here we argue that this strategy is commonly of limited practical value because of the heavy tail of the posterior distribution. Consider for example inpainting a missing region of the sky in an image. Since there is a high probability that the missing region contains no object but clouds, any set of samples from the posterior would be entirely dominated by (practically identical) completions of sky. However, arguably, presenting users with only one clear sky completion, along with several alternative solutions such as airships, birds, and balloons, would better outline the set of possibilities. In this paper, we initiate the study of meaningfully diverse image restoration. We explore several post-processing approaches that can be combined with any diverse image restoration method to yield semantically meaningful diversity. Moreover, we propose a practical approach for allowing diffusion based image restoration methods to generate meaningfully diverse outputs, while incurring only negligent computational overhead. We conduct extensive user studies to analyze the proposed techniques, and find the strategy of reducing similarity between outputs to be significantly favorable over posterior sampling.

SPOTLIGHT-93: AnyText: Multilingual Visual Text Generation and Editing

Primary Area: generative models

Keywords: diffusion model text-to-image text generation

Scores: [ 6 6 8 6 ]

Diffusion model based Text-to-Image has achieved impressive achievements recently. Although current technology for synthesizing images is highly advanced and capable of generating images with high fidelity, it is still possible to give the show away when focusing on the text area in the generated image, as synthesized text often contains blurred, unreadable, or incorrect characters, making visual text generation one of the most challenging issues in this field. To address this issue, we introduce AnyText, a diffusion-based multilingual visual text generation and editing model, that focuses on rendering accurate and coherent text in the image. AnyText comprises a diffusion pipeline with two primary elements: an auxiliary latent module and a text embedding module. The former uses inputs like text glyph, position, and masked image to generate latent features for text generation or editing. The latter employs an OCR model for encoding stroke data as embeddings, which blend with image caption embeddings from the tokenizer to generate texts that seamlessly integrate with the background. We employed text-control diffusion loss and text perceptual loss for training to further enhance writing accuracy. AnyText can write characters in multiple languages, to the best of our knowledge, this is the first work to address multilingual visual text generation. It is worth mentioning that AnyText can be plugged into existing diffusion models from the community for rendering or editing text accurately. After conducting extensive evaluation experiments, our method has outperformed all other approaches by a significant margin. Additionally, we contribute the first large-scale multilingual text images dataset, AnyWord-3M, containing 3 million image-text pairs with OCR annotations in multiple languages. Based on AnyWord-3M dataset, we propose AnyText-benchmark for the evaluation of visual text generation accuracy and quality. Our project will be open-sourced soon to improve and promote the development of text generation technology.

POSTER-515: SweetDreamer: Aligning Geometric Priors in 2D diffusion for Consistent Text-to-3D

Primary Area: generative models

Keywords: Text-to-3D diffusion models

Scores: [ 5 5 6 8 ]

It is inherently ambiguous to lift 2D results from pre-trained diffusion models to a 3D world for text-to-3D generation. 2D diffusion models solely learn view-agnostic priors and thus lack 3D knowledge during the lifting, leading to the multi-view inconsistency problem. We find that this problem primarily stems from geometric inconsistency, and avoiding misplaced geometric structures substantially mitigates the problem in the final outputs. Therefore, we improve the consistency by aligning the 2D geometric priors in diffusion models with well-defined 3D shapes during the lifting, addressing the vast majority of the problem. This is achieved by fine-tuning the 2D diffusion model to be viewpoint-aware and to produce view-specific coordinate maps of canonically oriented 3D objects. In our process, only coarse 3D information is used for aligning. This “coarse” alignment not only resolves the multi-view inconsistency in geometries but also retains the ability in 2D diffusion models to generate detailed and diversified high-quality ob-jects unseen in the 3D datasets. Furthermore, our aligned geometric priors (AGP) are generic and can be seamlessly integrated into various state-of-the-art pipelines, obtaining high generalizability in terms of unseen shapes and visual appearance while greatly alleviating the multi-view inconsistency problem. Our method represents a new state-of-the-art performance with a 85+% consistency rate by human evaluation, while many previous methods are around 30%.

POSTER-516: LLM-grounded Video Diffusion Models

Primary Area: generative models

Keywords: text-to-video generation diffusion models large language models

Scores: [ 6 6 6 ]

Text-conditioned diffusion models have emerged as a promising tool for neural video generation. However, current models still struggle with intricate spatiotemporal prompts and often generate restricted or incorrect motion (e.g., even lacking the ability to be prompted for objects moving from left to right). To address these limitations, we introduce LLM-grounded Video Diffusion (LVD). Instead of directly generating videos from the text inputs, LVD first leverages a large language model (LLM) to generate dynamic scene layouts based on the text inputs and subsequently uses the generated layouts to guide a diffusion model for video generation. We show that LLMs are able to understand complex spatiotemporal dynamics from text alone and generate layouts that align closely with both the prompts and the object motion patterns typically observed in the real world. We then propose to guide video diffusion models with these layouts by adjusting the attention maps. Our approach is training-free and can be integrated into any video diffusion model that admits classifier guidance. Our results demonstrate that LVD significantly outperforms its base video diffusion model and several strong baseline methods in faithfully generating videos with the desired attributes and motion patterns.

POSTER-517: SAN: Inducing Metrizability of GAN with Discriminative Normalized Linear Layer

Primary Area: generative models

Keywords: Generative adversarial network optimal transport sliced Wasserstein distance

Scores: [ 6 6 ]

Generative adversarial networks (GANs) learn a target probability distribution by optimizing a generator and a discriminator with minimax objectives. This paper addresses the question of whether such optimization actually provides the generator with gradients that make its distribution close to the target distribution. We derive metrizable conditions, sufficient conditions for the discriminator to serve as the distance between the distributions, by connecting the GAN formulation with the concept of sliced optimal transport. Furthermore, by leveraging these theoretical results, we propose a novel GAN training scheme called the Slicing Adversarial Network (SAN). With only simple modifications, a broad class of existing GANs can be converted to SANs. Experiments on synthetic and image datasets support our theoretical results and the effectiveness of SAN as compared to the usual GANs. We also apply SAN to StyleGAN-XL, which leads to a state-of-the-art FID score amongst GANs for class conditional generation on ImageNet 256$\times$256.

POSTER-518: Generative Modeling of Regular and Irregular Time Series Data via Koopman VAEs

Primary Area: generative models

Keywords: Time Series Generation Koopman Theory; Variational Autoencoder; Generative Modeling

Scores: [ 8 5 6 6 ]

Generating realistic time series data is important for numerous engineering and scientific applications. Several existing works tackle this problem using generative adversarial networks, however, GANs are often unstable during training and suffer from mode collpase. While variational autoencoders (VAEs) are more robust to the above issues, surprisingly, they are considered less for time series generation. In this work, we introduce Koopman VAE (KVAE), a new generative framework that is based on a novel design for the model prior, and that can be optimized for either regular and irregular training data. Inspired by the Koopman theory, we represent the latent conditional prior dynamics using a linear map. Our approach enhances generative modeling with two desired features: (i) incorporating domain knowledge can be achieved by leverageing spectral tools that prescribe constraints on the eigenvalues of the linear map; and (ii) studying the qualitative behavior and stablity of the system can be performed using tools from dynamical systems theory. Our results show that KVAE outperforms state-of-the-art GAN and VAE methods across several challenging synthetic and real-world time series generation benchmarks. Whether trained on regular or irregular data, KVAE generates time series that improve both discriminative and predictive metrics. Further, we present visual evidence suggesting that KVAE learns probability density functions that better approximate the empirical ground truth distribution.

POSTER-519: Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation

Primary Area: generative models

Keywords: Diffusion Models;text-vision Condition;

Scores: [ 5 6 6 8 ]

Originating from the diffusion phenomenon in physics that describes particle movement, the diffusion generative models inherit the characteristics of stochastic random walk in the data space along the denoising trajectory. However, the intrinsic mutual interference among image regions contradicts the need for practical downstream application scenarios where the preservation of low-level pixel information from given conditioning is desired (e.g., customization tasks like personalized generation and inpainting based on a user-provided single image). In this work, we investigate the diffusion (physics) in diffusion (machine learning) properties and propose our Cyclic One-Way Diffusion (COW) method to control the direction of diffusion phenomenon given a pre-trained frozen diffusion model for versatile customization application scenarios, where the low-level pixel information from the conditioning needs to be preserved. Notably, unlike most current methods that incorporate additional conditions by fine-tuning the base text-to-image diffusion model or learning auxiliary networks, our method provides a novel perspective to understand the task needs and is applicable to a wider range of customization scenarios in a learning-free manner. Extensive experiment results show that our proposed COW can achieve more flexible customization based on strict visual conditions in different application settings.

SPOTLIGHT-94: PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Primary Area: generative models

Keywords: Text-to-Image Diffusion Transformer

Scores: [ 6 8 6 8 ]

The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PixArt-α, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PixArt-α's training speed markedly surpasses existing large-scale T2I models, e.g., PixArt-α only takes 10.8% of Stable Diffusion v1.5's training time (~675 vs. ~6,250 A100 GPU days), saving nearly \300,000($26,000vs.$320,000)andreducing90\alpha$ excels in image quality, artistry, and semantic control. We hope PixArt-α will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.

POSTER-520: HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion

Primary Area: generative models

Keywords: Human Image Generation Latent Structural Diffusion

Scores: [ 6 8 6 10 ]

Despite significant advances in large-scale text-to-image models, achieving hyper-realistic human image generation remains a desirable yet unsolved task. Existing models like Stable Diffusion and DALL·E 2 tend to generate human images with incoherent parts or unnatural poses. To tackle these challenges, our key insight is that human image is inherently structural over multiple granularities, from the coarse-level body skeleton to fine-grained spatial geometry. Therefore, capturing such correlations between the explicit appearance and latent structure in one model is essential to generate coherent and natural human images. To this end, we propose a unified framework, HyperHuman, that generates in-the-wild human images of high realism and diverse layouts. Specifically, 1) we first build a large-scale human-centric dataset, named HumanVerse, which consists of 340M images with comprehensive annotations like human pose, depth, and surface normal. 2) Next, we propose a Latent Structural Diffusion Model that simultaneously denoises the depth and surface normal along with the synthesized RGB image. Our model enforces the joint learning of image appearance, spatial relationship, and geometry in a unified network, where each branch in the model complements to each other with both structural awareness and textural richness. 3) Finally, to further boost the visual quality, we propose a Structure-Guided Refiner to compose the predicted conditions for more detailed generation of higher resolution. Extensive experiments demonstrate that our framework yields the state-of-the-art performance, generating hyper-realistic human images under diverse scenarios.

POSTER-521: Object-Aware Inversion and Reassembly for Image Editing

Primary Area: generative models

Keywords: Image editing Diffusion model

Scores: [ 8 3 8 5 ]

Diffusion-based image editing methods have achieved remarkable advances in text-driven image editing. The editing task aims to edit an input image with the original prompt to the desired image aligned with the target prompt. By comparing the original and target prompts, we can obtain numerous editing pairs, each comprising an object and its corresponding editing target. To allow editability while maintaining fidelity to the input image, existing editing methods typically involve a fixed number of inversion steps that project the whole input image to its noisier latent representation, followed by a denoising process guided by the target prompt. However, we find that the optimal number of inversion steps for achieving ideal editing results varies significantly among different editing pairs, owing to varying editing difficulties. Therefore, the current literature, which relies on a fixed number of inversion steps, produces suboptimal generation quality, especially when handling multiple editing pairs in a natural image.To this end, we propose a new image editing paradigm, dubbed Object-aware Inversion and Reassembly (OIR), to enable object-level fine-grained editing. Specifically, we design a new search metric, which determines the optimal inversion step for each editing pair, by jointly considering the editability of the target and the fidelity of the non-editing region. When editing an image, we first search the optimal inversion step for each editing pair with our search metric and edit them separately to circumvent concept mismatch. Subsequently, we propose an additional reassembly step to seamlessly integrate the respective editing results and the non-editing region to obtain the final edited image. To systematically evaluate the effectiveness of our method, we collect two datasets for benchmarking single- and multi-object editing, respectively. Experiments demonstrate that our method achieves superior performance in editing object shapes, colors, materials, categories, etc., especially in multi-object editing scenarios.

POSTER-522: Noise-free Score Distillation

Primary Area: generative models

Keywords: score distillation sampling generative models text to image

Scores: [ 6 8 6 6 ]

Score Distillation Sampling (SDS) has emerged as the de facto approach for text-to-content generation in non-image domains. In this paper, we reexamine the SDS process and introduce a straightforward interpretation that demystifies the necessity for large Classifier-Free Guidance (CFG) scales, rooted in the distillation of an undesired noise term. Building upon our interpretation, we propose a novel Noise-Free Score Distillation (NFSD) process, which requires minimal modifications to the original SDS framework. Through this streamlined design, we achieve more effective distillation of pre-trained text-to-image diffusion models while using a nominal CFG scale. This strategic choice allows us to prevent the over-smoothing of results, ensuring that the generated data is both realistic and complies with the desired prompt. To demonstrate the efficacy of NFSD, we provide qualitative examples that compare NFSD and SDS, as well as several other methods.

SPOTLIGHT-95: SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Primary Area: generative models

Keywords: Image Synthesis Diffusion Generative AI

Scores: [ 8 8 8 8 ]

We present Stable Diffusion XL (SDXL), a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone, achieved by significantly increasing the number of attention blocks and including a second text encoder. Further, we design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. To ensure highest quality results, we also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL improves dramatically over previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators such as Midjourney.

POSTER-523: Human Motion Diffusion as a Generative Prior

Primary Area: generative models

Keywords: diffusion prior composition generation motion generative

Scores: [ 6 6 6 6 ]

Recent work has demonstrated the significant potential of denoising diffusion modelsfor generating human motion, including text-to-motion capabilities.However, these methods are restricted by the paucity of annotated motion data,a focus on single-person motions, and a lack of detailed control.In this paper, we introduce three forms of composition based on diffusion priors:sequential, parallel, and model composition.Using sequential composition, we tackle the challenge of long sequencegeneration. We introduce DoubleTake, an inference-time method with whichwe generate long animations consisting of sequences of prompted intervalsand their transitions, using a prior trained only for short clips.Using parallel composition, we show promising steps toward two-person generation.Beginning with two fixed priors as well as a few two-person training examples, we learn a slimcommunication block, ComMDM, to coordinate interaction between the two resulting motions.Lastly, using model composition, we first train individual priorsto complete motions that realize a prescribed motion for a given joint.We then introduce DiffusionBlending, an interpolation mechanism to effectively blend severalsuch models to enable flexible and efficient fine-grained joint and trajectory-level control and editing.We evaluate the composition methods using an off-the-shelf motion diffusion model,and further compare the results to dedicated models trained for these specific tasks.

POSTER-524: Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition

Primary Area: generative models

Keywords: video generation diffusion models

Scores: [ 6 8 8 6 ]

Video diffusion models have recently made great progress in generation quality, but are still limited by the high memory and computational requirements. This is because current video diffusion models often attempt to process high-dimensional videos directly. To tackle this issue, we propose content-motion latent diffusion model (CMD), a novel efficient extension of pretrained image diffusion models for video generation. Specifically, we propose an autoencoder that succinctly encodes a video as a combination of a content frame (like an image) and a low-dimensional motion latent representation. The former represents the common content, and the latter represents the underlying motion in the video, respectively. We generate the content frame by fine-tuning a pretrained image diffusion model, and we generate the motion latent representation by training a new lightweight diffusion model. A key innovation here is the design of a compact latent space that can directly utilizes a pretrained image diffusion model, which has not been done in previous latent video diffusion models. This leads to considerably better quality generation and reduced computational costs. For instance, CMD can sample a video 7.7$\times$ faster than prior approaches by generating a video of 512$\times$1024 resolution and length 16 in 3.1 seconds. Moreover, CMD achieves an FVD score of 212.7 on WebVid-10M, 27.3% better than the previous state-of-the-art of 292.4.

POSTER-525: TUVF: Learning Generalizable Texture UV Radiance Fields

Primary Area: generative models

Keywords: texture synthesis generative model NeRF

Scores: [ 8 6 6 8 ]

Textures are a vital aspect of creating visually appealing and realistic 3D models. In this paper, we study the problem of generating high-fidelity texture given shapes of 3D assets, which has been relatively less explored compared with generic 3D shape modeling. Our goal is to facilitate a controllable texture generation process, such that one texture code can correspond to a particular appearance style independent of any input shapes from a category. We introduce Texture UV Radiance Fields (TUVF) that generate textures in a learnable UV sphere space rather than directly on the 3D shape. This allows the texture to be disentangled from the underlying shape and transferable to other shapes that share the same UV space, i.e., from the same category. We integrate the UV sphere space with the radiance field, which provides a more efficient and accurate representation of textures than traditional texture maps. We perform our experiments on synthetic and real-world object datasets where we achieve not only realistic synthesis but also substantial improvements over state-of-the-arts on texture controlling and editing.

POSTER-526: Robustness of AI-Image Detectors: Fundamental Limits and Practical Attacks

Primary Area: generative models

Keywords: ai-image detection image watermark deepfake detection watermark attack generative models

Scores: [ 5 6 6 6 ]

In light of recent advancements in generative AI models, it has become essential to distinguish genuine content from AI-generated one to prevent the malicious usage of fake materials as authentic ones and vice versa. Various techniques have been introduced for identifying AI-generated images, with watermarking emerging as a promising approach. In this paper, we analyze the robustness of various AI-image detectors including watermarking and classifier-based deepfake detectors. For watermarking methods that introduce subtle image perturbations (i.e., low perturbation budget methods), we reveal a fundamental trade-off between the evasion error rate (i.e., the fraction of watermarked images detected as non-watermarked ones) and the spoofing error rate (i.e., the fraction of non-watermarked images detected as watermarked ones) upon an application of a diffusion purification attack. In this regime, we also empirically show that diffusion purification effectively removes watermarks with minimal changes to images. For high perturbation watermarking methods where notable changes are applied to images, the diffusion purification attack is not effective. In this case, we develop a model substitution adversarial attack that can successfully remove watermarks. Moreover, we show that watermarking methods are vulnerable to spoofing attacks where the attacker aims to have real images (potentially obscene) identified as watermarked ones, damaging the reputation of the developers. In particular, by just having black-box access to the watermarking method, we show that one can generate a watermarked noise image which can be added to the real images to have them falsely flagged as watermarked ones. Finally, we extend our theory to characterize a fundamental trade-off between the robustness and reliability of classifier-based deep fake detectors and demonstrate it through experiments.

POSTER-527: Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models

Primary Area: generative models

Keywords: LLMs Sparse Feedback Ratings Rankings Inconsistency Evaluation

Scores: [ 6 6 6 8 ]

Aligning large language models (LLMs) with human values and intents critically involves the use of human or AI feedback. While dense feedback annotations are expensive to acquire and integrate, sparse feedback presents a structural design choice between ratings (e.g., score Response A on a scale of 1-7) and rankings (e.g., is Response A better than Response B?). In this work, we analyze the effect of this design choice for the alignment and evaluation of LLMs. We uncover an inconsistency problem wherein the preferences inferred from ratings and rankings significantly disagree 60% for both human and AI annotators. Our subsequent analysis identifies various facets of annotator biases that explain this phenomena such as human annotators would rate denser responses higher while preferring accuracy during pairwise judgments, for a particular comparison instance. To our surprise, we observe that the choice of feedback protocol has a significant effect on the evaluation of aligned LLMs. In particular, we find that LLMs that leverage rankings data for alignment (say model X) are preferred over those that leverage ratings data (say model Y), with a rank-based evaluation protocol (is X/Y's response better than reference response?) but not with a rating-based evaluation protocol (score Rank X/Y's response on a scale of 1-7). Our findings thus shed light on critical gaps in methods for evaluating the real-world utility of language models and their strong dependence on the feedback protocol used for alignment.

SPOTLIGHT-96: ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Primary Area: generative models

Keywords: Large Language Model Tool Use API Use

Scores: [ 6 8 6 8 ]

Despite the advancements of open-source large language models (LLMs), e.g., LLaMA, they remain significantly limited in tool-use capabilities, i.e., using external tools (APIs) to fulfill human instructions. The reason is that current instruction tuning largely focuses on basic language tasks but ignores the tool-use domain. This is in contrast to the excellent tool-use capabilities of state-of-the-art (SOTA) closed-source LLMs, e.g., ChatGPT. To bridge this gap, we introduce ToolLLM, a general tool-use framework encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning dataset for tool use, which is constructed automatically using ChatGPT. Specifically, the construction can be divided into three stages: (i) API collection: we collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub; (ii) instruction generation: we prompt ChatGPT to generate diverse instructions involving these APIs, covering both single-tool and multi-tool scenarios; (iii) solution path annotation: we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To enhance the reasoning capabilities of LLMs, we develop a novel depth-first search-based decision tree algorithm. It enables LLMs to evaluate multiple reasoning traces and expand the search space. Moreover, to evaluate the tool-use capabilities of LLMs, we develop an automatic evaluator: ToolEval. Based on ToolBench, we fine-tune LLaMA to obtain an LLM ToolLLaMA, and equip it with a neural API retriever to recommend appropriate APIs for each instruction. Experiments show that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. Our ToolLLaMA also demonstrates strong zero-shot generalization ability in an out-of-distribution tool-use dataset: APIBench.

POSTER-528: GenCorres: Consistent Shape Matching via Coupled Implicit-Explicit Shape Generative Models

Primary Area: generative models

Keywords: Joint shape matching;shape generator;implicit representation;geometric regularization

Scores: [ 6 5 8 8 ]

This paper introduces GenCorres, a novel unsupervised joint shape matching (JSM) approach. The basic idea of GenCorres is to learn a parametric mesh generator to fit an unorganized deformable shape collection while constraining deformations between adjacent synthetic shapes to preserve geometric structures such as local rigidity and local conformality. GenCorres presents three appealing advantages over existing JSM techniques. First, GenCorres performs JSM among a synthetic shape collection whose size is much bigger than the input shapes and fully leverages the data-driven power of JSM. Second, GenCorres unifies consistent shape matching and pairwise matching (i.e., by enforcing deformation priors between adjacent synthetic shapes). Third, the generator provides a concise encoding of consistent shape correspondences. However, learning a mesh generator from an unorganized shape collection is challenging. It requires a good initial fitting to each shape and can easily get trapped by local minimums. GenCorres addresses this issue by learning an implicit generator from the input shapes, which provides intermediate shapes between two arbitrary shapes. We introduce a novel approach for computing correspondences between adjacent implicit surfaces and force the correspondences to preserve geometric structures and be cycle-consistent. Synthetic shapes of the implicit generator then guide initial fittings (i.e., via template-based deformation) for learning the mesh generator. Experimental results show that GenCorres considerably outperforms state-of-the-art JSM techniques on benchmark datasets. The synthetic shapes of GenCorres preserve local geometric features and yield competitive performance gains against state-of-the-art deformable shape generators.

POSTER-529: Energy-guided Entropic Neural Optimal Transport

Primary Area: generative models

Keywords: energy-based model generative model optimal transport entropic optimal transport general optimal transport cost function

Scores: [ 6 6 6 ]

Energy-based models (EBMs) are known in the Machine Learning community for decades. Since the seminal works devoted to EBMs dating back to the noughties, there have been a lot of efficient methods which solve the generative modelling problem by means of energy potentials (unnormalized likelihood functions). In contrast, the realm of Optimal Transport (OT) and, in particular, neural OT solvers is much less explored and limited by few recent works (excluding WGAN-based approaches which utilize OT as a loss function and do not model OT maps themselves). In our work, we bridge the gap between EBMs and Entropy-regularized OT. We present a novel methodology which allows utilizing the recent developments and technical improvements of the former in order to enrich the latter. From the theoretical perspective, we prove generalization bounds for our technique. In practice, we validate its applicability in toy 2D and image domains. To showcase the scalability, we empower our method with a pre-trained StyleGAN and apply it to high-res AFHQ 512×512 unpaired I2I translation. For simplicity, we choose simple short- and long-run EBMs as a backbone of our Energy-guided Entropic OT approach, leaving the application of more sophisticated EBMs for future research.

POSTER-530: Diffusion Models for Multi-Task Generative Modeling

Primary Area: generative models

Keywords: Diffusion Models Multi-Task Learning Generative Models

Scores: [ 6 6 6 5 ]

Generative modeling via diffusion-based models has been achieving state-of-the-art results on various generation tasks. Most existing diffusion models, however, are limited to a single-generation modeling. Can we generalize diffusion models with the ability of multi-task generative training for more generalizable modeling? In this paper, we propose a principled way to define a diffusion model for this purpose by constructing a unified multi-task diffusion model in a common {\em diffusion space}. We define the forward diffusion process to be driven by an information aggregation from multiple types of task-data, {\it e.g.}, images for a generation task and labels for a classification task. In the reverse process, we enforce information sharing by parameterizing a shared backbone denoising network with additional task-specific decoder heads. Such a structure can simultaneously learn to generate different types of multi-task data with a multi-task loss, which is derived from a multi-task variational lower bound that generalizes the standard diffusion model. We propose several multi-task generation settings to verify our framework, including image transition, masked-image training, joint image-label and joint image-representation generative modeling. Extensive experimental results on ImageNet indicate the effectiveness of our framework for various multi-task generative modeling, which we believe is an important research direction worthy of more future explorations.

POSTER-531: Beyond Accuracy: Evaluating Self-Consistency of Code LLMs

Primary Area: generative models

Keywords: Large Language Models Code Generation Code Summarization LLM Evaluation Self-Consistency

Scores: [ 8 3 8 6 ]

Code Large Language Models (LLMs) are being increasingly employed in real-life applications, so evaluating them is critical. While the general accuracy of Code LLMs on individual tasks has been substantially evaluated and improved, their self-consistency across different tasks is overlooked. Intuitively, a trustworthy model should be self-consistent when generating documentation for its own code and generating code for its own natural language specifications. Failure to preserve self-consistency reveals a model's lack of understanding of the shared semantics underlying natural language and programming language and therefore undermines its trustworthiness. In this paper, we first formally define the self-consistency of Code LLMs and then design a framework, IdentityChain, which can evaluate a model's self-consistency and general accuracy at the same time. We study eleven Code LLMs and show that their self-consistency is indeed a concerning aspect, distinct from general accuracy, which should be highlighted in the evaluation and improved in the training of Code LLMs in the future.

POSTER-532: Latent 3D Graph Diffusion

Primary Area: generative models

Keywords: 3D graphs latent diffusion models in/equivariant representations

Scores: [ 6 6 6 5 6 8 ]

Generating 3D graphs of \textit{symmetry-group equivariance} is of intriguing potential in broad applications from machine vision to molecular discovery.Emerging approaches adopt diffusion generative models (DGMs) with proper re-engineering to capture 3D graph distributions.In this paper, we raise an orthogonal and fundamental question of \textit{in what (latent) space we should diffuse 3D graphs}.\ding{182} We motivate the study with theoretical analysis showing that the performance bound of 3D graph diffusion could be improved in a latent space versus the original space, provided that there are (i) low dimensionality yet (ii) high quality (i.e., low reconstruction error) of the latent space, and (iii) symmetry preservation as an inductive bias of latent DGMs.\ding{183} Guided by the theoretical guidelines, we propose to perform 3D graph diffusion in a low-dimensional latent space, which is learned through cascaded 2D--3D graph autoencoders for low-error reconstruction and symmetry-group invariance.The overall pipeline is dubbed \textbf{latent 3D graph diffusion}.\ding{184} Motivated by applications in molecular discovery, we further extend latent 3D graph diffusion to conditional generation given SE(3)-invariant attributes or equivariant 3D objects.\ding{185} We also demonstrate empirically that out-of-distribution conditional generation can be further improved by regularizing the latent space via graph self-supervised learning.We validate through comprehensive experiments that our method generates 3D molecules of higher validity / drug-likeliness and comparable conformations / energetics, while being an order of magnitude faster in training. Codes will be released upon acceptance.

POSTER-533: Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

Primary Area: generative models

Keywords: mathematical reasoning large language models zero-shot learning code generation prompting

Scores: [ 3 8 6 8 ]

Recent progress in large language models (LLMs) like GPT-4 and PaLM-2 has brought significant advancements in addressing math reasoning problems. In particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets. In this paper, we explore the effect of code on enhancing LLMs' reasoning capability by introducing different constraints on the Code Usage Frequency of GPT-4 Code Interpreter. We found that its success can be largely attributed to its powerful skills in generating and executing code, evaluating the output of code execution, and rectifying its solution when receiving unreasonable outputs. Based on this insight, we propose a novel and effective prompting method, explicit $\underline{\text{c}}$ode-based $\underline{\text{s}}elf\underline{\text{v}}$erification (CSV), to further boost the mathematical reasoning potential of GPT-4 Code Interpreter. This method employs a zero-shot prompt on GPT-4 Code Interpreter to encourage it to use code to self-verify its answers. In instances where the verification state registers as "False", the model shall automatically amend its solution, analogous to our approach of rectifying errors during a mathematics examination. Furthermore, we recognize that the states of the verification result indicate the confidence of a solution, which can improve the effectiveness of majority voting. With GPT-4 Code Interpreter and CSV, we achieve an impressive zero-shot accuracy on MATH dataset (53.9% → 84.3%).

POSTER-534: Denoising Diffusion Step-aware Models

Primary Area: generative models

Keywords: diffusion models efficiency network pruning

Scores: [ 8 6 6 6 ]

Denoising Diffusion Probabilistic Models (DDPMs) have garnered popularity for data generation across various domains. However, a significant bottleneck is the necessity for whole-network computation during every step of the generative process, leading to high computational overheads. This paper presents a novel framework, Denoising Diffusion Step-aware Models (DDSM), to address this challenge. Unlike conventional approaches, DDSM employs a spectrum of neural networks whose sizes are adapted according to the importance of each generative step, as determined through evolutionary search. This step-wise network variation effectively circumvents redundant computational efforts, particularly in less critical steps, thereby enhancing the efficiency of the diffusion model. Furthermore, the step-aware design can be seamlessly integrated with other efficiency-geared diffusion models such as DDIMs and latent diffusion, thus broadening the scope of computational savings. Empirical evaluations demonstrate that DDSM achieves computational savings of 49% for CIFAR-10, 61% for CelebA-HQ, 59% for LSUN-bedroom, 71% for AFHQ, and 76% for ImageNet, all without compromising the generation quality. Our code and models will be publicly available.

POSTER-535: RAPPER: Reinforced Rationale-Prompted Paradigm for Natural Language Explanation in Visual Question Answering

Primary Area: generative models

Keywords: Vision Question Answering Natural Language Explanation (VQA-NLE)

Scores: [ 6 6 3 ]

Natural Language Explanation (NLE) in vision and language tasks aims to provide human-understandable explanations for the associated decision-making process. In practice, one might encounter explanations which lack informativeness or contradict visual-grounded facts, known as \textit{implausibility} and \textit{hallucination} problems, respectively. To tackle these challenging issues, we consider the task of visual question answering (VQA) and introduce \textit{Rapper}, a two-stage \textbf{R}einforced R\textbf{a}tionale-\textbf{P}rom\textbf{p}t\textbf{e}d Pa\textbf{r}adigm. By knowledge distillation, the former stage of \textit{Rapper} infuses rationale-prompting via large language models (LLMs), encouraging the rationales supported by language-based facts. As for the latter stage, a unique Reinforcement Learning from NLE Feedback (RLNF) is introduced for injecting visual facts into NLE generation. Finally, quantitative and qualitative experiments on two VL-NLE benchmarks show that \textsc{Rapper} surpasses state-of-the-art VQA-NLE methods while providing plausible and faithful NLE.

POSTER-536: MBR and QE Finetuning: Training-time Distillation of the Best and Most Expensive Decoding Methods

Primary Area: generative models

Keywords: Generative models Natural language generation Neural machine translation LLMs MBR decoding Quality estimation Finetuning Distillation

Scores: [ 6 6 6 6 ]

Recent research in decoding methods for Natural Language Generation (NLG) tasks has shown that MAP decoding is not optimal, because model probabilities do not always align with human preferences. Stronger decoding methods, including Quality Estimation (QE) reranking and Minimum Bayes' Risk (MBR) decoding, have since been proposed to mitigate the model-perplexity-vs-quality mismatch. While these decoding methods achieve state-of-the-art performance, they are prohibitively expensive to compute. In this work, we propose MBR finetuning and QE finetuning, which distill the quality gains from these decoding methods at training time, while using an efficient decoding algorithm at inference time. Using the canonical NLG task of Neural Machine Translation (NMT), we show that even with self-training, these finetuning methods significantly outperform the base model. Moreover, when using an external LLM as a teacher model, these finetuning methods outperform finetuning on human-generated references. These findings suggest new ways to leverage monolingual data to achieve improvements in model quality that are on par with, or even exceed, improvements from human-curated data, while maintaining maximum efficiency during decoding.

POSTER-537: Future Language Modeling from Temporal Document History

Primary Area: generative models

Keywords: Future Language Modeling Future Language Model Temporal Document History

Scores: [ 8 8 6 ]

POSTER-538: Unveiling and Manipulating Prompt Influence in Large Language Models

Primary Area: generative models

Keywords: Large language models prompt interpretability prompt manipulation controlled LLM generation

Scores: [ 6 6 8 6 ]

Prompts play a crucial role in guiding the responses of Large Language Models (LLMs). However, the intricate role of individual tokens in prompts, known as input saliency, in shaping the responses remains largely underexplored. Existing saliency methods either misalign with LLM generation objectives or rely heavily on linearity assumptions, leading to potential inaccuracies. To address this, we propose Token Distribution Dynamics (TDD), an elegantly simple yet remarkably effective approach to unveil and manipulate the role of prompts in generating LLM outputs. TDD leverages the robust interpreting capabilities of the language model head (LM head) to assess input saliency. It projects input tokens into the embedding space and then estimates their significance based on distribution dynamics over the vocabulary. We introduce three TDD variants: forward, backward, and bidirectional, each offering unique insights into token relevance. Extensive experiments reveal that the TDD surpasses state-of-the-art baselines with a big margin in elucidating the causal relationships between prompts and LLM outputs. Beyond mere interpretation, we apply TDD to two prompt manipulation tasks for controlled text generation: zero-shot toxic language suppression and sentiment steering. Empirical results underscore TDD's proficiency in identifying both toxic and sentimental cues in prompts, subsequently mitigating toxicity or modulating sentiment in the generated content.

POSTER-539: Circumventing Concept Erasure Methods For Text-To-Image Generative Models

Primary Area: generative models

Keywords: Model Editing Diffusion Model Concept Erasure

Scores: [ 8 6 8 5 6 ]

Text-to-image generative models can produce photo-realistic images for an extremely broad range of concepts, and their usage has proliferated widely among the general public. On the flip side, these models have numerous drawbacks, including their potential to generate images featuring sexually explicit content, mirror artistic styles without permission, or even hallucinate (or deepfake) the likenesses of celebrities. Consequently, various methods have been proposed in order to "erase" sensitive concepts from text-to-image models. In this work, we examine seven recently proposed concept erasure methods, and show that targeted concepts are not fully excised from any of these methods. Specifically, we leverage the existence of special learned word embeddings that can retrieve "erased" concepts from the sanitized models with no alterations to their weights. Our results highlight the brittleness of post hoc concept erasure methods, and call into question their use in the algorithmic toolkit for AI safety.

POSTER-540: Soft Mixture Denoising: Beyond the Expressive Bottleneck of Diffusion Models

Primary Area: generative models

Keywords: Diffusion Models Expressive Bottleneck Soft Mixture Denoising (SMD)

Scores: [ 8 8 5 6 ]

Because diffusion models have shown impressive performances in a number of tasks, such as image synthesis, there is a trend in recent works to prove (with certain assumptions) that these models have strong approximation capabilities. In this paper, we show that current diffusion models actually have an expressive bottleneck in backward denoising and some assumption made by existing theoretical guarantees is too strong. Based on this finding, we prove that diffusion models have unbounded errors in both local and global denoising. In light of our theoretical studies, we introduce soft mixture denoising (SMD), an expressive and efficient model for backward denoising. SMD not only permits diffusion models to well approximate any Gaussian mixture distributions in theory, but also is simple and efficient for implementation. Our experiments on multiple image datasets show that SMD significantly improves different types of diffusion models (e.g., DDPM), espeically in the situation of few backward iterations.

POSTER-541: Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake Analysis

Primary Area: generative models

Keywords: Large language model safety alignment mistake analysis

Scores: [ 8 5 6 5 ]

The rapid advancement of large language models (LLMs) presents both opportunities and challenges, particularly concerning unintentional generation of harmful and toxic responses. While the traditional alignment methods strive to steer LLMs towards desired performance and shield them from malicious content, this study proposes a novel alignment strategy rooted in mistake analysis by exposing LLMs to flawed outputs purposefully and then conducting a thorough assessment to fully comprehend the internal reasons via natural language. Thus, harmful responses can be transformed into instruction tuning corpus for model alignment, and LLMs can not only be deterred from generating toxic responses but also trained to self-criticize, leveraging its innate ability to discriminate toxic content. Experimental results demonstrate that the proposed method outperforms conventional alignment techniques for safety instruction following, while maintaining superior efficiency.

ORAL-31: Diffusion Model for Dense Matching

Primary Area: generative models

Keywords: Diffusion Models Visual Correspondence

Scores: [ 8 8 8 8 ]

The objective for establishing dense correspondence between paired images consists of two terms: a data term and a prior term. While conventional techniques focused on defining hand-designed prior terms, which are difficult to formulate, recent approaches have focused on learning the data term with deep neural networks without explicitly modeling the prior, assuming that the model itself has the capacity to learn an optimal prior from a large-scale dataset. The performance improvement was obvious, however, they often fail to address inherent ambiguities of matching, such as textureless regions, repetitive patterns, large displacements, or noises. To address this, we propose DiffMatch, a novel conditional diffusion-based framework designed to explicitly model both the data and prior terms for dense matching. This is accomplished by leveraging a conditional denoising diffusion model that explicitly takes matching cost and injects the prior within generative process. However, limited resolution of the diffusion model is a major hindrance. We address this with a cascaded pipeline, starting with a low-resolution model, followed by a super-resolution model that successively upsamples and incorporates finer details to the matching field. Our experimental results demonstrate significant performance improvements of our method over existing approaches, and the ablation studies validate our design choices along with the effectiveness of each component. The code and pretrained weights will be available.

POSTER-542: LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading

Primary Area: generative models

Keywords: lip-to-speech lip-reading diffusion models

Scores: [ 8 5 6 5 ]

Lip-to-speech involves generating a natural-sounding speech synchronized with a soundless video of a person talking. Despite recent advances, current methods still cannot produce high-quality speech with high levels of intelligibility for challenging and realistic datasets such as LRS3. In this work, we present LipVoicer, a novel method that generates high-quality speech, even for in-the-wild and rich datasets, by incorporating the text modality. Given a silent video, we first predict the spoken text using a pre-trained lip-reading network. We then condition a diffusion model on the video and use the extracted text through a classifier-guidance mechanism where a pre-trained automatic speech recognition (ASR ) serves as the classifier. LipVoicer outperforms multiple lip-to-speech baselines on LRS2 and LRS3, which are in-the-wild datasets with hundreds of unique speakers in their test set and an unrestricted vocabulary. Moreover, our experiments show that the inclusion of the text modality plays a major role in the intelligibility of the produced speech, readily perceptible while listening, and is empirically reflected in the substantial reduction of the word error rate ( WER ) metric. We demonstrate the effectiveness of LipVoicer through human evaluation, which shows that it produces more natural and synchronized speech signals compared to competing methods. Finally, we created a demo showcasing LipVoicer’s superiority in producing natural, synchronized, and intelligible speech, providing additional evidence of its effectiveness. Project page: https://lipvoicer.github.io

POSTER-543: Alleviating Exposure Bias in Diffusion Models through Sampling with Shifted Time Steps

Primary Area: generative models

Keywords: diffusion models sampling methods exposure bias

Scores: [ 6 6 6 6 ]

Diffusion Probabilistic Models (DPM) have shown remarkable efficacy in the synthesis of high-quality images. However, their inference process characteristically requires numerous, potentially hundreds, of iterative steps, which could exaggerate the problem of exposure bias due to the training and inference discrepancy. Previous work has attempted to mitigate this issue by perturbing inputs during training, which consequently mandates the retraining of the DPM. In this work, we conduct a systematic study of exposure bias in DPM and, intriguingly, we find that the exposure bias could be alleviated with a novel sampling method that we propose, without retraining the model. We empirically and theoretically show that, during inference, for each backward time step t and corresponding state x^t, there might exist another time step ts which exhibits superior coupling with x^t. Based on this finding, we introduce a sampling method named Time-Shift Sampler. Our framework can be seamlessly integrated to existing sampling algorithms, such as DDPM, DDIM and other high-order solvers, inducing merely minimal additional computations. Experimental results show our method brings significant and consistent improvements in FID scores on different datasets and sampling methods. For example, integrating Time-Shift Sampler to F-PNDM yields a FID=3.88, achieving 44.49% improvements as compared to F-PNDM, on CIFAR-10 with 10 sampling steps, which is more performant than the vanilla DDIM with 100 sampling steps. We will release the code upon acceptance.

POSTER-544: PanoDiffusion: 360-degree Panorama Outpainting via Diffusion

Primary Area: generative models

Keywords: Panorama Outpainting Latent Diffusion Multi-modal Generation

Scores: [ 6 8 8 6 ]

Generating complete 360\textdegree{} panoramas from narrow field of view images is ongoing research as omnidirectional RGB data is not readily available. Existing GAN-based approaches face some barriers to achieving higher quality output, and have poor generalization performance over different mask types. In this paper, we present our 360\textdegree{} indoor RGB panorama outpainting model using latent diffusion models (LDM), called PanoDiffusion. We introduce a new bi-modal latent diffusion structure that utilizes both RGB and depth panoramic data during training, which works surprisingly well to outpaint depth-free RGB images during inference. We further propose a novel technique of introducing progressive camera rotations during each diffusion denoising step, which leads to substantial improvement in achieving panorama wraparound consistency. Results show that our PanoDiffusion not only significantly outperforms state-of-the-art methods on RGB panorama outpainting by producing diverse well-structured results for different types of masks, but can also synthesize high-quality depth panoramas to provide realistic 3D indoor models.

POSTER-545: BayesDiff: Estimating Pixel-wise Uncertainty in Diffusion via Bayesian Inference

Primary Area: generative models

Keywords: diffusion model Baysian uncertainty

Scores: [ 8 6 6 6 ]

Diffusion models have impressive image generation capability, but low-quality generations still exist, and their identification remains challenging due to the lack of a proper sample-wise metric. To address this, we propose BayesDiff, a pixel-wise uncertainty estimator for generations from diffusion models based on Bayesian inference. In particular, we derive a novel uncertainty iteration principle to characterize the uncertainty dynamics in diffusion, and leverage the last-layer Laplace approximation for efficient Bayesian inference. The estimated pixel-wise uncertainty can not only be aggregated into a sample-wise metric to filter out low-fidelity images but also aids in augmenting successful generations and rectifying artifacts in failed generations in text-to-image tasks. Extensive experiments demonstrate the efficacy of BayesDiff and its promise for practical applications.

SPOTLIGHT-97: Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM Finetuning

Primary Area: generative models

Keywords: multi-modality; large language models; generation; model efficiency;

Scores: [ 6 6 8 8 ]

This paper introduces an efficient strategy to transform Large Language Models(LLMs) into Multi-Modal Large Language Models (MLLMs). By conceptualizingthis transformation as a domain adaptation process, i.e., transitioning from textunderstanding to embracing multiple modalities, we intriguingly note that, withineach attention block, tuning LayerNorm suffices to yield strong performance.Moreover, when benchmarked against other tuning approaches like full parameterfinetuning or LoRA, its benefits on efficiency are substantial. For example, whencompared to LoRA on a 13B model scale, performance can be enhanced by anaverage of over 20% across five multi-modal tasks, and meanwhile, results in asignificant reduction of trainable parameters by 41.9% and a decrease in GPUmemory usage by 17.6%. On top of this LayerNorm strategy, we showcase thatselectively tuning only with conversational data can improve efficiency further.Beyond these empirical outcomes, we provide a comprehensive analysis to explorethe role of LayerNorm in adapting LLMs to the multi-modal domain and improvingthe expressive power of the model.

POSTER-546: In-Context Learning Learns Label Relationships but Is Not Conventional Learning

Primary Area: generative models

Keywords: in-context learning large language models

Scores: [ 6 6 8 6 ]

The predictions of Large Language Models (LLMs) on downstream tasks often improve significantly when including examples of the input–label relationship in the context. However, there is currently no consensus about how this in-context learning (ICL) ability of LLMs works. For example, while Xie et al. (2022) liken ICL to a general-purpose learning algorithm, Min et al. (2022b) argue ICL does not even learn label relationships from in-context examples. In this paper, we provide novel insights into how ICL leverages label information, revealing both capabilities and limitations. To ensure we obtain a comprehensive picture of ICL behavior, we study probabilistic aspects of ICL predictions and thoroughly examine the dynamics of ICL as more examples are provided. Our experiments show that ICL predictions almost always depend on in-context labels, and that ICL can learn truly novel tasks in-context. However, we also find that ICL struggles to fully overcome prediction preferences acquired from pre-training data, and, further, that ICL does not consider all in-context information equally.

POSTER-547: Conditional Variational Diffusion Models

Primary Area: generative models

Keywords: Denoising Diffusion Probabilistic Models Inverse Problems Generative Models Super Resolution Phase Quantification Variational Methods

Scores: [ 5 8 8 5 3 ]

Inverse problems aim to determine causal factors from observations, a crucial task in engineering and science. Lately, generative models, especially diffusion models, have gained popularity in this area for their ability to produce realistic solutions and their good mathematical properties. Despite their success, an important drawback of diffusion models is their sensitivity to the choice of variance schedule, which controls the dynamics of the diffusion process. Fine-tuning this schedule for specific applications is crucial but time-costly, and does not guarantee an optimal result. We propose a novel approach for learning the schedule as part of the training process. Our method supports probabilistic conditioning on data, provides high-quality solutions, and is flexible, proving able to adapt to different applications with minimum overhead. This approach is tested in two unrelated inverse problems: super-resolution microscopy and quantitative phase imaging, yielding comparable or superior results to previous methods and fine-tuned diffusion models. We conclude that fine-tuning the schedule by experimentation should be avoided because it can be learned during training in a stable way that yields better results.

POSTER-548: Differentially Private Synthetic Data via Foundation Model APIs 1: Images

Primary Area: generative models

Keywords: synthetic data differential privacy model API foundation models

Scores: [ 5 6 8 6 ]

Generating differentially private (DP) synthetic data that closely resembles the original private data is a scalable way to mitigate privacy concerns in the current data-driven world. In contrast to current practices that train customized models for this task, we aim to generate DP Synthetic Data via APIs (DPSDA), where we treat foundation models as blackboxes and only utilize their inference APIs. Such API-based, training-free approaches are easier to deploy as exemplified by the recent surge in the number of API-based apps. These approaches can also leverage the power of large foundation models which are only accessible via their inference APIs. However, this comes with greater challenges due to strictly more restrictive model access and the need to protect privacy from the API provider.In this paper, we present a new framework called Private Evolution (PE) to solve this problem and show its initial promise on synthetic images. Surprisingly, PE can match or even outperform state-of-the-art (SOTA) methods without any model training. For example, on CIFAR10 (with ImageNet as the public data), we achieve FID≤7.9 with privacy cost ε = 0.67, significantly improving the previous SOTA from ε = 32. We further demonstrate the promise of applying PE on large foundation models such as Stable Diffusion to tackle challenging private datasets with a small number of high-resolution images.

POSTER-549: Training Diffusion Models with Reinforcement Learning

Primary Area: generative models

Keywords: reinforcement learning RLHF diffusion models

Scores: [ 6 6 8 5 ]

Diffusion models are a class of flexible generative models trained with an approximation to the log-likelihood objective. However, most use cases of diffusion models are not concerned with likelihoods, but instead with downstream objectives such as human-perceived image quality or drug effectiveness. In this paper, we investigate reinforcement learning methods for directly optimizing diffusion models for such objectives. We describe how posing denoising as a multi-step decision-making problem enables a class of policy gradient algorithms, which we refer to as denoising diffusion policy optimization (DDPO), that are more effective than alternative reward-weighted likelihood approaches. Empirically, DDPO is able to adapt text-to-image diffusion models to objectives that are difficult to express via prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Finally, we show that DDPO can improve prompt-image alignment using feedback from a vision-language model without the need for additional data collection or human annotation.

POSTER-550: GoLLIE: Annotation Guidelines improve Zero-Shot Information-Extraction

Primary Area: generative models

Keywords: Information Extraction Zero-Shot Annotation Guidelines Large Language Models LLM prompt

Scores: [ 8 5 6 6 ]

Large Language Models (LLMs) combined with instruction tuning have made significant progress when generalizing to unseen tasks. However, they have been less successful in Information Extraction (IE), lagging behind task-specific models. Typically, IE tasks are characterized by complex annotation guidelines which describe the task and give examples to humans. Previous attempts to leverage such information have failed, even with the largest models, as they are not able to follow the guidelines out-of-the-box. In this paper we propose GoLLIE (Guideline-following Large Language Model for IE), a model able to improve zero-shot results on unseen IE tasks by virtue of being fine-tuned to comply with annotation guidelines. Comprehensive evaluation empirically demonstrates that GoLLIE is able to generalize to and follow unseen guidelines, outperforming previous attempts at zero-shot information extraction. The ablation study shows that detailed guidelines is key for good results. Code, data and models will be made publicly available.

POSTER-551: Idempotent Generative Network

Primary Area: generative models

Keywords: Generative model idempotent energy based models

Scores: [ 8 5 3 8 ]

We propose a new approach for generative modeling based on training a neural network to be idempotent. An idempotent operator is one that can be applied sequentially without changing the result beyond the initial application, namely f(f(z))=f(z). The proposed model f is trained to map a source distribution (e.g, Gaussian noise) to a target distribution (e.g. realistic images) using the following objectives: (1) Instances from the target distribution should map to themselves, namely f(x)=x. We define the target manifold as the set of all instances that f maps to themselves.(2) Instances that form the source distribution should map onto the defined target manifold. This is achieved by optimizing the idempotence term, f(f(z))=f(z) which encourages the range of f(z) to be on the target manifold. Under ideal assumptions such a process provably converges to the target distribution. This strategy results in a model capable of generating an output in one step, maintaining a consistent latent space, while also allowing sequential applications for refinement. Additionally, we find that by processing inputs from both target and source distributions, the model adeptly projects corrupted or modified data back to the target manifold. This work is a first step towards a global projector'' that enables projecting any input into a target data distribution.

POSTER-552: Light Schrödinger Bridge

Primary Area: generative models

Keywords: Schrödinger Bridge Optimal transport Entropy regularized OT Unpaired Learning

Scores: [ 8 5 5 8 8 ]

Despite the recent advances in the field of computational Schrodinger Bridges (SB), most existing SB solvers are still heavy-weighted and require complex optimization of several neural networks. It turns out that there is no principal solver which plays the role of simple-yet-effective baseline for SB just like, e.g., k-means method in clustering, logistic regression in classification or Sinkhorn algorithm in discrete optimal transport. We address this issue and propose a novel fast and simple SB solver. Our development is a smart combination of two ideas which recently appeared in the field: (a) parameterization of the Schrodinger potentials with sum-exp quadratic functions and (b) viewing the log-Schrodinger potentials as the energy functions. We show that combined together these ideas yield a lightweight, simulation-free and theoretically justified SB solver with a simple straightforward optimization objective. As a result, it allows solving SB in moderate dimensions in a matter of minutes on CPU without a painful hyperparameter selection. Our light solver resembles the Gaussian mixture model which is widely used for density estimation. Inspired by this similarity, we also prove an important theoretical result showing that our light solver is a universal approximator of SBs.

POSTER-553: Latent Intuitive Physics: Learning to Transfer Hidden Physics from a 3D Video

Primary Area: generative models

Keywords: Intuitive physics inverse graphics particle-based fluid simulation neural rendering

Scores: [ 6 6 6 8 ]

We introduce latent intuitive physics, a transfer learning framework for physics simulation that can infer hidden properties of fluids from a single 3D video and simulate the observed fluid in novel scenes. Our key insight is to use latent features drawn from a learnable prior distribution conditioned on the underlying particle states to capture the invisible and complex physical properties. To achieve this, we train a parametrized prior learner given visual observations to approximate the visual posterior of inverse graphics, and both the particle states and the visual posterior are obtained from a learned neural renderer. The converged prior learner is embedded in our probabilistic physics engine, allowing us to perform novel simulations on unseen geometries, boundaries, and dynamics without knowledge of the true physical parameters. We validate our model in three ways: (i) novel scene simulation with the learned visual-world physics, (ii) future prediction of the observed fluid dynamics, and (iii) supervised particle simulation. Our model demonstrates strong performance in all three tasks.

POSTER-554: TextField3D: Towards Enhancing Open-Vocabulary 3D Generation with Noisy Text Fields

Primary Area: generative models

Keywords: 3D Open Vocabulary Conditional Generation

Scores: [ 6 6 8 ]

Recent works learn 3D representation explicitly under text-3D guidance. However, limited text-3D data restricts the vocabulary scale and text control of generations. Generators may easily fall into a stereotype concept for certain text prompts, thus losing open-vocabulary generation ability. To tackle this issue, we introduce a conditional 3D generative model, namely TextField3D. Specifically, rather than using the text prompts as input directly, we suggest to inject dynamic noise into the latent space of given text prompts, i.e., Noisy Text Fields (NTFs). In this way, limited 3D data can be mapped to the appropriate range of textual latent space that is expanded by NTFs. To this end, an NTFGen module is proposed to model general text latent code in noisy fields. Meanwhile, an NTFBind module is proposed to align view-invariant image latent code to noisy fields, further supporting image-conditional 3D generation. To guide the conditional generation in both geometry and texture, multi-modal discrimination is constructed with a text-3D discriminator and a text-2.5D discriminator. Compared to previous methods, TextField3D includes three merits: 1) large vocabulary, 2) text consistency, and 3) low latency. Extensive experiments demonstrate that our method achieves a potential open-vocabulary 3D generation capability.

ORAL-32: Improved Techniques for Training Consistency Models

Primary Area: generative models

Keywords: Consistency Models Consistency Training Diffusion Models Score-Based Generative Models Score-Based Diffusion Models Distillation

Scores: [ 8 8 6 6 ]

Consistency models are a nascent family of generative models that can sample high quality data in one step without the need for adversarial training. Current consistency models achieve optimal sample quality by distilling from pre-trained diffusion models, and employing learned metrics such as LPIPS. However, distillation limits the quality of consistency models to that of the pre-trained diffusion model, and LPIPS causes undesirable bias in evaluation. To tackle these challenges, we present improved techniques for consistency training, where consistency models learn directly from data without distillation. We delve into the theory behind consistency training and identify a previously overlooked flaw, which we address by eliminating Exponential Moving Average from the teacher consistency model. To replace learned metrics like LPIPS, we borrow Pseudo-Huber losses from robust statistics. Additionally, we introduce a new noise schedule for the consistency training objective, and propose a new curriculum for total discretization steps. Collectively, these modifications enable consistency models to achieve FID scores of 2.62 and 3.91 on CIFAR-10 and ImageNet 64×64 respectively in a single sampling step. These scores mark a 3.3$\times$ improvement compared to prior consistency training approaches. Through two-step sampling, we further reduce FID scores to 2.28 and 3.64, surpassing those obtained via distillation in both one-step and two-step settings, while narrowing the gap between consistency models and state-of-the-art generative models on both datasets.

ORAL-33: Lipschitz Singularities in Diffusion Models

Primary Area: generative models

Keywords: Image Generation Generative models Diffusion models

Scores: [ 6 8 8 8 ]

Diffusion models, which employ stochastic differential equations to sample images through integrals, have emerged as a dominant class of generative models. However, the rationality of the diffusion process itself receives limited attention, leaving the question of whether the problem is well-posed and well-conditioned. In this paper, we uncover a vexing propensity of diffusion models: they frequently exhibit the infinite Lipschitz near the zero point of timesteps. We provide theoretical proofs to illustrate the presence of infinite Lipschitz constants and empirical results to confirm it. The Lipschitz singularities pose a threat to the stability and accuracy during both the training and inference processes of diffusion models. Therefore, the mitigation of Lipschitz singularities holds great potential for enhancing the performance of diffusion models. To address this challenge, we propose a novel approach, dubbed E-TSDM, which alleviates the Lipschitz singularities of the diffusion model near the zero point. Remarkably, our technique yields a substantial improvement in performance. Moreover, as a byproduct of our method, we achieve a dramatic reduction in the Fréchet Inception Distance of acceleration methods relying on network Lipschitz, including DDIM and DPM-Solver, by over 33%. Extensive experiments on diverse datasets validate our theory and method. Our work may advance the understanding of the general diffusion process, and also provide insights for the design of diffusion models.

POSTER-555: Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators

Primary Area: generative models

Keywords: Optical Flow Image Editing Guidance Diffusion Models

Scores: [ 6 6 8 8 ]

Diffusion models are capable of generating impressive images conditioned on text descriptions, and extensions of these models allow users to edit images at a relatively coarse scale. However, the ability to precisely edit the layout, position, pose, and shape of objects in images with diffusion models is still difficult. To this end, we propose {\it motion guidance}, a zero-shot technique that allows a user to specify dense, complex motion fields that indicate where each pixel in an image should move. Motion guidance works by steering the diffusion sampling process with the gradients through an off-the-shelf optical flow network. Specifically, we design a guidance loss that encourages the sample to have the desired motion, as estimated by a flow network, while also being visually similar to the source image. By simultaneously sampling from a diffusion model and guiding the sample to have low guidance loss, we can obtain a motion-edited image. We demonstrate that our technique works on complex motions and produces high quality edits of real and generated images.

POSTER-556: A Flexible Generative Model for Heterogeneous Tabular EHR with Missing Modality

Primary Area: generative models

Keywords: Generative Model Synthetic EHR

Scores: [ 6 5 5 6 6 ]

Realistic synthetic electronic health records (EHRs) can be leveraged to acceler- ate methodological developments for research purposes while mitigating privacy concerns associated with data sharing. However, the training of Generative Ad- versarial Networks remains challenging, often resulting in issues like mode col- lapse. While diffusion models have demonstrated progress in generating qual- ity synthetic samples for tabular EHRs given ample denoising steps, their perfor- mance wanes when confronted with missing modalities in heterogeneous tabular EHRs data. For example, some EHRs contain solely static measurements, and some contain only contain temporal measurements, or a blend of both data types. To bridge this gap, we introduce FLEXGEN-EHR– a versatile diffusion model tai- lored for heterogeneous tabular EHRs, equipped with the capability of handling missing modalities in an integrative learning framework. We define an optimal transport module to align and accentuate the common feature space of hetero- geneity of EHRs. We empirically show that our model consistently outperforms existing state-of-the-art synthetic EHR generation methods both in fidelity by up to 3.10% and utility by up to 7.16%. Additionally, we show that our method can be successfully used in privacy-sensitive settings, where the original patient-level data cannot be shared.

POSTER-557: A unified sampling framework for solver searching of Diffusion Probabilistic Models

Primary Area: generative models

Keywords: Diffusion Probabilistic Model Diffusion Sampler Solver Schedule

Scores: [ 6 6 6 6 6 ]

Recent years have witnessed the rapid progress and broad application of diffusion probabilistic models (DPMs). Sampling from DPMs can be viewed as solving an ordinary differential equation (ODE). Despite the promising performance, the generation of DPMs usually consumes much time due to the large number of function evaluations (NFE). Though recent works have accelerated the sampling to around 20 steps with high-order solvers, the sample quality with less than 10 NFE can still be improved. In this paper, we propose a unified sampling framework (USF) to study the optional strategies for solver. Under this framework, we further reveal that taking different solving strategies at different timesteps may help further decrease the truncation error, and a carefully designed \emph{solver schedule} has the potential to improve the sample quality by a large margin. Therefore, we propose a new sampling framework based on the exponential integral formulation that allows free choices of solver strategy at each step and design specific decisions for the framework. Moreover, we propose S3, a predictor-based search method that automatically optimizes the solver schedule to get a better time-quality trade-off of sampling. We demonstrate that S3 can find outstanding solver schedules which outperform the state-of-the-art sampling methods on CIFAR-10, CelebA, ImageNet-64, and LSUN-Bedroom datasets. Specifically, we achieve 3.09 FID with 9 NFE and 7.65 FID with 5 NFE on CIFAR-10 dataset, outperforming the SOTA method significantly. We further apply S3 to Stable-Diffusion model and get an acceleration ratio of 2$\times$, showing the feasibility of sampling in very few steps without retraining of the neural network.

POSTER-558: Towards Green AI in Fine-tuning Large Language Models via Adaptive Backpropagation

Primary Area: generative models

Keywords: Green AI Large Language Models Fine-Tuning Adaptive Backpropagation

Scores: [ 6 5 6 5 ]

Fine-tuning is essential to adapting pre-trained large language models to downstream applications. With the increasing popularity of LLM-enabled applications, fine-tuning has been performed intensively worldwide, incurring a tremendous amount of computing costs that correspond to big carbon footprint and environmental impact. Mitigating such environmental impact directly correlates to reducing the fine-tuning FLOPs. Existing fine-tuning schemes focus on either saving memory or reducing the overhead of computing weight updates, but cannot achieve sufficient FLOPs reduction due to their ignorance of the training cost in backpropagation. To address this limitation, in this paper we present GreenTrainer, a new technique that minimizes the FLOPs of LLM fine-tuning via adaptive backpropagation, which adaptively selects the most appropriate set of LLM tensors for fine-tuning based on their importance and backpropagation cost in training. Experiment results show that GreenTrainer can save up to 64% training FLOPs compared to full fine-tuning, without any noticeable accuracy loss. Compared to the existing schemes such as Prefix Tuning and LoRA, GreenTrainer can achieve up to 4% improvement of model accuracy, with on-par FLOPs reduction.

POSTER-559: WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space

Primary Area: generative models

Keywords: 3D-aware image synthesis diffusion model latent diffusion 3D-aware generative model

Scores: [ 6 8 8 6 ]

Modern learning-based approaches to 3D-aware image synthesis achieve high photorealism and 3D-consistent viewpoint changes for the generated images. Existing approaches represent instances in a shared canonical space. However, for in-the-wild datasets a shared canonical system can be difficult to define or might not even exist. In this work, we instead model instances in view space, alleviating the need for posed images and learned camera distributions. We find that in this setting, existing GAN-based methods are prone to generating flat geometry and struggle with distribution coverage. We hence propose WildFusion, a new approach to 3D-aware image synthesis based on latent diffusion models (LDMs). We first train an autoencoder that infers a compressed latent representation, which additionallycaptures the images’ underlying 3D structure and enables not only reconstruction but also novel view synthesis. To learn a faithful 3D representation, we leverage cues from monocular depth prediction. Then, we train a diffusion model in the 3D-aware latent space, thereby enabling synthesis of high-quality 3D-consistent image samples, outperforming recent state-of-the-art GAN-based methods. Importantly,our 3D-aware LDM is trained without any direct supervision from multiview images or 3D geometry and does not require posed images or learned pose or camera distributions. It directly learns a 3D representation without relying on canonical camera coordinates. This opens up promising research avenues for scalable 3D-aware image synthesis and 3D content creation from in-the-wild image data.

ORAL-34: DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

Primary Area: generative models

Keywords: Text-to-3D Image-to-3D 3D Generation Efficiency

Scores: [ 8 8 10 8 ]

Recent advances in 3D content creation mostly leverage optimization-based 3D generation via score distillation sampling (SDS).Though promising results have been exhibited, these methods often suffer from slow per-sample optimization, limiting their practical usage. In this paper, we propose DreamGaussian, a novel 3D content generation framework that achieves both efficiency and quality simultaneously. Our key insight is to design a generative 3D Gaussian Splatting model with companioned mesh extraction and texture refinement in UV space.In contrast to the occupancy pruning used in Neural Radiance Fields, we demonstrate that the progressive densification of 3D Gaussians converges significantly faster for 3D generative tasks.To further enhance the texture quality and facilitate downstream applications, we introduce an efficient algorithm to convert 3D Gaussians into textured meshes and apply a fine-tuning stage to refine the details.Extensive experiments demonstrate the superior efficiency and competitive generation quality of our proposed approach.Notably, DreamGaussian produces high-quality textured meshes in just 2 minutes from a single-view image, achieving approximately 10 times acceleration compared to existing methods.

POSTER-560: WizardCoder: Empowering Code Large Language Models with Evol-Instruct

Primary Area: generative models

Keywords: Large Language Models Code Instruction Fine-tuning

Scores: [ 8 5 6 6 ]

POSTER-561: VDT: General-purpose Video Diffusion Transformers via Mask Modeling

Primary Area: generative models

Keywords: diffusion video generation

Scores: [ 6 6 6 ]

This work introduces Video Diffusion Transformer (VDT), which pioneers the use of transformers in diffusion-based video generation.It features transformer blocks with modularized temporal and spatial attention modules to leverage the rich spatial-temporal representation inherited in transformers.Additionally, we propose a unified spatial-temporal mask modeling mechanism, seamlessly integrated with the model, to cater to diverse video generation scenarios.VDT offers several appealing benefits.(1) It excels at capturing temporal dependencies to produce temporally consistent video frames and even simulate the physics and dynamics of 3D objects over time.(2) It facilitates flexible conditioning information, \eg, simple concatenation in the token space, effectively unifying different token lengths and modalities.(3) Pairing with our proposed spatial-temporal mask modeling mechanism, it becomes a general-purpose video diffuser for harnessing a range of tasks, including unconditional generation, video prediction, interpolation, animation, and completion, etc.%Extensive experiments on these tasks spanning various scenarios, including autonomous driving, natural weather, human action, and physics-based simulation, demonstrate the effectiveness of VDT.% Moreover, we provide a comprehensive study on the capabilities of VDT in capturing accurate temporal dependencies, handling conditioning information, and the spatial-temporal mask modeling mechanism.Additionally, we present comprehensive studies on how VDT handles conditioning information with the mask modeling mechanism, which we believe will benefit future research and advance the field. Codes and models are available at the https://VDT-2023.github.io.

SPOTLIGHT-98: EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models

Primary Area: generative models

Keywords: Diffusion Models Model Quantization Model Compression Efficient Models

Scores: [ 6 6 8 6 ]

Diffusion models have demonstrated remarkable capabilities in image synthesis and related generative tasks. Nevertheless, their practicality for low-latency real-world applications is constrained by substantial computational costs and latency issues. Quantization is a dominant way to compress and accelerate diffusion models, where post-training quantization (PTQ) and quantization-aware training (QAT) are two main approaches, each bearing its own properties. While PTQ exhibits efficiency in terms of both time and data usage, it may lead to diminished performance in low bit-width settings. On the other hand, QAT can help alleviate performance degradation but comes with substantial demands on computational and data resources. To capitalize on the advantages while avoiding their respective drawbacks, we introduce a data-free, quantization-aware and parameter-efficient fine-tuning framework for low-bit diffusion models, dubbed EfficientDM, to achieve QAT-level performance with PTQ-like efficiency. Specifically, we propose a quantization-aware variant of the low-rank adapter (QALoRA) that can be merged with model weights and jointly quantized to low bit-width. The fine-tuning process distills the denoising capabilities of the full-precision model into its quantized counterpart, eliminating the requirement for training data. To further enhance performance, we introduce scale-aware optimization to address ineffective learning of QALoRA due to variations in weight quantization scales across different layers. We also employ temporal learned step-size quantization to handle notable variations in activation distributions across denoising steps. Extensive experimental results demonstrate that our method significantly outperforms previous PTQ-based diffusion models while maintaining similar time and data efficiency. Specifically, there is only a marginal 0.05 sFID increase when quantizing both weights and activations of LDM-4 to 4-bit on ImageNet 256×256. Compared to QAT-based methods, our EfficientDM also boasts a 16.2× faster quantization speed with comparable generation quality, rendering it a compelling choice for practical applications.

POSTER-562: Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation

Primary Area: generative models

Keywords: Diffusion models NeRF 3D synthesis

Scores: [ 8 5 6 ]

Text-to-3D generation has shown rapid progress in recent days with the advent of score distillation sampling (SDS), a methodology of using pretrained text-to-2D diffusion models to optimize a neural radiance field (NeRF) in a zero-shot setting. However, the lack of 3D awareness in the 2D diffusion model often destabilizes previous methods from generating a plausible 3D scene. To address this issue, we propose 3DFuse, a novel framework that incorporates 3D awareness into the pretrained 2D diffusion model, enhancing the robustness and 3D consistency of score distillation-based methods. Specifically, we introduce a consistency injection module which constructs a 3D point cloud from the text prompt and utilizes its projected depth map at given view as a condition for the diffusion model. The 2D diffusion model, through its generative capability, robustly infers dense structure from the sparse point cloud depth map and generates a geometrically consistent and coherent 3D scene. We also introduce a new technique called semantic coding that reduces semantic ambiguity of the text prompt for improved results. Our method can be easily adapted to various text-to-3D baselines, and we experimentally demonstrate how our method notably enhances the 3D consistency of generated scenes in comparison to previous baselines, achieving state-of-the-art performance in geometric robustness and fidelity.

POSTER-563: Separate and Diffuse: Using a Pretrained Diffusion Model for Better Source Separation

Primary Area: generative models

Keywords: Source Separation Deep Learning Diffusion Models Upper Bound Information Bound

Scores: [ 6 6 6 ]

The problem of speech separation, also known as the cocktail party problem,refers to the task of isolating a single speech signal from a mixture of speechsignals. Previous work on source separation derived an upper bound for thesource separation task in the domain of human speech. This bound is derived fordeterministic models. Recent advancements in generative models challenge thisbound. We show how the upper bound can be generalized to the case of randomgenerative models. Applying a diffusion model Vocoder that was pretrained tomodel single-speaker voices on the output of a deterministic separation model leadsto state-of-the-art separation results. It is shown that this requires one to combinethe output of the separation model with that of the diffusion model. In our method,a linear combination is performed, in the frequency domain, using weights that areinferred by a learned model. We show state-of-the-art results on 2, 3, 5, 10, and 20speakers on multiple benchmarks. In particular, for two speakers, our method isable to surpass what was previously considered the upper performance bound.

POSTER-564: Sin3DM: Learning a Diffusion Model from a Single 3D Textured Shape

Primary Area: generative models

Keywords: 3D generation shape analysis and synthesis diffusion models

Scores: [ 8 5 6 5 ]

Synthesizing novel 3D models that resemble the input example as long been pursued by graphics artists and machine learning researchers. In this paper, we present Sin3DM, a diffusion model that learns the internal patch distribution from a single 3D textured shapeand generates high-quality variations with fine geometry and texture details. Training a diffusion model directly in 3D would induce large memory and computational cost. Therefore, we first compress the input into a lower-dimensional latent space and then train a diffusion model on it. Specifically, we encode the input 3D textured shape into triplane feature maps that represent the signed distance and texture fields of the input. The denoising network of our diffusion model has a limited receptive field to avoid overfitting, and uses triplane-aware 2D convolution blocks to improve the result quality. Aside from randomly generating new samples, our model also facilitates applications such as retargeting, outpainting and local editing. Through extensive qualitative and quantitative evaluation, we show that our method outperforms prior methods in generation quality of 3D shapes.

SPOTLIGHT-99: Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior

Primary Area: generative models

Keywords: large language models behavior simulation

Scores: [ 10 5 8 6 ]

Shannon, in his seminal paper introducing information theory, divided the communication into three levels: technical, semantic, and effectivenss. While the technical level is concerned with accurate reconstruction of transmitted symbols, the semantic and effectiveness levels deal with the inferred meaning and its effect on the receiver. Thanks to telecommunications, the first level problem has produced great advances like the internet. Large Language Models (LLMs) make some progress towards the second goal, but the third level still remains largely untouched. The third problem deals with predicting and optimizing communication for desired receiver behavior. LLMs, while showing wide generalization capabilities across a wide range of tasks, are unable to solve for this. One reason for the underperformance could be a lack of behavior tokens'' in LLMs' training corpora. Behavior tokens define receiver behavior over a communication, such as shares, likes, clicks, purchases, retweets, \textit{etc}. While preprocessing data for LLM training, behavior tokens are often removed from the corpora as noise. Therefore, in this paper, we make some initial progress towards reintroducing behavior tokens in LLM training. The trained models, other than showing similar performance to LLMs on content understanding tasks, show generalization capabilities on behavior simulation, content simulation, behavior understanding, and behavior domain adaptation. Using a wide range of tasks on two corpora, we show results on all these capabilities. We call these models Large Content and Behavior Models (LCBMs). Further, to spur more research on LCBMs, we release our new Content Behavior Corpus (CBC), a repository containing communicator, message, and corresponding receiver behavior.

POSTER-565: DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

Primary Area: generative models

Keywords: Large Language Models Hallucination Factuality Decoding Text Generation

Scores: [ 8 8 5 8 ]

Despite their impressive capabilities, large language models (LLMs) are prone to hallucinations, i.e., generating content that deviates from facts seen during pretraining. We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs that does not require conditioning on retrieved external knowledge nor additional fine-tuning. Our approach obtains the next-token distribution by contrasting the differences in logits obtained from projecting the later layers versus earlier layers to the vocabulary space, exploiting the fact that factual knowledge in an LLMs has generally been shown to be localized to particular transformer layers. We find that this Decoding by Contrasting Layers (DoLa) approach is able to better surface factual knowledge and reduce the generation of incorrect facts. DoLa consistently improves the truthfulness across multiple choices tasks and open-ended generation tasks, for example improving the performance of LLaMA family models on TruthfulQA by 12-17% absolute points, demonstrating its potential in making LLMs reliably generate truthful facts.

POSTER-566: Patched Denoising Diffusion Models For High-Resolution Image Synthesis

Primary Area: generative models

Keywords: Diffusion Models

Scores: [ 5 6 5 6 ]

We propose an effective denoising diffusion model for generating high-resolution images (e.g., 1024$\times$512), trained on small-size image patches (e.g., 64$\times$64). We name our algorithm Patch-DM, in which a new feature collage strategy is designed to avoid the boundary artifact when synthesizing large-size images. Feature collage systematically crops and combines partial features of the neighboring patches to predict the features of a shifted image patch, allowing the seamless generation of the entire image due to the overlap in the patch feature space. Patch-DM produces high-quality image synthesis results on our newly collected dataset of nature images (1024$\times$512), as well as on standard benchmarks of LHQ(1024$\times$ 1024), FFHQ(1024$\times$ 1024) and on other datasets with smaller sizes (256$\times$256), including LSUN-Bedroom, LSUN-Church, and FFHQ. We compare our method with previous patch-based generation methods and achieve state-of-the-art FID scores on all six datasets. Further, Patch-DM also reduces memory complexity compared to the classic diffusion models.

Primary Area: generative models

Keywords: Text-guided diffusion models Robustness Adversarial examiner Failure detection

Scores: [ 6 10 6 6 ]

Text-guided diffusion models (TDMs) are widely applied but can fail unexpectedly. Common failures include: (i) natural-looking text prompts generating images with the wrong content, or (ii) different random samples of the latent variables that generate vastly different, and even unrelated, outputs despite being conditioned on the same text prompt. In this work, we aim to study and understand the failure modes of TDMs in more detail. To achieve this, we propose SAGE, the first adversarial search method on TDMs that systematically explores the discrete prompt space and the high-dimensional latent space, to automatically discover undesirable behaviors and failure cases in image generation. We use image classifiers as surrogate loss functions during searching, and employ human inspections to validate the identified failures. For the first time, our method enables efficient exploration of both the discrete and intricate human language space and the challenging latent space, overcoming the gradient vanishing problem. Then, we demonstrate the effectiveness of SAGE on five widely used generative models and reveal four typical failure modes that have not been systematically studied before: (1) We find a variety of natural text prompts that generate images failing to capture the semantics of input texts. We further discuss the underlying causes and potential solutions based on the results. (2) We find regions in the latent space that lead to distorted images independent of the text prompt, suggesting that parts of the latent space are not well-structured. (3) We also find latent samples that result in natural-looking images unrelated to the text prompt, implying a possible misalignment between the latent and prompt spaces. (4) By appending a single adversarial token embedding to any input prompts, we can generate a variety of specified target objects, with minimal impact on CLIP scores, demonstrating the fragility of language representations.

POSTER-568: Generalized Schrödinger Bridge Matching

Primary Area: generative models

Keywords: Distribution matching diffusion models generalized Schrödinger bridge stochastic optimal control

Scores: [ 8 8 6 6 ]

Modern distribution matching algorithms for training diffusion or flow models directly prescribe the time evolution of the marginal distributions between two boundary distributions. In this work, we consider a generalized distribution matching setup, where these marginals are only implicitly described as a solution to some task-specific objective function. The problem setup, known as the Generalized Schrödinger Bridge (GSB), appears prevalently in many scientific areas both within and without machine learning. We propose Generalized Schödinger Bridge Matching (GSBM), a new matching algorithm inspired by recent advances, generalizing them beyond kinetic energy minimization and to account for nonlinear state costs. We show that such a generalization can be cast as solving conditional stochastic optimal control, for which efficient variational approximations can be used, and further debiased with the aid of path integral theory. Compared to prior methods for solving GSB problems, our GSBM algorithm always preserves a feasible transport map between the boundary distributions throughout training, thereby enabling stable convergence and significantly improved scalability. We empirically validate our claims on an extensive suite of experimental setups, including crowd navigation, opinion depolarization, LiDAR manifolds, and image domain transfer. Our work brings new algorithmic opportunities for training diffusion models enhanced with task-specific optimality structures.

POSTER-569: Self-Consuming Generative Models Go MAD

Primary Area: generative models

Keywords: Generative Models Synthetic Data Sampling Bias Diffusion Models GANs

Scores: [ 6 6 8 ]

Seismic advances in generative AI algorithms for imagery, text, and other data types have led to the temptation to use AI-synthesized data to train next-generation models. Repeating this process creates an autophagous ("self-consuming") loop whose properties are poorly understood. We conduct a thorough analytical and empirical analysis using state-of-the-art generative image models of three families of autophagous loops that differ in how fixed or fresh real training data is available through the generations of training and whether the samples from previous-generation models have been biased to trade off data quality versus diversity. Our primary conclusion across all scenarios is that without enough fresh real data in each generation of an autophagous loop, future generative models are doomed to have their quality (precision) or diversity (recall) progressively decrease. We term this condition Model Autophagy Disorder (MAD), by analogy to mad cow disease, and show that appreciable MADness arises in just a few generations.

SPOTLIGHT-100: Controlled Text Generation via Language Model Arithmetic

Primary Area: generative models

Keywords: Controlled text generation LLM Natural Language Processing

Scores: [ 8 8 6 8 5 ]

As Large Language Models (LLMs) are deployed more widely, customization with respect to vocabulary, style and character becomes more important. In this work we introduce model arithmetic, a novel inference framework for composing and biasing LLMs without the need for model (re)training or highly specific datasets. In addition, the framework allows for more precise control of generated text than direct prompting and prior controlled text generation (CTG) techniques. Using model arithmetic, we can express prior CTG techniques as simple formulas and naturally extend them to new and more effective formulations. Further, we show that speculative sampling, a technique for efficient LLM sampling, extends to our setting. This enables highly efficient text generation with multiple composed models with only marginal overhead over a single model. Our empirical evaluation demonstrates that model arithmetic allows fine-grained control of generated text while outperforming state-of-the-art on the task of toxicity reduction. We release an open source easy-to-use implementation of our framework at [ANONYMIZED].

SPOTLIGHT-101: Guiding Instruction-based Image Editing via Multimodal Large Language Models

Primary Area: generative models

Keywords: image editing multimodal large language model

Scores: [ 8 8 6 6 ]

Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive instructions and provides explicit guidance. The editing model jointly captures this visual imagination and performs manipulation through end-to-end training. We evaluate various aspects of Photoshop-style modification, global photo optimization, and local editing. Extensive experimental results demonstrate that expressive instructions are crucial to instruction-based image editing, and our MGIE can lead to a notable improvement in automatic metrics and human evaluation while maintaining competitive inference efficiency.

POSTER-570: On Error Propagation of Diffusion Models

Primary Area: generative models

Keywords: Diffusion Models Error Propagation Theoretical Explanation Regularization

Scores: [ 8 8 6 8 ]

POSTER-571: Achieving Human Parity in Content-Grounded Datasets Generation

Primary Area: generative models

Keywords: Data Generation content grounded long-form question-answering summarization

Scores: [ 5 5 6 ]

The lack of high-quality data for content-grounded generation tasks has been identified as a major obstacle to advancing these tasks. To address this gap, we propose a novel method for automatically generating high-quality content-grounded data. It consists of three stages: (a) Content Preparation, (b) Generation: creating task-specific examples from the content (e.g., question-answer pairs or summaries). (c) Filtering mechanism aiming to ensure the quality and faithfulness of the generated data. We showcase this methodology by generating large-scale data for synthetic Long-form question-answering (LFQA) and summarization. In a human evaluation, our generated data was found to be natural and of high quality. Furthermore, we compare models trained on our data with models trained on human-written data – ELI5 and ASQA for LFQA and CNN-DailyMail for Summarization. We show that our models are on par with or outperforming models trained on human-generated data and consistently outperforming them in faithfulness. Finally, we applied our method to create LFQA data within the medical domain and compared a model trained on it with models trained on other domains.

POSTER-572: Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Primary Area: generative models

Keywords: large language models sensitivity analysis prompt engineering evaluation prompting robustness in-context learning spurious features

Scores: [ 6 8 6 ]

As large language models (LLMs) are adopted as a fundamental component of language technologies, it is crucial to accurately characterize their performance. Because choices in prompt design can strongly influence model behavior, this design process is critical in effectively using any modern pre-trained generative language model. In this work, we focus on LLM sensitivity to a quintessential class of meaning-preserving design choices: prompt formatting. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points when evaluated using LLaMA-2-13B. Sensitivity remains even when increasing model size, the number of few-shot examples, or performing instruction tuning. Our analysis suggests that work evaluating LLMs with prompting-based methods would benefit from reporting a range of performance across plausible prompt formats, instead of the currently-standard practice of reporting performance on a single format. We also show that format performance only weakly correlates between models, which puts into question the methodological validity of comparing models with an arbitrarily chosen, fixed prompt format. To facilitate systematic analysis we propose FormatSpread, an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights. Furthermore, we present a suite of analyses that characterize the nature of this sensitivity, including exploring the influence of particular atomic perturbations and the internal representation of particular formats.

POSTER-573: Graph Generation with K2-trees

Primary Area: generative models

Keywords: Graph generative models graph neural networks

Scores: [ 5 5 8 8 ]

Generating graphs from a target distribution is a significant challenge across many domains, including drug discovery and social network analysis. In this work, we introduce a novel graph generation method leveraging K2 representation, originally designed for lossless graph compression. The K2 representation enables compact generation while concurrently capturing an inherent hierarchical structure of a graph. In addition, we make contributions by (1) presenting a sequential K2 representation that incorporates pruning, flattening, and tokenization processes and (2) introducing a Transformer-based architecture designed to generate the sequence by incorporating a specialized tree positional encoding scheme. Finally, we extensively evaluate our algorithm on four general and two molecular graph datasets to confirm its superiority for graph generation.

POSTER-574: Localizing and Editing Knowledge In Text-to-Image Generative Models

Primary Area: generative models

Keywords: text-to-image interpretability model editing

Scores: [ 8 5 8 5 ]

Text-to-Image Diffusion Models such as Stable-Diffusion and Imagen have achieved unprecedented quality of photorealism with state-of-the-art FID scores on MS-COCO and other generation benchmarks. Given a caption, image generation requires fine-grained knowledge about attributes such as object structure, style, and viewpoint amongst others. Where does this information reside in text-to-image generative models? In our paper, we tackle this question and understand how knowledge corresponding to distinct visual attributes is stored in large-scale text-to-image diffusion models. We adapt Causal Mediation Analysis for text-to-image models and trace knowledge about distinct visual attributes to various (causal) components in the (i) UNet and (ii) text-encoder of the diffusion model. In particular, we show that unlike large-language models, knowledge about different attributes is not localized in isolated components, but is instead distributed amongst a set of components in the conditional UNet. These sets of components are often distinct for different visual attributes (e.g., style} / objects). Remarkably, we find that the text-encoder in public text-to-image models such as Stable-Diffusion contains {\it only} one causal state across different visual attributes, and this is the first self-attention layer corresponding to the last subject token of the attribute in the caption. This is in stark contrast to the causal states in other language models which are often the mid-MLP layers. Based on this observation of only one causal state in the text-encoder, we introduce a fast, data-free model editing method DiffQuickFix which can effectively edit concepts (remove or update knowledge) in text-to-image models. DiffQuickFix can edit (ablate) concepts in under a second with a closed-form update, providing a significant 1000x speedup and comparable editing performance to existing fine-tuning based editing methods.

POSTER-575: OpenTab: Advancing Large Language Models as Open-domain Table Reasoners

Primary Area: generative models

Keywords: Large Language Models Table Reasoning

Scores: [ 6 8 6 ]

Large Language Models (LLMs) trained on large volumes of data excel at various natural language tasks, but they cannot handle tasks with data that has not been trained on. One solution is to use a retriever that fetches relevant information to expand LLM's knowledge scope. However, existing textual-oriented retrieved-based LLMs are not ideal on structured table data due to diversified data modalities and large table sizes. In this work, we propose OpenTab, an open-domain table reasoning framework powered by LLMs. Overall, OpenTab leverages table retriever to fetch relevant tables and then generates SQL programs to parse such tables efficiently. Utilizing the intermediate data derived from the SQL executions, it then conducts grounded inference to produce the accurate response. Extensive experimental evaluation shows that OpenTab significantly outperforms baselines in both open- and close-domain settings, achieving up to 21.5% higher accuracy. We further run detailed ablation studies to validate the efficacy of our proposed designs.

POSTER-576: Fusing Models with Complementary Expertise

Primary Area: generative models

Keywords: Fusion of Experts Mixture of Experts Expert Models

Scores: [ 6 8 6 6 ]

Training AI models that generalize across tasks and domains has long been among the open problems driving AI research. The emergence of Foundation Models made it easier to obtain expert models for a given task, but the heterogeneity of data that may be encountered at test time often means that any single expert is insufficient. We consider the Fusion of Experts (FoE) problem of fusing outputs of expert models with complementary knowledge of the data distribution and formulate it as an instance of supervised learning. Our method is applicable to both discriminative and generative tasks and leads to significant performance improvements in image and text classification, text summarization, multiple-choice QA, and automatic evaluation of generated text. We also extend our method to the "frugal" setting where it is desired to reduce the number of expert model evaluations at test time.

POSTER-577: Understanding the Effects of RLHF on LLM Generalisation and Diversity

Primary Area: generative models

Keywords: reinforcement learning large language models rlhf ood generalisation diversity

Scores: [ 3 6 8 ]

Large language models (LLMs) fine-tuned with reinforcement learning from humanfeedback (RLHF) have been used in some of the most widely deployed AI modelsto date, such as OpenAI’s ChatGPT or Anthropic’s Claude. While there has beensignificant work developing these methods, our understanding of the benefits anddownsides of each stage in RLHF is still limited. To fill this gap, we present anextensive analysis of how each stage of the process (i.e. supervised fine-tuning(SFT), reward modelling, and RLHF) affects two key properties: out-of-distributiongeneralisation (OOD) and output diversity. OOD generalisation is crucial given thewide range of real-world scenarios in which these models are being used, whileoutput diversity refers to the model’s ability to generate varied outputs, and isimportant for a variety of use cases. We perform our analysis across two basemodels on both summarisation and instruction following tasks, the latter beinghighly relevant for current LLM use cases. We find that RLHF generalises betterthan SFT to new inputs, particularly as the distribution shift between train and testbecomes larger. However, RLHF significantly reduces output diversity compared toSFT across a variety of measures, implying a tradeoff in current LLM fine-tuningmethods between generalisation and diversity. Our results provide guidance onwhich fine-tuning method should be used depending on the application, and showthat more research is needed to improve the tradeoff between generalisation anddiversity.

ORAL-35: Learning Energy Decompositions for Partial Inference of GFlowNets

Primary Area: generative models

Keywords: Generative flow networks reinforcement learning generative models

Scores: [ 8 8 8 8 ]

This paper studies generative flow networks (GFlowNets) to sample objects from the Boltzmann energy distribution via a sequence of actions. In particular, we focus on improving GFlowNet with partial inference: training flow functions with the evaluation of the intermediate states or transitions. To this end, the recently developed forward-looking GFlowNet reparameterizes the flow functions based on evaluating the energy of intermediate states. However, such an evaluation of intermediate energies may (i) be too expensive or impossible to evaluate and (ii) even provide misleading training signals under large energy fluctuations along the sequence of actions. To resolve this issue, we propose learning energy decompositions for GFlowNets (LED-GFN). Our main idea is to (i) decompose the energy of an object into learnable potential functions defined on state transitions and (ii) reparameterize the flow functions using the potential functions. In particular, to produce informative local credits, we propose to regularize the potential to change smoothly over the sequence of actions. It is also noteworthy that training GFlowNet with our learned potential can preserve the optimal policy. We empirically verify the superiority of LED-GFN in five problems including the generation of unstructured and maximum independent sets, molecular graphs, and RNA sequences.

POSTER-578: TapMo: Shape-aware Motion Generation of Skeleton-free Characters

Primary Area: generative models

Keywords: 3D animation text-driven animation motion generation mesh deformation

Scores: [ 8 6 6 6 ]

Previous motion generation methods are limited to the pre-rigged 3D human model, hindering their applications in the animation of various non-rigged characters. In this work, we present TapMo, a Text-driven Animation PIpeline for synthesizing Motion in a broad spectrum of skeleton-free 3D characters. The pivotal innovation in TapMo is its use of shape deformation-aware features as a condition to guide the diffusion model, thereby enabling the generation of mesh-specific motions for various characters. Specifically, TapMo comprises two main components - Mesh Handle Predictor and Shape-aware Diffusion Module. Mesh Handle Predictor predicts the skinning weights and clusters mesh vertices into adaptive handles for deformation control, which eliminates the need for traditional skeletal rigging. Shape-aware Motion Diffusion synthesizes motion with mesh-specific adaptations. This module employs text-guided motions and mesh features extracted during the first stage, preserving the geometric integrity of the animations by accounting for the character's shape and deformation. Trained in a weakly-supervised manner, TapMo can accommodate a multitude of non-human meshes, both with and without associated text motions. We demonstrate the effectiveness and generalizability of TapMo through rigorous qualitative and quantitative experiments. Our results reveal that TapMo consistently outperforms existing auto-animation methods, delivering superior-quality animations for both seen or unseen heterogeneous 3D characters.

POSTER-579: Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing

Primary Area: generative models

Keywords: large language models automatic prompt engineering in-context learning

Scores: [ 6 6 5 8 ]

Large language models (LLMs) have made impressive progress in natural language processing. These models rely on proper human instructions (or prompts) to generate suitable responses. However, the potential of LLMs are not fully harnessed by commonly-used prompting methods: many human-in-the-loop algorithms employ ad-hoc procedures for prompt selection; while auto prompt generation approaches are essentially searching all possible prompts randomly and inefficiently. We propose Evoke, an automatic prompt refinement framework. In Evoke, there are two instances of a same LLM: one as a reviewer (LLM-Reviewer), it scores the current prompt; the other as an author (LLM-Author), it edits the prompt by considering the edit history and the reviewer's feedback. Such an author-reviewer feedback loop ensures that the prompt is refined in each iteration. We further aggregate a data selection approach to Evoke, where only the hard samples are exposed to the LLM. The hard samples are more important because the LLM can develop deeper understanding of the tasks out of them, while the model may already know how to solve the easier cases. Experimental results show that Evoke significantly outperforms existing methods. For instance, in the challenging task of logical fallacy detection, Evoke scores above 80, while all other baseline methods struggle to reach 20.

POSTER-580: -Diff: Infinite Resolution Diffusion with Subsampled Mollified States

Primary Area: generative models

Keywords: Generative Models Diffusion Models Infinite Resolution

Scores: [ 6 8 8 6 ]

We introduce -Diff, a generative diffusion model defined in an infinite-dimensional Hilbert space that allows infinite resolution data to be modelled. By randomly sampling subsets of coordinates during training and learning to denoise the content at those coordinates, a continuous function is learned that allows sampling at arbitrary resolutions. Prior infinite-dimensional generative models use point-wise functions that require latent compression for global context. In contrast, we propose using non-local integral operators to map between Hilbert spaces, allowing spatial information aggregation; to facilitate this, we design a powerful and efficient multi-scale architecture that operates directly on raw sparse coordinates. Training on high-resolution datasets we demonstrate that high-quality diffusion models can be learned with even 8× subsampling rates, enabling substantial improvements in run-time and memory requirements, achieving significantly higher sample quality as evidenced by lower FID scores, while also being able to effectively scale to higher resolutions than the training data while retaining detail.

SPOTLIGHT-102: DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models

Primary Area: generative models

Keywords: Diffusion model Image editing Image generation

Scores: [ 6 6 6 6 6 ]

Despite the ability of existing large-scale text-to-image (T2I) diffusion models to generate high-quality images from detailed textual descriptions, they often lack the ability to precisely edit the generated or real images. In this paper, we propose a novel image editing method, DragonDiffusion, enabling Drag-style manipulation on Diffusion models. Specifically, we treat image editing as the change of feature correspondence in a pre-trained diffusion model. By leveraging feature correspondence, we develop energy functions that align with the editing target, transforming image editing operations into gradient guidance. Based on this guidance approach, we also construct multi-scale guidance that considers both semantic and geometric alignment. Furthermore, we incorporate a visual cross-attention strategy based on a memory bank design to ensure consistency between the edited result and original image. Benefiting from these efficient designs, all content editing and consistency operations come from the feature correspondence without extra model fine-tuning or additional modules. Extensive experiments demonstrate that our method has promising performance on various image editing tasks, including editing within a single image (e.g., object moving, resizing, and content dragging) and across images (e.g., appearance replacing and object pasting).

POSTER-581: Modelling complex vector drawings with stroke-clouds

Primary Area: generative models

Keywords: Generative Model Representation Learning Sets Vector Graphics

Scores: [ 6 6 5 ]

Vector drawings are innately interactive as they preserve creational cues. Despitethis desirable property they remain relatively under explored due to the difficultiesin modeling complex vector drawings. This is in part due to the primarily sequential and auto-regressive nature of existing approaches failing to scale beyond simpledrawings. In this paper, we define generative models over highly complex vectordrawings by first representing them as “stroke-clouds” – sets of arbitrary cardinality comprised of semantically meaningful strokes. The dimensionality of thestrokes is a design choice that allows the model to adapt to a range of complexities.We learn to encode these set of strokes into compact latent codes by a probabilisticreconstruction procedure backed by De-Finetti’s Theorem of Exchangability. Theparametric generative model is then defined over the latent vectors of the encodedstroke-clouds. The resulting “Latent stroke-cloud generator (LSG)” thus capturesthe distribution of complex vector drawings on an implicit set space. We demonstrate the efficacy of our model on complex drawings (a newly created Animeline-art dataset) through a rangeof generative tasks.

POSTER-582: Progressive3D: Progressively Local Editing for Text-to-3D Content Creation with Complex Semantic Prompts

Primary Area: generative models

Keywords: Text-to-3D Creation 3D Content Editing Attribute Mismatching

Scores: [ 6 6 6 5 ]

Recent text-to-3D generation methods achieve impressive 3D content creation capacity thanks to the advances in image diffusion models and optimizing strategies. However, current methods struggle to generate correct 3D content for a complex prompt in semantics, i.e., a prompt describing multiple interacted objects binding with different attributes. In this work, we propose a general framework named Progressive3D, which decomposes the entire generation into a series of locally progressive editing steps to create precise 3D content for complex prompts, and we constrain the content change to only occur in regions determined by user-defined region prompts in each editing step. Furthermore, we propose an overlapped semantic component suppression technique to encourage the optimization process to focus more on the semantic differences between prompts. Extensive experiments demonstrate that the proposed Progressive3D framework generates precise 3D content for prompts with complex semantics through progressive editing steps and is general for various text-to-3D methods driven by different 3D representations.

POSTER-583: Toward effective protection against diffusion-based mimicry through score distillation

Primary Area: generative models

Keywords: Diffusion models safety protection

Scores: [ 6 5 6 6 ]

While generative diffusion models excel in producing high-quality images, they can also be misused to mimic authorized images, posing a significant threat to AI systems. Efforts have been made to add calibrated perturbations to protect images from diffusion-based mimicry pipelines. However, most of the existing methods are too ineffective and even impractical to be used by individual users due to their high computation and memory requirements. In this work, we present novel findings on attacking latent diffusion models (LDM) and propose new plug-and-play strategies for more effective protection. In particular, we explore the bottleneck in attacking an LDM, discovering that the encoder module rather than the denoiser module is the vulnerable point. Based on this insight, we present our strategy using Score Distillation Sampling (SDS) to double the speed of protection and reduce memory occupation by half without compromising its strength. Additionally, we provide a robust protection strategy by counterintuitively minimizing the semantic loss, which can assist in generating more natural perturbations. Finally, we conduct extensive experiments to substantiate our findings and comprehensively evaluate our newly proposed strategies. We hope our insights and protective measures can contribute to better defense against malicious diffusion-based mimicry, advancing the development of secure AI systems.

POSTER-584: Masked Audio Generative Modeling

Primary Area: generative models

Keywords: Audio modeling audio generation music generation non-autoregressive models

Scores: [ 8 8 6 ]

We introduce MAGNeT, a masked generative sequence modeling method that operates directly over several streams of discrete audio representation, i.e., tokens. Unlike prior work, MAGNeT is comprised of a single-stage, non-autoregressive transformer encoder. During training, we predict spans of masked tokens obtained from the masking scheduler, while during inference we gradually construct the output sequence using several decoding steps. To further enhance the quality of the generated audio, we introduce a novel model rescorer method. In which, we leverage an external pre-trained model to rescore and rank predictions from MAGNeT which will be then used for later decoding steps. Lastly, we explore a hybrid version of MAGNeT, in which we fuse between autoregressive and non-autoregressive models to generate the first few seconds in an autoregressive manner while the rest of the sequence is being decoded in parallel. We demonstrate the efficiency of MAGNeT over the task of text-to-music generation and conduct extensive empirical evaluation, considering both automatic and human studies. We show the proposed approach is comparable to the evaluated baselines while being significantly faster (x7 faster than the autoregressive baseline). Through ablation studies and analysis, we shed light on the importance of each of the components comprising MAGNeT, together with pointing to the trade-offs between autoregressive and non-autoregressive considering latency, throughput, and generation quality. Samples are available as part of the supplemental material.

POSTER-585: Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds

Primary Area: generative models

Keywords: large multimodal pre-training open-world embodied agent large language model

Scores: [ 6 5 5 6 ]

Recent studies have presented compelling evidence that large language models (LLMs) can equip embodied agents with the self-driven capability to interact with the world, which marks an initial step toward versatile robotics.However, these efforts tend to overlook the visual richness of open worlds, rendering the entire interactive process akin to a blindfolded text-based game.''Consequently, LLM-based agents frequently encounter challenges in intuitively comprehending their surroundings and producing responses that are easy to understand.In this paper, we propose Steve-Eye, an end-to-end trained large multimodal model designed to address this limitation.Steve-Eye integrates the LLM with a visual encoder which enables it to process visual-text inputs and generate multimodal feedback.In addition, we use a semi-automatic strategy to collect an extensive dataset comprising 850K open-world instruction pairs, empowering our model to encompass three essential functions for an agent: multimodal perception, foundational knowledge base, and skill prediction and planning.Lastly, we develop three open-world evaluation benchmarks, then carry out extensive experiments from a wide range of perspectives to validate our model's capability to strategically act and plan.Codes and datasets will be released.

POSTER-586: Long-tailed Diffusion Models with Oriented Calibration

Primary Area: generative models

Keywords: diffusion model long tail distribution score matching

Scores: [ 6 6 6 ]

Diffusion models have the ability to produce high-quality images with remarkable realism and diversity. Their effectiveness heavily relies on massive training on large-scale datasets, which, however, can be considerably impaired in the presence of real-world long-tail data. For long tail diffusion model generation, current works focus on the calibration and enhancement of the tail generation with head-tail knowledge transfer. The transfer process relies on the abundant diversity derived from the head class and, more significantly, the condition capacity of the model prediction. However, it is worth noting that the dependency on the conditional model prediction to realize the knowledge transfer might exhibit bias during training, leading to unsatisfactory generation results and lack of robustness. To address the issue, we directly establish the knowledge transfer from head data samples, based on the multi-objective characteristics of the score function in the diffusion process. To this end, a directional calibration for the estimation of noisy tail sample score is performed towards the clean head samples~(T2H), leveraging the similarity within the data distribution from head to tail classes. This augmentation for the tail score estimation encourages better diversity in generating the samples of tail categories. We extensively evaluate our approach with experiments on multiple benchmark datasets, demonstrating its effectiveness and superior performance compared to existing methods.

POSTER-587: Denoising Task Routing for Diffusion Models

Primary Area: generative models

Keywords: Diffusion Model Architecture Multi-Task Learning (MTL) Diffusion Models

Scores: [ 6 8 8 ]

Diffusion models generate highly realistic images through learning a multi-step denoising process, naturally embodying the principles of multi-task learning (MTL). Despite the inherent connection between diffusion models and MTL, there remains an unexplored area in designing neural architectures that explicitly incorporate MTL into the framework of diffusion models. In this paper, we present Denoising Task Routing (DTR), a simple add-on strategy for existing diffusion model architectures to establish distinct information pathways for individual tasks within a single architecture by selectively activating subsets of channels in the model. What makes DTR particularly compelling is its seamless integration of prior knowledge of denoising tasks into the framework: (1) Task Affinity: DTR activates similar channels for tasks at adjacent timesteps and shifts activated channels as sliding windows through timesteps, capitalizing on the inherent strong affinity between tasks at adjacent timesteps. (2) Task Weights: During the early stages (higher timesteps) of the denoising process, DTR assigns a greater number of task-specific channels, leveraging the insight that diffusion models prioritize reconstructing global structure and perceptually rich contents in earlier stages, and focus on simple noise removal in later stages. Our experiments demonstrate that DTR consistently enhances the performance of diffusion models across various evaluation protocols, all without introducing additional parameters. Furthermore, DTR contributes to accelerating convergence during training. Finally, we show the complementarity between our architectural approach and existing MTL optimization techniques, providing a more complete view of MTL within the context of diffusion training.

SPOTLIGHT-103: SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

Primary Area: generative models

Keywords: diffusion model; single-view reconstruction; 3D generation; generative models

Scores: [ 8 10 8 8 6 ]

In this paper, we present a novel diffusion model called SyncDreamer that generates multiview-consistent images from a single-view image. Using pretrained large-scale 2D diffusion models, recent work Zero123 demonstrates the ability to generate plausible novel views from a single-view image of an object. However, maintaining consistency in geometry and colors for the generated images remains a challenge. To address this issue, we propose a synchronized multiview diffusion model that models the joint probability distribution of multiview images, enabling the generation of multiview-consistent images in a single reverse process. SyncDreamer synchronizes the intermediate states of all the generated images at every step of the reverse process through a 3D-aware feature attention mechanism that correlates the corresponding features across different views. Experiments show that SyncDreamer generates images with high consistency across different views, thus making it well-suited for various 3D generation tasks such as novel-view-synthesis, text-to-3D, and image-to-3D.

SPOTLIGHT-104: InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior

Primary Area: generative models

Keywords: 3D indoor scene synthesis conditional generateive models graph diffusion models

Scores: [ 8 6 8 8 ]

Comprehending natural language instructions is a charming property for 3D indoor scene synthesis systems. Existing methods suffer from directly modeling the object distributions within a scene, thereby hindering the controllability of generation. We introduce InstructScene, a novel generative framework that integrates a semantic graph prior and a layout decoder to improve controllability and fidelity for 3D scene synthesis. The proposed semantic graph prior jointly learns indoor scene appearance and layout distributions, exhibiting versatility across various generative tasks. To facilitate the benchmarking for text-driven 3D scene synthesis, we curate a high-quality dataset of scene-instruction pairs with large language and multimodal models. Extensive experimental results reveal that the proposed method surpasses existing state-of-the-art approaches by a large margin. Thorough ablation studies confirm the efficacy of crucial design components. Both our code and dataset will be publicly available after the review period.

POSTER-588: Confidence-aware Reward Optimization for Fine-tuning Text-to-Image Models

Primary Area: generative models

Keywords: text-to-image generation overoptimization confidence calibration

Scores: [ 6 6 6 ]

Fine-tuning text-to-image models using reward functions trained on human feedback data has emerged as a powerful approach for aligning model behavior with human intent. However, excessive optimization with such reward models, which are only proxy objectives, can degrade the performance of the fine-tuned models, a phenomenon commonly referred to as reward overoptimization. We introduce the Text-Image Alignment Assessment (TIA2) benchmark, a diverse collection of text prompts, images, and human annotations, for studying the issue in depth. We evaluate several state-of-the-art reward models for text-to-image generation on our benchmark and find that they are often not well-aligned with human assessment. We empirically demonstrate that overoptimization can occur when a poorly aligned reward model is used as a fine-tuning objective. To address this, we introduce a simple method, TextNorm, for inducing confidence calibration in reward models by normalizing the scores across prompts that are semantically different from the original prompt. We demonstrate that using the confidence-calibrated scores in fine-tuning effectively reduces the risk of overoptimization.

SPOTLIGHT-105: Distort, Distract, Decode: Instruction-Tuned Model Can Refine its Response from Noisy Instructions

Primary Area: generative models

Keywords: Instruction Following Language Model Decoding

Scores: [ 8 6 8 8 ]

While instruction-tuned language models have demonstrated impressive zero-shot generalization, these models often struggle to generate accurate responses when faced with instructions that fall outside their training set. This paper presents Instructive Decoding (ID), a simple yet effective approach that augments the efficacy of instruction-tuned models. Specifically, ID adjusts the logits for next-token prediction in a contrastive manner, utilizing predictions generated from a manipulated version of the original instruction, referred to as a noisy instruction. This noisy instruction aims to elicit responses that could diverge from the intended instruction yet remain plausible. We conduct experiments across a spectrum of such noisy instructions, ranging from those that insert semantic noise via random words to others like 'opposite' that elicit the deviated responses. Our approach achieves considerable performance gains across various instruction-tuned models and tasks without necessitating any additional parameter updates. Notably, utilizing 'opposite' as the noisy instruction in ID, which shows the maximum divergence from the original instruction, consistently produces the most significant performance gains across multiple models and tasks.

SPOTLIGHT-106: H-GAP: Humanoid Control with a Generalist Planner

Primary Area: generative models

Keywords: Generative Modelling Humanoid Control Model Predictive Control Model-based Reinforcement Learning Offline Reinforcement Learning

Scores: [ 8 8 6 ]

Humanoid control is an important research challenge offering avenues for integration into human-centric infrastructures and enabling physics-driven humanoid animations.The daunting challenges in this field stem from the difficulty of optimizing in high-dimensional action spaces and the instability introduced by the bipedal morphology of humanoids. However, the extensive collection of human motion-captured data and the derived datasets of humanoid trajectories, such as MoCapAct, paves the way to tackle these challenges. In this context, we present Humanoid Generalist Autoencoding Planner (H-GAP), a state-action trajectory generative model trained on humanoid trajectories derived from human motion-captured data, capable of adeptly handling downstream control tasks with Model Predictive Control (MPC).For 56 degrees of freedom humanoid, we empirically demonstrate that H-GAP learns to represent and generate a wide range of motor behaviors. Further, without any learning from online interactions, it can also flexibly transfer these behaviours to solve novel downstream control tasks via planning. Notably, H-GAP excels established MPC baselines with access to the ground truth model, and is superior or comparable to offline RL methods trained for individual tasks.Finally, we do a series of empirical studies on the scaling properties of H-GAP, showing the potential for performance gains via additional data but not computing.

POSTER-589: Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering

Primary Area: generative models

Keywords: large language model in-context learning calibration prompt

Scores: [ 6 6 6 6 ]

Prompting and in-context learning (ICL) have become efficient learning paradigms for large language models (LLMs). However, LLMs suffer from prompt brittleness and various bias factors in the prompt, including but not limited to the formatting, the choice verbalizers, and the ICL examples. To address this problem that results in unexpected performance degradation, calibration methods have been developed to mitigate the effects of these biases while recovering LLM performance. In this work, we first conduct a systematic analysis of the existing calibration methods, where we both provide a unified view and reveal the failure cases. Inspired by these analyses, we propose Batch Calibration (BC), a simple yet intuitive method that controls the contextual bias from the batched input, unifies various prior approaches and effectively addresses the aforementioned issues. BC is zero-shot, inference-only, and incurs negligible additional costs. In the few-shot setup, we further extend BC to allow it to learn the contextual bias from labeled data. We validate the effectiveness of BC with PaLM 2-(S, M, L) and CLIP models and demonstrate state-of-the-art performance over previous calibration baselines across more than 10 natural language understanding and image classification tasks.

POSTER-590: Particle Guidance: non-I.I.D. Diverse Sampling with Diffusion Models

Primary Area: generative models

Keywords: diffusion models score-based models diversity guidance conformer generation image generation

Scores: [ 5 5 8 6 ]

In light of the widespread success of generative models, a significant amount of research has gone into speeding up their sampling time. However, generative models are often sampled multiple times to obtain a diverse set incurring in a cost that is orthogonal to sampling time. We tackle the question of how to improve diversity and sample efficiency by moving beyond the common assumption of independent samples. For this we propose particle guidance, an extension of diffusion-based generative sampling where a joint-particle time-evolving potential enforces diversity. We analyze theoretically the joint distribution that particle guidance generates, its implications on the choice of potential, and the connections with methods in other disciplines. Empirically, we test the framework both in the setting of conditional image generation, where we are able to increase diversity without affecting quality, and molecular conformer generation, where we reduce the state-of-the-art median error by 13% on average.

POSTER-591: Generative Pre-training for Speech with Flow Matching

Primary Area: generative models

Keywords: speech pre-training speech generation generative pre-training flow matching

Scores: [ 6 8 6 3 ]

Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training.

POSTER-592: Language Model Cascades: Token-Level Uncertainty And Beyond

Primary Area: generative models

Keywords: Cascades Efficient Inference Language Models

Scores: [ 6 6 8 8 ]

Recent advances in language models (LMs) have led to significant improvements in quality on complex NLP tasks, but at the expense of increased inference costs. A simple strategy to achieve more favorable cost-quality tradeoffs is cascading: here, a small model is invoked for most “easy” instances, while a few “hard” instances are deferred to the large model. While the principles underpinning effective cascading are well-studied for classification tasks — with deferral based on predicted class uncertainty favored theoretically and practically — a similar understanding is lacking for generative LM tasks. In this work, we initiate a systematic study of deferral rules for LM cascades. We begin by examining the natural extension of predicted class uncertainty to generative LM tasks, namely, the predicted sequence uncertainty. We show that this measure suffers from the length bias problem, either over- or under-emphasizing outputs based on their lengths. This is because LMs produce a sequence of uncertainty values, one for each output token; and moreover, the number of output tokens is variable across different examples. To mitigate the length bias, we propose to exploit the richer token-level uncertainty information implicit in generative LMs. We argue that naive predicted sequence uncertainty corresponds to a simple aggregation of these uncertainties. By contrast, we show that incorporating token-level uncertainty through learned post-hoc deferral rules can significantly outperform such simple aggregation strategies, via experiments on a range of natural language benchmarks with FLAN-T5 models. We further show that incorporating embeddings from the smaller model and intermediate layers of the larger model can give an additional boost in the overall cost-quality tradeoff.

POSTER-593: HiGen: Hierarchical Graph Generative Networks

Primary Area: generative models

Keywords: Generative Models Graph Generative Network Graph Neural Network Probabilistic Model

Scores: [ 6 6 6 6 8 ]

Most real-world graphs exhibit a hierarchical structure, which is often overlooked by existing graph generation methods. To address this limitation, we propose a novel graph generative network that captures the hierarchical nature of graphs and successively generates the graph sub-structures in a coarse-to-fine fashion. At each level of hierarchy, this model generates communities in parallel, followed by the prediction of cross-edges between communities using separate neural networks. This modular approach enables scalable graph generation for large and complex graphs. Moreover, we model the output distribution of edges in the hierarchical graph with a multinomial distribution and derive a recursive factorization for this distribution. This enables us to generate community graphs with integer-valued edge weights in an autoregressive manner.Empirical studies demonstrate the effectiveness and scalability of our proposed generative model, achieving state-of-the-art performance in terms of graph quality across various benchmark datasets.

SPOTLIGHT-107: At Which Training Stage Does Code Data Help LLMs Reasoning?

Primary Area: generative models

Keywords: code data large language models reasoning capabilities

Scores: [ 8 8 8 5 ]

POSTER-594: VersVideo: Leveraging Enhanced Temporal Diffusion Models for Versatile Video Generation

Primary Area: generative models

Keywords: video generation diffusion model temproal modeling

Scores: [ 6 6 3 6 ]

Creating stable, controllable videos is a complex task due to the need for significant variation in temporal dynamics and cross-frame temporal consistency. To address this, we enhance the spatial-temporal capability and introduce a versatile video generation model, VersVideo, which leverages textual, visual, and stylistic conditions. Current video diffusion models typically extend image diffusion architectures by supplementing 2D operations (such as convolutions and attentions) with temporal operations. While this approach is efficient, it often restricts spatial-temporal performance due to the oversimplification of standard 3D operations. To counter this, we incorporate two key elements: (1) multi-excitation paths for spatial-temporal convolutions with dimension pooling across different axes, and (2) multi-expert spatial-temporal attention blocks. These enhancements boost the model's spatial-temporal performance without significantly escalating training and inference costs. We also tackle the issue of information loss that arises when a variational autoencoder is used to transform pixel space into latent features and then back into pixel frames. To mitigate this, we incorporate temporal modules into the decoder to maintain inter-frame consistency. Lastly, by utilizing the innovative denoising UNet and decoder, we develop a unified ControlNet model suitable for various conditions, including image, Canny, HED, depth, and style. Examples of the videos generated by our model can be found at https://anonymous-pages.github.io/video_demos/.

POSTER-595: Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models

Primary Area: generative models

Keywords: large language models quantization model compression

Scores: [ 6 6 6 ]

Large Language Models (LLMs) have recently demonstrated a remarkable success across various tasks. However, efficiently serving LLMs has been a challenge due to its large memory bottleneck, specifically in small batch inference settings (e.g. mobile devices). Weight-only quantization can be a promising approach, but sub-4 bit quantization remains a challenge due to large-magnitude activation outliers. To mitigate the undesirable outlier effect, we first propose per-IC quantization, a simple yet effective method that creates quantization groups within each input channel (IC) rather than the conventional per-output channel (OC). Our method is motivated by the observation that activation outliers affect the input dimension of the weight matrix, so similarly grouping the weights in the IC direction can isolate outliers to be within a group. We also find that activation outliers do not dictate quantization difficulty, and inherent weight sensitivities also exist. With per-IC quantization as a new outlier-friendly scheme, we then propose Adaptive Dimensions (AdaDim), a versatile quantization framework that can adapt to various weight sensitivity patterns. We demonstrate the effectiveness of AdaDim by augmenting prior methods such as Round-To-Nearest and GPTQ, showing significant improvements across various language modeling benchmarks for both base (up to +4.7% on MMLU) and instruction-tuned (up to +10% on HumanEval) LLMs.

POSTER-596: FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing

Primary Area: generative models

Keywords: diffusion model video editing text-to-video

Scores: [ 8 5 5 8 ]

Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts.A major challenge in this task is to ensure that all frames in the edited video are visually consistent. Most recent works apply advanced text-to-image diffusion models to this task by inflating 2D spatial attention in the U-Net into spatio-temporal attention.Although temporal context can be added through spatio-temporal attention, it may introduce some irrelevant information for each patch and therefore cause inconsistency in the edited video. In this paper, for the first time, we introduce optical flow into the attention module in diffusion model's U-Net to address the inconsistency issue for text-to-video editing.Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion based text-to-video editing methods and improve their visual consistency.Experiment results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance. In particular, our method excels in maintaining the visual consistency in the edited videos.

SPOTLIGHT-108: On the Stability of Iterative Retraining of Generative Models on their own Data

Primary Area: generative models

Keywords: Generative Models Iterative Training Diffusion

Scores: [ 8 5 8 6 ]

Deep generative models have made tremendous progress in modeling complex data, often exhibiting generation quality that surpasses a typical human's ability to discern the authenticity of samples. Undeniably a key driver of this success is enabled by the massive amounts of web-scale data consumed by these models. Due to the striking performance of these models combined with their ease of availability, the web will inevitably be increasingly populated with synthetic content. Such a fact directly implies that future iterations of generative models must contend with the reality that their training is curated from both clean data and artificially generated data from past models. In this paper, we develop a framework to rigorously study the impact on the stability of training generative models on mixed datasets of real and synthetic data. We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough and the proportion of clean training data (w.r.t. synthetic data) is large enough. Building on this foundation we quantify the error incurred by iterative retraining of generative models and we also provide a radius stability in parameter space. We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models on CIFAR10 and FFHQ.

POSTER-597: Bellman Optimal Step-size Straightening of Flow-Matching Models

Primary Area: generative models

Keywords: flow matching generative model efficient sampling distillation responsible ML

Scores: [ 6 6 6 ]

Flow matching is a powerful framework for generating high-quality samples in various applications, especially image synthesis. However, the intensive computational demands of these models, especially during the fine-tuning process and sampling processes, pose significant challenges for low-resource scenarios. This paper introduces Bellman Optimal Step-size Straightening (BOSS) technique for distilling flow-matching generative models: it aims specifically for a few-step efficient image sampling while adhering to a computational budget constraint. First, this technique involves a dynamic programming algorithm that optimizes the step sizes of the pretrained network. Then, it refines the velocity network to match the optimal step sizes, aiming to straighten the generation paths. Extensive experimental evaluations across image generation tasks demonstrate the efficacy of BOSS in terms of both resource utilization and image quality. Our results reveal that BOSS achieves substantial gains in efficiency while maintaining competitive sample quality, effectively bridging the gap between low-resource constraints and the demanding requirements of flow-matching generative models. Our paper also fortifies the responsible development of artificial intelligence, offering a more sustainable generative model that reduces computational costs and environmental footprints.

POSTER-598: Talk like a Graph: Encoding Graphs for Large Language Models

Primary Area: generative models

Keywords: Graph problems large language models encoding graphs generative models

Scores: [ 6 6 6 6 ]

POSTER-599: Consistent Video-to-Video Transfer Using Synthetic Dataset

Primary Area: generative models

Keywords: Computer Vision Video Editing Diffusion Model

Scores: [ 5 6 5 8 ]

We introduce a novel and efficient approach for text-based video-to-video editing that eliminates the need for resource-intensive per-video-per-model finetuning. At the core of our approach is a synthetic paired video dataset tailored for video-to-video transfer tasks. Inspired by Instruct Pix2Pix's image transfer via editing instruction, we adapt this paradigm to the video domain. Extending the Prompt-to-Prompt to videos, we efficiently generate paired samples, each with an input video and its edited counterpart. Alongside this, we introduce the Long Video Sampling Correction during sampling, ensuring consistent long videos across batches. Our method surpasses current methods like Tune-A-Video, heralding substantial progress in text-based video-to-video editing and suggesting exciting avenues for further exploration and deployment.

POSTER-600: Towards Generative Abstract Reasoning: Completing Raven’s Progressive Matrix via Rule Abstraction and Selection

Primary Area: generative models

Keywords: Deep Latent Variable Models Generative Models Raven’s Progressive Matrix Abstract Visual Reasoning

Scores: [ 8 6 6 6 ]

Endowing machines with abstract reasoning ability has been a long-term research topic in artificial intelligence. Raven's Progressive Matrix (RPM) is widely used to probe abstract visual reasoning in machine intelligence, where models need to understand the underlying rules and select the missing bottom-right images out of candidate sets to complete image matrices. The participators can display powerful reasoning ability by inferring the underlying attribute-changing rules and imagining the missing images at arbitrary positions. However, existing solvers can hardly manifest such an ability in realistic RPM problems. In this paper, we propose a conditional generative model to solve answer generation problems through Rule AbstractIon and SElection (RAISE) in the latent space. RAISE encodes image attributes as latent concepts and decomposes underlying rules into atomic rules by means of concepts, which are abstracted as global learnable parameters. When generating the answer, RAISE selects proper atomic rules out of the global knowledge set for each concept and composes them into the integrated rule of an RPM. In most configurations, RAISE outperforms the compared generative solvers in tasks of generating bottom-right and arbitrary-position answers. We test RAISE in the odd-one-out task and two held-out configurations to demonstrate how learning decoupled latent concepts and atomic rules helps find the image breaking the underlying rules and handle RPMs with unseen combinations of rules and attributes.

POSTER-601: HIFA: High-fidelity Text-to-3D Generation with Advanced Diffusion Guidance

Primary Area: generative models

Keywords: Text-to-3D Generative models Diffusion models

Scores: [ 6 8 6 ]

The advancements in automatic text-to-3D generation have been remarkable. Most existing methods use pre-trained text-to-image diffusion models to optimize 3D representations like Neural Radiance Fields (NeRFs) via latent-space denoising score matching. Yet, these methods often result in artifacts and inconsistencies across different views due to their suboptimal optimization approaches and limited understanding of 3D geometry. Moreover, the inherent constraints of NeRFs in rendering crisp geometry and stable textures usually lead to a two-stage optimization to attain high-resolution details. This work proposes holistic sampling and smoothing approaches to achieve high-quality text-to-3D generation, all in a single-stage optimization. We compute denoising scores in the text-to-image diffusion model's latent and image spaces. Instead of randomly sampling timesteps (also referred to as noise levels in denoising score matching), we introduce a novel timestep annealing approach that progressively reduces the sampled timestep throughout optimization. To generate high-quality renderings in a single-stage optimization, we propose regularization for the variance of z-coordinates along NeRF rays. To address texture flickering issues in NeRFs, we introduce a kernel smoothing technique that refines importance sampling weights coarse-to-fine, ensuring accurate and thorough sampling in high-density regions. Extensive experiments demonstrate the superiority of our method over previous approaches, enabling the generation of highly detailed and view-consistent 3D assets through a single-stage training process.

POSTER-602: DiffusionSat: A Generative Foundation Model for Satellite Imagery

Primary Area: generative models

Keywords: satellite images generative models diffusion models computer vision

Scores: [ 8 6 3 8 ]

Diffusion models have achieved state-of-the-art results on many modalities including images, speech, and video. However, existing models are not tailored to support remote sensing data, which is widely used in important applications including environmental monitoring and crop-yield prediction. Satellite images are significantly different from natural images -- they can be multi-spectral, irregularly sampled across time -- and existing diffusion models trained on images from the Web do not support them. Furthermore, remote sensing data is inherently spatio-temporal, requiring conditional generation tasks not supported by traditional methods based on captions or images. In this paper, we present DiffusionSat, to date the largest generative foundation model trained on a collection of publicly available large, high-resolution remote sensing datasets .As text-based captions are sparsely available for satellite images, we incorporate the associated metadata such as geolocation as conditioning information. Our method produces realistic samples and can be used to solve multiple generative tasks including temporal generation, multi-spectral superrresolution and in-painting. Our method outperforms previous state-of-the-art methods for satellite image generation and is the first large-scale \textit{generative} foundation model for satellite imagery.

POSTER-603: Diffusion Sampling with Momentum for Mitigating Divergence Artifacts

Primary Area: generative models

Keywords: diffusion models heavy ball momentum divergence artifacts numerical method ode solver image generation

Scores: [ 8 5 6 8 ]

Despite the remarkable success of diffusion models in image generation, slow sampling remains a persistent issue. To accelerate the sampling process, prior studies have reformulated diffusion sampling as an ODE/SDE and introduced higher-order numerical methods. However, these methods often produce divergence artifacts, especially with a low number of sampling steps, which limits the achievable acceleration. In this paper, we investigate the potential causes of these artifacts and suggest that the small stability regions of these methods could be the principal cause. To address this issue, we propose two novel techniques. The first technique involves the incorporation of Heavy Ball (HB) momentum, a well-known technique for improving optimization, into existing diffusion numerical methods to expand their stability regions. We also prove that the resulting methods have first-order convergence. The second technique, called Generalized Heavy Ball (GHVB), constructs a new high-order method that offers a variable trade-off between accuracy and artifact suppression. Experimental results show that our techniques are highly effective in reducing artifacts and improving image quality, surpassing state-of-the-art diffusion solvers on both pixel-based and latent-based diffusion models for low-step sampling. Our research provides novel insights into the design of numerical methods for future diffusion work.

POSTER-604: Label-Noise Robust Diffusion Models

Primary Area: generative models

Keywords: diffusion model noisy label robustness

Scores: [ 6 6 6 5 ]

Conditional diffusion models have shown remarkable performance in various generative tasks, but training them requires large-scale datasets that often contain noise in conditional inputs, a.k.a. noisy labels. This noise leads to condition mismatch and quality degradation of generated data. This paper proposes Transition-aware weighted Denoising Score Matching (TDSM) for training conditional diffusion models with noisy labels, which is the first study in the line of diffusion models. The TDSM objective contains a weighted sum of score networks, where the weights represent instance-wise and time-dependent label transition probabilities. These weights are derived from the relationship between the conditional scores on noisy and clean labels. Also, we introduce a transition-aware weight estimator, which leverages a time-dependent noisy-label classifier distinctively customized to the diffusion process. We conduct experiments on various datasets and noisy label settings, and we verify that models trained with the TDSM objective generate high-quality samples that closely match the given conditions. Furthermore, our models improve generation performance even on benchmark datasets, which implies the potential noisy labels and their risk of generative model learning in prevalent benchmark datasets. Finally, we show improved performance of TDSM on top of conventional noisy label corrections, which empirically proves its contribution as a part of label-noise robust generative models.

SPOTLIGHT-109: DMV3D: Denoising Multi-view Diffusion Using 3D Large Reconstruction Model

Primary Area: generative models

Keywords: 3D Generation; Single-view 3D Reconstruction; text-to-3D

Scores: [ 10 8 6 8 ]

We propose DMV3D, a novel 3D generation approach that uses a transformer-based 3D large reconstruction model to denoise multi-view diffusion. Our reconstruction model incorporates a triplane NeRF representation and, functioning as a denoiser, can denoise noisy multi-view images via 3D NeRF reconstruction and rendering, achieving single-stage 3D generation in the 2D diffusion denoising process. We train DMV3D on large-scale multi-view image datasets of extremely diverse objects using only image reconstruction losses, without accessing 3D assets. We demonstrate state-of-the-art results for the single-image reconstruction problem where probabilistic modeling of unseen object parts is required for generating diverse reconstructions with sharp textures. We also show high-quality text-to-3D generation results outperforming previous 3D diffusion models. Our project website is at: https://dmv3d.github.io/.

POSTER-605: Duolando: Follower GPT with Off-Policy Reinforcement Learning for Dance Accompaniment

Primary Area: generative models

Keywords: motion generation multi-modality dance generation human human-interaction GPT reinforcement learning

Scores: [ 8 3 8 6 ]

We introduce a novel task within the field of human motion generation, termed dance accompaniment, which necessitates the generation of responsive movements from a dance partner, the "follower", synchronized with the lead dancer’s movements and the underlying musical rhythm. Unlike existing solo or group dance generation tasks, a duet dance scenario entails a heightened degree of interaction between the two participants, requiring delicate coordination in both pose and position. To support this task, we first build a large-scale and diverse duet interactive dance dataset, DD100, by recording about 115.4 minutes of professional dancers’ performances. To address the challenges inherent in this task, we propose a GPT based model, Duolando, which autoregressively predicts the subsequent tokenized motion conditioned on the coordinated information of the music, the leader’s and the follower’s movements. To further enhance the GPT’s capabilities of generating stable results on unseen conditions (music and leader motions), we devise an off-policy reinforcement learning strategy that allows the model to explore viable trajectories from out-of-distribution samplings, guided by human-defined rewards. Based on the collected dataset and proposed method, we establish a benchmark with several carefully designed metrics.

SPOTLIGHT-110: DreamFlow: High-quality text-to-3D generation by Approximating Probability Flow

Primary Area: generative models

Keywords: Text-to-3D generation Diffusion model Score-based Generative Model Score Distillation Sampling

Scores: [ 6 8 6 5 ]

Recent progress in text-to-3D generation has been achieved through the utilization of score distillation methods: they make use of the pre-trained text-to-image (T2I) diffusion models by distilling via the diffusion model training objective. However, such an approach inevitably results in the use of random timesteps at each update, which increases the variance of the gradient and ultimately prolongs the optimization process. In this paper, we propose to enhance the text-to-3D optimization by leveraging the T2I diffusion prior in the generative sampling process with a predetermined timestep schedule. To this end, we interpret text-to-3D optimization as a multi-view image-to-image translation problem, and propose a solution by approximating the probability flow. By leveraging the proposed novel optimization algorithm, we design DreamFlow, a practical three-stage coarse-to-fine text-to-3D optimization framework that enables fast generation of high-quality and high-resolution (i.e., 1024×1024) 3D contents. For example, we demonstrate that DreamFlow is 5 times faster than the existing state-of-the-art text-to-3D method, while producing more photorealistic 3D contents.

POSTER-606: The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”

Primary Area: generative models

Keywords: LLMs Large Language Models Question Answering Generalization Knowledge Representation Logical Inference Relations

Scores: [ 6 6 8 6 ]

We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sentence of the form "A is B", it will not automatically generalize to the reverse direction "B is A". This is the Reversal Curse. For instance, if a model is trained on "Olaf Scholz was the ninth Chancellor of Germany", it will not automatically be able to answer the question, "Who was the ninth Chancellor of Germany?". Moreover, the likelihood of the correct answer ("Olaf Scholz") will not be higher than for a random name. Thus, models exhibit a basic failure of logical deduction and do not generalize a prevalent pattern in their training set (i.e. if "A is B" occurs, "B is A" is more likely to occur). We provide evidence for the Reversal Curse by finetuning GPT-3 and Llama-1 on fictitious statements such as "Uriah Hawthorne is the composer of Abyssal Melodies" and showing that they fail to correctly answer "Who composed Abyssal Melodies?". The Reversal Curse is robust across model sizes and model families and is not alleviated by data augmentation. We also evaluate ChatGPT (GPT-3.5 and GPT-4) on questions about real-world celebrities, such as "Who is Tom Cruise's mother? [A: Mary Lee Pfeiffer]" and the reverse "Who is Mary Lee Pfeiffer's son?". GPT-4 correctly answers questions like the former 79% of the time, compared to 33% for the latter. This shows a failure of logical deduction that we hypothesize is caused by the Reversal Curse.

POSTER-607: RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

Primary Area: generative models

Keywords: Retrieval Augmented Language Models Information Retrieval summarization QA

Scores: [ 8 8 6 6 ]

Retrieval-augmented language models can better adapt to changes in world state and incorporate long-tail knowledge. However, most existing methods retrieve only short contiguous chunks from a retrieval corpus, limiting holistic understanding of the overall document context. We introduce the novel approach of recursively embedding, clustering, and summarizing chunks of text, constructing a tree with differing levels of summarization from the bottom up. At inference time, our RAPTOR model retrieves from this tree, integrating information across lengthy documents at different levels of abstraction. Controlled experiments show that retrieval with recursive summaries offers significant improvements over traditional retrieval-augmented LMs on several tasks. On question-answering tasks that involve complex, multi-step reasoning, we show state-of-the-art results; for example, by coupling RAPTOR retrieval with the use of GPT-4, we can improve the best performance on the QuALITY benchmark by 20% in absolute accuracy.

POSTER-608: Customizable Combination of Parameter-Efficient Modules for Multi-Task Learning

Primary Area: generative models

Keywords: Modular skill learning Multi-task learning Parameter-Efficient Fine-Tuning

Scores: [ 6 6 6 6 ]

Modular skill learning is an emerging direction in the field of Parameter Efficient Fine-Tuning (PEFT), as it enables neural networks to better organize and clarify various aspects of knowledge, leading to improved knowledge transfer for new tasks. In this paper, we introduce a novel approach that categorizes skills into shared domain skills and specialized skills, with the skill parameters being highly parameterized using low-rank or sparse techniques. Each task is associated with an exclusive specialized skill while also benefiting from shared domain skills. Moreover, tasks can selectively utilize specialized skills from other tasks as needed. To facilitate this approach, we propose a skill assignment matrix that can be jointly learned, and the task network is instantiated based on the skill parameters. To evaluate the effectiveness of our approach, we conducted extensive experiments on the Super Natural Instructions and SuperGLUE datasets. Our results demonstrate that compared to fully-shared, task-specific, or skill-indistinguishable baselines. Modular learning with skill-type discrimination significantly enhances the sample efficiency of multi-task learning. Furthermore, the freezing of a substantial number of base model parameters greatly improves parameter efficiency, leading to boosted training efficiency.

SPOTLIGHT-111: AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Primary Area: generative models

Keywords: Deep Learning Diffusion Model Video Generation

Scores: [ 6 8 8 6 ]

With the advance of text-to-image (T2I) diffusion models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. However, adding motion dynamics to existing high-quality personalized T2Is and enabling them to generate animations remains an open challenge. In this paper, we present AnimateDiff, a practical framework for animating personalized T2I models without requiring model-specific tuning. At the core of our framework is a plug-and-play motion module that can be trained once and seamlessly integrated into any personalized T2Is originating from the same base T2I. Through our proposed training strategy, the motion module effectively learns transferable motion priors from real-world videos. Once trained, the motion module can be inserted into a personalized T2I model to form a personalized animation generator. We further propose MotionLoRA, a lightweight fine-tuning technique for AnimateDiff that enables a pre-trained motion module to adapt to new motion patterns, such as different shot types, at a low training and data collection cost. We evaluate AnimateDiff and MotionLoRA on several public representative personalized T2I models collected from the community. The results demonstrate that our approaches help these models generate temporally smooth animation clips while preserving the visual quality and motion diversity.

POSTER-609: Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code

Primary Area: generative models

Keywords: diffusion models image editing diffusion inversion

Scores: [ 6 8 6 6 ]

Text-guided diffusion models have revolutionized image generation and editing, offering exceptional realism and diversity. Specifically, in the context of diffusion-based editing, where a source image is edited according to a target prompt, the process commences by acquiring a noisy latent vector corresponding to the source image via the diffusion model. This vector is subsequently fed into separate source and target diffusion branches for editing. The accuracy of this inversion process significantly impacts the final editing outcome, influencing both essential content preservation of the source image and edit fidelity according to the target prompt. Prior inversion techniques aimed at finding a unified solution in both the source and target diffusion branches. However, our theoretical and empirical analyses reveal that disentangling these branches leads to a distinct separation of responsibilities for preserving essential content and ensuring edit fidelity. Building on this insight, we introduce “Direct Inversion,” a novel technique achieving optimal performance of both branches with just three lines of code. To assess image editing performance, we present PIE-Bench, an editing benchmark with 700 images showcasing diverse scenes and editing types, accompanied by versatile annotations and comprehensive evaluation metrics. Compared to state-of-the-art optimization-based inversion techniques, our solution not only yields superior performance across 8 editing methods but also achieves nearly an order of speed-up.

POSTER-610: DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation

Primary Area: generative models

Keywords: diffusion model subject-driven text-to-image generation disentangled finetuning

Scores: [ 8 8 8 6 ]

Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing attention. Existing methods mainly resort to finetuning a pretrained generative model, where the identity-relevant information (e.g., the boy) and the identity-irrelevant information (e.g., the background or the pose of the boy) are entangled in the latent embedding space. However, the highly entangled latent embedding may lead to the failure of subject-driven text-to-image generation as follows: (i) the identity-irrelevant information hidden in the entangled embedding may dominate the generation process, resulting in the generated images heavily dependent on the irrelevant information while ignoring the given text descriptions; (ii) the identity-relevant information carried in the entangled embedding can not be appropriately preserved, resulting in identity change of the subject in the generated images. To tackle the problems, we propose DisenBooth, an identity-preserving disentangled tuning framework for subject-driven text-to-image generation. Specifically, DisenBooth finetunes the pretrained diffusion model in the denoising process. Different from previous works that utilize an entangled embedding to denoise each image, DisenBooth instead utilizes disentangled embeddings to respectively preserve the subject identity and capture the identity-irrelevant information. We further design the novel weak denoising and contrastive embedding auxiliary tuning objectives to achieve the disentanglement. Extensive experiments show that our proposed DisenBooth framework outperforms baseline models for subject-driven text-to-image generation with the identity-preserved embedding. Additionally, by combining the identity-preserved embedding and identity-irrelevant embedding, DisenBooth demonstrates more generation flexibility and controllability.

POSTER-611: Alpagasus: Training a Better Alpaca Model with Fewer Data

Primary Area: generative models

Keywords: data selection; large language model; instruction tuning;

Scores: [ 6 6 6 ]

Large language models~(LLMs) strengthen instruction-following capability through instruction-finetuning (IFT) on supervised instruction/response data. However, widely used IFT datasets (e.g., Alpaca's 52k data) surprisingly contain many low-quality instances with incorrect or irrelevant responses, which are misleading and detrimental to IFT. In this paper, we propose a simple and effective data selection strategy that automatically identifies and removes low-quality data using a strong LLM (e.g., ChatGPT). To this end, we introduce Alpagasus, which is finetuned on only 9k high-quality data filtered from the 52k Alpaca data. Alpagasus significantly outperforms the original Alpaca as evaluated by GPT-4 on multiple test sets and the controlled human study. Its 13B variant matches >90% performance of its teacher LLM (i.e., Text-Davinci-003) on test tasks. It also provides 5.7x faster training, reducing the training time for a 7B variant from 80 minutes (for Alpaca) to 14 minutes \footnote{We apply IFT for the same number of epochs as Alpaca(7B) but on fewer data, using 4$\times$NVIDIA A100 (80GB) GPUs and following the original Alpaca setting and hyperparameters.}. In the experiment, we also demonstrate that our method can work not only for machine-generated datasets but also for human-written datasets. Overall, Alpagasus demonstrates a novel data-centric IFT paradigm that can be generally applied to instruction-tuning data, leading to faster training and better instruction-following models.

POSTER-612: MVDream: Multi-view Diffusion for 3D Generation

Primary Area: generative models

Keywords: Image Generation 3D Generation Diffusion Model Multi-view consistency

Scores: [ 8 6 6 6 ]

We introduce MVDream, a multi-view diffusion model that is able to generate consistent multi-view images from a given text prompt. Learning from both 2D and 3D data, a multi-view diffusion model can achieve the generalizability of 2D diffusion models and the consistency of 3D renderings. We demonstrate that such a multi-view prior can serve as a generalizable 3D prior that is agnostic to 3D representations. It can be applied to 3D generation via Score Distillation Sampling, significantly enhancing the consistency and stability of existing 2D-lifting methods. It can also learn new concepts from a few 2D examples, akin to DreamBooth, but for 3D generation.

POSTER-613: ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Primary Area: generative models

Keywords: Large Language Model Multi-Agent Debate LLM evaluators

Scores: [ 5 6 5 6 6 ]

Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality.Recognizing that best practices of human evaluation processes often involve multiple human annotators collaborating in the evaluation, we resort to a multi-agent debate framework, moving beyond single-agent prompting strategies.In this paper, we construct a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of different texts. Our experiments on two benchmarks illustrate that ChatEval delivers superior accuracy and correlation in alignment with human assessment. Furthermore, we find that the diverse role prompts (different personas) are essential in the multi-agent debate process; that is, utilizing the same role description in the prompts can lead to a degradation in performance. Our qualitative analysis also shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.

POSTER-614: SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

Primary Area: generative models

Keywords: generative model; video generation; diffusion model

Scores: [ 6 6 5 5 ]

Recently video generation has achieved substantial progress with realistic results. Nevertheless, existing AI-generated videos are usually very short clips ("shot-level'') depicting a single scene. To deliver a coherent long video ("story-level''), it is desirable to have creative transition and prediction effects across different clips. This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction. The goal is to generate high-quality long videos with smooth and creative transitions between scenes and varying lengths of shot-level videos. Specifically, we propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions. By providing the images of different scenes as inputs, combined with text-based control, our model generates transition videos that ensure coherence and visual quality. Furthermore, the model can be readily extended to various tasks such as image-to-video animation and autoregressive video prediction. To conduct a comprehensive evaluation of this new generative task, we propose three assessing criteria for smooth and creative transition: temporal consistency, semantic similarity, and video-text semantic alignment. Extensive experiments validate the effectiveness of our approach over existing methods for generative transition and prediction, enabling the creation of story-level long videos.

POSTER-615: Denoising Diffusion Bridge Models

Primary Area: generative models

Keywords: Generative Modeling Diffusion Models

Scores: [ 6 8 6 8 ]

Diffusion models are powerful generative models that map noise to data using stochastic processes. However, for many applications such as image editing, the model input comes from a distribution that is not random noise. As such, diffusion models must rely on cumbersome methods like guidance or projected sampling to incorporate this information in the generative process. In our work, we propose Denoising Diffusion Bridge Models (DDBMs), a natural alternative to this paradigm based on diffusion bridges, a family of processes that interpolate between two paired distributions given as endpoints. Our method learns the score of the diffusion bridge from data and maps from one endpoint distribution to the other by solving a (stochastic) differential equation based on the learned score. Our method naturally unifies several classes of generative models, such as score-based diffusion models and OT-Flow-Matching, allowing us to adapt existing design and architectural choices to our more general problem. Empirically, we apply DDBMs to challenging image datasets in both pixel and latent space. On standard image translation problems, DDBMs achieve significant improvement over baseline methods, and, when we reduce the problem to image generation by setting the source distribution to random noise, DDBMs achieve comparable FID scores to state-of-the-art methods despite being built for a more general task.

POSTER-616: SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking

Primary Area: generative models

Keywords: Sequence Modelling Imitiation Learning Language Modelling

Scores: [ 6 6 6 6 ]

In many domains, autoregressive models can attain high likelihood on the task of predicting the next observation. However, this maximum-likelihood (MLE) objective does not necessarily match a downstream use-case of autoregressively generating high-quality sequences. The MLE objective weights sequences proportionally to their frequency under the data distribution, with no guidance for the model's behaviour out of distribution (OOD): leading to compounding error during autoregressive generation. In order to address this compounding error problem, we formulate sequence generation as an imitation learning (IL) problem. This allows us to minimize a variety of divergences between the distribution of sequences generated by an autoregressive model and sequences from a dataset, including divergences with weight on OOD generated sequences. The IL framework also allows us to incorporate backtracking by introducing a backspace action into the generation process. This further mitigates the compounding error problem by allowing the model to revert a sampled token if it takes the sequence OOD. Our resulting method, SequenceMatch, can be implemented without adversarial training or major architectural changes. We identify the SequenceMatch-χ2 divergence as a more suitable training objective for autoregressive models which are used for generation. We show that empirically, SequenceMatch training leads to improvements over MLE on text generation with language models and arithmetic

POSTER-617: Multimarginal Generative Modeling with Stochastic Interpolants

Primary Area: generative models

Keywords: multi-marginal unsupervised learning generative modeling measure transport optimal transport

Scores: [ 5 5 8 ]

Given a set of K probability densities, we consider the multimarginal generative modeling problem of learning a joint distribution that recovers these densities as marginals. The structure of this joint distribution should identify multi-way correspondences among the prescribed marginals. We formalize an approach to this task within a generalization of the stochastic interpolant framework, leading to efficient learning algorithms built upon dynamical transport of measure. Our generative models are defined by velocity and score fields that can be characterized as the minimizers of simple quadratic objectives, and they are defined on a simplex that generalizes the time variable in the usual dynamical transport framework. The resulting transport on the simplex is influenced by all marginals, and we show that multi-way correspondences can be extracted. The identification of such correspondences has applications to style transfer, algorithmic fairness, and data decorruption. In addition, the multimarginal perspective enables an efficient algorithm for optimizing the dynamical transport cost in the ordinary two-marginal setting. We demonstrate these capacities with several numerical examples.

POSTER-618: Few-shot Hybrid Domain Adaptation of Image Generator

Primary Area: generative models

Keywords: Few shot; Domain Adaptation; Image Generator

Scores: [ 5 8 8 ]

Can a pre-trained generator be adapted to the hybrid of multiple target domains and generate images with integrated attributes of them? In this work, we introduce a new task -- Few-shot Hybrid Domain Adaptation (HDA). Given a source generator and several target domains, HDA aims to acquire an adapted generator that preserves the integrated attributes of all target domains, without overriding the source domain's characteristics. Compared with Domain Adaptation (DA), HDA offers greater flexibility and versatility to adapt generators to more composite and expansive domains. Simultaneously, HDA also presents more challenges than DA as we have access only to images from individual target domains and lack authentic images from the hybrid domain. To address this issue, we introduce a discriminator-free framework that directly encodes different domains' images into well-separable subspaces. To achieve HDA, we propose a novel directional subspace loss comprised of a distance loss and a direction loss. Concretely, the distance loss blends the attributes of all target domains by reducing the distances from generated images to all target subspaces. The direction loss preserves the characteristics from the source domain by guiding the adaptation along the perpendicular to subspaces. Experiments show that our method can obtain numerous domain-specific attributes in a single adapted generator, which surpasses the baseline methods in semantic similarity, image fidelity, and cross-domain consistency.

POSTER-619: Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive

Primary Area: generative models

Keywords: diffusion models layout-to-image domain generalization

Scores: [ 6 6 6 6 ]

Despite the recent advances in large-scale diffusion models, little progress has been made on the layout-to-image (L2I) synthesis task. Current L2I models either suffer from poor editability via text or weak alignment between the generated image and the input layout. This limits their usability in practice. To mitigate this, we propose to integrate adversarial supervision into the conventional training pipeline of L2I diffusion models (ALDM). Specifically, we employ a segmentation-based discriminator which provides explicit feedback to the diffusion generator on the pixel-level alignment between the denoised image and the input layout. To encourage consistent adherence to the input layout over the sampling steps, we further introduce the multistep unrolling strategy. Instead of looking at a single timestep, we unroll a few steps recursively to imitate the inference process, and ask the discriminator to assess the alignment of denoised images with the layout over a certain time window. Our experiments show that ALDM enables layout faithfulness of the generated images, while allowing broad editability via text prompts. Moreover, we showcase its usefulness for practical applications: by synthesizing target distribution samples via text control, we improve domain generalization of semantic segmentation models by a large margin (~12 mIoU points).

POSTER-620: AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors

Primary Area: generative models

Keywords: large language mode agent multi-agent

Scores: [ 6 6 6 6 ]

Autonomous agents empowered by Large Language Models (LLMs) have undergone significant improvements, enabling them to generalize across a broad spectrum of tasks. However, in real-world scenarios, cooperation among individuals is often required to enhance the efficiency and effectiveness of task accomplishment. Hence, inspired by human group dynamics, we propose a multi-agent framework AgentVerse that can effectively orchestrate a collaborative group of expert agents as a greater-than-the-sum-of-its-parts system. Our experiments demonstrate that AgentVerse can proficiently deploy multi-agent groups that outperform a single agent. Extensive experiments on text understanding, reasoning, coding, tool utilization, and embodied AI confirm the effectiveness of AgentVerse. Moreover, our analysis of agent interactions within AgentVerse reveals the emergence of specific collaborative behaviors, contributing to heightened group efficiency. We will release our codebase, AgentVerse, to further facilitate multi-agent research.

SPOTLIGHT-112: CAS: A Probability-Based Approach for Universal Condition Alignment Score

Primary Area: generative models

Keywords: Generative model diffusion model score-based prior conditional diffusion model text-to-image alignment score inversion process image quality assessment T2I alignment score

Scores: [ 8 8 8 ]

Recent conditional diffusion models have shown remarkable advancements and have been widely applied in fascinating real-world applications. However, samples generated by these models often do not strictly comply with user-provided conditions. Due to this, there have been few attempts to evaluate this alignment via pre-trained scoring models to select well-generated samples. Nonetheless, current studies are confined to the text-to-image domain and require large training datasets. This suggests that crafting alignment scores for various conditions will demand considerable resources in the future. In this context, we introduce a universal condition alignment score that leverages the conditional probability measurable through the diffusion process. Our technique operates across all conditions and requires no additional models beyond the diffusion model used for generation, effectively enabling self-rejection. Our experiments validate that our met- ric effectively applies in diverse conditional generations, such as text-to-image, {instruction, image}-to-image, edge-/scribble-to-image, and text-to-audio.

POSTER-621: Decomposed Diffusion Sampler for Accelerating Large-Scale Inverse Problems

Primary Area: generative models

Keywords: Diffusion models; Inverse problems; Krylov subspace

Scores: [ 8 6 6 6 ]

SPOTLIGHT-113: Single Motion Diffusion

Primary Area: generative models

Keywords: Deep Learning Motion synthesis Animation Single Instance Learning Generative models

Scores: [ 8 8 8 6 ]

Synthesizing realistic animations of humans, animals, and even imaginary creatures, has long been a goal for artists and computer graphics professionals. Compared to the imaging domain, which is rich with large available datasets, the number of data instances for the motion domain is limited, particularly for the animation of animals and exotic creatures (e.g., dragons), which have unique skeletons and motion patterns. In this work, we introduce SinMDM, a Single Motion Diffusion Model. It is designed to learn the internal motifs of a single motion sequence with arbitrary topology and synthesize a variety of motions of arbitrary length that remain faithful to the learned motifs. We harness the power of diffusion models and present a denoising network explicitly designed for the task of learning from a single input motion. SinMDM is crafted as a lightweight architecture, which avoids overfitting by using a shallow network with local attention layers that narrow the receptive field and encourage motion diversity. Our work applies to multiple contexts, including spatial and temporal in-betweening, motion expansion, style transfer, and crowd animation. Our results show that SinMDM outperforms existing methods both qualitatively and quantitatively. Moreover, while prior network-based approaches require additional training for different applications, SinMDM supports these applications during inference. Our code is included as supplementary material and will be published.

POSTER-622: Group Preference Optimization: Few-Shot Alignment of Large Language Models

Primary Area: generative models

Keywords: Large Language Models alignment group preference alignment few-shot learning in-context learning fine-tuning

Scores: [ 6 5 6 ]

Many applications of large language models (LLMs), ranging from chatbots tocreative writing, require nuanced subjective judgments that can differ significantlyacross different groups. Existing alignment algorithms can be expensive to alignfor each group, requiring prohibitive amounts of group-specific preference dataand computation for real-world use cases. We introduce Group Preference Optimization (GPO), an alignment framework that steers language models to preferences of individual groups in a few-shot manner. In GPO, we augment the baseLLM with an independent transformer module trained to predict the preferencesof a group for the LLM generations. For few-shot learning, we parameterize thismodule as an in-context autoregressive transformer and train it via meta-learningon several groups. We empirically validate the efficacy of GPO through rigorous evaluations using LLMs with varied sizes on three human opinion adaptation tasks. These tasks involve adapting to the preferences of US demographicgroups, global countries, and individual users. Our results demonstrate that GPOnot only aligns models more accurately but also requires fewer group-specificpreferences and less training and inference computing resources, outperformingexisting strategies such as in-context steering and fine-tuning methods.

POSTER-623: The Blessing of Randomness: SDE Beats ODE in General Diffusion-based Image Editing

Primary Area: generative models

Keywords: Diffusion models Stochastic differential equations Image dragging Image editing Score-based models

Scores: [ 8 6 5 ]

We present a unified probabilistic formulation for diffusion-based image editing, where a latent variable is edited in a task-specific manner and generally deviates from the corresponding marginal distribution induced by the original stochastic or ordinary differential equation (SDE or ODE). Instead, it defines a corresponding SDE or ODE for editing. In the formulation, we prove that the Kullback-Leibler divergence between the marginal distributions of the two SDEs gradually decreases while that for the ODEs remains as the time approaches zero, which shows the promise of SDE in image editing. Inspired by it, we provide the SDE counterparts for widely used ODE baselines in various tasks including inpainting and image-to-image translation, where SDE shows a consistent and substantial improvement. Moreover, we propose \emph{SDE-Drag} -- a simple yet effective method built upon the SDE formulation for point-based content dragging. We build a challenging benchmark (termed \emph{DragBench}) with open-set natural, art, and AI-generated images for evaluation. A user study on DragBench indicates that SDE-Drag significantly outperforms our ODE baseline, existing diffusion-based methods, and the renowned DragGAN. Our results demonstrate the superiority and versatility of SDE in image editing and push the boundary of diffusion-based editing methods.

POSTER-624: The LLM Surgeon

Primary Area: generative models

Keywords: large language model LLM compression pruning fisher

Scores: [ 5 5 5 ]

State-of-the-art language models are becoming increasingly large in an effort to achieve the highest performance on large corpora of available textual data. However, the sheer size of the Transformer architectures makes it difficult to deploy models within computational, environmental or device-specific constraints. We explore data-driven compression of existing pretrained models as an alternative to training smaller models from scratch. To do so, we scale Kronecker-factored curvature approximations of the target loss landscape to large language models. In doing so, we can compute both the dynamic allocation of structures that can be removed as well as updates of remaining weights that account for the removal. We provide a general framework for unstructured, semi-structured and structured pruning and improve upon weight updates to capture more correlations between weights, while remaining computationally efficient. Experimentally, our method can prune rows and columns from a range of OPT models and Llamav2-7B by 20%-30%, with a negligible loss in performance, and achieve state-of-the-art results in unstructured and semi-structured pruning of large language models. We will open source our code on GitHub upon acceptance.

POSTER-625: DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior

Primary Area: generative models

Keywords: text to 3D diffusion models NeRF neural rendering 3d synthesis

Scores: [ 6 5 6 8 ]

We present DreamCraft3D, a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects. We tackle the problem by leveraging a 2D reference image to guide the stages of geometry sculpting and texture boosting. A central focus of this work is to address the consistency issue that existing works encounter. To sculpt geometries that render coherently, we perform score distillation sampling via a view-dependent diffusion model. This 3D prior, alongside several training strategies, prioritizes the geometry consistency but compromises the texture fidelity. We further propose bootstrapped score distillation to specifically boost the texture. We train a personalized diffusion model, Dreambooth, on the augmented renderings of the scene, imbuing it with 3D knowledge of the scene being optimized. The score distillation from this 3D-aware diffusion prior provides view-consistent guidance for the scene. Notably, through an alternating optimization of the diffusion prior and 3D scene representation, we achieve mutually reinforcing improvements: the optimized 3D scene aids in training the scene-specific diffusion model, which offers increasingly view-consistent guidance for 3D optimization. The optimization is thus bootstrapped and leads to substantial texture boosting. With tailored 3D priors throughout the hierarchical generation, DreamCraft3D generates coherent 3D objects with photorealistic renderings, advancing the state-of-the-art in 3D content generation.

SPOTLIGHT-114: Idempotence and Perceptual Image Compression

Primary Area: generative models

Keywords: perceptual image compression neural image compression

Scores: [ 6 8 8 8 ]

Idempotence is the stability of image codec to re-compression. At the first glance, it is unrelated to perceptual image compression. However, we find that theoretically: 1) Conditional generative model-based perceptual codec satisfies idempotence; 2) Unconditional generative model with idempotence constraint is equivalent to conditional generative codec. Based on this newfound equivalence, we propose a new paradigm of perceptual image codec by inverting unconditional generative model with idempotence constraints. Our codec is theoretically equivalent to conditional generative codec, and it does not require training new models. Instead, it only requires a pre-trained mean-square-error codec and unconditional generative model. Empirically, we show that our proposed approach outperforms state-of-the-art methods such as HiFiC and ILLM, in terms of Fréchet Inception Distance (FID). The source code is provided as supplementary material.

POSTER-626: WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions

Primary Area: generative models

Keywords: Large Language Model Instruction Fine-tuning

Scores: [ 6 6 6 6 ]

Training large language models (LLMs) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions. In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM. Both automatic and human evaluations consistently indicate that WizardLM outperforms baselines such as Alpaca (trained from Self-Instruct) and Vicuna (trained from human-created instructions). The experimental results demonstrate that the quality of instruction-following dataset crafted by Evol-Instruct can significantly improve the performance of LLMs.

SPOTLIGHT-115: Generative Learning for Financial Time Series with Irregular and Scale-Invariant Patterns

Primary Area: generative models

Keywords: generative model time series pattern recognition diffusion model financial time series

Scores: [ 8 8 6 ]

Limited data availability poses a major obstacle in training deep learning models for financial applications. Synthesizing financial time series to augment real-world data is challenging due to the irregular and scale-invariant patterns uniquely associated with financial time series - temporal dynamics that repeat with varying duration and magnitude. Such dynamics cannot be captured by existing approaches, which often assume regularity and uniformity in the underlying data. We develop a novel generative framework called FTS-Diffusion to model irregular and scale-invariant patterns that consists of three modules. First, we develop a scale-invariant pattern recognition algorithm to extract recurring patterns that vary in duration and magnitude. Second, we construct a diffusion-based generative network to synthesize segments of patterns. Third, we model the temporal transition of patterns in order to aggregate the generated segments. Extensive experiments show that FTS-Diffusion generates synthetic financial time series highly resembling observed data, outperforming state-of-the-art alternatives. Two downstream experiments demonstrate that augmenting real-world data with synthetic data generated by FTS-Diffusion reduces the error of stock market prediction by up to 17.9%. To the best of our knowledge, this is the first work on generating intricate time series with irregular and scale-invariant patterns, addressing data limitation issues in finance.

POSTER-627: The Generative AI Paradox: “What It Can Create, It May Not Understand”

Primary Area: generative models

Keywords: LMs language models vision models GPT4 ChatGPT Midjourney generative models understanding models

Scores: [ 6 8 6 8 ]

The recent wave of generative AI has sparked unprecedented global attention, with both excitement and concern over potentially superhuman levels of artificial intelligence: models now take only seconds to produce outputs that would challenge or exceed the capabilities even of expert humans. At the same time, models still show basic errors in understanding that would not be expected even in non-expert humans. This presents us with an apparent paradox: how do we reconcile seemingly superhuman capabilities with the persistence of errors that few humans would make? In this work, we posit that this tension reflects a divergence in the configuration of intelligence in today's generative models relative to intelligence in humans. Specifically, we propose and test the Generative AI Paradox hypothesis: generative models, having been trained directly to reproduce expert-like outputs, acquire generative capabilities that are not contingent upon---and can therefore exceed---their ability to understand those same types of outputs. This contrasts with humans, for whom basic understanding almost always precedes the ability togenerate expert-level outputs. We test this hypothesis through controlled experiments analyzing generation vs.~understanding in generative models, across both language and image modalities. Our results show that although models can outperform humans in generation, they consistently fall short of human capabilities in measures of understanding, as well as weaker correlation between generation and understanding performance, and more brittleness to adversarial inputs. Our findings support the hypothesis that models' generative capability may not be contingent upon understanding capability, and call for caution in interpreting artificial intelligence by analogy to human intelligence.

POSTER-628: Solving Diffusion ODEs with Optimal Boundary Conditions for Better Image Super-Resolution

Primary Area: generative models

Keywords: diffusion models diffusion ODE image super-resolution

Scores: [ 6 8 6 ]

Diffusion models, as a kind of powerful generative model, have given impressive results on image super-resolution (SR) tasks. However, due to the randomness introduced in the reverse process of diffusion models, the performances of diffusion-based SR models are fluctuating at every time of sampling, especially for samplers with few resampled steps. This inherent randomness of diffusion models results in ineffectiveness and instability, making it challenging for users to guarantee the quality of SR results. However, our work takes this randomness as an opportunity: fully analyzing and leveraging it leads to the construction of an effective plug-and-play sampling method that owns the potential to benefit a series of diffusion-based SR methods. More in detail, we propose to steadily sample high-quality SR images from pre-trained diffusion-based SR models by solving diffusion ordinary differential equations (diffusion ODEs) with optimal boundary conditions (BCs) and analyze the characteristics between the choices of BCs and their corresponding SR results. Our analysis shows the route to obtain an approximately optimal BC via an efficient exploration in the whole space. The quality of SR results sampled by the proposed method with fewer steps outperforms the quality of results sampled by current methods with randomness from the same pre-trained diffusion-based SR model, which means that our sampling method ''boosts'' current diffusion-based SR models without any additional training.

POSTER-629: PB-LLM: Partially Binarized Large Language Models

Primary Area: generative models

Keywords: Large Language Model Network Compression

Scores: [ 8 6 5 8 ]

This paper explores network binarization, a radical form of quantization, compressing model weights to a single bit, specifically for Large Language Models (LLMs) compression. Due to previous binarization methods collapsing LLMs, we propose a novel approach, Partially-Binarized LLM (PB-LLM), which can achieve extreme low-bit quantization while maintaining the linguistic reasoning capacity of quantized LLMs. Specifically, our exploration first uncovers the ineffectiveness of naïve applications of existing binarization algorithms and highlights the imperative role of salient weights in achieving low-bit quantization. Thus, PB-LLM filters a small ratio of salient weights during binarization, allocating them to higher-bit storage, i.e., partially-binarization. PB-LLM is extended to recover the capacities of quantized LMMs, by analyzing from the perspective of post-training quantization (PTQ) and quantization-aware training (QAT). Under PTQ, combining the concepts from GPTQ, we reconstruct the binarized weight matrix guided by the Hessian matrix and successfully recover the reasoning capacity of PB-LLM in low-bit. Under QAT, we freeze the salient weights during training, explore the derivation of optimal scaling factors crucial for minimizing the quantization error, and propose a scaling mechanism based on this derived scaling strategy for residual binarized weights. Those explorations and the developed methodologies significantly contribute to rejuvenating the performance of low-bit quantized LLMs and present substantial advancements in the field of network binarization for LLMs.

POSTER-630: Manifold Diffusion Fields

Primary Area: generative models

Keywords: Diffusion models riemannian manifolds graphs

Scores: [ 6 6 8 ]

We present Manifold Diffusion Fields (MDF), an approach that unlocks learning of diffusion models of data in general non-euclidean geometries. Leveraging insights from spectral geometry analysis, we define an intrinsic coordinate system on the manifold via the eigen-functions of the Laplace-Beltrami Operator. MDF represents functions using an explicit parametrization formed by a set of multiple input-output pairs. Our approach allows to sample continuous functions on manifolds and is invariant with respect to rigid and isometric transformations of the manifold. In addition, we show that MDF generalizes to the case where the training set contains functions on different manifolds. Empirical results on multiple datasets and manifolds including challenging scientific problems like weather prediction or molecular conformation show that MDF can capture distributions of such functions with better diversity and fidelity than previous approaches.

SPOTLIGHT-116: Learning Energy-Based Models by Cooperative Diffusion Recovery Likelihood

Primary Area: generative models

Keywords: Energy-based model recovery-likelihood cooperative learning

Scores: [ 8 6 6 6 8 ]

Training energy-based models (EBMs) on high-dimensional data can be both challenging and time-consuming, and there exists a noticeable gap in sample quality between EBMs and other generative frameworks like GANs and diffusion models. To close this gap, inspired by the recent efforts of learning EBMs by maximimizing diffusion recovery likelihood (DRL), we propose cooperative diffusion recovery likelihood (CDRL), an effective approach to tractably learn and sample from a series of EBMs defined on increasingly noisy versons of a dataset, paired with an initializer model for each EBM. At each noise level, the two models are jointly estimated within a cooperative training framework: Samples from the initializer serve as starting points that are refined by a few MCMC sampling steps from the EBM. The EBM is then optimized by maximizing recovery likelihood, while the initializer model is optimized by learning from the difference between the refined samples and the initial samples. In addition, we made several practical designs for EBM training to further improve the sample quality. Combining these advances, we significantly boost the generation performance compared to existing EBM methods on CIFAR-10 and ImageNet 32x32. And we have shown that CDRL has great potential to largely reduce the sampling time. We also demonstrate the effectiveness of our models for several downstream tasks, including classifier-free guided generation, compositional generation, image inpainting and out-of-distribution detection.

POSTER-631: GAIA: Data-driven Zero-shot Talking Avatar Generation

Primary Area: generative models

Keywords: Talking Avatar Generation Video Generation Disentanglement Diffusion Models

Scores: [ 6 6 6 8 ]

Previous methods for zero-shot talking avatar generation have relied on domain-specific heuristics such as warping-based motion representation and 3D Morphable Models (3DMMs), etc. However, these heuristics limit the diversity and naturalness of the generated avatars due to the absence of end-to-end learning. In this work, we instead provide a data-driven solution, named GAIA (Generative AI for Avatar), that eliminates the domain priors in talking avatar generation. In light of the observation that the speech only drives the motion of the avatar while the appearance of the target avatar and the background typically remain the same throughout the entire video, we divide our approach into two stages: 1) motion and appearance disentanglement and 2) speech-to-motion generation. The first stage disentangles each frame into motion and appearance representations, while the second stage generates motion sequences conditioned on the speech and reference avatar image. To evaluate our approach, we collect a large-scale high-quality talking avatar dataset and train the model on it with different scales (up to 2B parameters). Experimental results verify the superiority, scalability, and flexibility of GAIA as 1) the resulting model beats previous baseline models in terms of naturalness, diversity, lip-sync quality, and visual quality; 2) the framework is scalable and larger models yield better results; 3) it is general and enables different applications like controllable talking avatar generation and text-driven video generation.

ORAL-36: Generalization in diffusion models arises from geometry-adaptive harmonic representation

Primary Area: generative models

Keywords: diffusion models memorization generalization inductive bias curse of dimensionality denoising geometry-adaptive harmonic basis

Scores: [ 8 8 10 8 ]

High-quality samples generated with score-based reverse diffusion algorithms provide evidence that deep neural networks (DNN) trained for denoising can learn high-dimensional densities, despite the curse of dimensionality. However, recent reports of memorization of the training set raise the question of whether these networks are learning the true'' density of the data. Here, we show that two denoising DNNs trained on non-overlapping subsets of a dataset learn nearly the same score function, and thus the same density, with a surprisingly small number of training images. This strong generalization demonstrates the existence of powerful inductive biases in the DNN architecture and/or training algorithm. We analyze these, demonstrating that the denoiser performs a shrinkage operation in a basis adapted to the underlying image. Examination of these bases reveals oscillating harmonic structures along contours and in homogeneous image regions. We show that trained denoisers are inductively biased towards these geometry-adaptive harmonic representations by demonstrating that they arise even when the network is trained on image classes such as low-dimensional manifolds for which the harmonic basis is suboptimal. Additionally, we show that the denoising performance of the networks is near-optimal when trained on regular image classes for which the optimal basis is known to be geometry-adaptive and harmonic.

SPOTLIGHT-117: Coeditor: Leveraging Repo-level Diffs for Code Auto-editing

Primary Area: generative models

Keywords: language model for code editing refactoring

Scores: [ 5 6 8 6 ]

Developers often dedicate significant time to maintaining and refactoring existing code. However, most prior work on generative models for code focuses solely on creating new code, overlooking the distinctive needs of editing existing code. In this work, we explore a multi-round code auto-editing setting, aiming to predict edits to a code region based on recent changes within the same codebase. Our model, Coeditor, is a fine-tuned language model specifically designed for code editing tasks. We represent code changes using a line diff format and employ static analysis to form large customized model contexts, ensuring the availability of appropriate information for prediction. We collect a code editing dataset from the commit histories of 1650 open-source Python projects for training and evaluation. In a simplified single-round, single-edit task, Coeditor significantly outperforms GPT-3.5 and SOTA open-source code completion models (bringing exact-match accuracy from 34.7 up to 60.4), demonstrating the benefits of incorporating editing history for code completion. In a multi-round, multi-edit setting, we observe substantial gains by iteratively conditioning on additional user edits. We have open-sourced our code, data, and model weights to encourage future research and have released a VSCode extension powered by our model for interactive IDE usage.

POSTER-632: TOSS: High-quality Text-guided Novel View Synthesis from a Single Image

Primary Area: generative models

Keywords: novel view synthesis 3D object synthesis diffusion model

Scores: [ 6 6 6 6 ]

In this paper, we present TOSS, which introduces text to the task of novel view synthesis (NVS) from just a single RGB image. While Zero123 has demonstrated impressive zero-shot open-set NVS capabilities, it treats NVS as a pure image-to-image translation problem. This approach suffers from the challengingly under-constrained nature of single-view NVS: the process lacks means of explicit user control and often result in implausible NVS generations.To address this limitation, TOSS uses text as high-level semantic information to constrain the NVS solution space.TOSS fine-tunes text-to-image Stable Diffusion pre-trained on large-scale text-image pairs and introduces modules specifically tailored to image and camera pose conditioning, as well as dedicated training for pose correctness and preservation of fine details. Comprehensive experiments are conducted with results showing that our proposed TOSS outperforms Zero123 with higher-quality NVS results and faster convergence. We further support these results with comprehensive ablations that underscore the effectiveness and potential of the introduced semantic guidance and architecture design.

POSTER-633: ED-NeRF: Efficient Text-Guided Editing of 3D Scene With Latent Space NeRF

Primary Area: generative models

Keywords: NeRF Diffusion model 3D scene editing

Scores: [ 6 6 5 5 ]

Recently, there has been a significant advancement in text-to-image diffusion models, leading to groundbreaking performance in 2D image generation. These advancements have been extended to 3D models, enabling the generation of novel 3D objects from textual descriptions. This has evolved into NeRF editing methods, which allow the manipulation of existing 3D objects through textual conditioning. However, existing NeRF editing techniques have faced limitations in their performance due to slow training speeds and the use of loss functions that do not adequately consider editing. To address this, here we present a novel 3D NeRF editing approach dubbed ED-NeRF by successfully embedding real-world scenes into the latent space of the latent diffusion model (LDM) through a unique refinement layer. This approach enables us to obtain a NeRF backbone that is not only faster but also more amenable to editing compared to traditional image space NeRF editing. Furthermore, we propose an improved loss function tailored for editing by migrating the delta denoising score (DDS) distillation loss, originally used in 2D image editing to the three-dimensional domain. This novel loss function surpasses the well-known score distillation sampling (SDS) loss in terms of suitability for editing purposes. Our experimental results demonstrate that ED-NeRF achieves faster editing speed while producing improved output quality compared to state-of-the-art 3D editing models.

POSTER-634: Elucidating the design space of classifier-guided diffusion generation

Primary Area: generative models

Keywords: conditional diffusion sampling classifier guidance

Scores: [ 5 6 5 6 8 ]

Guidance in conditional diffusion generation is of great importance for sample quality and controllability. However, existing guidance schemes are to be desired. On one hand, mainstream methods such as classifier guidance and classifier-free guidance both require extra training with labeled data, which is time-consuming and unable to adapt to new conditions.On the other hand, training-free methods such as universal guidance, though more flexible, have yet to demonstrate comparable performance. In this work, through a comprehensive investigation into the design space, we show that it is possible to achieve significant performance improvements over existing guidance schemes by leveraging off-the-shelf classifiers in a training-free fashion, enjoying the best of both worlds. Employing calibration as a general guideline, we propose several pre-conditioning techniques to better exploit pretrained off-the-shelf classifiers for guiding diffusion generation. Extensive experiments on ImageNet validate our proposed method, showing that state-of-the-art (SOTA) diffusion models (DDPM, EDM, DiT) can be further improved (up to 20%) using off-the-shelf classifiers with barely any extra computational cost.With the proliferation of publicly available pretrained classifiers, our proposed approach has great potential and can be readily scaled up to text-to-image generation tasks.

POSTER-635: On gauge freedom, conservativity and intrinsic dimensionality estimation in diffusion models

Primary Area: generative models

Keywords: gauge freedom conservativitym intrinsic dimensionality estimation diffusion models explainable AI theory

Scores: [ 6 5 8 8 ]

Diffusion models are generative models that have recently demonstrated impressive performances in terms of sampling quality and density estimation in high dimensions. They rely on a forward continuous diffusion process and a backward continuous denoising process, which can be described by a time-dependent vector field and is used as a generative model. In the original formulation of the diffusion model, this vector field is assumed to be the score function (i.e. it is the gradient of the log-probability at a given time in the diffusion process). Curiously, on the practical side, most studies on diffusion models implement this vector field as a neural network function and do not constrain it be the gradient of some energy function (that is, most studies do not constrain the vector field to be conservative). Even though some studies investigated empirically whether such a constraint will lead to a performance gain with contradicting results, they lack analytical evidence. Here, we provide three analytical results regarding the extent of the modeling freedom of this vector field. Firstly, we show that to obtain exact density estimation and exact sampling, it is neither necessary nor sufficient to assume the vector field to be conservative. Secondly, we derive the full (gauge) freedom satisfied by the vector field. Finally, we show that when it comes to inferring local information of the data manifold, conservativity is sufficient. In particular, we provide a novel algorithm to infer the intrinsic dimensionality of manifolds based on diffusion models.

POSTER-636: DiffEnc: Variational Diffusion with a Learned Encoder

Primary Area: generative models

Keywords: DDPM diffusion image generation encoder

Scores: [ 6 6 6 5 ]

Diffusion models may be viewed as hierarchical variational autoencoders (VAEs) with two improvements: parameter sharing for the conditionals in the generative process and efficient computation of the loss as independent terms over the hierarchy. We consider two changes to the diffusion model that retain these advantages while potentially adding model flexibility. Firstly, we introduce a data and depth-dependent mean function in the diffusion process, which leads to a modified diffusion loss. Our proposed framework, DiffEnc, achieves state of the art results on CIFAR-10. Secondly, we let the ratio of the noise variance of the reverse encoder process and the generative process be a free weight parameter rather than being fixed to one. This leads to theoretical insights: For a finite depth hierarchy, the evidence lower bound (ELBO) can be used as an objective for a weighted diffusion loss approach and for optimizing the noise schedule specifically for inference. For the infinite-depth hierarchy, on the other hand, the weight parameter has to be one to have a well-defined ELBO.

POSTER-637: R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

Primary Area: generative models

Keywords: Stable Diffusion training-free grounded text-to-image generation controllable generation

Scores: [ 6 6 6 6 ]

Recent text-to-image (T2I) diffusion models have achieved remarkable progress in generating high-quality images given text-prompts as input. However, these models fail to convey appropriate spatial composition specified by a layout instruction. In this work, we probe into zero-shot grounded T2I generation with diffusion models, that is, generating images corresponding to the input layout information without training auxiliary modules or finetuning diffusion models. We propose a Region and Boundary (R&B) aware cross-attention guidance approach that gradually modulates the attention maps of diffusion model during generative process, and assists the model to synthesize images (1) with high fidelity, (2) highly compatible with textual input, and (3) interpreting layout instructions accurately. Specifically, we leverage the discrete sampling to bridge the gap between consecutive attention maps and discrete layout constraints, and design a region-aware loss to refine the generative layout during diffusion process. We further propose a boundary-aware loss to strengthen object discriminability within the corresponding regions. Experimental results show that our method outperforms existing state-of-the-art zero-shot grounded T2I generation methods by a large margin both qualitatively and quantitatively on several benchmarks.

POSTER-638: Language-Interfaced Tabular Oversampling via Progressive Imputation and Self-Authentication

Primary Area: generative models

Keywords: Tabular data imbalanced learning language models

Scores: [ 6 6 6 ]

Tabular data in the wild are frequently afflicted with class-imbalance, biasing machine learning models towards major classes. A straightforward, data-centric approach to this problem is oversampling - where synthetic minority samples are generated to balance the classes. Although tabular generative models are capable of generating synthetic samples, their integrity suffers when the number of minority samples is low. To this end, language models primed with rich prior knowledge are a fitting candidate for the task at hand. However, an oversampling strategy utilizing the extensive capabilities of such language models is yet to emerge. In this paper, we propose a novel tabular oversampling framework to channel the power of language interfaces. By leveraging its conditional sampling capabilities, we synthesize minority samples by progressively masking the important features of the majority class samples and imputing them towards the minority distribution. To reduce the inclusion of imperfectly converted samples, we utilize the power of the language model itself to self-authenticate the labels of the samples generated by itself, sifting out ill-converted samples. Extensive experiments on a variety of datasets and imbalance ratios reveal that the proposed method successfully generates reliable minority samples to boost the performance of machine learning classifiers, even under heavy imbalance ratios.

ORAL-37: Detecting, Explaining, and Mitigating Memorization in Diffusion Models

Primary Area: generative models

Keywords: Diffusion Model Memorization

Scores: [ 8 8 8 8 ]

POSTER-639: Continuous-Multiple Image Outpainting in One-Step via Positional Query and A Diffusion-based Approach

Primary Area: generative models

Keywords: Diffusion models image outpainting

Scores: [ 6 6 8 ]

Image outpainting aims to generate the content of an input sub-image beyond its original boundaries. It is an important task in content generation yet remains an open problem for generative models. This paper pushes the technical frontier of image outpainting in two directions that have not been resolved in literature: 1) outpainting with arbitrary and continuous multiples (without restriction), and 2) outpainting in a single step (even for large expansion multiples). Moreover, we develop a method that does not depend on a pre-trained backbone network, which is in contrast commonly required by the previous SOTA outpainting methods. The arbitrary multiple outpainting is achieved by utilizing randomly cropped views from the same image during training to capture arbitrary relative positional information. Specifically, by feeding one view and positional embeddings as queries, we can reconstruct another view. At inference, we generate images with arbitrary expansion multiples by inputting an anchor image and its corresponding positional embeddings. The one-step outpainting ability here is particularly noteworthy in contrast to previous methods that need to be performed for times to obtain a final multiple which is times of its basic and fixed multiple. We evaluate the proposed approach (called PQDiff as we adopt a diffusion-based generator as our embodiment, under our proposed \textbf{P}ositional \textbf{Q}uery scheme) on public benchmarks, demonstrating its superior performance over state-of-the-art approaches. Specifically, PQDiff achieves state-of-the-art FID scores on the Scenery (\textbf{21.512}), Building Facades (\textbf{25.310}), and WikiArts (\textbf{36.212}) datasets. Furthermore, under the 2.25x, 5x, and 11.7x outpainting settings, PQDiff only takes \textbf{51.3%}, \textbf{25.6%}, and \textbf{17.1%} of the time of the benchmark state-of-the-art (SOTA) method.

POSTER-640: Human Feedback is not Gold Standard

Primary Area: generative models

Keywords: human evaluation large language models evaluation natural language generation

Scores: [ 8 6 6 6 ]

Human feedback has become the de facto standard for evaluating the performance of Large Language Models, and is increasingly being used as a training objective. However, it is not clear which properties of a generated output this single `preference' score captures. We hypothesise that preference scores are subjective and open to undesirable biases. We critically analyse the use of human feedback for both training and evaluation, to verify whether it fully captures a range of crucial error criteria. We find that while preference scores have fairly good coverage, they under-represent important aspects like factuality. We further hypothesise that both preference scores and error annotation may be affected by confounders, and leverage instruction-tuned models to generate outputs that vary along two possible confounding dimensions: assertiveness and complexity. We find that the assertiveness of an output skews the perceived rate of factuality errors, indicating that human annotations are not a fully reliable evaluation metric or training objective. Finally, we offer preliminary evidence that using human feedback as a training objective disproportionately increases the assertiveness of model outputs. We encourage future work to carefully consider whether preference scores are well aligned with the desired objective.

POSTER-641: STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative Models

Primary Area: generative models

Keywords: Generative Models Video Generative Models Evaluation Fidelity Diversity Assessment

Scores: [ 6 3 6 6 ]

Image generative models have made significant progress in generating realistic and diverse images, supported by comprehensive guidance from various evaluation metrics. However, current video generative models struggle to generate evenshort video clips, with limited tools that provide insights for improvements. Current video evaluation metrics are simple adaptations of image metrics by switching the embeddings with video embedding networks, which may underestimate the unique characteristics of video. Our analysis reveals that the widely used Frechet Video Distance (FVD) has a stronger emphasis on the spatial aspect than the temporal naturalness of video and is inherently constrained by the input size of the embedding networks used, limiting it to 16 frames. Additionally, it demonstrates considerable instability and diverges from human evaluations. To address the limitations, we propose STREAM, a new video evaluation metric uniquely designed to independently evaluate spatial and temporal aspects. This feature allows comprehensive analysis and evaluation of video generative models from various perspectives, unconstrained by video length. We provide analytical and experimental evidence demonstrating that STREAM provides an effective evaluation tool for both visual and temporal quality of videos, offering insights into area of improvement for video generative models. To the best of our knowledge, STREAM is the first evaluation metric that can separately assess the temporal and spatial aspects of videos. Our code is available at https://anonymous.4open.science/r/STREAM-4200/stream.py.

SPOTLIGHT-118: Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis

Primary Area: generative models

Keywords: Neural Radiance Field One-shot Talking Face Generation

Scores: [ 8 10 8 8 ]

One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio to generate a talking portrait video. The existing methods fail to simultaneously achieve the goals of accurate 3D avatar reconstruction and stable talking face animation. Besides, while the existing works mainly focus on synthesizing the head part, it is also vital to generate natural torso and background segments to obtain a realistic talking portrait video. To address these limitations, we present Real3D-Potrait, a framework that (1) improves the one-shot 3D reconstruction power with a large image-to-plane model that distills 3D prior knowledge from a 3D face generative model; (2) facilitates accurate motion-conditioned animation with an efficient motion adapter; (3) synthesizes realistic video with natural torso movement and switchable background using a head-torso-background super-resolution model; and (4) supports one-shot audio-driven talking face generation with a generalizable audio-to-motion model. Extensive experiments show that Real3D-Portrait generalizes well to unseen identities and generates more realistic talking portrait videos compared to previous methods. Video samples are available at https://real3dportrait.github.io.

SPOTLIGHT-119: Local Search GFlowNets

Primary Area: generative models

Keywords: GFlowNet molecule optimization biological sequence design local search reinforcement learning

Scores: [ 8 6 6 ]

Generative Flow Networks (GFlowNets) are amortized sampling methods that learn a distribution over discrete objects proportional to their rewards. GFlowNets exhibit a remarkable ability to generate diverse samples, yet occasionally struggle to consistently produce samples with high rewards due to over-exploration on wide sample space. This paper proposes to train GFlowNets with local search which focuses on exploiting high rewarded sample space to resolve this issue. Our main idea is to explore the local neighborhood via destruction and reconstruction guided by backward and forward policies, respectively. This allows biasing the samples toward high-reward solutions, which is not possible for a typical GFlowNet solution generation scheme which uses the forward policy to generate the solution from scratch. Extensive experiments demonstrate a remarkable performance improvement in several biochemical tasks. Source code is available: \url{https://anonymous.4open.science/r/ls-gfn-FC9A/README.md}.

POSTER-642: IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs

Primary Area: generative models

Keywords: Efficient Transformers Inference-time Efficiency CPU

Scores: [ 6 6 3 6 6 ]

One limitation of existing transformer-based models is that they cannot handle very long sequences as input since their self-attention operations exhibit quadratic time and space complexity. This problem becomes especially acute when transformers are deployed on hardware platforms equipped only with CPUs. To address this issue, we propose a novel method for accelerating self-attention at inference time that works with pretrained transformer models out-of-the-box without requiring retraining. We experiment using our method to accelerate various long-sequence transformers on various benchmarks and demonstrate a greater speedup compared to the baselines.

SPOTLIGHT-120: Beyond Linear Spherical Interpolation: Noise Correction for Image Interpolation with Diffusion Models

Primary Area: generative models

Keywords: Image Interpolation;Diffusion Models

Scores: [ 8 8 8 ]

Image interpolation based on diffusion models is promising in creating fresh and interesting images. Advanced interpolation methods mainly focus on linear spherical interpolation, delivering remarkable success for images generated by diffusion models.However, existing methods struggle with natural images (not generated by diffusion models), limiting practical applications.Our investigation into the interpolation process has unveiled that its shortcomings are rooted in the introduction of inappropriate noise, which may either exceed or fall below the denoising threshold, leading to issues such as image artifacts and information loss in the interpolated images. To address this issue, we initially investigated a direct noise addition method, which improved image quality but introduced unwanted information. Drawing from these findings, we subsequently developed a novel interpolation approach that harnesses the advantages of both techniques. This approach retains the valuable noise with information from the original images while introducing a subtle Gaussian noise to enhance interpolation quality. Moreover, we introduced an innovative constraint on the noise component responsible for generating artifacts and incorporated original image to supplement missing information.These enhancements not only improved the interpolation results for images within the training domain but also extended the capability to interpolate with natural images beyond the training domain, achieving in the best interpolation results to date.

POSTER-643: DreamClean: Restoring Clean Image Using Deep Diffusion Prior

Primary Area: generative models

Keywords: Image Restoration Diffusion Models Image Prior Blind Restoration Super-Resolution Image Denoising Colorization JPEG Artifacts Correction

Scores: [ 8 6 8 6 ]

Image restoration poses a garners substantial interest due to the exponential surge in demands for recovering high-quality images from diverse mobile camera devices, adverse lighting conditions, suboptimal shooting environments, and frequent image compression for efficient transmission purposes. Yet this problem gathers significant challenges as people are blind to the type of restoration the images suffer, which, is usually the case in real-day scenarios and is most urgent to solve for this field. Current research, however, heavily relies on prior knowledge of the restoration type, either explicitly through rules or implicitly through the availability of degraded-clean image pairs to define the restoration process, and consumes considerable effort to collect image pairs of vast degradation types. This paper introduces DreamClean, a training-free method that needs no degradation prior knowledge but yields high-fidelity and generality towards various types of image degradation. DreamClean embeds the degraded image back to the latent of pre-trained diffusion models and re-sample it through a carefully designed diffusion process that mimics those generating clean images. Thanks to the rich image prior in diffusion models and our novel Variance Preservation Sampling (VPS) technique, DreamClean manages to handle various different degradation types at one time and reaches far more satisfied final quality than previous competitors. DreamClean relies on elegant theoretical supports to assure its convergence to clean image when VPS has appropriate parameters, and also enjoys superior experimental performance over various challenging tasks that could be overwhelming for previous methods when degradation prior is unavailable.

POSTER-644: Scale-Adaptive Diffusion Model for Complex Sketch Synthesis

Primary Area: generative models

Keywords: Generative model Classifier-based Diffusion Model Sketch

Scores: [ 6 6 6 5 ]

While diffusion models have revolutionized generative AI, their application to human sketch generation, especially in the creation of complex yet concise and recognizable sketches, remains largely unexplored. Existing efforts have primarily focused on vector-based sketches, limiting their ability to handle intricate sketch data. This paper introduces an innovative extension of diffusion models to pixellevel sketch generation, addressing the challenge of dynamically optimizing the guidance scale for classifier-guided diffusion. Our approach achieves a delicate balance between recognizability and complexity in generated sketches through scale-adaptive classifier-guided diffusion models, a scaling indicator, and the concept of a residual sketch. We also propose a three-phase sampling strategy to enhance sketch diversity and quality. Experiments on the QuickDraw dataset showcase the potential of diffusion models to push the boundaries of sketch generation, particularly in complex scenarios unattainable by vector-based methods.

POSTER-645: Jointly Training Large Autoregressive Multimodal Models

Primary Area: generative models

Keywords: Large Multimodal Models; Joint Training; Interleaved Image-Text Generation; Autoregressive Models

Scores: [ 6 5 6 5 ]

In recent years, advances in the large-scale pretraining of language and text-to-image models have revolutionized the field of machine learning. Yet, integrating these two modalities into a single, robust model capable of generating seamless multimodal outputs remains a significant challenge. To address this gap, we present the Joint Autoregressive Mixture (JAM) framework, a modular approach that systematically fuses existing text and image generation models. We also introduce a specialized, data-efficient instruction-tuning strategy, tailored for mixed-modal generation tasks. Our final instruct-tuned model demonstrates unparalleled performance in generating high-quality multimodal outputs and represents the first model explicitly designed for this purpose.

POSTER-646: MiniLLM: Knowledge Distillation of Large Language Models

Primary Area: generative models

Keywords: Large Lanauge Models Knowledge Distillation

Scores: [ 6 8 5 6 ]

Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs). However, previous KD methods are primarily applied to white-box classification models or training small models to imitate black-box model APIs like ChatGPT. How to effectively distill the knowledge from white-box generative LLMs is still under-explored, which becomes more and more important with the prosperity of LLMs. In this work, we propose MiniLLM that distills smaller language models from generative larger language models. We first replace the forward Kullback-Leibler divergence (KLD) objective in the standard KD approaches with reverse KLD, which is more suitable for KD on generative language models, to prevent the student model from overestimating the low-probability regions of the teacher distribution. Then, we derive an effective optimization approach to learn this objective. Extensive experiments in the instruction-following setting show that the MiniLLM models generate more precise responses with the higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance. Our method is also scalable for different model families with 120M to 13B parameters. Our code can be found in the supplementary material.

POSTER-647: ControlVideo: Training-free Controllable Text-to-video Generation

Primary Area: generative models

Keywords: Diffusion models video generation

Scores: [ 6 5 6 5 ]

POSTER-648: Beyond Vanilla Variational Autoencoders: Detecting Posterior Collapse in Conditional and Hierarchical Variational Autoencoders

Primary Area: generative models

Keywords: variational autoencoders posterior collapse

Scores: [ 6 6 6 6 8 ]

POSTER-649: Diffusion-TS: Interpretable Diffusion for General Time Series Generation

Primary Area: generative models

Keywords: Diffusion models Time series

Scores: [ 6 5 8 ]

Denoising diffusion probabilistic models (DDPMs) are becoming the leading paradigm for generative models. It has recently shown breakthroughs in audio synthesis and time series imputation. However, little attention has been given to leveraging the powerful generative ability for general time series production. In this paper, we propose Diffusion-TS, the first DDPM-based framework that generates multivariate time series samples of high quality by using an encoder-decoder Transformer with disentangled temporal representations, in which the decomposition technique guides Diffusion-TS to capture the semantic meaning of time series while Transformers mine detailed sequential information from the noisy model input. For more interpretable and accurate pattern modeling, we train the model to directly reconstruct the sample instead of the noise in each diffusion step, combining a Fourier-based loss term. In addition, it is shown that the proposed Diffusion-TS can be easily extended to conditional generation tasks, such as forecasting and imputation, without any model changes. This also motivates us to further explore the performance of Diffusion-TS under irregular settings. Finally, through qualitative and quantitative experiments, results show that Diffusion-TS achieves the state-of-the-art results on various realistic analyses of time series. Our code and models are attached in the supplementary material and will be made publicly available.

POSTER-650: Llemma: An Open Language Model for Mathematics

Primary Area: generative models

Keywords: reasoning language models pretraining

Scores: [ 8 6 6 ]

We present Llemma, a large language model for mathematics. We continue pretraining Code Llama on the Proof-Pile-2, a mixture of scientific papers, web data containing mathematics, and mathematical code, yielding Llemma. On the MATH benchmark Llemma outperforms all known openly released models, as well as the unreleased Minerva model suite on an equi-parameter basis. Moreover, Llemma is capable of tool use and formal theorem proving without any finetuning. We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments.

POSTER-651: Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding

Primary Area: generative models

Keywords: Table Understanding In-context Learning Large Language Model

Scores: [ 5 6 5 6 ]

Table-based reasoning with large language models (LLMs) is a promising direction to tackle many table understanding tasks, such as table-based question answering and fact verification. Compared with generic reasoning, table-based reasoning requires the extraction of underlying semantics from both free-form questions and semi-structured tabular data. Chain-of-Thought and its similar approaches incorporate the reasoning chain in the form of textual context, but it is still an open question how to effectively leverage tabular data in the reasoning chain. We propose the Chain-of-Table framework, where tabular data is explicitly used in the reasoning chain as a proxy for intermediate thoughts. Specifically, we guide LLMs using in-context samples to iteratively generate operations and update the table to represent a complex reasoning chain. LLMs can therefore dynamically plan the next operation based on the results of the previous ones. This continuous evolution of the table forms a chain, showing the reasoning process for a given tabular problem. The chain carries structured information of the intermediate results, enabling more accurate and reliable predictions. Chain-of-Table achieves new state-of-the-art performance on WikiTQ, FeTaQA, and TabFact benchmarks across multiple LLM choices.

ORAL-38: Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space

Primary Area: generative models

Keywords: Tabular data tabular generation diffusion models

Scores: [ 8 6 8 5 ]

Recent advances in tabular data generation have greatly enhanced synthetic data quality. However, extending diffusion models to tabular data is challenging due to the intricately varied distributions and a blend of data types of tabular data. This paper introduces TABSYN, a methodology that synthesizes tabular data by leveraging a diffusion model within a variational autoencoder (VAE) crafted latent space.The key advantages of the proposed TabSyn include (1) Generality: the ability to handle a broad spectrum of data types by converting them into a single unified space and explicitly capture inter-column relations, (2) Quality: optimizing the distribution of latent embeddings to enhance the subsequent training of diffusion models, which helps generate high-quality synthetic data, (3) Speed: much fewer number of reverse steps and faster synthesis speed than existing diffusion-based methods. Extensive experiments on six datasets with five metrics demonstrate that TabSyn outperforms existing methods. Specifically, it reduces the error rates by 86% and 67% for column-wise distribution and pair-wise column correlation estimations compared with the most competitive baselines, its superiority in accuratelylearning the data distributions of tabular data.

POSTER-652: DORSal: Diffusion for Object-centric Representations of Scenes et al.

Primary Area: generative models

Keywords: novel view synthesis object-centric scene representations camera control scene editing 3D diffusion generative models

Scores: [ 6 6 6 5 ]

Recent progress in 3D scene understanding enables scalable learning of representations across large datasets of diverse scenes. As a consequence, generalization to unseen scenes and objects, rendering novel views from just a single or a handful of input images, and controllable scene generation that supports editing, is now possible. However, training jointly on a large number of scenes typically compromises rendering quality when compared to single-scene optimized models such as NeRFs. In this paper, we leverage recent progress in diffusion models to equip 3D scene representation learning models with the ability to render high-fidelity novel views, while retaining benefits such as object-level scene editing to a large degree. In particular, we propose DORSal, which adapts a video diffusion architecture for 3D scene generation conditioned on frozen object-centric slot-based representations of scenes. On both complex synthetic multi-object scenes and on the real-world large-scale Street View dataset, we show that DORSal enables scalable neural rendering of 3D scenes with object-level editing and improves upon existing approaches.

POSTER-653: Generalized Knowledge Distillation for Auto-regressive Language Models

Primary Area: generative models

Keywords: Language models Distillation RLHF

Scores: [ 8 6 6 6 ]

Knowledge distillation (KD) is widely used for compressing a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, current KD methods for auto-regressive sequence models suffer from distribution mismatch between output sequences seen during training and those generated by the student during inference. To address this issue, we introduce Generalized Knowledge Distillation (GKD). Instead of solely relying on a fixed set of output sequences, GKD trains the student on its self-generated output sequences by leveraging feedback from the teacher on such sequences. Unlike supervised KD approaches, GKD also offers the flexibility to employ alternative loss functions between the student and teacher, which can be useful when the student lacks the expressivity to mimic the teacher's distribution. Furthermore, GKD facilitates the seamless integration of distillation with RL fine-tuning (RLHF). We demonstrate the efficacy of GKD for distilling auto-regressive T5 language models on summarization, translation, and arithmetic reasoning tasks.

POSTER-654: Efficient-3Dim: Learning a Generalizable Single-image Novel-view Synthesizer in One Day

Primary Area: generative models

Keywords: Novel View Synthesis 3D from Single Image Efficient Training

Scores: [ 8 8 8 5 ]

The task of novel view synthesis aims to generate unseen perspectives of an object or scene from a limited set of input images. Nevertheless, synthesizing novel views from a single image still remains a significant challenge in the realm of computer vision. Previous approaches tackle this problem by adopting mesh prediction, multi-plain image construction, or more advanced techniques such as neural radiance fields. Recently, a pre-trained diffusion model that is specifically designed for 2D image synthesis has demonstrated its capability in producing photorealistic novel views, if sufficiently optimized on a 3D finetuning task. Although the fidelity and generalizability are greatly improved, training such a powerful diffusion model requires a vast volume of training data and model parameters, resulting in a notoriously long time and high computational costs. To tackle this issue, we propose Efficient-3DiM, a simple but effective framework to learn single-image novel-view generative models. Motivated by our in-depth visual analysis of the inference process of diffusion models, we propose several pragmatic strategies to reduce the training overhead to a manageable scale, including a crafted timestep sampling strategy, a superior 3D feature extractor, and an enhanced training scheme. When combined, our framework is able to reduce the total training time from 10 days to less than 1 day, significantly accelerating the training process under the same computational platform (one instance with 8 Nvidia A100 GPUs). Comprehensive experiments are conducted to demonstrate the efficiency and generalizability of our proposed method on several common benchmarks.

POSTER-655: Step-Back Prompting Enables Reasoning Via Abstraction in Large Language Models

Primary Area: generative models

Keywords: Prompting Large Language Models Reasoning Abstraction

Scores: [ 8 8 8 ]

We present Step-Back Prompting, a simple prompting technique that enables LLMs to do abstractions to derive high-level concepts and first principles from instances containing specific details. Using the concepts and principles to guide the reasoning steps, LLMs significantly improve their abilities in following a correct reasoning path towards the solution. We conduct experiments of Step-Back Prompting with PaLM-2 models and observe substantial performance gains on a wide range of challenging reasoning-intensive tasks including STEM, Knowledge QA, and Multi-Hop Reasoning. For instance, Step-Back Prompting improves PaLM-2L performance on MMLU Physics and Chemistry by 7% and 11%, TimeQA by 34%, and MuSiQue by 7%.

POSTER-656: Don't Play Favorites: Minority Guidance for Diffusion Models

Primary Area: generative models

Keywords: diffusion models fairness generative models minority generation

Scores: [ 6 6 6 3 ]

We explore the problem of generating minority samples using diffusion models. The minority samples are instances that lie on low-density regions of a data manifold. Generating a sufficient number of such minority instances is important, since they often contain some unique attributes of the data. However, the conventional generation process of the diffusion models mostly yields majority samples (that lie on high-density regions of the manifold) due to their high likelihoods, making themselves ineffective and time-consuming for the minority generating task. In this work, we present a novel framework that can make the generation process of the diffusion models focus on the minority samples. We first highlight that Tweedie's denoising formula yields favorable results for majority samples. The observation motivates us to introduce a metric that describes the uniqueness of a given sample. To address the inherent preference of the diffusion models w.r.t. the majority samples, we further develop minority guidance, a sampling technique that can guide the generation process toward regions with desired likelihood levels. Experiments on benchmark real datasets demonstrate that our minority guidance can greatly improve the capability of generating high-quality minority samples over existing generative samplers. We showcase that the performance benefit of our framework persists even in demanding real-world scenarios such as medical imaging, further underscoring the practical significance of our work.

POSTER-657: Training Unbiased Diffusion Models From Biased Dataset

Primary Area: generative models

Keywords: diffusion model density ratio estimation dataset bias

Scores: [ 6 5 8 5 ]

With significant advancements in diffusion models, addressing the potential risks of dataset bias becomes increasingly important. Since generated outputs directly suffer from dataset bias, mitigating latent bias becomes a key factor in improving sample quality and proportion. This paper proposes time-dependent importance reweighting to mitigate the bias for the diffusion models. We demonstrate that the time-dependent density ratio becomes more precise than previous approaches, thereby minimizing error propagation in generative learning. Whiledirectly applying it to score-matching is intractable, we discovered that using the time-dependent density ratio both for reweighting and score correction can lead to a tractable form of the objective function to regenerate the unbiased data density. Furthermore, we theoretically established a connection with traditional score-matching, and we demonstrated its convergence to an unbiased distribution. The experimental evidence supports the usefulness of the proposed method, which outperforms baselines including time-independent importance reweighting on CIFAR-10, CIFAR-100, FFHQ, and CelebA with various bias settings.

POSTER-658: Domain-agnostic Latent Diffusion Models for Synthesizing High-Quality Implicit Neural Representations

Primary Area: generative models

Keywords: Implicit neural representation generative model domain agnostic

Scores: [ 6 6 6 6 ]

Recent studies have introduced a new class of generative models for synthesizing implicit neural representations (INRs) that capture arbitrary continuous signals in various domains.These models opened the door for domain-agnostic generative models, but they often fail to achieve high-quality generation.We observed that the existing methods generate the weights of neural networks to parameterize INRs and evaluate the network with fixed positional embeddings (PEs).Arguably, this architecture limits the expressive power of generative models and results in low-quality INR generation.To address this limitation, we propose Domain-agnostic Latent Diffusion Model for INRs (DDMI) that generates adaptive positional embeddings instead of neural networks' weights.Specifically, we develop a Discrete-to-continuous space Variational AutoEncoder (D2C-VAE), which seamlessly connects discrete data and the continuous signal functions in the shared latent space. Additionally, we introduce a novel conditioning mechanism for evaluating INRs with the generated hierarchically decomposed basis fields to further enhance expressive power.Extensive experiments across four modalities, \eg, 2D images, 3D shapes, Neural Radiance Fields, and videos, with seven benchmark datasets, demonstrate the versatility of DDMI and its superior performance compared to the existing INR generative models.

POSTER-659: Instant3D: Fast Text-to-3D with Sparse-view Generation and Large Reconstruction Model

Primary Area: generative models

Keywords: text-to-3d generative models diffusion models 3D reconstruction 3D generation sparse-view reconstruction

Scores: [ 8 8 6 ]

Text-to-3D with diffusion models have achieved remarkable progress in recent years. However, existing methods either rely on score distillation-based optimization which suffer from slow inference, low diversity and Janus problems, or are feed-forward methods that generate low quality results due to the scarcity of 3D training data. In this paper, we propose Instant3D, a novel method that generates high-quality and diverse 3D assets from text prompts in a feed-forward manner. We adopt a two-stage paradigm, which first generates a sparse set of four structured and consistent views from text in one shot with a fine-tuned 2D text-to-image diffusion model, and then directly regresses the NeRF from the generated images with a novel transformer-based sparse-view reconstructor. Through extensive experiments, we demonstrate that our method can generate high-quality, diverse and Janus-free 3D assets within 20 seconds, which is two order of magnitude faster than previous optimization-based methods that can take 1 to 10 hours. Our project webpage: https://instant-3d.github.io/.

POSTER-660: Efficient and Scalable Graph Generation by Spectrum Preserving Local Expansion

Primary Area: generative models

Keywords: Graph Generation Denoising Diffusion Spectral Graph Theory

Scores: [ 6 5 8 5 ]

In the realm of generative models for graphs, a substantial body of research has been undertaken. However, most existing methods encounter difficulties when applied to large graphs. These issues stem from the need to express the entire joint distribution across all node pairs, the difficulty of accurately representing both the global structure and local details of the graph at the same time, and the computational complexity of the neural networks used.To overcome these issues, we introduce a method that generates a graph by progressively expanding a single node to a target graph. In each step, nodes and edges are added in a localized manner through denoising diffusion, building first the global structure, and then refining the local details. Since each generation step is local, our method does not need to model the complete joint distribution over all node pairs at once, allowing for significant computational efficiency, that is, an empirical subquadratic runtime in relation to the number of nodes. At the same time, multiscale generation allows us to retain high expressivity. Our experiments show that our model achieves state-of-the-art performance on well-established benchmark datasets while successfully scaling to graphs with at least 5000 nodes. Our method is also the first to successfully extrapolate to graphs outside of the training distribution, showcasing a much better generalization capability over existing methods.

POSTER-661: Unbalancedness in Neural Monge Maps Improves Unpaired Domain Translation

Primary Area: generative models

Keywords: optimal transport domain translation image translation flow matching

Scores: [ 6 6 6 6 ]

In optimal transport (OT), a Monge map is known as a mapping that transports a source distribution to a target distribution in the most cost-efficient way. Recently, multiple neural estimators for Monge maps have been developed and applied in diverse unpaired domain translation tasks, e.g. in single-cell biology and computer vision. However, the classic OT framework enforces mass conservation, whichmakes it prone to outliers and limits its applicability in real-world scenarios. The latter can be particularly harmful in OT domain translation tasks, where the relative position of a sample within a distribution is explicitly taken into account. While unbalanced OT tackles this challenge in the discrete setting, its integration into neural Monge map estimators has received limited attention. We propose a theoreticallygrounded method to incorporate unbalancedness into any Monge map estimator. We improve existing estimators to model cell trajectories over time and to predict cellular responses to perturbations. Moreover, our approach seamlessly integrates with the OT flow matching (OT-FM) framework. While we show that OT-FM performs competitively in image translation, we further improve performance byincorporating unbalancedness (UOT-FM), which better preserves relevant features. We hence establish UOT-FM as a principled method for unpaired image translation.

POSTER-662: Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models

Primary Area: generative models

Keywords: Computer Vision Diffusion Models Video Editing

Scores: [ 6 6 8 6 ]

This paper introduces a novel grounding-guided video-to-video translation framework called Ground-A-Video for multi-attribute video editing.Recent endeavors in video editing have showcased promising results in single-attribute editing or style transfer tasks, either by training T2V models on text-video data or adopting training-free methods.However, when confronted with the complexities of multi-attribute editing scenarios, they exhibit shortcomings such as omitting or overlooking intended attribute changes, modifying the wrong elements of the input video, and failing to preserve regions of the input video that should remain intact.Ground-A-Video attains temporally consistent multi-attribute editing of input videos in a training-free manner without aforementioned shortcomings.Central to our method is the introduction of cross-frame gated attention which incorporates groundings information into the latent representations in a temporally consistent fashion, along with Modulated Cross-Attention and optical flow guided inverted latents smoothing.Extensive experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.Further results and codes are provided at our anonymous project page (http://ground-a-video.github.io).

POSTER-663: RA-DIT: Retrieval-Augmented Dual Instruction Tuning

Primary Area: generative models

Keywords: retrieval-augmented language model large language model knowledge intensive NLP

Scores: [ 5 6 6 8 ]

Retrieval-augmented language models (RALMs) improve performance by accessing long-tail and up-to-date knowledge from external data stores, but are challenging to build. Existing approaches require either expensive retrieval-specific modifications to LM pre-training or use post-hoc integration of the data store that leads to suboptimal performance. We introduce Retrieval-Augmented Dual Instruction Tuning (RA-DIT), a lightweight fine-tuning methodology that provides a third option by retrofitting any LLM with retrieval capabilities. Our approach operates in two distinct fine-tuning steps: (1) one updates a pre-trained LM to better use retrieved information, while (2) the other updates the retriever to return more relevant results, as preferred by the LM. By fine-tuning over tasks that require both knowledge utilization and contextual awareness, we demonstrate that each stage yields significant performance improvements, and using both leads to additional gains. Our best model, RA-DIT 65B, achieves state-of-the-art performance across a range of knowledge-intensive zero- and few-shot learning benchmarks, significantly outperforming existing in-context RALM approaches by up to +8.9% in 0-shot setting and +1.4% in 5-shot setting on average.

POSTER-664: Directly Fine-Tuning Diffusion Models on Differentiable Rewards

Primary Area: generative models

Keywords: diffusion models preference-based learning

Scores: [ 5 8 8 3 ]

We present Direct Reward Fine-Tuning (DRaFT), a simple and effective method for fine-tuning diffusion models to maximize differentiable reward functions, such as scores from human preference models. We first show that it is possible to backpropagate the reward gradient through the full sampling procedure, and that doing so achieves strong performance on a variety of reward functions, outperforming reinforcement learning-based approaches. We then propose more efficient variants of DRaFT: DRaFT-K, which truncates backpropagation to only the last K steps of sampling, and DRaFT-LV, which obtains lower-variance gradient estimates for the case when K=1. We show that our methods work well for a variety of reward functions and can be used to substantially improve the aesthetic quality of images generated by Stable Diffusion 1.4. Finally, we draw connections between our approach and prior work, providing a unifying perspective on the design space of gradient-based fine-tuning algorithms.

ORAL-39: Self-Alignment with Instruction Backtranslation

Primary Area: generative models

Keywords: large language models self-supervised learning data augmentation

Scores: [ 8 8 8 8 ]

We present a scalable method to build a high quality instruction following language model by automatically labelling human-written text with corresponding instructions. Our approach, named instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. The seed model is used to construct training examples by generating instruction prompts for web documents (self-augmentation), and then selecting high quality examples from among these candidates (self-curation). This data is then used to finetune a stronger model. Finetuning LLaMa on two iterations of our approach yields a model that outperforms all other LLaMa-based models on the Alpaca leaderboard not relying on distillation data, demonstrating highly effective self-alignment.

POSTER-665: InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation

Primary Area: generative models

Keywords: Diffusion Models Generative Models Acceleration

Scores: [ 8 8 6 6 ]

Diffusion models have revolutionized text-to-image generation with its exceptional quality and creativity. However, its multi-step sampling process is known to be slow, often requiring tens of inference steps to obtain satisfactory results. Previous attempts to improve its sampling speed and reduce computational costs through distillation have been unsuccessful in achieving a functional one-step model.In this paper, we explore a recent method called Rectified Flow, which, thus far, has only been applied to small datasets. The core of Rectified Flow lies in its \emph{reflow} procedure, which straightens the trajectories of probability flows, refines the coupling between noises and images, and facilitates the distillation process with student models. We propose a novel text-conditioned pipeline to turn Stable Diffusion (SD) into an ultra-fast one-step model, in which we find reflow plays a critical role in improving the assignment between noise and images. Leveraging our new pipeline, we create, to the best of our knowledge, the first one-step diffusion-based text-to-image generator with SD-level image quality, achieving an FID (Fréchet Inception Distance) of 23.3 on MS COCO 2017-5k, surpassing the previous state-of-the-art technique, progressive distillation, by a significant margin (37.2 23.3 in FID). By utilizing an expanded network with 1.7B parameters, we further improve the FID to 22.4. We call our one-step models \emph{InstaFlow}. On MS COCO 2014-30k, InstaFlow yields an FID of 13.1 in just 0.09 second, the best in 0.1 second regime, outperforming the recent StyleGAN-T (13.9 in 0.1 second). Notably, the training of InstaFlow only costs 199 A100 GPU days.

POSTER-666: DreamTime: An Improved Optimization Strategy for Diffusion-Guided 3D Generation

Primary Area: generative models

Keywords: Score Distillation 3D Content Creation Diffusion Model

Scores: [ 8 6 3 6 ]

Text-to-image diffusion models pre-trained on billions of image-text pairs have recently enabled 3D content creation by optimizing a randomly initialized differentiable 3D representation with score distillation. However, the optimization process suffers slow convergence and the resultant 3D models often exhibit two limitations: (a) quality concerns such as missing attributes and distorted shape and textures; (b) extremely low diversity comparing to text-guided image synthesis. In this paper, we show that the conflict between the 3D optimization process and uniform timestep sampling in score distillation is the main reason for these limitations. To resolve this conflict, we propose to prioritize timestep sampling with monotonically non-increasing functions, which aligns the 3D optimization process with the sampling process of diffusion model. Extensive experiments show that our simple redesign significantly improves 3D content creation with faster convergence, better quality and diversity.

POSTER-667: A Variational Perspective on Solving Inverse Problems with Diffusion Models

Primary Area: generative models

Keywords: diffusion models score matching variational approximation regularization by denoising inverse problems

Scores: [ 5 5 6 6 ]

Diffusion models have emerged as a key pillar of foundation models in visual domains. One of their critical applications is to universally solve different downstream inverse tasks via a single diffusion prior without re-training for each task. Most inverse tasks can be formulated as inferring a posterior distribution over data (e.g., a full image) given a measurement (e.g., a masked image). This is however challenging in diffusion models since the nonlinear and iterative nature of the diffusion process renders the posterior intractable. To cope with this challenge, we propose a variational approach that by design seeks to approximate the true posterior distribution. We show that our approach naturally leads to regularization by denoising diffusion process (RED-diff) where denoisers at different timesteps concurrently impose different structural constraints over the image. To gauge the contribution of denoisers from different timesteps, we propose a weighting mechanism based on signal-to-noise-ratio (SNR). Our approach provides a new variational perspective for solving inverse problems with diffusion models, allowing us to formulate sampling as stochastic optimization, where one can simply apply off-the-shelf solvers with lightweight iterates. Our experiments for image restoration tasks such as inpainting and superresolution demonstrate the strengths of our method compared with state-of-the-art sampling-based diffusion models.

SPOTLIGHT-121: Bespoke Solvers for Generative Flow Models

Primary Area: generative models

Keywords: generative models flow matching diffusion models normalizing flows ode solver fast sampling distillation

Scores: [ 6 8 8 8 6 ]

Diffusion or flow-based models are powerful generative paradigms that are notoriously hard to sample as samples are defined as solutions to high-dimensional Ordinary or Stochastic Differential Equations (ODEs/SDEs) which require a large Number of Function Evaluations (NFE) to approximate well. Existing methods to alleviate the costly sampling process include model distillation and designing dedicated ODE solvers. However, distillation is costly to train and sometimes can deteriorate quality, while dedicated solvers still require relatively large NFE to produce high quality samples. In this paper we introduce Bespoke solvers'', a novel framework for constructing custom ODE solvers tailored to the ODE of a given pre-trained flow model. Our approach optimizes an order consistent and parameter-efficient solver (e.g., with 80 learnable parameters), is trained for roughly 1% of the GPU time required for training the pre-trained model, and significantly improves approximation and generation quality compared to dedicated solvers. For example, a Bespoke solver for a CIFAR10 model produces samples with Fréchet Inception Distance (FID) of 2.73 with 10 NFE, and gets to 1% of the Ground Truth (GT) FID (2.59) for this model with only 20 NFE. On the more challenging ImageNet-64$\times$64, Bespoke samples at 2.2 FID with 10 NFE, and gets within 2% of GT FID (1.71) with 20 NFE.

POSTER-668: Denoising Diffusion via Image-Based Rendering

Primary Area: generative models

Keywords: Neural Scene Representations Generative Models Denoising Diffusion 3D Reconstruction

Scores: [ 6 8 6 6 ]

Generating and reconstructing 3D scenes is a challenging open problem, which requires synthesizing plausible content that is fully consistent in 3D space. While recent methods such as neural radiance fields excel at view synthesis and 3D reconstruction, they cannot synthesize plausible details in unobserved regions since they lack a generative capability. Conversely, existing generative methods are typically not capable of reconstructing detailed, large-scale scenes in the wild, as they use limited-capacity 3D scene representations, require aligned camera poses, or rely on additional regularizers. In this work, we introduce the first diffusion model able to perform fast, detailed reconstruction and generation of real-world 3D scenes. To achieve this, we make three contributions. First, we introduce a new neural scene representation that can efficiently and accurately represent large 3D scenes, dynamically allocating more capacity as needed to capture details visible in each image. Second, we propose a denoising-diffusion framework to learn a prior over this novel 3D scene representation, using only 2D images without the need for any additional supervision signal such as masks, depths, or regularizers, required by prior works. Third, we introduce a fast and unified architecture that supports conditioning on varying numbers of images, including a priori generation of 3D scenes and reconstruction from one or several images. We evaluate the model on several challenging datasets of real and synthetic images, and demonstrate superior results on generation, novel view synthesis and 3D reconstruction.

POSTER-669: Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors

Primary Area: generative models

Keywords: Neural Radiance Fields 3D generation generative 3D models

Scores: [ 5 8 5 8 ]

We present Magic123'', a two-stage coarse-to-fine approach for high-quality, textured 3D mesh generation from a single image in the wild using both 2D and 3D priors. In the first stage, we optimize a neural radiance field to produce a coarse geometry. In the second stage, we adopt a memory-efficient differentiable mesh representation to yield a high-resolution mesh with a visually appealing texture. In both stages, the 3D content is learned through reference-view supervision and novel-view guidance by a joint 2D and 3D diffusion prior. We introduce a trade-off parameter between the 2D and 3D priors to control the details and 3D consistencies of the generation. Magic123 demonstrates a significant improvement over previous image-to-3D techniques, as validated through extensive experiments on diverse synthetic and real-world images.

POSTER-670: NEFTune: Noisy Embeddings Improve Instruction Finetuning

Primary Area: generative models

Keywords: Instruction Finetuning

Scores: [ 6 5 6 6 ]

We show that language model finetuning can be improved, sometimes dramatically, with a simple augmentation. NEFTune adds noise to the embedding vectors during training.Standard finetuning of LLaMA-2-7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings.NEFTune also improves over strong baselines on modern instruction datasets.Models trained with Evol-Instruct see a 10% improvement, with ShareGPT an 8% improvement, and with OpenPlatypus an 8% improvement. Even powerful models further refined with RLHF such as LLaMA-2-Chat benefit from additional training with NEFTune.

POSTER-671: Inner Classifier-Free Guidance and Its Taylor Expansion for Diffusion Models

Primary Area: generative models

Keywords: diffusion models classifier-free guidance

Scores: [ 5 6 8 5 ]

Classifier-free guidance (CFG) is a pivotal technique for balancing the diversity and fidelity of samples in conditional diffusion models. This approach involves utilizing a single model to jointly optimize the conditional score predictor and unconditional score predictor, eliminating the need for additional classifiers. It delivers impressive results and can be employed for continuous and discrete condition representations. However, when the condition is continuous, it prompts the question of whether the trade-off can be further enhanced. Our proposed inner classifier-free guidance (ICFG) provides an alternative perspective on the CFG method when the condition has a specific structure, demonstrating that CFG represents a first-order case of ICFG. Additionally, we offer a second-order implementation, highlighting that even without altering the training policy, our second-order approach can introduce new valuable information and achieve an improved balance between fidelity and diversity for Stable Diffusion.

POSTER-672: Zero Bubble Pipeline Parallelism

Primary Area: infrastructure, software libraries, hardware, etc.

Keywords: Pipeline Parallelism Zero Bubble

Scores: [ 8 6 6 8 ]

Pipeline parallelism is one of the key components for large-scale distributed training, yet its efficiency suffers from pipeline bubbles which were deemed inevitable. In this work, we introduce a scheduling strategy that, to our knowledge, is the first to successfully achieve zero pipeline bubbles. The key idea behind this improvement is to split the backward computation into two parts, one that computes gradient for the input and another that computes for the parameters. Based on this idea, we handcraft novel pipeline schedules that significantly outperform the baseline methods. We further develop an algorithm that automatically finds an optimal schedule based on specific model configuration and memory limit. Additionally, to truly achieve zero bubble, we introduce a novel technique to bypass synchronizations during the optimizer step. Experimental evaluations show that our method outperforms the 1F1B schedule up to 15% in throughput under a similar memory limit. This number can be further pushed to 30% when the memory constraint is relaxed. We believe our results mark a major step forward in harnessing the true potential of pipeline parallelism.

SPOTLIGHT-122: BatteryML:An Open-source platform for Machine Learning on Battery Degradation

Primary Area: infrastructure, software libraries, hardware, etc.

Keywords: BatteryML Battery life prediction Machine learning Open-source platform Unified standards Collaborative research

Scores: [ 8 6 8 6 ]

SPOTLIGHT-123: DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines

Primary Area: infrastructure, software libraries, hardware, etc.

Keywords: programming models prompting techniques in-context learning few-shot learning chain of thought multi-hop reasoning language agents

Scores: [ 6 8 8 ]

The ML community is rapidly exploring techniques for prompting language models (LMs) and for stacking them into pipelines that solve complex tasks. Unfortunately, existing LM pipelines are typically implemented using hard-coded “prompt templates”, i.e. lengthy strings discovered via trial and error. Toward a more systematic approach for developing and optimizing LM pipelines, we introduce DSPy, a programming model that abstracts LM pipelines as text transformation graphs, or imperative computational graphs where LMs are invoked through declarative modules. DSPy modules are parameterized, meaning they can learn (by creating and collecting demonstrations) how to apply compositions of prompting, finetuning, augmentation, and reasoning techniques. We design a compiler that will optimize any DSPy pipeline to maximize a given metric. We conduct two case studies, showing that succinct DSPy programs can express and optimize sophisticated LM pipelines that reason about math word problems, tackle multi-hop retrieval, answer complex questions, and control agent loops. Within minutes of compiling, DSPy can automatically produce prompt pipelines and finetune pipelines that outperform out-of-the-box few-shot prompting as well as expert-created demonstrations for GPT-3.5 and Llama2-13b-chat. On top of that, DSPy programs compiled to relatively small LMs like 770M parameter T5 and Llama2- 13b-chat are competitive with many approaches that rely on large and proprietary LMs like GPT-3.5 and on expert-written prompt chains.

POSTER-673: Canonpipe: Data Debugging with Shapley Importance over Machine Learning Pipelines

Primary Area: infrastructure, software libraries, hardware, etc.

Keywords: data debugging data valuation shapley value machine learning pipelines

Scores: [ 8 8 6 6 ]

When a machine learning (ML) model exhibits poor quality (e.g., poor accuracy or fairness), the problem can often be traced back to errors in the training data. Being able to discover the data examples that are the most likely culprits is a fundamental concern that has received a lot of attention recently. One prominent way to measure "data importance" with respect to model quality is the Shapley value. Unfortunately, existing methods only focus on the ML model in isolation, without considering the broader ML pipeline for data preparation and feature extraction, which appears in the majority of real-world ML code. This presents a major limitation to applying existing methods in practical settings. In this paper, we propose Canonpipe, a method for efficiently computing Shapley-based data importance over ML pipelines. We introduce several approximations that lead to dramatic improvements in terms of computational speed. Finally, our experimental evaluation demonstrates that our methods are capable of data error discovery that is as effective as existing Monte Carlo baselines, and in some cases even outperform them.

POSTER-674: Towards Cheaper Inference in Deep Networks with Lower Bit-Width Accumulators

Primary Area: infrastructure, software libraries, hardware, etc.

Keywords: Deep Neural Networks Quantized Neural Networks Network Quantization Accumulators Accelerators Inference Computer Vision Language Models

Scores: [ 6 6 5 6 ]

The majority of the research on the quantization of Deep Neural Networks (DNNs) is focused on reducing the precision of tensors visible by high-level frameworks (e.g., weights, activations, and gradients). However, current hardware still relies on high-accuracy core operations. Most significant is the operation of accumulating products. This high-precision accumulation operation is gradually becoming the main computational bottleneck. This is because, so far, the usage of low-precision accumulators led to a significant degradation in performance. In this work, we present a simple method to train and fine-tune DNNs, to allow, for the first time, utilization of cheaper, 12-bits accumulators, with no significant degradation in accuracy. Lastly, we show that as we decrease the accumulation precision further, using fine-grained gradient approximations can improve the DNN accuracy.

POSTER-675: FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Primary Area: infrastructure, software libraries, hardware, etc.

Keywords: Attention GPUs LLM training LLM inference

Scores: [ 5 8 10 6 ]

Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in the sequence length. FlashAttention [5] exploits the asymmetric GPU memory hierarchy to bring significant memory saving (linear instead of quadratic) and runtime speedup (2-4× compared to optimized baselines), with no approximation. However, FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40% of the theoretical maximum FLOPs/s. We observe that the inefficiency is due to suboptimal work partitioning between different thread blocks and warps on the GPU, causing either low-occupancy or unnecessary shared memory reads/writes. We propose FlashAttention-2, with better work partitioning to address these issues. In particular, we (1) tweak the algorithm to reduce the number of non-matmul FLOPs (2) parallelize the attention computation, even for a single head, across different thread blocks to increase occupancy, and (3) within each thread block, distribute the work between warps to reduce communication through shared memory. These yield around 2× speedup compared to FlashAttention, reaching 50-73% of the theoretical maximum FLOPs/s on A100 and getting close to the efficiency of GEMM operations. We empirically validate that when used end-to-end to train GPT-style models, FlashAttention-2 reaches training speed of up to 225 TFLOPs/s per A100 GPU (72% model FLOPs utilization).

POSTER-676: SRL: Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores

Primary Area: infrastructure, software libraries, hardware, etc.

Keywords: Reinforcement Learning Distributed Systems Large Scale Training

Scores: [ 8 8 6 8 ]

The ever-growing complexity of reinforcement learning (RL) tasks demands a distributed system to efficiently generate and process a massive amount of data. However, existing open-source libraries suffer from various limitations, which impede their practical use in challenging scenarios where large-scale training is necessary. In this paper, we present a novel abstraction on the dataflows of RLtraining, which unifies diverse RL training applications into a general framework. Following this abstraction, we develop a scalable, efficient, and extensible distributed RL system called ReaLly Scalable RL (SRL), which allows efficient and massively parallelized training and easy development of customized algorithms. Our evaluation shows that SRL outperforms existing academic libraries, reaching at most 21x higher training throughput in a distributed setting. On learning performance, beyond performing and scaling well on common RL benchmarks with different RL algorithms, SRL can reproduce the same solution in the challenging hide-and-seek environment as reported by OpenAI with up to 5x speedup in wallclock time. Notably, SRL is the first in the academic community to perform RL experiments at a large scale with over 15k CPU cores. SRL anonymous repository is available at: https://anonymous.4open.science/r/srl-1E45/.

POSTER-677: Automatic Functional Differentiation in JAX

Primary Area: infrastructure, software libraries, hardware, etc.

Keywords: JAX automatic differentiation calculus of variation higher-order functions functionals operators JVP rules transpose rules

Scores: [ 5 6 8 8 ]

We extend JAX with the capability to automatically differentiate higher-order functions (functionals and operators). By representing functions as infinite dimensional generalization of arrays, we seamlessly use JAX's existing primitive system to implement higher-order functions. We present a set of primitive operators that serve as foundational building blocks for constructing several key types of functionals. For every introduced primitive operator, we derive and implement both linearization and transposition rules, aligning with JAX's internal protocols for forward and reverse mode automatic differentiation. This enhancement allows for functional differentiation in the same syntax traditionally use for functions. The resulting functional gradients are themselves functions ready to be invoked in python. We showcase this tool's efficacy and simplicity through applications where functional derivatives are indispensable.

POSTER-678: ZeRO++: Extremely Efficient Collective Communication for Large Model Training

Primary Area: infrastructure, software libraries, hardware, etc.

Keywords: low-precision LLM pretraining 2 bits auto compression low memory pretraining

Scores: [ 8 5 5 6 6 ]

Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of large language models on massive GPU clusters due to its ease of use, efficiency, and good scalability. However, when training on low-bandwidth clusters, and/or when small batch size per GPU is used, ZeRO’s effective throughput is limited due to communication overheads. To alleviate this limitation, this paper introduces ZeRO++ composing of three communication volume reduction techniques (lowprecision all-gather, data remapping, and low-precision gradient averaging) to significantly reduce the communication volume up to 4x that enables up to 2.16x better throughput at 384 GPU scale. Our results also show ZeRO++ can speedup the RLHF by 3.3x compared to vanilla ZeRO. To verify the convergence of ZeRO++, we test up to 13B model for pretraining with 8/6-bits all gather and up to 30B model for finetuning with 4/2-bits all gather, and demonstrate on-par accuracy as original ZeRO (aka standard training). As a byproduct, the model trained with ZeRO++ is naturally weight-quantized, which can be directly used for inference without post-training quantization or quantization-aware training.

POSTER-679: FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

Primary Area: infrastructure, software libraries, hardware, etc.

Keywords: convolutions GPUs hardware-efficient algorithms long context fast fourier transform I/O awareness

Scores: [ 6 8 8 ]

Convolution models with long filters have demonstrated state-of-the-art reasoning abilities in many long-sequence tasks but lag behind the most optimized Transformers in wall-clock time.A major bottleneck is the Fast Fourier Transform (FFT)---which allows long convolutions to run in O(NlogN) time in sequence length N but has poor hardware utilization.In this paper, we study how to optimize the FFT convolution.We find two key bottlenecks: the FFT does not effectively use specialized matrix multiply units, and it incurs expensive I/O between layers of the memory hierarchy.In response, we propose FlashFFTConv.FlashFFTConv uses a matrix decomposition that computes the FFT using matrix multiply units and enables kernel fusion for long sequences, reducing I/O.We also present two sparse convolution algorithms---1) partial convolutions and 2) frequency-sparse convolutions---which can be implemented simply by skipping blocks in the matrix decomposition, enabling further opportunities for memory and compute savings.FlashFFTConv speeds up exact FFT convolutions by up to 8.7$\times$ over PyTorch and achieves up to 4.4$\times$ speedup end-to-end.Given the same compute budget, FlashFFTConv allows Hyena-GPT-s to achieve 2.3 points better perplexity and M2-BERT-base to achieve 3.3 points higher GLUE score---matching models with twice the parameter count.FlashFFTConv also achieves 96.1% accuracy on Path-512, a high-resolution vision task where no model had previously achieved better than 50%.Furthermore, partial convolutions enable longer-sequence models---yielding the first DNA model that can process the longest human genes (2.3M base pairs)---and frequency-sparse convolutions speed up pretrained models while maintaining or improving model quality.

POSTER-680: LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models

Primary Area: infrastructure, software libraries, hardware, etc.

Keywords: Neural Network Quantization Model Compression Generative Language Model Transformer Deep Learning

Scores: [ 8 6 6 8 ]

Recent advances in self-supervised learning and the Transformer architecture have significantly improved natural language processing (NLP), achieving remarkably low perplexity.However, the growing size of NLP models introduces a memory wall problem during the generation phase.To mitigate this issue, recent efforts have focused on quantizing model weights to sub-4-bit precision while preserving full precision for activations, resulting in practical speed-ups during inference on a single GPU.However, these improvements primarily stem from reduced memory movement, which necessitates a resource-intensive dequantization process rather than actual computational reduction.In this paper, we introduce LUT-GEMM, an efficient kernel for quantized matrix multiplication, which not only eliminates the resource-intensive dequantization process but also reduces computational costs compared to previous kernels for weight-only quantization.Furthermore, we proposed group-wise quantization to offer a flexible trade-off between compression ratio and accuracy.The impact of LUT-GEMM is facilitated by implementing high compression ratios through low-bit quantization and efficient LUT-based operations.We show experimentally that when applied to the OPT-175B model with 3-bit quantization, LUT-GEMM substantially accelerates token generation latency, achieving a remarkable 2.1x improvement on a single GPU when compared to OPTQ, which relies on the costly dequantization process.

POSTER-681: AutoChunk: Automated Activation Chunk for Memory-Efficient Deep Learning Inference

Primary Area: infrastructure, software libraries, hardware, etc.

Keywords: Long Sequence Inference Inference Compiler Activation Memory Machine Learning Infrastructure Low Resource Inference

Scores: [ 5 8 6 ]

Large deep learning models have achieved impressive performance across a range of applications. However, their large memory requirements, including parameter memory and activation memory, have become a significant challenge for their practical serving. While existing methods mainly address parameter memory, the importance of activation memory has been overlooked. Especially for long input sequences, activation memory is expected to experience a significant exponential growth as the length of sequences increases. In this approach, we propose AutoChunk, an automatic and adaptive compiler system that efficiently reduces activation memory for long sequence inference by chunk strategies. The proposed system generates chunk plans by optimizing through multiple stages. In each stage, the chunk search pass explores all possible chunk candidates and the chunk selection pass identifies the optimal one. At runtime, AutoChunk employs code generation to automatically apply chunk strategies. The experiments demonstrate that AutoChunk can reduce over 80% of activation memory while maintaining speed loss within 10%, extend max sequence length by 3.2x to 11.7x, and outperform state-of-the-art methods by a large margin.

POSTER-682: Retrieval-Guided Reinforcement Learning for Boolean Circuit Minimization

Primary Area: infrastructure, software libraries, hardware, etc.

Keywords: Electronics Design Automation (EDA) Logic Synthesis Reinforcement Learning Hardware design Circuits

Scores: [ 8 6 6 5 6 ]

Logic synthesis, a pivotal stage in chip design, entails optimizing chip specifications encoded in hardware description languages like Verilog into highly efficient implementations using Boolean logic gates. The process involves a sequential application of logic minimization heuristics (synthesis recipe"), with their arrangement significantly impacting crucial metrics such as area and delay. Addressing the challenge posed by the broad spectrum of hardware design complexities — from variations of past designs (e.g., adders and multipliers) to entirely novel configurations (e.g., innovative processor instructions) — requires a nuanced 'synthesis recipe' guided by human expertise and intuition. This study conducts a thorough examination of learning and search techniques for logic synthesis, unearthing a surprising revelation: pre-trained agents, when confronted with entirely novel designs, may veer off course, detrimentally affecting the search trajectory. We present RGLS, a meticulously tuned α parameter that adeptly adjusts recommendations from pre-trained agents during the search process. Computed based on similarity scores through nearest neighbor retrieval from the training dataset, RGLS yields superior synthesis recipes tailored for a wide array of hardware designs. Our findings showcase substantial enhancements in the Quality of Result (QoR) of synthesized circuits, boasting improvements of up to 24.8% compared to state-of-the-art techniques. Furthermore, RGLS achieves an impressive up to 9x reduction in runtime (iso-QoR) when compared to current state-of-the-art methodologies.

SPOTLIGHT-124: Unleashing the Potential of Fractional Calculus in Graph Neural Networks with FROND

Primary Area: learning on graphs and other geometries & topologies

Keywords: graph neural network

Scores: [ 8 8 6 8 ]

We introduce the FRactional-Order graph Neural Dynamical network (FROND), a learning framework that extends traditional graph neural ordinary differential equation (ODE) models by incorporating the time-fractional Caputo derivative. Due to its non-local nature, fractional calculus allows our framework to capture long-term memories in the feature updating process, in contrast to the Markovian nature of updates in traditional graph neural ODE models. This can lead to improved graph representation learning.We offer an interpretation of the feature updating process on graphs from a non-Markovian random walk perspective when the feature updating is governed by a diffusion process. We demonstrate analytically that over-smoothing can be mitigated in this setting.To experimentally demonstrate the versatility of the FROND framework, we evaluate the fractional counterparts of various established graph ODE models. Their consistently superior performance, compared to their original counterparts, highlights the potential of the FROND framework as an effective extension to boost the efficacy of various graph neural ODE models.

POSTER-683: Universal Graph Random Features

Primary Area: learning on graphs and other geometries & topologies

Keywords: Graphs kernels random walks Laplacian adjacency matrix kernel learning ordinary differential equation neural network

Scores: [ 8 8 8 8 ]

POSTER-684: Beyond Spatio-Temporal Representations: Evolving Fourier Transform for Temporal Graphs

Primary Area: learning on graphs and other geometries & topologies

Keywords: Temporal Dynamic Graphs Spectral Transform GNN

Scores: [ 6 6 6 6 ]

We present the Evolving Graph Fourier Transform (EFT), the first invertible spectral transform that captures evolving representations on temporal graphs. We motivate our work by the inadequacy of existing methods for capturing the evolving graph spectra, which are also computationally expensive due to the temporal aspect along with the graph vertex domain. We view the problem as an optimization over the Laplacian of the continuous time dynamic graph. Additionally, we propose pseudo-spectrum relaxations that decompose the transformation process, making it highly computationally efficient. The EFT method adeptly captures the evolving graph's structural and positional properties, making it effective for downstream tasks on evolving graphs. Hence, as a reference implementation, we develop a simple neural model induced with \eft for capturing evolving graph spectra. We empirically validate our theoretical findings on a number of large-scale and standard temporal graph benchmarks and demonstrate that our model achieves state-of-the-art performance.

POSTER-685: Partitioning Message Passing for Graph Fraud Detection

Primary Area: learning on graphs and other geometries & topologies

Keywords: Graph Neural Networks Fraud Detection

Scores: [ 5 5 6 6 ]

Label imbalance and homophily-heterophily mixture are the fundamental problems encountered when applying Graph Neural Networks (GNNs) to Graph Fraud Detection (GFD) tasks. Existing GNN-based GFD models are designed to augment graph structure to accommodate the inductive bias of GNNs towards homophily, by excluding heterophilic neighbors during message passing. In our work, we argue that the key to applying GNNs for GFD is not to exclude but to {\em distinguish} neighbors with different labels. Grounded in this perspective, we introduce Partitioning Message Passing (PMP), an intuitive yet effective message passing paradigm expressly crafted for GFD. Specifically, in the neighbor aggregation stage of PMP, neighbors with different classes are aggregated with distinct node-specific aggregation functions. By this means, the center node can adaptively adjust the information aggregated from its heterophilic and homophilic neighbors, thus avoiding the model gradient being dominated by benign nodes which occupy the majority of the population. We theoretically establish a connection between the spatial formulation of PMP and spectral analysis to characterize that PMP operates an adaptive node-specific spectral graph filter, which demonstrates the capability of PMP to handle heterophily-homophily mixed graphs. Extensive experimental results show that PMP can significantly boost the performance on GFD tasks.

Primary Area: learning on graphs and other geometries & topologies

Keywords: Graph Neural Network Link Prediction

Scores: [ 3 8 6 ]

POSTER-687: Counting Graph Substructures with Graph Neural Networks

Primary Area: learning on graphs and other geometries & topologies

Keywords: graph neural networks expressive power representation learning subgraph isomorphism cliques cycles motifs substructures count message-passing

Scores: [ 6 6 6 6 ]

Graph Neural Networks (GNNs) are powerful representation learning tools that have achieved remarkable performance in various important tasks. However, their ability to count substructures, which play a crucial role in biological and social networks, remains uncertain. In this work, we fill this gap and characterize the representation power of GNNs in terms of their ability to produce powerful representations that count graph substructures. In particular, we study the message-passing operations of GNNs with random stationary input and show that they can produce permutation equivariant representations that are associated with high-order statistical moments. Using these representations, we prove that GNNs can learn how to count cycles, quasi-cliques, and the number of connected components in a graph. We also provide new insights into the generalization capacity of GNNs. Our analysis is constructive and enables the design of a generic GNN architecture that shows remarkable performance in four distinct tasks: cycle detection, cycle counting, graph classification, and molecular property prediction.

POSTER-688: NP-GL: Extending Power of Nature from Binary Problems to Real-World Graph Learning

Primary Area: learning on graphs and other geometries & topologies

Keywords: graph learning nature-powered computing dynamic physical system

Scores: [ 6 6 5 ]

Nature performs complex computations constantly at clearly lower cost and higher performance than digital computers. It is crucial to understand how to harness the unique computational power of nature in Machine Learning (ML). In the past decade, besides the development of Neural Networks (NNs), the community has also relentlessly explored nature-powered ML paradigms. Although most of them are still predominantly theoretical, a new practical paradigm enabled by the recent advent of CMOS-compatible room-temperature nature-based computers has emerged. By harnessing the nature's power of entropy increase, this paradigm can solve binary learning problems delivering immense speedup and energy savings compared with NNs, while maintaining comparable accuracy. Regrettably, its values to the real world are highly constrained by its binary nature. A clear pathway to its extension to real-valued problems remains elusive. This paper aims to unleash this pathway by proposing a novel end-to-end Nature-Powered Graph Learning (NP-GL) framework. Specifically, through a three-dimensional co-design, NP-GL can leverage the nature's power of entropy increase to efficiently solve real-valued graph learning problems. Experimental results across 4 real-world applications with 6 datasets demonstrate that NP-GL delivers, on average, 6.97×103 speedup and 105 energy consumption reduction with comparable or even higher accuracy than Graph Neural Networks (GNNs).

POSTER-689: InterpGNN: Understand and Improve Generalization Ability of Transdutive GNNs through the Lens of Interplay between Train and Test Nodes

Primary Area: learning on graphs and other geometries & topologies

Keywords: Graph Neural Networks Generalization bounds PAC-Bayesian Structural imbalance Graph Global Workspace

Scores: [ 6 6 6 6 6 ]

POSTER-690: A Lie Group Approach to Riemannian Normalization for SPD Neural Networks

Primary Area: learning on graphs and other geometries & topologies

Keywords: Lie Groups Riemannian Batch Normalization SPD Neural Networks

Scores: [ 6 3 8 1 8 ]

Manifold-valued measurements are ubiquitous in numerous applications within computer vision and machine learning. Recent studies have extended the Deep Neural Networks (DNNs) to manifolds, and concomitantly, normalization techniques have also been adapted to several manifolds, referred to as Riemannian normalization. Nonetheless, most of the existing Riemannian normalization methods have been derived in an \emph{ad hoc} manner and only apply to specific manifolds. This paper establishes a unified framework for Riemannian Batch Normalization (RBN) techniques on Lie groups. Our framework offers the theoretical guarantee of controlling both the Riemannian mean and variance. Empirically, we focus on the application to Symmetric Positive Definite (SPD) manifolds, which possess three distinct types of Lie group structures. By the deformation concept, we generalize the existing Lie groups on SPD manifolds into three families of parameterized Lie groups. Specific normalization layers induced by these Lie groups are then proposed for the SPD neural networks. We demonstrate the effectiveness of our approach through three sets of experiments: radar recognition, human action recognition, and electroencephalography (EEG) classification.

ORAL-40: Graph Neural Networks for Learning Equivariant Representations of Neural Networks

Primary Area: learning on graphs and other geometries & topologies

Keywords: Deep weight space Graph neural networks Transformers Permutation equivariance Implicit neural representations Networks for networks Neural graphs

Scores: [ 8 8 6 ]

Neural networks that process the parameters of other neural networks find applications in domains as diverse as classifying implicit neural representations, generating neural network weights, and predicting generalization errors.However, existing approaches either overlook the inherent permutation symmetry in the neural network or rely on intricate weight-sharing patterns to achieve equivariance, while ignoring the impact of the network architecture itself.In this work, we propose to represent neural networks as computational graphs of parameters, which allows us to harness powerful graph neural networks and transformers that preserve permutation symmetry.Consequently, our approach enables a single model to encode neural computational graphs with diverse architectures.We showcase the effectiveness of our method on a wide range of tasks, including classification and editing of implicit neural representations, predicting generalization performance, and learning to optimize, while consistently outperforming state-of-the-art methods.

POSTER-691: Graph Lottery Ticket Automated

Primary Area: learning on graphs and other geometries & topologies

Keywords: Lottery Ticket Hypothesis Graph Neural Networks Graph Neural Tangent Kernel

Scores: [ 6 6 5 ]

Graph Neural Networks (GNNs) have emerged as the leading deep learning models for graph-based representation learning. However, the training and inference of GNNs on large graphs remain resource-intensive, impeding their utility in real-world scenarios and curtailing their applicability in deeper and more sophisticated GNN architectures. To address this issue, the Graph Lottery Ticket (GLT) hypothesis assumes that GNN with random initialization harbors a pair of core subgraph and sparse subnetwork, which can yield comparable performance and higher efficiency to that of the original dense network and complete graph. Despite that GLT offers a new paradigm for GNN training and inference, existing GLT algorithms heavily rely on trial-and-error pruning rate tuning and scheduling, and adhere to an irreversible pruning paradigm that lacks elasticity. Worse still, current methods suffer scalability issues when applied to deep GNNs, as they maintain the same topology structure across all layers. These challenges hinder the integration of GLT into deeper and larger-scale GNN contexts. To bridge this critical gap, this paper introduces an $\textbf{A}$daptive, $\textbf{D}$ynamic, and $\textbf{A}$utomated framework for identifying $\textbf{G}$raph $\textbf{L}$ottery $\textbf{T}ickets(\textbf{AdaGLT}$). Our proposed method derives its key advantages and addresses the above limitations through the following three aspects: 1) tailoring layer-adaptive sparse structures for various datasets and GNNs, thus endowing it with the capability to facilitate deeper GNNs; 2) integrating the pruning and training processes, thereby achieving a dynamic workflow encompassing both pruning and restoration; 3) automatically capturing graph lottery tickets across diverse sparsity levels, obviating the necessity for extensive pruning parameter tuning. More importantly, we rigorously provide theoretical proofs to guarantee AdaGLT to mitigate over-smoothing issues and obtain improved sparse structures in deep GNN scenarios. Extensive experiments demonstrate that AdaGLT outperforms state-of-the-art competitors across multiple graph datasets of various scales and types, particularly in scenarios involving deep GNNs.

POSTER-692: Rethinking the Power of Graph Canonization in Graph Representation Learning with Stability

Primary Area: learning on graphs and other geometries & topologies

Keywords: Graph neural networks Graph canonization Stability

Scores: [ 6 6 ]

SPOTLIGHT-125: Enabling Efficient Equivariant Operations in the Fourier Basis via Gaunt Tensor Products

Primary Area: learning on graphs and other geometries & topologies

Keywords: Equivariant Operations; Tensor Product; Change of basis; Spherical Harmonics; Fourier Basis;

Scores: [ 8 6 8 ]

Developing equivariant neural networks for the E(3) group plays a pivotal role in modeling 3D data across real-world applications. Enforcing this equivariance primarily involves the tensor products of irreducible representations (irreps). However, the computational complexity of such operations increases significantly as higher-order tensors are used. In this work, we propose a systematic approach to substantially accelerate the computation of the tensor products of irreps. We mathematically connect the commonly used Clebsch-Gordan coefficients to the Gaunt coefficients, which are integrals of products of three spherical harmonics. Through Gaunt coefficients, the tensor product of irreps becomes equivalent to the multiplication between spherical functions represented by spherical harmonics. This perspective further allows us to change the basis for the equivariant operations from spherical harmonics to a 2D Fourier basis. Consequently, the multiplication between spherical functions represented by a 2D Fourier basis can be efficiently computed via the convolution theorem and Fast Fourier Transforms. This transformation reduces the complexity of full tensor products of irreps from O(L6) to O(L3), where L is the max degree of irreps. Leveraging this approach, we introduce the Gaunt Tensor Product, which serves as a new method to construct efficient equivariant operations across different model architectures. Our experiments demonstrate both the superior efficiency and improved performance of our approach on a range of tasks. The code and models will be made publicly available.

SPOTLIGHT-126: Graph Metanetworks for Processing Diverse Neural Architectures

Primary Area: learning on graphs and other geometries & topologies

Keywords: Metanetwork graph equivariance expressivity

Scores: [ 6 6 6 ]

Neural networks efficiently encode learned information within their parameters. Consequently, many tasks can be unified by treating neural networks themselves as input data. When doing so, recent studies demonstrated the importance of accounting for the symmetries and geometry of parameter spaces. However, those works developed architectures tailored to specific networks such as MLPs and CNNs without normalization layers, and generalizing such architectures to other types of networks can be challenging. In this work, we overcome these challenges by building new metanetworks --- neural networks that take weights from other neural networks as input. Put simply, we carefully build graphs representing the input neural networks and process the graphs using graph neural networks. Our approach, Graph Metanetworks (GMNs), generalizes to neural architectures where competing methods struggle, such as multi-head attention layers, normalization layers, convolutional layers, ResNet blocks, and group-equivariant linear layers. We prove that GMNs are expressive and equivariant to parameter permutation symmetries that leave the input neural network functions unchanged. We validate the effectiveness of our method on several metanetwork tasks over diverse neural network architectures.

POSTER-693: Graph Parsing Networks

Primary Area: learning on graphs and other geometries & topologies

Keywords: GNN graph pooling parsing

Scores: [ 5 8 8 3 ]

Graph pooling compresses graph information into a compact representation. State-of-the-art graph pooling methods follow a hierarchical approach, which reduces the graph size step-by-step. These methods must balance memory efficiency with preserving node information, depending on whether they use node dropping or node clustering. Additionally, fixed pooling ratios or numbers of pooling layers are predefined for all graphs, which prevents personalized pooling structures from being captured for each individual graph. In this work, inspired by bottom-up grammar induction, we propose an efficient graph parsing algorithm to infer the pooling structure, which then drives graph pooling. The resulting Graph Parsing Network (GPN) adaptively learns personalized pooling structure for each individual graph. GPN benefits from the discrete assignments generated by the graph parsing algorithm, allowing good memory efficiency while preserving node information intact. Experimental results on standard benchmarks demonstrate that GPN outperforms state-of-the-art graph pooling methods in graph classification tasks while being able to achieve competitive performance in node classification tasks. We also conduct a graph reconstruction task to show GPN's ability to preserve node information and measure both memory and time efficiency through relevant tests.

POSTER-694: Conformal Inductive Graph Neural Networks

Primary Area: learning on graphs and other geometries & topologies

Keywords: Conformal Prediction Graph Neural Networks Inductive Node Classification

Scores: [ 5 8 6 6 ]

Conformal prediction (CP) can transform any model's output into prediction sets guaranteed to include (cover) the true label. CP requires exchangeability, a relaxation of the i.i.d. assumption, to obtain a valid distribution-free coverage guarantee. This makes it directly applicable to transductive node-classification. However, conventional CP cannot be applied in inductive settings due to the implicit shift in the (calibration) scores caused by message passing with the new nodes. We fix this issue for both cases of node and edge-exchangeable graphs, recovering the standard coverage guarantee without sacrificing statistical efficiency. We further prove that the guarantee holds independently of the prediction time, e.g. upon arrival of a new node/edge or at any subsequent moment.

POSTER-695: Label-free Node Classification on Graphs with Large Language Models (LLMs)

Primary Area: learning on graphs and other geometries & topologies

Keywords: Node classification Graph Neural Networks Large Language Models

Scores: [ 8 6 6 6 ]

In recent years, there have been remarkable advancements in node classification achieved by Graph Neural Networks (GNNs). However, they necessitate abundant high-quality labels to ensure promising performance. In contrast, Large Language Models (LLMs) exhibit impressive zero-shot proficiency on text-attributed graphs. Yet, they face challenges in efficiently processing structural data and suffer from high inference costs. In light of these observations, this work introduces a label-free node classification on graphs with LLMs pipeline, LLM-GNN. It amalgamates the strengths of both GNNs and LLMs while mitigating their limitations. Specifically, LLMs are leveraged to annotate a small portion of nodes and then GNNs are trained on LLMs' annotations to make predictions for the remaining large portion of nodes. The implementation of LLM-GNN faces a unique challenge: how can we actively select nodes for LLMs to annotate and consequently enhance the GNN training? How can we leverage LLMs to obtain annotations of high quality, representativeness, and diversity, thereby enhancing GNN performance with less cost?To tackle this challenge, we develop an annotation quality heuristic and leverage the confidence scores derived from LLMs to advanced node selection. Comprehensive experimental results validate the effectiveness of LLM-GNN. In particular, LLM-GNN can achieve an accuracy of 74.9% on a vast-scale dataset \products with a cost less than 1 dollar.

POSTER-696: Tractable Probabilistic Graph Representation Learning with Graph-Induced Sum-Product Networks

Primary Area: learning on graphs and other geometries & topologies

Keywords: Deep Graph Networks Graph Neural Networks Bayesian Networks Sum-Product Networks Graph Representation Learning Missing Data Graph Classification Weak Supervision

Scores: [ 6 6 5 ]

POSTER-697: VQGraph: Rethinking Graph Representation Space for Bridging GNNs and MLPs

Primary Area: learning on graphs and other geometries & topologies

Keywords: Graph Knowledge Distillation Efficient Graph Learning

Scores: [ 5 8 8 6 5 ]

GNN-to-MLP distillation aims to utilize knowledge distillation (KD) to learn computationally-efficient multi-layer perceptron (student MLP) on graph data by mimicking the output representations of teacher GNN. Existing methods mainly make the MLP to mimic the GNN predictions over a few class labels. However, the class space may not be expressive enough for covering numerous diverse local graph structures, thus limiting the performance of knowledge transfer from GNN to MLP. To address this issue, we propose to learn a new powerful graph representation space by directly labeling nodes' diverse local structures for GNN-to-MLP distillation. Specifically, we propose a variant of VQ-VAE to learn a structure-aware tokenizer on graph data that can encode each node's local substructure as a discrete code. The discrete codes constitute a codebook as a new graph representation space that is able to identify different local graph structures of nodes with the corresponding code indices. Then, based on the learned codebook, we propose a new distillation target, namely soft code assignments, to directly transfer the structural knowledge of each node from GNN to MLP. The resulting framework VQGraph achieves new state-of-the-art performance on GNN-to-MLP distillation in both transductive and inductive settings across seven graph datasets. We show that VQGraph with better performance infers faster than GNNs by 828×, and also achieves accuracy improvement over GNNs and stand-alone MLPs by 3.90% and 28.05% on average, respectively.

POSTER-698: Efficient Subgraph GNNs by Learning Effective Selection Policies

Primary Area: learning on graphs and other geometries & topologies

Keywords: Graph Neural Networks Subgraphs Expressive power Sampling

Scores: [ 6 6 8 6 ]

POSTER-699: PRES: Toward Scalable Memory-Based Dynamic Graph Neural Networks

Primary Area: learning on graphs and other geometries & topologies

Keywords: Dynamic Graph Neural Network

Scores: [ 6 6 ]

Memory-based Dynamic Graph Neural Networks (MDGNNs) are a family of dynamic graph neural networks that leverage a memory module to extract, distill, and memorize long-term temporal dependencies, leading to superior performance compared to memory-less counterparts. However, training MDGNNs faces the challenge of handling entangled temporal and structural dependencies, requiring sequential and chronological processing of data sequences to capture accurate temporal patterns. During the batch training, the temporal data points within the same batch will be processed in parallel, while their temporal dependencies are neglected. This issue is referred to as temporal discontinuity and restricts the effective temporal batch size, limiting data parallelism and reducing MDGNNs' flexibility in industrial applications. This paper studies the efficient training of MDGNNs at scale, focusing on the temporal discontinuity in training MDGNNs with large temporal batch sizes. We first conduct a theoretical study on the impact of temporal batch size on the convergence of MDGNN training. Based on the analysis, we propose PRES, an iterative prediction-correction scheme combined with a memory coherence learning objective to mitigate the effect of temporal discontinuity, enabling MDGNNs to be trained with significantly larger temporal batches without sacrificing generalization performance. Experimental results demonstrate that our approach enables up to a 4 × larger temporal batch (3.4$\times$ speed-up) during MDGNN training.

ORAL-41: Flow Matching on General Geometries

Primary Area: learning on graphs and other geometries & topologies

Keywords: general manifolds diffusion models continuous normalizing flow

Scores: [ 8 8 8 ]

We propose Riemannian Flow Matching (RFM), a simple yet powerful framework for training continuous normalizing flows on manifolds. Existing methods for generative modeling on manifolds either require expensive simulation, are inherently unable to scale to high dimensions, or use approximations for limiting quantities that result in biased training objectives. Riemannian Flow Matching bypasses these limitations and offers several advantages over previous approaches: it is simulation-free on simple geometries, does not require divergence computation, and computes its target vector field in closed-form. The key ingredient behind RFM is the construction of a relatively simple premetric for defining target vector fields, which encompasses the existing Euclidean case. To extend to general geometries, we rely on the use of spectral decompositions to efficiently compute premetrics on the fly. Our method achieves state-of-the-art performance on real-world non-Euclidean datasets, and we demonstrate tractable training on general geometries, including triangular meshes with highly non-trivial curvature and boundaries.

SPOTLIGHT-127: Consistency Training with Learnable Data Augmentation for Graph Anomaly Detection with Limited Supervision

Primary Area: learning on graphs and other geometries & topologies

Keywords: Graph anomaly detection consistency training learnable data augmentation

Scores: [ 5 8 8 ]

Graph Anomaly Detection (GAD) has surfaced as a significant field of research, predominantly due to its substantial influence in production environments. Although existing approaches for node anomaly detection have shown effectiveness, they have yet to fully address two major challenges: operating in settings with limited supervision and managing class imbalance effectively. In response to these challenges, we propose a novel model, ConsisGAD, which is tailored for GAD in scenarios characterized by limited supervision and is anchored in the principles of consistency training. Under limited supervision, ConsisGAD effectively leverages the abundance of unlabeled data for consistency training by incorporating a novel learnable data augmentation mechanism, thereby introducing controlled noise into the dataset. Moreover, ConsisGAD takes advantage of the variance in homophily distribution between normal and anomalous nodes to craft a simplified GNN backbone, enhancing its capability to distinguish effectively between these two classes. Comprehensive experiments on several benchmark datasets validate the superior performance of ConsisGAD in comparison to state-of-the-art baselines.

SPOTLIGHT-128: SaNN: Simple Yet Powerful Simplicial-aware Neural Networks

Primary Area: learning on graphs and other geometries & topologies

Keywords: Graph Neural Networks Higher-order Representation Learning Simplicial Complexes Simplicial Neural Networks Weisfeiler-Lehman Isomorphism Test

Scores: [ 6 6 8 8 ]

Simplicial neural networks (SNNs) are deep models for higher-order graph representation learning. SNNs learn low-dimensional embeddings of simplices in a simplicial complex by aggregating features of their respective upper, lower, boundary, and coboundary adjacent simplices. The aggregation in SNNs is carried out during training. Since the number of simplices of various orders in a simplicial complex is significantly large, the memory and training-time requirement in SNNs is enormous. In this work, we propose a scalable simplicial-aware neural network (SaNN) model with a constant run-time and memory requirements independent of the size of the simplicial complex and the density of interactions in it. SaNN is based on pre-aggregated simplicial-aware features as inputs to a neural network, so it has a strong simplicial-structural inductive bias. We provide theoretical conditions under which SaNN is provably more powerful than the Weisfeiler-Lehman (WL) graph isomorphism test and as powerful as the simplicial Weisfeiler-Lehman (SWL) test. We also show that SaNN is permutation and orientation equivariant and satisfies simplicial-awareness of the highest order in a simplicial complex. We demonstrate via numerical experiments that despite being computationally economical, the proposed model achieves state-of-the-art performance in predicting trajectories, simplicial closures, and classifying graphs.

POSTER-700: Structural Fairness-aware Active Learning for Graph Neural Networks

Primary Area: learning on graphs and other geometries & topologies

Keywords: Active Learning Graph Neural Networks Structural Fairness

Scores: [ 5 6 6 6 ]

Graph Neural Networks (GNNs) have seen significant achievements in semi-supervised node classification. Yet, their efficacy often hinges on access to high-quality labeled node samples, which may not always be available in real-world scenarios. While active learning is commonly employed across various domains to pinpoint and label high-quality samples based on data features, graph data present unique challenges due to their intrinsic structures that render nodes non-i.i.d. Furthermore, biases emerge from the positioning of labeled nodes; for instance, nodes closer to the labeled counterparts often yield better performance. To better leverage graph structure and mitigate structural bias in active learning, we present a unified optimization framework (SCARCE), which is also easily incorporated with node features. Extensive experiments demonstrate that the proposed method not only improves the GNNs performance but also paves the way for more fair results.

Primary Area: learning on graphs and other geometries & topologies

Keywords: Link prediction; Node Local Topology; Graph Neural Network ; Cold-start Issues

Scores: [ 6 6 6 5 ]

POSTER-702: Approximately Piecewise E(3) Equivariant Point Networks

Primary Area: learning on graphs and other geometries & topologies

Keywords: E(3) equivariant networks

Scores: [ 6 8 6 ]

Integrating a notion of symmetry into point cloud neural networks is a provably effective way to improve their generalization capability. Of particular interest are E(3) equivariant point cloud networks where Euclidean transformations applied to the inputs are preserved in the outputs. Recent efforts aim to extend networks that are equivariant with respect to a single global E(3) transformation, to accommodate inputs made of multiple parts, each of which exhibits local E(3) symmetry.In practical settings, however, the partitioning into individually transforming regions is unknown a priori.Errors in the partition prediction would unavoidably map to errors in respecting the true input symmetry. Past works have proposed different ways to predict the partition, which may exhibit uncontrolled errors in their ability to maintain equivariance to the actual partition. To this end, we introduce APEN: a general framework for constructing approximate piecewise-E(3) equivariant point networks. Our framework offers an adaptable design to guaranteed bounds on the resulting piecewise E(3) equivariance approximation errors.Our primary insight is that functions which are equivariant with respect to a finer partition (compared to the unknown true partition) will also maintain equivariance in relation to the true partition. Leveraging this observation, we propose a compositional design for a partition prediction model. It initiates with a fine partition and incrementally transitions towards a coarser subpartition of the true one, consistently maintaining piecewise equivariance in relation to the current partition.As a result, the equivariance approximation error can be bounded solely in terms of (i) uncertainty quantification of the partition prediction, and (ii) bounds on the probability of failing to suggest a proper subpartition of the ground truth one.We demonstrate the practical effectiveness of APEN using two data types exemplifying part-based symmetry: (i) real-world scans of room scenes containing multiple furniture-type objects; and, (ii) human motions, characterized by articulated parts exhibiting rigid movement. Our empirical results demonstrate the advantage of integrating piecewise E(3) symmetry into network design, showing a distinct improvement in generalization over prior works in terms of generalization accuracy for both classification and segmentation tasks.

POSTER-703: Clifford Group Equivariant Simplicial Message Passing Networks

Primary Area: learning on graphs and other geometries & topologies

Keywords: Clifford Algebra Geometric Algebra Graph Neural Networks Simplicial Message Passing Topological Deep Learning Geometric Deep Learning Equivariance

Scores: [ 6 6 6 ]

We introduce Clifford Group Equivariant Simplicial Message Passing Networks, a method for steerable E(n)-equivariant message passing on simplicial complexes. Our method integrates the expressivity of Clifford group-equivariant layers with simplicial message passing, which is topologically more intricate than regular graph message passing. Clifford algebras include higher-order objects such as bivectors and trivectors, which express geometric features (e.g., areas, volumes) derived from vectors. Using this knowledge, we represent simplex features through geometric products of their vertices. To achieve efficient simplicial message passing, we share the parameters of the message network across different dimensions. Additionally, we restrict the final message to an aggregation of the incoming messages from different dimensions, leading to what we term shared simplicial message passing. Experimental results show that our method is able to outperform both equivariant and simplicial graph neural networks on a variety of geometric tasks.

POSTER-704: Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

Primary Area: learning on graphs and other geometries & topologies

Keywords: graph neural networks Datasets molecules molecular graphs Quantum Multi-task foundation model

Scores: [ 8 6 8 6 3 ]

Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, where datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models.In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset.In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks.

POSTER-705: Probabilistically Rewired Message-Passing Neural Networks

Primary Area: learning on graphs and other geometries & topologies

Keywords: GNN MPNN graph rewiring probabilistic weisfeiler leman lehman k-subset sampling expressivity

Scores: [ 8 6 6 ]

Message-passing graph neural networks (MPNNs) emerged as powerful tools for processing graph-structured input. However, they operate on a fixed input graph structure, ignoring potential noise and missing information. Furthermore, their local aggregation mechanism can lead to problems such as over-squashing and limited expressive power in capturing relevant graph structures. Existing solutions to these challenges have primarily relied on heuristic methods, often disregarding the underlying data distribution. Hence, devising principled approaches for learning to infer graph structures relevant to the given prediction task remains an open challenge. In this work, leveraging recent progress in exact and differentiable k-subset sampling, we devise probabilistically rewired MPNNs (PR-MPNNs), which learn to add relevant edges while omitting less beneficial ones. For the first time, our theoretical analysis explores how PR-MPNNs enhance expressive power, and we identify precise conditions under which they outperform purely randomized approaches. Empirically, we demonstrate that our approach effectively mitigates issues like over-squashing and under-reaching. In addition, on established real-world datasets, our method exhibits competitive or superior predictive performance compared to traditional MPNN models and recent graph transformer architectures.

POSTER-706: VCR-Graphormer: A Mini-batch Graph Transformer via Virtual Connections

Primary Area: learning on graphs and other geometries & topologies

Keywords: Graph Representation Learning Graph Transformer

Scores: [ 5 6 6 5 ]

Graph transformer has been proven as an effective graph learning method for its adoption of attention mechanism that is capable of capturing expressive representations from complex topological and feature information of graphs. Graph transformer conventionally performs dense attention (or global attention) for every pair of nodes to learn node representation vectors, resulting in quadratic computational costs that are unaffordable for large-scale graph data. Therefore, mini-batch training for graph transformers is a promising direction, but limited samples in each mini-batch can not support effective dense attention to encode informative representations.Facing this bottleneck, (1) we start by assigning each node a token list that is sampled by personalized PageRank (PPR) and then apply standard multi-head self-attention only on this list to compute its node representations. This PPR tokenization method decouples model training from complex graph topological information and makes heavy feature engineering offline and independent, such that mini-batch training of graph transformers is possible by loading each node's token list in batches. We further prove this PPR tokenization is viable as a graph convolution network with a fixed polynomial filter and jumping knowledge. However, only using personalized PageRank may limit information carried by a token list, which could not support different graph inductive biases for model training. To this end, (2) we rewire graphs by introducing multiple types of virtual connections through structure- and content-based super nodes that enable PPR tokenization to encode local and global contexts, long-range interaction, and heterophilous information into each node's token list, and then formalize our Virtual Connection Ranking based Graph Transformer (VCR-Graphormer). Overall, VCR-Graphormer only needs O(m+klogk) complexity for graph tokenization as compared to O(n3) of previous works. We also show that VCR-Graphormer outperforms the state-of-the-arts on node classification in 12 datasets.

POSTER-707: Harnessing Explanations: LLM-to-LM Interpreter for Enhanced Text-Attributed Graph Representation Learning

Primary Area: learning on graphs and other geometries & topologies

Keywords: large language models (LLM) feature learning text attributed graphs (TAG) graph neural networks (GNN)

Scores: [ 6 5 6 ]

Representation learning on text-attributed graphs (TAGs) has become a critical research problem in recent years. A typical example of a TAG is a paper citation graph, where the text of each paper serves as node attributes. Initial graph neural network (GNN) pipelines handled these text attributes by transforming them into shallow or hand-crafted features, such as skip-gram or bag-of-words features. Recent efforts have focused on enhancing these pipelines with language models (LMs), which typically demand intricate designs and substantial computational resources. With the advent of powerful large language models (LLMs) such as GPT or Llama2, which demonstrate an ability to reason and to utilize general knowledge, there is a growing need for techniques which combine the textual modelling abilities of LLMs with the structural learning capabilities of GNNs. Hence, in this work, we focus on leveraging LLMs to capture textual information as features, which can be used to boost GNN performance on downstream tasks. A key innovation is our use of \emph{explanations as features}: we prompt an LLM to perform zero-shot classification, request textual explanations for its decision-making process, and design an \emph{LLM-to-LM interpreter} to translate these explanations into informative features that enhance downstream GNNs. Our experiments demonstrate that our method achieves state-of-the-art results on well-established TAG datasets, including \texttt{Cora}, \texttt{PubMed}, \texttt{ogbn-arxiv}, as well as our newly introduced dataset, \texttt{arXiv-2023}. Furthermore, our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on \texttt{ogbn-arxiv}. Lastly, we believe the versatility of the proposed method extends beyond TAGs and holds the potential to enhance other tasks involving graph-text data~\footnote{Our codes and datasets are available at: \url{https://anonymous.4open.science/r/TAPE-dev}}.

POSTER-708: Learning From Simplicial Data Based on Random Walks and 1D Convolutions

Primary Area: learning on graphs and other geometries & topologies

Keywords: simplicial complex simplicial neural network random walks

Scores: [ 8 5 8 5 ]

Triggered by limitations of graph-based deep learning methods in terms of computational expressivity and model flexibility, recent years have seen a surge of interest in computational models that operate on higher-order topological domains such as hypergraphs and simplicial complexes. While the increased expressivity of these models can indeed lead to a better classification performance and a more faithful representation of the underlying system, the computational cost of these higher-order models can increase dramatically. To this end, we here explore a simplicial complex neural network learning architecture based on random walks and fast 1D convolutions (SCRaWl), in which we can adjust the increase in computational cost by varying the length and number of random walks considered while accounting for higher-order relationships. Importantly, due to the random walk-based design, the expressivity of the proposed architecture is provably incomparable to that of existing message-passing simplicial neural networks. We empirically evaluate SCRaWl on real-world datasets and show that it outperforms other simplicial neural networks.

POSTER-709: Differentiable Euler Characteristic Transforms for Shape Classification

Primary Area: learning on graphs and other geometries & topologies

Keywords: Differentiable Euler Characteristic Transforms for Shape Classification

Scores: [ 6 5 6 8 ]

The Euler Characteristic Transform (ECT) has proven to be a powerful representation, combining geometrical and topological characteristics of shapes and graphs. However, the ECT was hitherto unable to learn task-specific representations. We overcome this issue and develop a novel computational layer that enables learning the ECT in an end-to-end fashion. Our method, DECT, is fast and computationally efficient, while exhibiting performance on a par with more complex models in both graph and point cloud classification tasks. Moreover, we show that this seemingly unexpressive statistic still provides the same topological expressivity as more complex topological deep learning layers provide.

POSTER-710: Mitigating Severe Robustness Degradation on Graphs

Primary Area: learning on graphs and other geometries & topologies

Keywords: Graph neural network

Scores: [ 6 6 6 8 ]

SPOTLIGHT-129: On the Markov Property of Neural Algorithmic Reasoning: Analyses and Methods

Primary Area: learning on graphs and other geometries & topologies

Keywords: Neural Algorithmic Reasoning

Scores: [ 6 8 8 6 6 ]

Neural algorithmic reasoning is an emerging research direction that endows neural networks with the ability to mimic algorithmic executions step-by-step. A common paradigm in existing designs involves the use of historical embeddings in predicting the results of future execution steps. Our observation in this work is that such historical dependence intrinsically contradicts the Markov nature of algorithmic reasoning tasks. Based on this motivation, we present our ForgetNet, which does not use historical embeddings and thus is consistent with the Markov nature of the tasks. To address challenges in training ForgetNet at early stages, we further introduce G-ForgetNet, which uses a gating mechanism to allow for the selective integration of historical embeddings. Such an enhanced capability could provide valuable guidance during the model's early training phase. Our extensive experiments, based on the CLRS-30 algorithmic reasoning benchmark, demonstrate that both ForgetNet and G-ForgetNet achieve better generalization capability than existing methods. Furthermore, we investigate the behavior of the gating mechanism, highlighting its degree of alignment with our intuitions and its effectiveness for robust performance.

SPOTLIGHT-130: NetInfoF Framework: Measuring and Exploiting Network Usable Information

Primary Area: learning on graphs and other geometries & topologies

Keywords: Graph Neural Networks Information Theory Heterophily Graphs

Scores: [ 6 8 8 ]

POSTER-711: Rethinking Label Poisoning for GNNs: Pitfalls and Attacks

Primary Area: learning on graphs and other geometries & topologies

Keywords: Graph Neural Networks Label Poisoning Attacks Pitfalls

Scores: [ 6 8 5 8 ]

Node labels for graphs are usually generated using an automated process or crowd-sourced from human users. This opens up avenues for malicious users to compromise the training labels, making it unwise to blindly rely on them. While robustness against noisy labels is an active area of research, there are only a handful of papers in the literature that address this for graph-based data. Even more so, the effects of adversarial label perturbations is sparsely studied. More critically, we reveal that the entire literature on label poisoning for GNNs is plagued by serious evaluation pitfalls. Thus making it hard to conclude how robust GNNs are against label perturbations. After course correcting the state of label poisoning attacks with our faithful evaluation, we identify a discrepancy in attack efficiency of 9% on average. Additionally, we introduce two new simple yet effective attacks that are significantly stronger (up to 8%) than the previous strongest attack. Our strongest proposed attack can be efficiently computed and is theoretically backed.

POSTER-712: Temporal Generalization Estimation in Evolving Graphs

Primary Area: learning on graphs and other geometries & topologies

Keywords: Generalization Estimation Evolving Graph Graph Neural Network

Scores: [ 5 8 6 5 6 ]

Graph Neural Networks (GNNs) are widely deployed in vast fields, but they often struggle to maintain accurate representations as graphs evolve. We theoretically establish a lower bound, proving that under mild conditions, representation distortion inevitably occurs over time. To estimate the temporal distortion without human annotation after deployment, one naive approach is to pre-train a recurrent model (e.g., RNN) before deployment and use this model afterwards, but the estimation is far from satisfactory. In this paper, we analyze the representation distortion from an information theory perspective, and attribute it primarily to inaccurate feature extraction during evolution. Consequently, we introduce Smart, a straightforward and effective baseline enhanced by an adaptive feature extractor through self-supervised graph reconstruction. In synthetic random graphs, we further refine the former lower bound to show the inevitable distortion over time and empirically observe that Smart achieves good estimation performance. Moreover, we observe that Smart consistently shows outstanding generalization estimation on four real-world evolving graphs. The ablation studies underscore the necessity of graph reconstruction. For example, on OGB-arXiv dataset, the estimation metric MAPE deteriorates from 2.19% to 8.00% without reconstruction.

POSTER-713: Improved algorithm and bounds for successive projection

Primary Area: learning on graphs and other geometries & topologies

Keywords: Simplex vertex hunting successive projection pseudo-points pruning hyper-spectral unmixing archetypal analysis network analysis.

Scores: [ 3 6 6 8 ]

POSTER-714: Orbit-Equivariant Graph Neural Networks

Primary Area: learning on graphs and other geometries & topologies

Keywords: graph neural networks equivariance expressivity graph orbits

Scores: [ 6 8 6 8 ]

Equivariance is an important structural property that is captured by architectures such as graph neural networks (GNNs). However, equivariant graph functions cannot produce different outputs for similar nodes, which may be undesirable when the function is trying to optimize some global graph property. In this paper, we define orbit-equivariance, a relaxation of equivariance which allows for such functions whilst retaining important structural inductive biases. We situate the property in the hierarchy of graph functions, define a taxonomy of orbit-equivariant functions, and provide four different ways to achieve non-equivariant GNNs. For each, we analyze their expressivity with respect to orbit-equivariance and evaluate them on two novel datasets, one of which stems from a real-world use-case of designing optimal bioisosteres.

POSTER-715: Effective Structural Encodings via Local Curvature Profiles

Primary Area: learning on graphs and other geometries & topologies

Keywords: Graph Neural Networks Discrete Curvature Rewiring Positional Encodings Structural Encodings

Scores: [ 3 6 8 6 ]

Structural and Positional Encodings can significantly improve the performance of Graph Neural Networks in downstream tasks. Recent literature has begun to systematically investigate differences in the structural properties that these approaches encode, as well as performance trade-offs between them. However, the question of which structural properties yield the most effective encoding remains open. In this paper, we investigate this question from a geometric perspective. We propose a novel structural encoding based on discrete Ricci curvature (Local Curvature Profiles, short LCP) and show that it significantly outperforms existing encoding approaches. We further show that combining local structural encodings, such as LCP, with global positional encodings improves downstream performance, suggesting that they capture complementary geometric information. Finally, we compare different encoding types with (curvature-based) rewiring techniques. Rewiring has recently received a surge of interest due to its ability to improve the performance of Graph Neural Networks by mitigating over-smoothing and over-squashing effects. Our results suggest that utilizing curvature information for structural encodings delivers significantly larger performance increases than rewiring.

POSTER-716: On the Power of the Weisfeiler-Leman Test for Graph Motif Parameters

Primary Area: learning on graphs and other geometries & topologies

Keywords: WL test graph neural networks graph motif parameters subgraph counting

Scores: [ 6 6 8 6 ]

POSTER-717: HoloNets: Spectral Convolutions do extend to Directed Graphs

Primary Area: learning on graphs and other geometries & topologies

Keywords: Spectral Graph Convolutions Directed Graphs Weighted Graphs Complex Analysis Spectral Graph Theory Heterophily Node Classification Graph Regression pectral Graph Theory Rigorous Proofs

Scores: [ 6 8 6 6 ]

Within the graph learning community, conventional wisdom dictates that spectral convolutional networks may only be deployed on undirected graphs: Only there could the existence of a well-defined graph Fourier transform be guaranteed, so that information may be translated between spatial- and spectral domains. Here we show this traditional reliance on the graph Fourier transform to be superfluous: Making use of certain advanced tools from complex analysis and spectral theory, we extend spectral convolutions to directed graphs. We provide a frequency- response interpretation of newly developed filters, investigate the influence of the basis’ used to express filters and discuss the interplay with characteristic operators on which networks are based. In order to thoroughly test the developed general theory, we conduct experiments in real world settings, showcasing that directed spectral convolutional networks provide new state of the art results for heterophilic node classification and – as opposed to baselines – may be rendered stable to resolution-scale varying topological perturbations.

POSTER-718: Simplicial Representation Learning with Neural k-Forms

Primary Area: learning on graphs and other geometries & topologies

Keywords: geometric deep learning differential forms representation learning graph learning geometry topology

Scores: [ 8 6 5 6 5 ]

\emph{Geometric deep learning} extends deep learning to incorporate information about the geometry and topology data, especially in complex domains like graphs. Despite the popularity of message passing in this field, it has limitations such as the need for graph rewiring, ambiguity in interpreting data, and over-smoothing. In this paper, we take a different approach, focusing on leveraging geometric information from simplicial complexes embedded in Rn using node coordinates. We use differential k-forms in Rn to create representations of simplices, offering interpretability and geometric consistency without message passing. This approach also enables us to apply differential geometry tools and achieve universal approximation. Our method is efficient, versatile, and applicable to various input complexes, including graphs, simplicial complexes, and cell complexes. It outperforms existing message passing neural networks in harnessing information from geometrical graphs with node features serving as coordinates.

POSTER-719: Bounding the Expected Robustness of Graph Neural Networks Subject to Node Feature Attacks

Primary Area: learning on graphs and other geometries & topologies

Keywords: Graph Neural Networks Adversarial Robustness

Scores: [ 5 8 6 8 ]

Graph Neural Networks (GNNs) have demonstrated state-of-the-art performance in various graph representation learning tasks. Recently, studies revealed their vulnerability to adversarial attacks. In this work, we theoretically define the concept of expected robustness in the context of attributed graphs and relate it to the classical definition of adversarial robustness in the graph representation learning literature. Our definition allows us to derive an upper bound of the expected robustness of Graph Convolutional Networks (GCNs) and Graph Isomorphism Networks subject to node feature attacks. Building on these findings, we connect the expected robustness of GNNs to the orthogonality of their weight matrices and consequently propose an attack-independent, more robust variant of the GCN, called the Graph Convolutional Orthogonal Robust Networks (GCORNs). We further introduce a probabilistic method to estimate the expected robustness, which allows us to evaluate the effectiveness of GCORN on several real-world datasets. Experimental experiments showed that GCORN outperforms available defense methods.

SPOTLIGHT-131: Hybrid Directional Graph Neural Network for Molecules

Primary Area: learning on graphs and other geometries & topologies

Keywords: Graph Neural Networks; Equivariance; Molecular model

Scores: [ 6 8 8 ]

Equivariant message passing neural networks have emerged as the prevailing approach for predicting chemical properties of molecules due to their ability to leverage translation and rotation symmetries, resulting in a strong inductive bias. However, the equivariant operations in each layer can impose excessive constraints on the function form and network flexibility. To address these challenges, we introduce a novel network called the Hybrid Directional Graph Neural Network (HDGNN), which effectively combines strictly equivariant operations with learnable modules. We evaluate the performance of HDGNN on the QM9 dataset and the IS2RE dataset of OC20, demonstrating its state-of-the-art performance on several tasks and competitive performance on others. Our code is anonymously released on https://github.com/AnonymousACode/HDGNN.

POSTER-720: Forward Learning of Graph Neural Networks

Primary Area: learning on graphs and other geometries & topologies

Keywords: graph neural networks forward learning forward-forward algorithm

Scores: [ 6 6 6 8 ]

Graph neural networks (GNNs) have achieved remarkable success across a wide range of applications, such as recommendation, drug discovery, and question answering. Behind the success of GNNs lies the backpropagation (BP) algorithm, which is the de facto standard for training deep neural networks. However, despite its effectiveness, BP imposes several constraints, which are not only biologically implausible, but also limit the scalability, parallelism, and flexibility in learning neural networks. Examples of such constraints include the storage of neural activities computed in the forward pass for use in the subsequent backward pass, and the dependence of parameter updates on non-local signals. To address these limitations, the forward-forward algorithm (FF) was recently proposed as an alternative to BP in the image classification domain, which trains neural networks by performing two forward passes over positive and negative data. Inspired by this advance, we propose ForwardGNN in this work, a new forward learning procedure for GNNs, which avoids the constraints imposed by BP via an effective layer-wise local forward training. ForwardGNN extends the original FF to deal with graph data and GNNs, and makes it possible to operate without generating negative inputs (hence no longer forward-forward). Further, ForwardGNN enables each layer to learn from both the bottom-up and top-down signals without relying on the backpropagation of errors. Extensive experiments involving five real-world datasets and three representative GNNs show the effectiveness and generality of the proposed forward graph learning framework.

POSTER-721: Causality-Inspired Spatial-Temporal Explanations for Dynamic Graph Neural Networks

Primary Area: learning on graphs and other geometries & topologies

Keywords: Dynamic Graph Graph Explanation Graph Neural Network Causal Inference

Scores: [ 5 5 8 6 ]

Dynamic Graph Neural Networks (DyGNNs) have gained significant popularity in the research of dynamic graphs, but are limited by the low transparency, such that human-understandable insights can hardly be drawn from their predictions. Although a number of existing research have been devoted to investigating the interpretability of graph neural networks (GNNs), achieving the interpretability of DyGNNs is pivotally challenging due to the complex spatial-temporal correlations in dynamic graphs. To this end, we propose an innovative causality-inspired generative model based on structural causal model (SCM), which explores the underlying philosophies of DyGNN predictions by identifying the trivial, static, and dynamic causal relationships. To reach this goal, two critical tasks need to be accomplished including (1) disentangling the complex causal relationships, and (2) fitting the spatial-temporal explanations of DyGNNs in the SCM architecture. To tackle these challenges, the proposed method incorporates a contrastive learning module to disentangle trivial and causal relationships, and a dynamic correlating module to disentangle dynamic and static causal relationships, respectively. A dynamic VGAE-based framework is further developed, which generates causal-and-dynamic masks for spatial interpretability, and recognizes dynamic relationships along the time horizon through causal invention for temporal interpretability. Comprehensive experiments have been conducted on both synthetic and real-world datasets, where our approach yields substantial improvements, thereby demonstrating significant superiority.

Primary Area: learning on graphs and other geometries & topologies

Keywords: Link Prediction;Graph Neural Network

Scores: [ 6 6 8 5 ]

POSTER-723: Local Graph Clustering with Noisy Labels

Primary Area: learning on graphs and other geometries & topologies

Keywords: local graph clustering graph diffusion attributed graphs noisy labels

Scores: [ 6 6 8 3 ]

The growing interest in machine learning problems over graphs with additional node information such as texts, images, or labels has popularized methods that require the costly operation of processing the entire graph. Yet, little effort has been made to the development of fast local methods (i.e. without accessing the entire graph) that extract useful information from such data. To that end, we propose a study of local graph clustering using noisy node labels as a proxy for additional node information. In this setting, nodes receive initial binary labels based on cluster affiliation: 1 if they belong to the target cluster and 0 otherwise. Subsequently, a fraction of these labels is flipped. We investigate the benefits of incorporating noisy labels for local graph clustering. By constructing a weighted graph with such labels, we study the performance of graph diffusion-based local clustering method on both the original and the weighted graphs. From a theoretical perspective, we consider recovering an unknown target cluster with a single seed node in a random graph with independent noisy node labels. We provide sufficient conditions on the label noise under which, with high probability, using diffusion in the weighted graph yields a more accurate recovery of the target cluster. This approach proves more effective than using the given labels alone or using diffusion in the label-free original graph. Empirically, we show that reliable node labels can be obtained with just a few samples from an attributed graph. Moreover, utilizing these labels via diffusion in the weighted graph leads to significantly better local clustering performance across several real-world datasets, improving F1 scores by up to 13%.

POSTER-724: Rethinking and Extending the Probabilistic Inference Capacity of GNNs

Primary Area: learning on graphs and other geometries & topologies

Keywords: graph neural networks expressiveness approximate inference

Scores: [ 5 6 8 6 8 ]

POSTER-725: A Characterization Theorem for Equivariant Networks with Point-wise Activations

Primary Area: learning on graphs and other geometries & topologies

Keywords: Geometric Deep Learning Equivariant Neural Networks Characterization Theorem Point-wise Activations

Scores: [ 8 6 6 8 ]

Equivariant neural networks have shown improved performance, expressiveness and sample complexity on symmetrical domains. But for some specific symmetries, representations, and choice of coordinates, the most common point-wise activations, such as ReLU, are not equivariant, hence they cannot be employed in the design of equivariant neural networks. The theorem we present in this paper describes all possibile combinations of representations, choice of coordinates and point-wise activations to obtain an equivariant layer, generalizing and strengthening existing characterizations.Notable cases of practical relevance are discussed as corollaries. Indeed, we prove that rotation-equivariant networks can only be invariant, as it happens for any network which is equivariant with respect to connected compact groups. Then, we discuss implications of our findings when applied to important instances of equivariant networks. First, we completely characterize permutation equivariant networks such as Invariant Graph Networks with point-wise nonlinearities and their geometric counterparts, highlighting a plethora of models whose expressive power and performance are still unknown. Second, we show that feature spaces of disentangled steerable convolutional neural networks are trivial representations.

POSTER-726: Mirage: Model-agnostic Graph Distillation for Graph Classification

Primary Area: learning on graphs and other geometries & topologies

Keywords: graph distillation graph classification frequent pattern mining

Scores: [ 6 6 6 6 ]

GNNs, like other deep learning models, are data and computation hungry. There is a pressing need to scale training of GNNs on large datasets to enable their usage on low-resource environments. Graph distillation is an effort in that direction with the aim to construct a smaller synthetic training set from the original training data without significantly compromising model performance. While initial efforts are promising, this work is motivated by two key observations: (1) Existing graph distillation algorithms themselves rely on training with the full dataset, which undermines the very premise of graph distillation. (2) The distillation process is specific to the target GNN architecture and hyper-parameters and thus not robust to changes in the modeling pipeline. We circumvent these limitations by designing a distillation algorithm called MIRAGE for graph classification. MIRAGE is built on the insight that a message-passing GNN decomposes the input graph into a multiset of computation trees. Furthermore, the frequency distribution of computation trees is often skewed in nature, enabling us to condense this data into a concise distilled summary. By compressing the computation data itself, as opposed to emulating gradient flows on the original training set—a prevalent approach to date—MIRAGE transforms into an unsupervised and architecture-agnostic distillation algorithm. Extensive benchmarking on real-world datasets underscores MIRAGE’s superiority, showcasing enhanced generalization accuracy, data compression, and distillation efficiency when compared to state-of-the-art baselines.

POSTER-727: Robust Angular Synchronization via Directed Graph Neural Networks

Primary Area: learning on graphs and other geometries & topologies

Keywords: group synchronization angular synchronization neural networks directed graphs deep learning cycle consistency

Scores: [ 8 6 6 5 ]

The angular synchronization problem aims to accurately estimate (up to a constant additive phase) a set of unknown angles θ1,,θn[0,2π) from m noisy measurements of their offsets θiθj mod 2π. Applications include, for example, sensor network localization, phase retrieval, and distributed clock synchronization. An extension of the problem to the heterogeneous setting (dubbed k-synchronization) is to estimate k groups of angles simultaneously, given noisy observations (with unknown group assignment) from each group. Existing methods for angular synchronization usually perform poorly in high-noise regimes, which are common in applications. In this paper, we leverage neural networks for the angular synchronization problem, and its heterogeneous extension, by proposing GNNSync, a theoretically-grounded end-to-end trainable framework using directed graph neural networks. In addition, new loss functions are devised to encode synchronization objectives. Experimental results on extensive data sets demonstrate that GNNSync attains competitive, and often superior, performance against a comprehensive set of baselines for the angular synchronization problem and its extension, validating the robustness of GNNSync even at high noise levels.

POSTER-728: Learning Adaptive Multiresolution Transforms via Meta-Framelet-based Graph Convolutional Network

Primary Area: learning on graphs and other geometries & topologies

Keywords: Graph neural networks graph multiresolution analysis

Scores: [ 6 8 8 6 ]

POSTER-729: Locality-Aware Graph Rewiring in GNNs

Primary Area: learning on graphs and other geometries & topologies

Keywords: Graph Neural Networks Message Passing Neural Networks Over-squashing Graph Rewiring

Scores: [ 8 5 3 6 ]

POSTER-730: Rayleigh Quotient Graph Neural Networks for Graph-level Anomaly Detection

Primary Area: learning on graphs and other geometries & topologies

Keywords: Graph-level anomaly detection Spectral GNN Rayleigh Quotient

Scores: [ 6 6 6 ]

Graph-level anomaly detection has gained significant attention as it finds many applications in various domains, such as cancer diagnosis and enzyme prediction. However, existing methods fail to capture the underlying properties of graph anomalies, resulting in unexplainable framework design and unsatisfying performance. In this paper, we take a step back and re-investigate the spectral differences between anomalous and normal graphs. Our main observation shows a significant disparity in the accumulated spectral energy between these two classes. Moreover, we prove that the accumulated spectral energy of the graph signal can be represented by its Rayleigh Quotient, indicating that the Rayleigh Quotient is a driving factor behind the anomalous properties of graphs. Motivated by this, we propose Rayleigh Quotient Graph Neural Network (RQGNN), the first spectral GNN for graph-level anomaly detection, providing a new perspective on exploring the inherent spectral features of anomalous graphs. Specifically, we introduce a novel framework that consists of two components: the Rayleigh Quotient learning component (RQL) and Chebyshev Wavelet GNN with RQ-pooling (CWGNN-RQ). RQL explicitly captures the Rayleigh Quotient of graphs and CWGNN-RQ implicitly explores the spectral space of graphs. Extensive experiments on 10 real-world datasets show that RQGNN outperforms the best rival by 6.74% in Macro-F1 score and 1.44% in AUC, demonstrating the effectiveness of our framework.

SPOTLIGHT-132: One For All: Towards Training One Graph Model For All Classification Tasks

Primary Area: learning on graphs and other geometries & topologies

Keywords: Graph Neural Network Large Language Model In-context Learning

Scores: [ 10 6 6 6 ]

POSTER-731: Understanding Expressivity of Neural KG Reasoning from Rule Structure Learning

Primary Area: learning on graphs and other geometries & topologies

Keywords: Graph Neural Networks KG reasoning Link prediction Rule structure learning Expressivity

Scores: [ 8 6 6 5 ]

Knowledge graph (KG) reasoning refers to the task of deducing new facts from the existing facts in KG, which has been applied in many fields. Recently, Graph Neural Networks (GNNs) with tail entity scoring achieve the state-of-the-art performance on KG reasoning. However, the theoretical understandings for these GNNs are either lacking or focusing on single-relational graphs, leaving what the kind of rule structures these GNNs can learn an open problem. We propose to fill the above gap in this paper. Specifically, GNNs with tail entity scoring are unified into a common framework. Then, we analyze their expressivity by formally describing the rule structures they can learn and theoretically demonstrating their superiority. These results further inspire us to propose a novel labeling strategy to learn more rule structures in KG reasoning. Experimental results are consistent with our theoretical findings and verify the effectiveness of our proposed method.

SPOTLIGHT-133: Improving Generalization in Equivariant Graph Neural Networks with Physical Inductive Biases

Primary Area: learning on graphs and other geometries & topologies

Keywords: Equivariant Graph Neural Network Graph Neural Network

Scores: [ 6 8 6 ]

Graph Neural Networks (GNNs) with equivariant properties have emerged as powerful tools for modeling complex dynamics of multi-object physical systems. However, their generalization ability is limited by the inadequate consideration of physical inductive biases: (1) Existing studies overlook the continuity of transitions among system states, opting to employ several discrete transformation layers to learn the direct mapping between two adjacent states; (2) Most models only account for first-order velocity information, despite the fact that many physical systems are governed by second-order motion laws. To incorporate these inductive biases, we propose the Second-order Equivariant Graph Neural Ordinary Differential Equation (SEGNO). Specifically, we show how the second-order continuity can be incorporated into GNNs while maintaining the equivariant property. Furthermore, we offer theoretical insights into SEGNO, highlighting that it can learn a unique trajectory between adjacent states, which is crucial for model generalization. Additionally, we prove that the discrepancy between this learned trajectory of SEGNO and the true trajectory is bounded. Extensive experiments on complex dynamical systems including molecular dynamics and motion capture demonstrate that our model yields a significant improvement over the state-of-the-art baselines.

POSTER-732: Polynomial Width is Sufficient for Set Representation with High-dimensional Features

Primary Area: learning on graphs and other geometries & topologies

Keywords: Set Representation; Permutation Invariance; Permutation Equivariance

Scores: [ 8 8 8 5 ]

Set representation has become ubiquitous in deep learning for modeling the inductive bias of neural networks that are insensitive to the input order. DeepSets is the most widely used neural network architecture for set representation. It involves embedding each set element into a latent space with dimension L, followed by a sum pooling to obtain a whole-set embedding, and finally mapping the whole-set embedding to the output. In this work, we investigate the impact of the dimension L on the expressive power of DeepSets. Previous analyses either oversimplified high-dimensional features to be one-dimensional features or were limited to analytic activations, thereby diverging from practical use or resulting in L that grows exponentially with the set size N and feature dimension D. To investigate the minimal value of L that achieves sufficient expressive power, we present two set-element embedding layers: (a) linear + power activation (LP) and (b) linear + exponential activations (LE). We demonstrate that L being poly(N,D) is sufficient for set representation using both embedding layers. We also provide a lower bound of L for the LP embedding layer. Furthermore, we extend our results to permutation-equivariant set functions and the complex field.

POSTER-733: Repelling Random Walks

Primary Area: learning on graphs and other geometries & topologies

Keywords: Graphs random walkers quasi-Monte Carlo kernel PageRank graphlets scalable mixing

Scores: [ 6 6 6 ]

We present a novel quasi-Monte Carlo mechanism to improve graph-based sampling, coined repelling random walks. By inducing correlations between the trajectories of an interacting ensemble such that their marginal transition probabilities are unmodified, we are able to explore the graph more efficiently, improving the concentration of statistical estimators whilst leaving them unbiased. The mechanism has a trivial drop-in implementation. We showcase the effectiveness of repelling random walks in a range of settings including estimation of graph kernels, the PageRank vector and graphlet concentrations. We provide detailed experimental evaluation and robust theoretical guarantees. To our knowledge, repelling random walks constitute the first rigorously studied quasi-Monte Carlo scheme correlating the directions of walkers on a graph, inviting new research in this exciting nascent domain.

POSTER-734: From Latent Graph to Latent Topology Inference: Differentiable Cell Complex Module

Primary Area: learning on graphs and other geometries & topologies

Keywords: Topological Deep Learning Geometric Deep Learning Latent Topology Inference Latent Graph Inference Cell Complexes

Scores: [ 8 8 8 ]

Latent Graph Inference (LGI) relaxed the reliance of Graph Neural Networks (GNNs) on a given graph topology by dynamically learning it. However, most of LGI methods assume to have a (noisy, incomplete, improvable, ...) input graph to rewire and can solely learn regular graph topologies. In the wake of the success of Topological Deep Learning (TDL), we study Latent Topology Inference (LTI) for learning higher-order cell complexes (with sparse and not regular topology) describing multi-way interactions between data points. To this aim, we introduce the Differentiable Cell Complex Module (DCM), a novel learnable function that computes cell probabilities in the complex to improve the downstream task. We show how to integrate DCM with cell complex message-passing networks layers and train it in an end-to-end fashion, thanks to a two-step inference procedure that avoids an exhaustive search across all possible cells in the input, thus maintaining scalability. Our model is tested on several homophilic and heterophilic graph datasets and it is shown to outperform other state-of-the-art techniques, offering significant improvements especially in cases where an input graph is not provided.

POSTER-735: On the Limitations of Temperature Scaling for Distributions with Overlaps

Primary Area: learning theory

Keywords: calibration temperature scaling mixup label noise

Scores: [ 6 6 8 ]

Despite the impressive generalization capabilities of deep neural networks, they have been repeatedly shown to be overconfident when they are wrong. Fixing this issue is known as model calibration, and has consequently received much attention in the form of modified training schemes and post-training calibration procedures such as temperature scaling. While temperature scaling is frequently used because of its simplicity, it is often outperformed by modified training schemes. In this work, we identify a specific bottleneck for the performance of temperature scaling. We show that for empirical risk minimizers for a general set of distributions in which the supports of classes have overlaps, the performance of temperature scaling degrades with the amount of overlap between classes, and asymptotically becomes no better than random when there are a large number of classes. On the other hand, we prove that optimizing a modified form of the empirical risk induced by the Mixup data augmentation technique can in fact lead to reasonably good calibration performance, showing that training-time calibration may be necessary in some situations. We also verify that our theoretical results reflect practice by showing that Mixup significantly outperforms empirical risk minimization (with respect to multiple calibration metrics) on image classification benchmarks with class overlaps introduced in the form of label noise.

POSTER-736: An Efficient Tester-Learner for Halfspaces

Primary Area: learning theory

Keywords: testable learning pac learning agnostic learning Massart label noise adversarial label noise distribution testing

Scores: [ 6 5 8 8 ]

We give the first efficient algorithm for learning halfspaces in the testable learning model recently defined by Rubinfeld and Vasilyan [2022]. In this model, a learner certifies that the accuracy of its output hypothesis is near optimal whenever the training set passes an associated test, and training sets drawn from some target distribution must pass the test. This model is more challenging than distribution-specific agnostic or Massart noise models where the learner is allowed to fail arbitrarily if the distributional assumption does not hold. We consider the setting where the target distribution is the standard Gaussian in d dimensions and the label noise is either Massart or adversarial (agnostic). For Massart noise, our tester-learner runs in polynomial time and outputs a hypothesis with (information-theoretically optimal) error opt+ϵ (and extends to any fixed strongly log-concave target distribution). For adversarial noise, our tester-learner obtains error O(opt)+ϵ in polynomial time. Prior work on testable learning ignores the labels in the training set and checks that the empirical moments of the covariates are close to the moments of the base distribution. Here we develop new tests of independent interest that make critical use of the labels and combine them with the moment-matching approach of Gollakota et al. [2022]. This enables us to implement a testable variant of the algorithm of Diakonikolas et al. [2020a, 2020b] for learning noisy halfspaces using nonconvex SGD.

POSTER-737: Achieving Fairness in Multi-Agent MDP Using Reinforcement Learning

Primary Area: learning theory

Keywords: Fairness Multi-Agent Reinforcement Learning Markov Decision Process

Scores: [ 8 8 8 3 ]

Fairness plays a crucial role in various multi-agent systems (e.g., communication networks, financial markets, etc.). Many multi-agent dynamical interactions can be cast as Markov Decision Processes (MDPs). While existing research has focused on studying fairness in known environments, the exploration of fairness in such systems for unknown environments remains open. In this paper, we propose a Reinforcement Learning (RL) approach to achieve fairness in multi-agent finite-horizon episodic MDPs. Instead of maximizing the sum of individual agents' value functions, we introduce a fairness function that ensures equitable rewards across agents. Since the classical Bellman's equation does not hold when the sum of individual value functions is not maximized, we cannot use traditional approaches. Instead, in order to explore, we maintain a confidence bound of the unknown environment and then propose an online convex optimization based approach to obtain a policy constrained to this confidence region. We show that such an approach achieves sub-linear regret in terms of the number of episodes. Additionally, we provide a probably approximately correct (PAC) guarantee based on the obtained regret bound. We also propose an offline RL algorithm and bound the optimality gap with respect to the optimal fair solution. To mitigate computational complexity, we introduce a policy-gradient type method for the fair objective. Simulation experiments also demonstrate the efficacy of our approach.

POSTER-738: Incentivized Truthful Communication for Federated Bandits

Primary Area: learning theory

Keywords: Contextual bandit Federated learning Truthful mechanism design

Scores: [ 6 8 6 8 ]

To enhance the efficiency and practicality of federated bandit learning, recent advances have introduced incentives to motivate communication among clients, where a client participates only when the incentive offered by the server outweighs its participation cost.However, existing incentive mechanisms naively assume the clients are truthful: they all report their true coat and thus the higher cost one participating client claims, the more the server has to pay. Such mechanisms are therefore vulnerable to strategic clients aiming to optimize their own utility by misreporting. To address this issue, we propose an incentive compatible (i.e., truthful) communication protocol, named Truth-FedBan, where the incentive for each participant is independent of its self-reported cost, and reporting the true costs is the only way to achieve the best utility for each client. Truth-FedBan also provides near-optimal theoretical guarantees on regret and communication cost. Extensive numerical studies further validate the effectiveness of our proposed solution for incentivized federated bandits.

POSTER-739: Transformers as Decision Makers: Provable In-Context Reinforcement Learning via Supervised Pretraining

Primary Area: learning theory

Keywords: transformers in-context learning reinforcement learning learning theory

Scores: [ 6 8 6 5 ]

Large transformer models pretrained on offline reinforcement learning datasets have demonstrated remarkable in-context reinforcement learning (ICRL) capabilities, where they can make good decisions when prompted with interaction trajectories from unseen environments. However, when and how transformers can be trained to perform ICRL have not been theoretically well-understood. In particular, it is unclear which reinforcement-learning algorithms transformers can perform in context, and how distribution mismatch in offline training data affects the learned algorithms. This paper provides a theoretical framework that analyzes supervised pretraining for ICRL. This includes two recently proposed training methods --- algorithm distillation and decision-pretrained transformers. First, assuming model realizability, we prove the supervised-pretrained transformer will imitate the conditional expectation of the expert algorithm given the observed trajectory. The generalization error will scale with model capacity and a distribution divergence factor between the expert and offline algorithms. Second, we show transformers with ReLU attention can efficiently approximate near-optimal online reinforcement learning algorithms like LinUCB and Thompson sampling for stochastic linear bandits, and UCB-VI for tabular Markov decision processes. This provides the first quantitative analysis of the ICRL capabilities of transformers pretrained from offline trajectories.

SPOTLIGHT-134: Inverse Approximation Theory for Nonlinear Recurrent Neural Networks

Primary Area: learning theory

Keywords: Recurrent neural networks sequence modelling approximation theory

Scores: [ 8 8 8 5 ]

We prove an inverse approximation theorem for the approximation of nonlinear sequence-to-sequence relationships using recurrent neural networks (RNNs). This is a so-called Bernstein-type result in approximation theory, which deduces properties of a target function under the assumption that it can be effectively approximated by a hypothesis space. In particular, we show that nonlinear sequence relationships that can be stably approximated by nonlinear RNNs must have an exponential decaying memory structure - a notion that can be made precise. This extends the previously identified curse of memory in linear RNNs into the general nonlinear setting, and quantifies the essential limitations of the RNN architecture for learning sequential relationships with long-term memory. Based on the analysis, we propose a principled reparameterization method to overcome the limitations. Our theoretical results are confirmed by numerical experiments.

SPOTLIGHT-135: How Over-Parameterization Slows Down Gradient Descent in Matrix Sensing: The Curses of Symmetry and Initialization

Primary Area: learning theory

Keywords: non-convex optimization random initialization global convergence matrix recovery matrix sensing

Scores: [ 8 8 8 8 ]

This paper rigorously shows how over-parameterization dramatically changes the convergence behaviors of gradient descent (GD) for the matrix sensing problem, where the goal is to recover an unknown low-rank ground-truth matrix from near-isotropic linear measurements.First, we consider the symmetric setting with the symmetric parameterization where MRn×n is a positive semi-definite unknown matrix of rank rn, and one uses a symmetric parameterization XX to learn M. Here XRn×k with k>r is the factor matrix. We give a novel Ω(1/T2) lower bound of randomly initialized GD for the over-parameterized case (k>r) where T is the number of iterations. This is in stark contrast to the exact-parameterization scenario (k=r) where the convergence rate is exp(Ω(T)). Next, we study asymmetric setting where MRn1×n2 is the unknown matrix of rank rmin{n1,n2}, and one uses an asymmetric parameterization FG to learn M where FRn1×k and GRn2×k. We give the first global exact convergence result of randomly initialized GD for the exact-parameterization case (k=r) with an exp(Ω(T)) rate. Furthermore, we give the first global exact convergence result for the over-parameterization case (k>r) with an exp(Ω(α2T)) rate where α is the initialization scale. This linear convergence result in the over-parameterization case is especially significant because one can apply the asymmetric parameterization to the symmetric setting to speed up from Ω(1/T2) to linear convergence. Therefore, we identify a surprising phenomenon: asymmetric parameterization can exponentially speed up convergence. Equally surprising is our analysis that highlights the importance of imbalance between F and G. This is in sharp contrast to prior works which emphasize balance. We further give an example showing the dependency on α in the convergence rate is unavoidable in the worst case. On the other hand, we propose a novel method that only modifies one step of GD and obtains a convergence rate independent of α, recovering the rate in the exact-parameterization case. We provide empirical studies to verify our theoretical findings.

POSTER-740: Toward Student-oriented Teacher Network Training for Knowledge Distillation

Primary Area: learning theory

Keywords: Knowledge distillation Teacher-student training Empirical risk minimization

Scores: [ 5 6 5 5 ]

How to conduct teacher training for knowledge distillation is still an open problem. It has been widely observed that a best-performing teacher does not necessarily yield the best-performing student, suggesting a fundamental discrepancy between the current teacher training practice and the ideal teacher training strategy. To fill this gap, we explore the feasibility of training a teacher that is oriented toward student performance with empirical risk minimization (ERM). Our analyses are inspired by the recent findings that the effectiveness of knowledge distillation hinges on the teacher’s capability to approximate the true label distribution of training inputs. We theoretically establish that ERM minimizer can approximate the true label distribution of training data as long as the feature extractor of the learner network is Lipschitz continuous and is robust to feature transformations. In light of our theory, we propose a teacher training method SoTeacher which incorporates Lipschitz regularization and consistency regularization into ERM. Experiments on benchmark datasets using various knowledge distillation algorithms and teacher-student pairs confirm that SoTeacher can improve student accuracy consistently.

POSTER-741: Benign Oscillation of Stochastic Gradient Descent with Large Learning Rate

Primary Area: learning theory

Keywords: deep learning theory; large learning rate; oscillation of stochastic gradient descent;

Scores: [ 3 6 8 ]

In this work, we theoretically investigate the generalization property of neural networks (NN) trained by stochastic gradient descent (SGD) with \emph{large learning rate}. Under such a training regime, our finding is that, the oscillation of the NN weights caused by SGD with large learning rates turns out to be beneficial to generalization, potentially improving over the same NN trained by SGD with small learning rates that converges more smoothly. In view of the findings, we call such a phenomenon “benign oscillation”. Our theory towards demystifying such a phenomenon builds upon the feature learning perspective of deep learning. Specifically, we consider a feature-noise data generation model that consists of (i) weak features which have a small 2-norm and appear in each data point; (ii) strong features which have a large 2-norm but appear only in a certain fraction of all data points; and (iii) noise. We prove that NNs trained by oscillating SGD with a large learning rate can effectively learn the weak features in the presence of those strong features. In contrast, NNs trained by SGD with a small learning rate only learn the strong features but make little progress in learning the weak features. Consequently, when it comes to the new testing data points that consist of only weak features, the NN trained by oscillating SGD with large learning rates can still make correct predictions, while the NN trained by SGD with small learning rates could not. Our theory sheds light on how large learning rate training benefits the generalization of NNs. Experimental results demonstrate our findings on the phenomenon of “benign oscillation".

SPOTLIGHT-136: How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?

Primary Area: learning theory

Keywords: in-context learning linear regression ridge regression Bayes optimality

Scores: [ 6 8 8 5 ]

Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities, enabling them to solve unseen tasks solely based on input contexts without adjusting model parameters. In this paper, we study ICL in one of its simplest setups: pretraining a single-layer linear attention model for linear regression with a Gaussian prior. We establish a statistical task complexity bound for the attention model pretraining, showing that effective pretraining only requires a small number of independent tasks. Furthermore, we prove that the pretrained model closely matches the Bayes optimal algorithm, i.e., optimally tuned ridge regression, by achieving nearly Bayes optimal risk on unseen tasks under a fixed context length. These theoretical findings complement prior experimental research and shed light on the statistical foundations of ICL.

POSTER-742: Provable Memory Efficient Self-Play Algorithm for Model-free Reinforcement Learning

Primary Area: learning theory

Keywords: model-free RL self-play memory efficiency Q-learning Nash equilibrium Markov policy

Scores: [ 8 8 6 6 ]

The thriving field of multi-agent reinforcement learning (MARL) studies how a group of interacting agents make decisions autonomously in a shared dynamic environment. Existing theoretical studies in this area suffer from at least two of the following obstacles: memory inefficiency, the heavy dependence of sample complexity on the long horizon and the large state space, the high computational complexity, non-Markov policy, non-Nash policy, and high burn-in cost. In this work, we take a step towards settling this problem by designing a model-free self-play algorithm \emph{Memory-Efficient Nash Q-Learning (ME-Nash-QL)} for two-player zero-sum Markov games, which is a specific setting of MARL. We prove that ME-Nash-QL can output an ε-approximate Nash policy with remarkable space complexity O(SABH), sample complexity O~(H4SAB/ε2), and computational complexity O(Tpoly(AB)), where S is the number of states, {A,B} is the number of actions for the two players, H is the horizon length, and T is the number of samples. Notably, our approach outperforms in terms of space complexity compared to existing algorithms for tabular cases. It achieves the lowest computational complexity while preserving Markov policies, setting a new standard. Furthermore, our algorithm outputs a Nash policy and achieves the best sample complexity compared with the existing guarantee for long horizons, i.e. when minA,BH2. Our algorithm also achieves the best burn-in cost O(SABpoly(H)), whereas previous algorithms need at least O(S3ABpoly(H)) to attain the same level of sample complexity with ours.

SPOTLIGHT-137: How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation

Primary Area: learning theory

Keywords: Attention computation kronecker computation

Scores: [ 8 8 8 ]

In the classical transformer attention scheme, we are given three n×d size matrices Q,K,V (the query, key, and value tokens), and the goal is to compute a new n×d size matrix D1exp(QK)V where D=diag(exp(QK)1n). Here, exp() is applied entry-wise and 1n denotes a length-n vector whose entries are all ones.Intuitively, attention computation captures pairwise information between words in a sentence, but not higher-order information. Indeed, recent work \cite{sht23} has shown that attention units cannot solve simple problems about detecting triples of connected words.In this work, we study a generalization of attention which captures triple-wise correlations. The generalization is based on computations involving tensors defined by tuples of words. More formally, given five n×d size matrices Q,K1,K2,V1 and V2 (generalized query, key, and value tokens), our new goal is to compute an n×d size matrix $D^{-1} \exp( Q ( K_1 \oslash K_2)^\top ) (V_1 \oslash V_2) $ where D=diag(exp(Q(K1K2))1n2) and K1K2Rn2×d denotes the column-wise Kronecker product of K1 and K2. This generalization is indeed able to solve problems about detecting triple-wise connections that were shown to be impossible for transformers.The potential downside of this generalization is that it appears as though computations are even more difficult, since the straightforward algorithm requires cubic time in n. However, we show that in the bounded-entry setting (which arises in practice, and which is well-studied in both theory and practice), there is actually a near-linear time algorithm. More precisely, we show that bounded entries are both necessary and sufficient for quickly performing generalized computations: On the positive side, if all entries of the input matrices are bounded above by o(logn3) then we show how to approximate the tensor-type'' attention matrix in n1+o(1) time. On the negative side, we show that if the entries of the input matrices may be as large as Ω(logn3), then there is no algorithm that runs faster than n3o(1) (assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory).We also show that our construction, algorithms, and lower bounds naturally generalize to higher-order tensors and correlations. Interestingly, the higher the order of the tensors, the lower the bound on the entries needs to be for an efficient algorithm. Our results thus yield a natural tradeoff between the boundedness of the entries, and order of the tensor one may use for more expressive, efficient attention computation.Our constructions make use of a novel connection with a higher-order variant on the kernel density estimation problem. They combine a number of technical tools, including the polynomial method, algebraic geometry codes, and multiparty Merlin-Arthur communication protocols.

SPOTLIGHT-138: Learning to Reject for Balanced Error and Beyond

Primary Area: learning theory

Keywords: Learning to reject balanced error evaluation metrics selective classification plug-in approach long-tail learning class imbalance non-decomposable metrics

Scores: [ 8 8 8 8 ]

Learning to reject (L2R) is a classical problem where one seeks a classifier capable of abstaining on low-confidence samples. Most prior work on L2R has focused on minimizing the standard misclassification error. However, in many real-world applications, the label distribution is highly imbalanced, necessitating alternate evaluation metrics such as the balanced error or the worst-group error that enforce equitable performance across both the head and tail classes. In this paper, we establish that traditional L2R methods can be grossly sub-optimal for such metrics, and show that this is due to an intricate dependence in the objective between the label costs and the rejector. We then derive the form of the Bayes-optimal classifier and rejector for the balanced error, propose a novel plug-in approach to mimic this solution, and extend our results to general evaluation metrics. Through experiments on benchmark image classification tasks, we show that our approach yields better trade-offs in both the balanced and worst-group error compared to L2R baselines.

POSTER-743: Learning Conditional Invariances through Non-Commutativity

Primary Area: learning theory

Keywords: Invariance Learning Domain Adaptation

Scores: [ 5 8 6 ]

Invariance learning algorithms that conditionally filter out domain-specific random variables as distractors, do so based only on the data semantics, and not the target domain under evaluation. We show that a provably optimal and sample-efficient way of learning conditional invariances is by relaxing the invariance criterion to be non-commutatively directed towards the target domain. Under domain asymmetry, i.e., when the target domain contains semantically relevant information absent in the source, the risk of the encoder φ that is optimal on average across domains is strictly lower-bounded by the risk of the target-specific optimal encoder Φτ. We prove that non-commutativity steers the optimization towards Φτ instead of φ, bringing the H-divergence between domains down to zero, leading to a stricter bound on the target risk. Both our theory and experiments demonstrate that non-commutative invariance (NCI) can leverage source domain samples to meet the sample complexity needs of learning Φτ, surpassing SOTA invariance learning algorithms for domain adaptation, at times by over 2%, approaching the performance of an oracle. Implementation will be made public upon acceptance.

SPOTLIGHT-139: Generalization of Deep ResNets in the Mean-Field Regime

Primary Area: learning theory

Keywords: ResNet mean field generalization Rademacher complexity

Scores: [ 5 8 8 ]

Despite the widespread empirical success of ResNet, the generalization ability of deep ResNet is rarely explored beyond the lazy-training regime. In this work, we investigate ResNet in the limit of infinitely deep and wide neural networks, of which the gradient flow is described by a partial differential equation in the large-neural network limit, i.e., the \emph{mean-field} regime. To derive the generalization bounds under this setting, our analysis necessitates a shift from the conventional time-invariant Gram matrix employed in the lazy training regime to a time-variant, distribution-dependent version tailored to the mean-field regime. To this end, we provide a lower bound on the minimum eigenvalue of the Gram matrix under the mean-field regime. Besides, the traceability of the dynamic of Kullback-Leibler (KL) divergence is also required under the mean-field regime. We therefore establish the linear convergence of the empirical error and estimate the upper bound of the KL divergence over parameters distribution. The above two results are employed to build the uniform convergence for generalization bound via Rademacher complexity. Our results offer new insights into the generalization ability of deep ResNet beyond the lazy training regime and contribute to advancing the understanding of the fundamental properties of deep neural networks.

POSTER-744: On the Analysis of GAN-based Image-to-Image Translation with Gaussian Noise Injection

Primary Area: learning theory

Keywords: Image to Image translation Noise robustness f-divergence Score matching GAN

Scores: [ 6 6 3 ]

Image-to-image (I2I) translation is vital in computer vision tasks like style transfer and domain adaptation. While recent advances in GAN have enabled high-quality sample generation, real-world challenges such as noise and distortion remain significant obstacles. Although Gaussian noise injection during training has been utilized, its theoretical underpinnings have been unclear. This work provides a robust theoretical framework elucidating the role of Gaussian noise injection in I2I translation models. We address critical questions on the influence of noise variance on distribution divergence, resilience to unseen noise types, and optimal noise intensity selection. Our contributions include connecting f-divergence and score matching, unveiling insights into the impact of Gaussian noise on aligning probability distributions, and demonstrating generalized robustness implications. We also explore choosing an optimal training noise level for consistent performance in noisy environments. Extensive experiments validate our theoretical findings, showing substantial improvements over various I2I baseline models in noisy settings. Our research rigorously grounds Gaussian noise injection for I2I translation, offering a sophisticated theoretical understanding beyond heuristic applications.

POSTER-745: The Human-AI Substitution game: active learning from a strategic labeler

Primary Area: learning theory

Keywords: active learning strategic learning

Scores: [ 6 6 6 6 ]

The standard active learning setting assumes a willing labeler, who provides labels on informative examples to speed up learning. However, if the labeler wishes to be compensated for as many labels as possible before learning finishes, the labeler may benefit from actually slowing down learning. This incentive arises for instance if the labeler is to be replaced by the ML model, once it is learned. In this paper, we initiate the study of learning from a strategic labeler, who selectively abstains from labeling to slow down learning. We first prove that strategic abstention can prolong learning, and propose novel complexity measures to analyze the query cost of the learning game. Next, we develop a near-optimal deterministic algorithm, prove its robustness to strategic labeling, and contrast it with other active learning algorithms. We also provide extensions that encompass other learning setups/goals. Finally, we characterize the query cost of multi-task active learning, with and without abstention. Our first exploration of strategic labeling aims to add to our theoretical understanding of the imitative nature of ML in human-AI interaction.

SPOTLIGHT-140: On the Provable Advantage of Unsupervised Pretraining

Primary Area: learning theory

Keywords: unsupervised pretraining; representation learning; sample complexity

Scores: [ 6 8 8 6 ]

Unsupervised pretraining, which learns a useful representation using a large amount of unlabeled data to facilitate the learning of downstream tasks, is a critical component of modern large-scale machine learning systems. Despite its tremendous empirical success, the rigorous theoretical understanding of why unsupervised pretraining generally helps remains rather limited---most existing results are restricted to particular methods or approaches for unsupervised pretraining with specialized structural assumptions. This paper studies a generic framework,where the unsupervised representation learning task is specified by an abstract class of latent variable models Φ and the downstream task is specified by a class of prediction functions Ψ. We consider a natural approach of using Maximum Likelihood Estimation (MLE) for unsupervised pretraining and Empirical Risk Minimization (ERM) for learning downstream tasks. We prove that, under a mild informative'' condition, our algorithm achieves an excess risk of tildemathcalO(C_Φ/m+C_Ψ/n) for downstream tasks, where C_Φ,C_Ψ are complexity measures of function classes Φ,Ψ, and m,n are the number of unlabeled and labeled data respectively. Comparing to the baseline of O~(C_ΦΨ/n) achieved by performing supervised learning using only the labeled data, our result rigorously shows the benefit of unsupervised pretraining when mn and C_ΦΨ>C_Ψ. This paper further shows that our generic framework covers a wide range of approaches for unsupervised pretraining, including factor models, Gaussian mixture models, and contrastive learning.

POSTER-746: Random Sparse Lifts: Construction, Analysis and Convergence of finite sparse networks

Primary Area: learning theory

Keywords: Neural networks theory of deep learning convergence guarantees random graphs algebraic topology

Scores: [ 8 5 5 8 ]

We present a framework to define a large class of neural networks for which, by construction, training by gradient flow provably reaches arbitrarily low loss when the number of parameters grows. Distinct from the fixed-space global optimality of non-convex optimization, this new form of convergence, and the techniques introduced to prove such convergence, pave the way for a usable deep learning convergence theory in the near future, without overparameterization assumptions relating the number of parameters and training samples. We define these architectures from a simple computation graph and a mechanism to lift it, thus increasing the number of parameters, generalizing the idea of increasing the widths of multi-layer perceptrons. We show that architectures similar to most common deep learning models are present in this class, obtained by sparsifying the weight tensors of usual architectures at initialization. Leveraging tools of algebraic topology and random graph theory, we use the computation graph’s geometry to propagate properties guaranteeing convergence to any precision for these large sparse models.

SPOTLIGHT-141: Linear Convergence Bounds for Diffusion Models via Stochastic Localization

Primary Area: learning theory

Keywords: diffusion models score-based generative models convergence bounds stochastic localization

Scores: [ 8 8 8 6 6 8 ]

Denoising diffusion models are a powerful method for generating approximate samples from high-dimensional data distributions. Recent results have provided polynomial bounds on their convergence rate, assuming L2-accurate scores. Until now, the tightest bounds were either superlinear in the data dimension or required strong smoothness assumptions. We provide the first convergence bounds which are linear in the data dimension (up to logarithmic factors) assuming only finite second moments of the data distribution. We show that diffusion models require at most O~(dlog2(1/δ)ε2) steps to approximate an arbitrary distribution on Rd corrupted with Gaussian noise of variance δ to within ε2 in KL divergence. Our proof extends the Girsanov-based methods of previous works. We introduce a refined treatment of the error from discretizing the reverse SDE inspired by stochastic localization theory.

POSTER-747: Learning Thresholds with Latent Values and Censored Feedback

Primary Area: learning theory

Keywords: Threshold Latent Value Censored Feedback Query Complexity

Scores: [ 6 5 5 8 ]

In this paper, we investigate a problem of actively learning threshold in latent space, where the unknown reward g(γ,v) depends on the proposed threshold γ and latent value v and it can be only achieved if the threshold is lower than or equal to the unknown latent value. This problem has broad applications in practical scenarios, e.g., reserve price optimization in online auctions, online task assignments in crowdsourcing, setting recruiting bars in hiring, etc. We first characterize the query complexity of learning a threshold with the expected reward at most ϵ smaller than the optimum and prove that the number of queries needed can be infinitely large even when g(γ,v) is monotone with respect to both γ and v. On the positive side, we provide a tight query complexity Θ~(1/ϵ3) when g is monotone and the CDF of value distribution is Lipschitz. Moreover, we show a tight Θ~(1/ϵ3) query complexity can be achieved as long as g satisfies one-sided Lipschitzness, which provides a complete characterization for this problem. Finally, we extend this model to an online learning setting and demonstrate a tight Θ(T2/3) regret bound using continuous-arm bandit techniques and the aforementioned query complexity results.

POSTER-748: Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification

Primary Area: learning theory

Keywords: inverse reinforcement learning reward learing misspecification

Scores: [ 6 6 6 8 ]

Inverse reinforcement learning (IRL) aims to infer an agent's preferences (represented as a reward function R) from their behaviour (represented as a policy π). To do this, we need a behavioural model of how π relates to R. In the current literature, the most common behavioural models are optimality, Boltzmann-rationality, and causal entropy maximisation. However, the true relationship between a human's preferences and their behaviour is much more complex than any of these behavioural models. This means that the behavioural models are misspecified, which raises the concern that they may lead to systematic errors if applied to real data. In this paper, we analyse how sensitive the IRL problem is to misspecification of the behavioural model. Specifically, we provide necessary and sufficient conditions that completely characterise how the observed data may differ from the assumed behavioural model without incurring an error above a given threshold. In addition to this, we also characterise the conditions under which a behavioural model is robust to small perturbations of the observed policy, and we analyse how robust many behavioural models are to misspecification of their parameter values (such as e.g. the discount rate). Our analysis suggests that the IRL problem is highly sensitive to misspecification, in the sense that very mild misspecification can lead to very large errors in the inferred reward function.

POSTER-749: Sampling Multimodal Distributions with the Vanilla Score: Benefits of Data-Based Initialization

Primary Area: learning theory

Keywords: sampling score matching contrastive divergence langevin dynamics

Scores: [ 8 3 8 6 ]

POSTER-750: Sample-Efficient Multi-Agent RL: An Optimization Perspective

Primary Area: learning theory

Keywords: multi-agent reinforcement learning theory general function approximation general-sum Markov Games

Scores: [ 6 6 6 6 6 ]

We study multi-agent reinforcement learning (MARL) for the general-sum Markov Games (MGs) under general function approximation. In order to find the minimum assumption for sample-efficient learning, we introduce a novel complexity measure called the Multi-Agent Decoupling Coefficient (MADC) for general-sum MGs. Using this measure, we propose the first unified algorithmic framework that ensures sample efficiency in learning Nash Equilibrium, Coarse Correlated Equilibrium, and Correlated Equilibrium for both model-based and model-free MARL problems with low MADC. We also show that our algorithm provides comparable sublinear regret to the existing works. Moreover, our algorithm combines an equilibrium-solving oracle with a single objective optimization subprocedure that solves for the regularized payoff of each deterministic joint policy, which avoids solving constrained optimization problems within data-dependent constraints (Jin et al. 2020; Wang et al. 2023) or executing sampling procedures with complex multi-objective optimization problems (Foster et al. 2023), thus being more amenable to empirical implementation.

POSTER-751: Flat Minima in Linear Estimation and an Extended Gauss Markov Theorem

Primary Area: learning theory

Keywords: regularization regression optimization

Scores: [ 6 5 6 ]

We consider the problem of linear estimation, and establish an extension of the Gauss-Markov theorem, in which the bias operator is allowed to be non-zero but bounded with respect to a matrix norm of Schatten type. We derive simple and explicit formulas for the optimal estimator in the cases of Nuclear and Spectral norms (with the Frobenius case recovering ridge regression). Additionally, we analytically derive the generalization error in multiple random matrix ensembles, and compare with Ridge regression. Finally, we conduct an extensive simulation study, in which we show that the cross-validated Nuclear and Spectral regressors can outperform Ridge in several circumstances.

POSTER-752: Analysis of Learning a Flow-based Generative Model from Limited Sample Complexity

Primary Area: learning theory

Keywords: statistical physics flow-based generative model stochastic interpolation gaussian mixture auto-encoder

Scores: [ 8 8 3 ]

We study the problem of training a flow-based generative model, parametrized by a two-layer autoencoder, to sample from a high-dimensional Gaussian mixture. We provide a sharp end-to-end analysis of the problem. First, we provide a tight closed-form characterization of the learnt velocity field, when parametrized by a shallow denoising auto-encoder trained on a finite number n of samples from the target distribution. Building on this analysis, we provide a sharp description of the corresponding generative flow, which pushes the base Gaussian density forward to an approximation of the target density. In particular, we provide closed-form formulae for the distance between the means of the generated mixture and the mean of the target mixture, which we show decays as Θn(1n). Finally, this rate is shown to be in fact Bayes-optimal.

POSTER-753: Tight Rates in Supervised Outlier Transfer Learning

Primary Area: learning theory

Keywords: Minimax rate outlier detection transfer learning Neyman-Pearson unbalanced classification

Scores: [ 5 6 8 ]

A critical barrier to learning an accurate decision rule for outlier detection is the scarcity of outlier data. As such, practitioners often turn to the use of similar but imperfect outlier data from which they might \emph{transfer} information to the target outlier detection task. Despite the recent empirical success of transfer learning in outlier detection, a fundamental understanding of when and how knowledge can be transferred from a source to a target in outlier detection remains elusive. In this work, we adopt the traditional framework of Neyman-Pearson classification---which formalizes \emph{supervised outlier detection}, i.e., unbalanced classification---with the added assumption that we have access to both source and (some or no) target outlier data. Our main results are then as follows:We first determine the information-theoretic limits of the problem under a measure of discrepancy that extends some existing notions from traditional balanced classification; interestingly, unlike in balanced classification, seemingly very dissimilar sources can provide much information about a target, thus resulting in fast transfer.We then show that, in principle, these information-theoretic limits are achievable by \emph{adaptive} procedures, i.e., procedures with no a priori information on the discrepancy between source and target distributions.

SPOTLIGHT-142: Bandits Meet Mechanism Design to Combat Clickbait in Online Recommendation

Primary Area: learning theory

Keywords: bandits mechanism design incentive-aware learning nash equilibrium

Scores: [ 6 8 6 8 ]

We study a strategic variant of the multi-armed bandit problem, which we coin the strategic click-bandit. This model is motivated by applications in online recommendation where the choice of recommended items depends on both the click-through rates and the post-click rewards. Like in classical bandits, rewards follow a fixed unknown distribution. However, we assume that the click-rate of each arm is chosen strategically by the arm (e.g., a host on Airbnb) in order to maximize the number of times it gets clicked. The algorithm designer does not know the post-click rewards nor the arms' actions (i.e., strategically chosen click-rates) in advance, and must learn both values over time. To solve this problem, we design an incentive-aware learning algorithm, UCB-S, which achieves two goals simultaneously: (a) incentivizing desirable arm behavior under uncertainty; (b) minimizing regret by learning unknown parameters. We characterize all approximate Nash equilibria among arms under UCB-S and show a O~(KT) regret bound uniformly in every equilibrium. We also show that incentive-unaware algorithms generally fail to achieve low regret in the strategic click-bandit. Finally, we support our theoretical results by simulations of strategic arm behavior which confirm the effectiveness and robustness of our proposed incentive design.

POSTER-754: The Expressive Power of Low-Rank Adaptation

Primary Area: learning theory

Keywords: LoRA expressive power parameter-efficient fine-tuning adaptation neural networks transformer

Scores: [ 6 8 6 6 ]

Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that leverages low-rank adaptation of weight matrices, has emerged as a prevalent technique for fine-tuning pre-trained models such as large language models and diffusion models.Despite its huge success in practice, the theoretical underpinnings of LoRA have largely remained unexplored. This paper takes the first step to bridge this gap by theoretically analyzing the expressive power of LoRA. We prove that, for fully connected neural networks, LoRA can adapt any model f to accurately represent any smaller target model f¯ if LoRA-rank (width of f)×depth of f¯depth of f, under a mild assumption. We also quantify the approximation error when the LoRA-rank is lower than the threshold. For Transformer networks, we show any model can be adapted to a target model of the same size with rank-(embedding size2) LoRA adapters.All our theoretical insights are validated by numerical experiments.

POSTER-755: Improving equilibrium propagation without weight symmetry through Jacobian homeostasis

Primary Area: learning theory

Keywords: Equilibrium propagation local learning weight transport jacobian regularization

Scores: [ 5 6 8 6 ]

Equilibrium propagation (EP) is a compelling alternative to the backpropagation of error algorithm (BP) for computing gradients of neural networks on biological or analog neuromorphic substrates. Still, the algorithm requires weight symmetry and infinitesimal equilibrium perturbations, i.e., nudges, to yield unbiased gradient estimates.Both requirements are challenging to implement in physical systems.Yet, whether and how weight asymmetry contributes to bias is unknown because, in practice, its contribution may be masked by a finite nudge. To address this question, we study generalized EP, which can be formulated without weight symmetry, and analytically isolate the two sources of bias.For complex-differentiable non-symmetric networks, we show that bias due to finite nudge can be avoided by estimating exact derivatives via a Cauchy integral.In contrast, weight asymmetry induces residual bias through poor alignment of EP's neuronal error vectors compared to BP resulting in low task performance.To mitigate the latter issue, we present a new homeostatic objective that directly penalizes functional asymmetries of the Jacobian at the network's fixed point. This homeostatic objective dramatically improves the network's ability to solve complex tasks such as ImageNet 32$\times$32. Our results lay the theoretical groundwork for studying and mitigating the adverse effects of imperfections of physical networks on learning algorithms that rely on the substrate's relaxation dynamics.

POSTER-756: Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts

Primary Area: learning theory

Keywords: Mixture of Experts Maximum Likelihood Estimation

Scores: [ 6 6 6 ]

Top-K sparse softmax gating mixture of experts has been widely used for scaling up massive deep-learning architectures without increasing the computational cost. Despite its popularity in real-world applications, the theoretical understanding of that gating function has remained an open problem. The main challenge comes from the structure of the top-K sparse softmax gating function, which partitions the input space into multiple regions with distinct behaviors. By focusing on a Gaussian mixture of experts, we establish theoretical results on the effects of the top-K sparse softmax gating function on both density and parameter estimations. Our results hinge upon defining novel loss functions among parameters to capture different behaviors of the input regions. When the true number of experts k is known, we demonstrate that the convergence rates of density and parameter estimations are both parametric on the sample size. However, when k becomes unknown and the true model is over-specified by a Gaussian mixture of k experts where k>k, our findings suggest that the number of experts selected from the top-K sparse softmax gating function must exceed the total cardinality of a certain number of Voronoi cells associated with the true parameters to guarantee the convergence of the density estimation. Moreover, while the density estimation rate remains parametric under this setting, the parameter estimation rates become substantially slow due to an intrinsic interaction between the softmax gating and expert functions.

SPOTLIGHT-143: Feature emergence via margin maximization: case studies in algebraic tasks

Primary Area: learning theory

Keywords: inductive bias margin maximization feature learning mechanistic interpretability

Scores: [ 6 8 8 6 ]

Understanding the internal representations learned by neural networks is a cornerstone challenge in the science of machine learning. While there have been significant recent strides in some cases towards understanding how neural networks implement specific target functions, this paper explores a complementary question -- why do networks arrive at particular computational strategies? Our inquiry focuses on the algebraic learning tasks of modular addition, sparse parities, and finite group operations. Our primary theoretical findings analytically characterize the features learned by stylized neural networks for these algebraic tasks. Notably, our main technique demonstrates how the principle of margin maximization alone can be used to fully specify the features learned by the network. Specifically, we prove that the trained networks utilize Fourier features to perform modular addition and employ features corresponding to irreducible group-theoretic representations to perform compositions in general groups, aligning closely with the empirical observations of Nanda et al. (2023) and Chughtai et al. (2023). More generally, we hope our techniques can help to foster a deeper understanding of why neural networks adopt specific computational strategies.

SPOTLIGHT-144: Asymptotically Free Sketched Ridge Ensembles: Risks, Cross-Validation, and Tuning

Primary Area: learning theory

Keywords: asymptotic freeness sketching ensembles ridge regression generalized cross-validation tuning

Scores: [ 6 8 8 8 ]

We employ random matrix theory to establish consistency of generalized cross validation (GCV) for estimating prediction risks of sketched ridge regression ensembles, enabling efficient and consistent tuning of regularization and sketching parameters. Our results hold for a broad class of asymptotically free sketches under very mild data assumptions. For squared prediction risk, we provide a decomposition into an unsketched equivalent implicit ridge bias and a sketching-based variance, and prove that the risk can be globally optimized by only tuning sketch size in infinite ensembles. For general subquadratic prediction risk functionals, we extend GCV to construct consistent risk estimators, and thereby obtain distributional convergence of the GCV-corrected predictions in Wasserstein-2 metric. This in particular allows construction of prediction intervals with asymptotically correct coverage conditional on the training data. We also propose an "ensemble trick" whereby the risk for unsketched ridge regression can be efficiently estimated via GCV using small sketched ridge ensembles. We empirically validate our theoretical results using both synthetic and real large-scale datasets with practical sketches including CountSketch and subsampled randomized discrete cosine transforms.

POSTER-757: Information Bottleneck Analysis of Deep Neural Networks via Lossy Compression

Primary Area: learning theory

Keywords: deep learning information bottleneck principle stochastic neural networks lossy compression

Scores: [ 6 5 6 6 ]

The Information Bottleneck (IB) principle offers an information-theoretic framework for analyzing the training process of deep neural networks (DNNs). Its essence lies in tracking the dynamics of two mutual information (MI) values: between the hidden layer output and the DNN input/target. According to the hypothesis put forth by Shwartz-Ziv & Tishby (2017), the training process consists of two distinct phases: fitting and compression. The latter phase is believed to account for the good generalization performance exhibited by DNNs. Due to the challenging nature of estimating MI between high-dimensional random vectors, this hypothesis has only been verified for NNs of tiny sizes or specific types, such as quantized NNs. In this paper, we introduce a framework for conducting IB analysis of general NNs. Our approach leverages the stochastic NN method proposed by Goldfeld et al. (2019) and incorporates a compression step to overcome the obstacles associated with high dimensionality. In other words, we estimate the MI between the compressed representations of high-dimensional random vectors. The proposed method is supported by both theoretical and practical justifications. Notably, we demonstrate the accuracy of our estimator through synthetic experiments featuring predefined MI values and comparison with MINE (Belghazi et al., 2018). Finally, we perform IB analysis on a close-to-real-scale convolutional DNN, which reveals new features of the MI dynamics.

SPOTLIGHT-145: A path-norm toolkit for modern networks: consequences, promises and challenges

Primary Area: learning theory

Keywords: ReLU neural networks path-norm generalization contraction lemma peeling

Scores: [ 6 8 8 ]

This work introduces the first toolkit around path-norms that is fully able to encompass general DAG ReLU networks with biases, skip connections and max pooling.This toolkit notably allows us to establish generalization bounds for real modern neural networks that are not only the most widely applicable path-norm based ones, but also recover or beat the sharpest known bounds of this type. These extended path-norms further enjoy the usual benefits of path-norms: ease of computation, invariance under the symmetries of the network, and improved sharpness on feedforward networks compared to the product of operators' norms, another complexity measure most commonly used. The versatility of the toolkit and its ease of implementation allow us to challenge the concrete promises of path-norm-based generalization bounds, by numerically evaluating the sharpest known bounds for ResNets on ImageNet.

POSTER-758: Feature Collapse

Primary Area: learning theory

Keywords: deep learning theory representation learning neural collapse

Scores: [ 6 8 6 5 ]

We formalize and study a phenomenon called feature collapse that makes precise the intuitive idea that entities playing a similar role in a learning task receive similar representations. As feature collapse requires a notion of task, we leverage a synthetic task in which a learner must classify `sentences' constituted of L tokens. We start by showing experimentally that feature collapse goes hand in hand with generalization. We then prove that, in the large sample limit, distinct tokens that play identical roles in the task receive identical local feature representations in the first layer of the network. This analysis shows that a neural network trained on this task provably learns interpretable and meaningful representations in its first layer.

SPOTLIGHT-146: Convergence of Bayesian Bilevel Optimization

Primary Area: learning theory

Keywords: Hyperparameter optimization Bayesian optimization Convergence rate Bilevel optimization Learning theory

Scores: [ 8 6 6 6 ]

This paper presents the first theoretical guarantee for Bayesian bilevel optimization (BBO) that we term for the prevalent bilevel framework combining Bayesian optimization at the outer level to tune hyperparameters, including the inner-level stochastic gradient descent (SGD). We prove sublinear regret bounds suggesting simultaneous convergence of the inner-level model parameters and outer-level hyperparameters to optimal configurations for generalization capability. A pivotal, technical novelty in the proofs is modeling the excess risk of the SGD-trained parameters as evaluation noise during Bayesian optimization. Our theory implies the inner unit horizon, defined as the number of SGD iterations, shapes the convergence behavior of BBO. This suggests practical guidance on configuring the inner unit horizon to enhance training efficiency and model performance.

POSTER-759: Maximum Likelihood Estimation is All You Need for Well-Specified Covariate Shift

Primary Area: learning theory

Keywords: Covariate shift; Maximum Likelihood Estimation; Out-of-Distribution generalization;

Scores: [ 6 8 8 3 ]

A key challenge of modern machine learning systems is to achieve Out-of-Distribution (OOD) generalization --- generalizing to target data whose distribution differs from those of source data. Despite its significant importance, the fundamental question of what are the most effective algorithms for OOD generalization'' remains open even under the standard setting of covariate shift.This paper addresses this fundamental question by proving that, surprisingly, classical Maximum Likelihood Estimation (MLE) purely using source data (without any modification) achieves the minimax optimality for covariate shift under the well-specified setting. This result holds for a very large class of parametric models, including but not limited to linear regression, logistic regression, and phase retrieval, and does not require any boundedness condition on the density ratio. This paper further complement the study by proving that for the misspecified setting, MLE can perform poorly, and the Maximum Weighted Likelihood Estimator (MWLE) emerges as minimax optimal in specific scenarios, outperforming MLE.

POSTER-760: Combinatorial Bandits for Maximum Value Reward Function under Value-Index Feedback

Primary Area: learning theory

Keywords: Combinatorial multi-armed bandit k-MAX bandit value-index feedback maximum reward function

Scores: [ 6 6 6 6 ]

We consider a combinatorial multi-armed bandit problem for maximum value reward function under maximum value and index feedback. This is a new feedback structure that lies in between commonly studied semi-bandit and full-bandit feedback structures. We propose an algorithm and provide a regret bound for problem instances with stochastic arm outcomes according to arbitrary distributions with finite supports. The regret analysis rests on considering an extended set of arms, associated with values and probabilities of arm outcomes, and applying a smoothness condition. Our algorithm achieves a O((k/Δ)log(T)) distribution-dependent and a O~(T) distribution-independent regret where k is the number of arms selected in each round, Δ is a distribution-dependent reward gap and T is the horizon time. Perhaps surprisingly, the regret bound is comparable to previously-known bound under more informative semi-bandit feedback. We demonstrate the effectiveness of our algorithm through experimental results.

POSTER-761: Symmetric Single Index Learning

Primary Area: learning theory

Keywords: single-index symmetric permutation invariant gradient flow learning

Scores: [ 6 8 5 6 ]

POSTER-762: Minimum width for universal approximation using ReLU networks on compact domain

Primary Area: learning theory

Keywords: universal approximation neural networks

Scores: [ 8 8 5 ]

It has been shown that deep neural networks of a large enough width are universal approximators but they are not if the width is too small.There were several attempts to characterize the minimum width wmin enabling the universal approximation property; however, only a few of them found the exact values.In this work, we show that the minimum width for universal approximation of Lp functions from [0,1]dx to Rdy is exactly maxdx,dy,2 if an activation function is ReLU-Like (e.g., ReLU, GELU, Softplus).Compared to the known result for ReLU networks, wmin=maxdx+1,dy when the domain is Rdx, our result first shows that approximation on a compact domain requires smaller width than on Rdx.We next prove a lower bound on wmin for uniform approximation using general activation functions including ReLU: wmindy+1 if dx<dy2dx. Together with our first result, this shows a dichotomy between Lp and uniform approximations for general activation functions and input/output dimensions.

POSTER-763: Enhancing Neural Training via a Correlated Dynamics Model

Primary Area: learning theory

Keywords: dynamics modeling training analysis low dimensional model for training

Scores: [ 10 5 3 8 ]

As neural networks grow in scale, their training becomes both computationally demanding and rich in dynamics. Amidst the flourishing interest in these training dynamics, we present a novel observation: Parameters during training exhibit intrinsic correlations over time. Capitalizing on this, we introduce \emph{correlation mode decomposition} (CMD). This algorithm clusters the parameter space into groups, termed modes, that display synchronized behavior across epochs. This enables CMD to efficiently represent the training dynamics of complex networks, like ResNets and Transformers, using only a few modes. Moreover, test set generalization is enhanced. We introduce an efficient CMD variant, designed to run concurrently with training. Our experiments indicate that CMD surpasses the state-of-the-art method for compactly modeled dynamics on image classification. Our modeling can improve training efficiency and lower communication overhead, as shown by our preliminary experiments in the context of federated learning.

POSTER-764: Towards the Fundamental Limits of Knowledge Transfer over Finite Domains

Primary Area: learning theory

Keywords: knowledge transfer classification minimax optimality density estimation knowledge distillation

Scores: [ 6 6 8 6 ]

We characterize the statistical efficiency of knowledge transfer through n samples from a teacher to a probabilistic student classifier with input space S over labels A. We show that privileged information at three progressive levels accelerates the transfer. At the first level, only samples with hard labels are known, via which the maximum likelihood estimator attains the minimax rate |S||A|/n. The second level has the teacher probabilities of sampled labels available in addition, which turns out to boost the convergence rate lower bound to |S||A|/n. However, under this second data acquisition protocol, minimizing a naive adaptation of the cross-entropy loss results in an asymptotically biased student. We overcome this limitation and achieve the fundamental limit by using a novel empirical variant of the squared error logit loss. The third level further equips the student with the soft labels (complete logits) on A given every sampled input, thereby provably enables the student to enjoy a rate |S|/n free of |A|. We find any Kullback-Leibler divergence minimizer to be optimal in the last case. Numerical simulations distinguish the four learners and corroborate our theory.

POSTER-765: An Agnostic View on the Cost of Overfitting in (Kernel) Ridge Regression

Primary Area: learning theory

Keywords: kernel ridge regression cost of overfitting benign overfitting tempered overfitting

Scores: [ 8 6 6 6 ]

We study the cost of overfitting in noisy kernel ridge regression (KRR), which we define as the ratio between the test error of the interpolating ridgeless model and the test error of the optimally-tuned model. We take an agnostic'' view in the following sense: we consider the cost as a function of sample size for any target function, even if the sample size is not large enough for consistency or the target is outside the RKHS. We analyze the cost of overfitting under a Gaussian universality ansatz using recently derived (non-rigorous) risk estimates in terms of the task eigenstructure. Our analysis provides a more refined characterization of benign, tempered and catastrophic overfitting (cf. Mallinar et al. 2022).

POSTER-766: Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking

Primary Area: learning theory

Keywords: grokking implicit bias margin kernel training dynamics generalization

Scores: [ 6 6 6 ]

Recent work by Power et al. (2022) highlighted a surprising "grokking" phenomenon in learning arithmetic tasks: a neural net first "memorizes" the training set, resulting in perfect training accuracy but near-random test accuracy, and after training for sufficiently longer, it suddenly transitions to perfect test accuracy. This paper studies the grokking phenomenon in theoretical setups and shows that it can be induced by a dichotomy of early and late phase implicit biases. Specifically, when training homogeneous neural nets with large initialization and small weight decay on both classification and regression tasks, we prove that the training process gets trapped at a solution corresponding to a kernel predictor for a long time, and then a very sharp transition to min-norm/max-margin predictors occurs, leading to a dramatic change in test accuracy. Even in the absence of weight decay, we show that grokking can still happen when the late phase implicit bias is driven by other regularization mechanisms, such as implicit margin maximization or sharpness reduction.

POSTER-767: Theoretical Understanding of Learning from Adversarial Perturbations

Primary Area: learning theory

Keywords: Adversarial Perturbations Adversarial Examples Adversarial Attacks Non-Robust Features

Scores: [ 6 5 6 5 ]

It is not fully understood why adversarial examples can deceive neural networks and transfer between different networks. To elucidate this, several studies hypothesized that adversarial perturbations contain data features that are imperceptible to humans but still recognizable by neural networks. Empirical evidence has shown that neural networks trained on mislabeled samples with these perturbations can generalize to natural test data. However, a theoretical understanding of this counterintuitive phenomenon is limited. In this study, assuming orthogonal training samples, we first prove that one-hidden-layer neural networks can learn natural data structures from adversarial perturbations. Our results highlight that various adversarial perturbations, even perturbations of a few pixels, encapsulate abundant data information and facilitate network generalization. Moreover, we reveal that, under mild conditions, the decision boundary from learning perturbations aligns with that from natural data, except for specific points in the input space.

POSTER-768: A Statistical Analysis of Wasserstein Autoencoders for Intrinsically Low-dimensional Data

Primary Area: learning theory

Keywords: Wasserstein Autoencoders Statistical Analysis Error rates Intrinsic Dimension

Scores: [ 6 6 6 6 ]

Variational Autoencoders (VAEs) have gained significant popularity among researchers as a powerful tool for understanding unknown distributions based on limited samples. This popularity stems partly from their impressive performance and partly from their ability to provide meaningful feature representations in the latent space. Wasserstein Autoencoders (WAEs), a variant of VAEs, aim to not only improve model efficiency but also interpretability. However, there has been limited focus on analyzing their statistical guarantees. The matter is further complicated by the fact that the data distributions to which WAEs are applied - such as natural images - are often presumed to possess an underlying low-dimensional structure within a high-dimensional feature space, which current theory does not adequately account for, rendering known bounds inefficient. To bridge the gap between the theory and practice of WAEs, in this paper, we show that WAEs can learn the data distributions when the network architectures are properly chosen. We show that the convergence rates of the expected excess risk in the number of samples for WAEs are independent of the high feature dimension, instead relying only on the intrinsic dimension of the data distribution.

SPOTLIGHT-147: Scaling Laws for Associative Memories

Primary Area: learning theory

Keywords: scaling law associative memory mechanistic interpretability Hopfield network

Scores: [ 8 8 6 8 8 ]

Learning arguably involves the discovery and memorization of abstract rules.The aim of this paper is to study associative memory mechanisms.Our model is based on high-dimensional matrices consisting of outer products of embeddings, which relates to the inner layers of transformer language models.We derive precise scaling laws with respect to sample size and parameter size, and discuss the statistical efficiency of different estimators, including optimization-based algorithms.We provide extensive numerical experiments to validate and interpret theoretical results, including fine-grained visualizations of the stored memory associations.

SPOTLIGHT-148: Sample-Efficient Linear Representation Learning from Non-IID Non-Isotropic Data

Primary Area: learning theory

Keywords: Representation learning meta learning multi-task learning

Scores: [ 8 8 6 8 ]

SPOTLIGHT-149: Harnessing Density Ratios for Online Reinforcement Learning

Primary Area: learning theory

Keywords: reinforcement learning theory online RL offline RL hybrid RL density ratio marginalized importance weight weight function general function approximation

Scores: [ 8 6 6 ]

The theories of offline and online reinforcement learning, despite having evolved in parallel, have recently started to see unification, and algorithms/concepts in one setting often have natural counterparts in the other. However, the notion of density ratio modeling, an emerging topic in offline RL, has been largely absent from online RL, perhaps for good reason: the very existence and boundedness of density ratios relies on a dataset with good coverage, but the core challenge in online RL is to collect such an exploratory dataset without having one to start.In this work we show—perhaps surprisingly—that density ratio-based algorithms have online counterparts. Assuming the mere existence of an exploratory distribution with good coverage, a structural condition known as coverability (Xie et al., 2023), we give an algorithm (GLOW) which performs sample-efficient online exploration under value-function and density-ratio realizability. GLOW addressesunbounded density ratios via careful use of truncation, and combines this with optimism to guide exploration. GLOW is computationally inefficient; we complement it with a more efficient counterpart, HYGLOW, for the Hybrid RL setting (Song et al., 2023) in which online RL is augmented with additional offline data. HYGLOW is derived as a special case of a novel meta-algorithm, H2O, which provides a provable black-box reduction from hybrid RL to offline RL.

POSTER-769: Horizon-Free Regret for Linear Markov Decision Processes

Primary Area: learning theory

Keywords: Linear MDP; horizon-free regret analysis

Scores: [ 6 5 6 8 ]

POSTER-770: When can transformers reason with abstract symbols?

Primary Area: learning theory

Keywords: transformers language models reasoning theoretical analysis variable binding

Scores: [ 8 6 8 8 8 ]

We investigate the capability of Transformer large language models (LLMs) to generalize on unseen symbols when trained on tasks that rely on abstract symbols (e.g., variables in programming and mathematics). Such a 'variable-binding' capability has long been studied in the neuroscience literature as one of the most basic 'reasoning' capabilities. For (i) binary classification tasks, we prove that Transformers can generalize to unseen symbols but require astonishingly large training data. For (ii) tasks with labels dependent on input symbols, we show an ''inverse scaling law'': Transformers fail to generalize to unseen symbols as their embedding dimension increases. For both cases (i) and (ii), we propose a Transformer modification, adding two trainable parameters per head that can reduce the amount of data needed.

POSTER-771: Algorithms for Caching and MTS with reduced number of predictions

Primary Area: learning theory

Keywords: ML-Augmented Algorithms Caching Metrical Task Systems

Scores: [ 8 8 8 8 ]

ML-augmented algorithms utilize predictions to achieve performance beyond their worst-case bounds. Producing these predictions might be a costly operation – this motivated Im et al. [2022] to introduce the study of algorithms which use predictions parsimoniously. We design parsimonious algorithms for caching and MTS with action predictions, proposed by Antoniadis et al. [2023], focusing on the parameters of consistency (performance with perfect predictions) and smoothness (dependence of their performance on prediction error). Our algorithm for caching is 1-consistent, robust, and its smoothness deteriorates with decreasing number of available predictions. We propose an algorithm for general MTS whose consistency and smoothness both scale linearly with the decreasing number of predictions. Without restriction on the number of available predictions, both algorithms match the earlier guarantees achieved by Antoniadis et al. [2023].

POSTER-772: Early Neuron Alignment in Two-layer ReLU Networks with Small Initialization

Primary Area: learning theory

Keywords: Neural Networks Small Initialization Gradient Flow Neuron Alignment

Scores: [ 5 8 5 8 ]

This paper studies the problem of training a two-layer ReLU network for binary classification using gradient flow with small initialization. We consider a training dataset with well-separated input vectors: Any pair of input data with the same label are positively correlated, and any pair with different labels are negatively correlated. Our analysis shows that, during the early phase of training, neurons in the first layer try to align with either the positive data or the negative data, depending on its corresponding weight on the second layer. A careful analysis of the neurons' directional dynamics allows us to provide an O(lognμ) upper bound on the time it takes for all neurons to achieve good alignment with the input data, where n is the number of data points and μ measures how well the data are separated. After the early alignment phase, the loss converges to zero at a O(1t) rate, and the weight matrix on the first layer is approximately low-rank. Numerical experiments on the MNIST dataset illustrate our theoretical findings.

POSTER-773: Learning Hierarchical Polynomials with Three-Layer Neural Networks

Primary Area: learning theory

Keywords: Hierarchical polynomials feature learning three-layer networks sample complexity gradient descent

Scores: [ 5 8 5 5 ]

We study the problem of learning hierarchical polynomials over the standard Gaussian distribution with three-layer neural networks. We specifically consider target functions of the form h=gp where p:RdR is a degree k polynomial and g:RR is a degree q polynomial. This function class generalizes the single-index model, which corresponds to k=1, and is a natural class of functions possessing an underlying hierarchical structure. Our main result shows that for a large subclass of degree k polynomials p, a three-layer neural network trained via layerwise gradient descent on the square loss learns the target h up to vanishing test error in O~(dk) samples and polynomial time. This is a strict improvement over kernel methods, which require Θ~(dkq) samples, as well as existing guarantees for two-layer networks, which require the target function to be low-rank. Our result also generalizes prior works on three-layer neural networks, which were restricted to the case of p being a quadratic. When p is indeed a quadratic, we achieve the information-theoretically optimal sample complexity O~(d2), which is an improvement over prior work (Nichani et al., 2023) requiring a sample size of Θ~(d4). Our proof proceeds by showing that during the first stage of training the network performs feature learning to recover the feature p with O~(dk) samples. This work demonstrates the ability of three-layer neural networks to learn complex features and as a result learn a broad class of hierarchical functions.

SPOTLIGHT-150: Implicit bias of SGD in L2-regularized linear DNNs: One-way jumps from high to low rank

Primary Area: learning theory

Keywords: implicit bias SGD low-rank linear networks

Scores: [ 8 8 5 10 ]

The L2-regularized loss of Deep Linear Networks (DLNs) withmore than one hidden layers has multiple local minima, correspondingto matrices with different ranks. In tasks such as matrix completion,the goal is to converge to the local minimum with the smallest rankthat still fits the training data. While rank-underestimating minimacan be avoided since they do not fit the data, GD might getstuck at rank-overestimating minima. We show that with SGD, there is always a probability to jumpfrom a higher rank minimum to a lower rank one, but the probabilityof jumping back is zero. More precisely, we define a sequence of sets$B_{1}\subset B_{2}\subset\cdots\subset B_{R}$ so that $B_{r}$contains all minima of rank r or less (and not more) that are absorbingfor small enough ridge parameters λ and learning rates η:SGD has prob. 0 of leaving Br, and from any starting point thereis a non-zero prob. for SGD to go in Br.

POSTER-774: Anisotropy helps: improved statistical and computational complexity of the mean-field Langevin dynamics under structured data

Primary Area: learning theory

Keywords: neural network optimization feature learning mean-field Langevin dynamics

Scores: [ 6 6 8 ]

Recent works have shown that neural networks optimized by gradient-based methods can adapt to sparse or low-dimensional target functions through feature learning; an often studied target is the sparse parity function defined on the unit hypercube. However, such isotropic data setting does not capture the anisotropy and low intrinsic dimensionality exhibited in realistic datasets. In this work, we address this shortcoming by studying how gradient-based feature learning interacts with structured (anisotropic) input data: we consider the sparse parity problem on high-dimensional orthotope where the feature coordinates have varying magnitudes, and analyze the learning complexity of the mean-field Langevin dynamics (MFLD), which describes the noisy gradient descent update on two-layer neural network. We show that the statistical complexity (i.e. sample size) and computational complexity (i.e. width of the neural network) of MFLD can both be improved when prominent directions of the anisotropic input data aligns with the support of the target function. Moreover, by employing an anisotropic weight decay regularization determined by the gradient covariance, the problem can be efficiently learned by a constant-width neural network.

SPOTLIGHT-151: Spectrally Transformed Kernel Regression

Primary Area: learning theory

Keywords: Learning Theory Unlabeled Data Kernel Methods Semi-supervised Learning Representation Learning Label Propagation

Scores: [ 8 8 8 8 8 ]

Unlabeled data is a key component of modern machine learning.In general, the role of unlabeled data is to impose on the predictor a form of smoothness,usually from the similarity information encoded in a base kernel, such as the ϵ-neighbor kernel or the adjacency matrix of a graph.This work establishes a unified theory for learning with unlabeled data and a base kernel,by revisiting the classical idea of spectrally transformed kernel regression (STKR).First, by characterizing a universal type of "target smoothness", we show that any sufficiently smooth target function can be learned by STKR,so the theory in this work is valid for a broad class of methods, including various semi-supervised, self-supervised and representation learning algorithms.Second, we provide scalable STKR implementations for the inductive setting and a general transformation function, while prior work is mostly limited to the transductive setting.Third, we derive (near-)tight statistical guarantees for STKR in two scenarios: STKR with a known analytic transformation,and STKR with kernel PCA when the transformation is unknown.Overall, we believe that this work helps deepen our understanding of how to work with unlabeled data,and its generality and broad scope make it easier to inspire new methods.

POSTER-775: More is Better: when Infinite Overparameterization is Optimal and Overfitting is Obligatory

Primary Area: learning theory

Keywords: overparameterization interpolation random feature regression kernel regression generalization overfitting

Scores: [ 6 8 6 ]

In our era of enormous neural networks, empirical progress has been driven by the philosophy that *more is better.*Recent deep learning practice has found repeatedly that larger model size, more data, and more computation (resulting in lower training loss) optimizing to near-interpolation improves performance. In this paper, we give theoretical backing to these empirical observations by showing that these three properties hold in random feature (RF) regression, a class of models equivalent to shallow networks with only the last layer trained.Concretely, we first show that the test risk of RF regression decreases monotonically with both the number of features and samples, provided the ridge penalty is tuned optimally. In particular, this implies that infinite width RF architectures are preferable to those of any finite width. We then proceed to demonstrate that, for a large class of tasks characterized by powerlaw eigenstructure, training to near-zero training loss is obligatory: near-optimal performance can only be achieved when the training error is much smaller than the test error. Grounding our theory in real-world data, we find empirically that standard computer vision tasks with convolutional neural kernels clearly fall into this class. Taken together, our results tell a simple, testable story of the benefits of overparameterization and overfitting in random feature models.

POSTER-776: Fantastic Generalization Measures are Nowhere to be Found

Primary Area: learning theory

Keywords: overparametrization generalization

Scores: [ 8 6 8 6 ]

Numerous generalization bounds have been proposed in the literature as potential explanations for the ability of neural networks to generalize in the overparameterized setting. However, none of these bounds are tight. For instance, in their paper “Fantastic Generalization Measures and Where to Find Them”, Jiang et al. (2020) examine more than a dozen generalization bounds, and show empirically that none of them imply guarantees that can explain the remarkable performance of neural networks. This raises the question of whether tight generalization bounds are at all possible. We consider two types of generalization bounds common in the literature: (1) bounds that depend on the training set and the output of the learning algorithm. There are multiple bounds of this type in the literature (e.g., norm- and margin-based bounds), but we prove mathematically that no such bound can be uniformly tight in the overparameterized setting; (2) bounds that depend on the training set and on the learning algorithm (e.g., stability bounds). For these bounds, we show a trade-off between the algorithm's performance and the bound's tightness. Namely, under mild assumptions, if the algorithm achieves good accuracy in the overparameterized setting, then no generalization bound can be tight for it. We conclude that generalization bounds in the overparameterized setting cannot be tight without suitable assumptions on the population distribution.

SPOTLIGHT-152: Optimal Sample Complexity of Contrastive Learning

Primary Area: learning theory

Keywords: learning theory sample complexity vc dimension contrastive learning metric learning

Scores: [ 8 6 8 8 ]

Contrastive learning is a highly successful technique for learning representations of data from labeled tuples, specifying the distance relations within the tuple. We study the sample complexity of contrastive learning, i.e. the minimum number of labeled tuples sufficient for getting high generalization accuracy. We give tight bounds on the sample complexity in a variety of settings, focusing on arbitrary distance functions, p-distances, and tree metrics. Our main result is an (almost) optimal bound on the sample complexity of learning p-distances for integer p. For any p1, we show that Θ~(nd) labeled tuples are necessary and sufficient for learning d-dimensional representations of n-point datasets. Our results hold for an arbitrary distribution of the input samples and are based on giving the corresponding bounds on the Vapnik-Chervonenkis/Natarajan dimension of the associated problems. We further show that the theoretical bounds on sample complexity obtained via VC/Natarajan dimension can have strong predictive power for experimental results, in contrast with the folklore belief about a substantial gap between the statistical learning theory and the practice of deep learning.

POSTER-777: A Sublinear Adversarial Training Algorithm

Primary Area: learning theory

Keywords: Adversarial training

Scores: [ 3 6 6 8 ]

Adversarial training is a widely used strategy for making neural networks resistant to adversarial perturbations. For a neural network of width m, n input training data in d dimension, it takes Ω(mnd) time cost per training iteration for the forward and backward computation. In this paper we analyze the convergence guarantee of adversarial training procedure on a two-layer neural network with shifted ReLU activation, and shows that only o(m) neurons will be activated for each input data per iteration. Furthermore, we develop an algorithm for adversarial training with time cost o(mnd) per iteration by applying half-space reporting data structure.

SPOTLIGHT-153: Memorization Capacity of Multi-Head Attention in Transformers

Primary Area: learning theory

Keywords: Learning Theory Expressivity Multi-Head Attention Transformers

Scores: [ 8 8 8 6 ]

Transformers have become the go-to architecture for language and vision tasks, yet their theoretical properties, especially memorization capacity, remain elusive. This paper investigates the memorization abilities of multi-head attention mechanisms, examining how many example sequences they can memorize, as a function of the number of heads and sequence length. Motivated by experimental findings on vision transformers, we introduce novel assumptions about the linear independence of input data, distinct from the commonly used general-position assumption. Under these assumptions, we demonstrate that an attention layer with H heads, dimension d, and context size n<d, featuring Θ(Hd2) parameters, can memorize Ω(Hn) examples. Our analysis sheds light on how different attention heads handle various example sequences, aided by the softmax operator’s saturation property. We validate our findings through experiments on synthetic data.

SPOTLIGHT-154: High-dimensional SGD aligns with emerging outlier eigenspaces

Primary Area: learning theory

Keywords: stochastic gradient descent Hessian multi-layer neural networks high-dimensional classification Gaussian mixture model XOR problem

Scores: [ 10 8 8 6 8 6 ]

We rigorously study the joint evolution of training dynamics via stochastic gradient descent (SGD) and the spectra of empirical Hessian and gradient matrices. We prove that in two canonical classification tasks for multi-class high-dimensional mixtures and either 1 or 2-layer neural networks, the SGD trajectory rapidly aligns with emerging low-rank outlier eigenspaces of the Hessian and gradient matrices. Moreover, in multi-layer settings this alignment occurs per layer, with the final layer's outlier eigenspace evolving over the course of training, and exhibiting rank deficiency when the SGD converges to sub-optimal classifiers. This establishes some of the rich predictions that have arisen from extensive numerical studies in the last decade about the spectra of Hessian and information matrices over the course of training in overparametrized networks.

POSTER-778: A Black-box Approach for Non-stationary Multi-agent Reinforcement Learning

Primary Area: learning theory

Keywords: regret analysis learning in non-stationary games bandit feedback

Scores: [ 6 6 6 6 ]

We investigate learning the equilibria in non-stationary multi-agent systems and address the challenges that differentiate multi-agent learning from single-agent learning. Specifically, we focus on games with bandit feedback, where testing an equilibrium can result in substantial regret even when the gap to be tested is small, and the existence of multiple optimal solutions (equilibria) in stationary games poses extra challenges. To overcome these obstacles, we propose a versatile black-box approach applicable to a broad spectrum of problems, such as general-sum games, potential games, and Markov games, when equipped with appropriate learning and testing oracles for stationary environments. Our algorithms can achieve O~(Δ1/4T3/4) regret when the degree of nonstationarity, as measured by total variation Δ, is known, and O~(Δ1/5T4/5) regret when Δ is unknown, where T is the number of rounds. Meanwhile, our algorithm inherits the favorable dependence on number of agents from the oracles. As a side contribution that may be independent of interest, we show how to test for various types of equilibria by a black-box reduction to single-agent learning, which includes Nash equilibria, correlated equilibria, and coarse correlated equilibria.

POSTER-779: ReTaSA: A Nonparametric Functional Estimation Approach for Addressing Continuous Target Shift

Primary Area: learning theory

Keywords: target shift regularization distributional shift integral equation importance weight

Scores: [ 8 8 6 6 ]

The presence of distribution shifts poses a significant challenge for deploying modern machine learning models in real-world applications. This work focuses on the target shift problem in a regression setting (Zhang et al., 2013; Nguyen et al., 2016). More specifically, the target variable y (also known as the response variable), which is continuous, has different marginal distributions in the training source and testing domain, while the conditional distribution of features x given y remains the same. While most literature focuses on classification tasks with finite target space, the regression problem has an infinite dimensional target space, which makes many of the existing methods inapplicable. In this work, we show that the continuous target shift problem can be addressed by estimating the importance weight function from an ill-posed integral equation. We propose a nonparametric regularized approach named ReTaSA to solve the ill-posed integral equation and provide theoretical justification for the estimated importance weight function. The effectiveness of the proposed method has been demonstrated with extensive numerical studies on synthetic and real-world datasets.

POSTER-780: The optimality of kernel classifiers in Sobolev space

Primary Area: learning theory

Keywords: kernel classification minimax optimality neural network classifiers reproducing kernel Hilbert Space

Scores: [ 5 6 5 6 ]

Kernel methods are widely used in machine learning, especially for classification problems. However, the theoretical analysis of kernel classification is still limited. This paper investigates the statistical performances of kernel classifiers. With some mild assumptions on the conditional probability η(x)=P(Y=1X=x), we derive an upper bound on the classification excess risk of a kernel classifier using recent advances in the theory of kernel regression. We also obtain a minimax lower bound for Sobolev spaces, which shows the optimality of the proposed classifier. Our theoretical results can be extended to the generalization error of overparameterized neural network classifiers. To make our theoretical results more applicable in realistic settings, we also propose a simple method to estimate the interpolation smoothness of 2η(x)1 and apply the method to real datasets.

POSTER-781: Optimal criterion for feature learning of two-layer linear neural network in high dimensional interpolation regime

Primary Area: learning theory

Keywords: two-layer linear neural network feature learning optimal criterion multi-output linear reggression high-dimensional setting

Scores: [ 5 8 6 6 ]

Deep neural networks with feature learning have shown surprising generalization performance in high dimensional settings, but it has not been fully understood how and when they enjoy the benefit of feature learning. In this paper, we theoretically analyze the statistical properties of the benefits from feature learning in a two-layer linear neural network with multiple outputs in a high-dimensional setting. For that purpose, we propose a new criterion that allows feature learning of a two-layer linear neural network in a high-dimensional setting. Interestingly, we can show that models with smaller values of the criterion generalize even in situations where normal ridge regression fails to generalize. This is because the proposed criterion contains a proper regularization for the feature mapping and acts as an upper bound on the predictive risk. As an important characterization of the criterion, the two-layer linear neural network that minimizes this criterion can achieve the optimal Bayes risk that is determined by the distribution of the true signals across the multiple outputs. To the best of our knowledge, this is the first study to specifically identify the conditions under which a model obtained by proper feature learning can outperform normal ridge regression in a high-dimensional multiple-output linear regression problem.

POSTER-782: Koopman-based generalization bound: New aspect for full-rank weights

Primary Area: learning theory

Keywords: generalization bound full-rank weight matrix Koopman operator

Scores: [ 6 8 5 ]

We propose a new bound for generalization of neural networks using Koopman operators. Whereas most of existing works focus on low-rank weight matrices, we focus on full-rank weight matrices. Our bound is tighter than existing norm-based bounds when the condition numbers of weight matrices are small. Especially, it is completely independent of the width of the network if the weight matrices are orthogonal. Our bound does not contradict to the existing bounds but is a complement to the existing bounds. As supported by several existing empirical results, low-rankness is not the only reason for generalization. Furthermore, our bound can be combined with the existing bounds to obtain a tighter bound. Our result sheds new light on understanding generalization of neural networks with full-rank weight matrices, and it provides a connection between operator-theoretic analysis and generalization of neural networks.

POSTER-783: Oracle Efficient Algorithms for Groupwise Regret

Primary Area: learning theory

Keywords: Oracle efficiency online learning groupwise regret sleeping experts

Scores: [ 6 8 8 3 ]

We study the problem of online prediction, in which at each time step t{1,2,T}, an individual xt arrives, whose label we must predict. Each individual is associated with various groups, defined based on their features such as age, sex, race etc., which may intersect. Our goal is to make predictions that have regret guarantees not just overall but also simultaneously on each sub-sequence comprised of the members of any single group. Previous work such as [Blum & Lykouris][1] and [Lee et al][2] provide attractive regret guarantees for these problems; however, these are computationally intractable on large model classes (e.g., the set of all linear models, as used in linear regression). We show that a simple modification of the sleeping experts technique of [Blum & Lykouris][1] yields an efficient reduction to the well-understood problem of obtaining diminishing external regret absent group considerations. Our approach gives similar regret guarantees compared to [Blum & Lykouris][1]; however, we run in time linear in the number of groups, and are oracle-efficient in the hypothesis class. This in particular implies that our algorithm is efficient whenever the number of groups is polynomially bounded and the external-regret problem can be solved efficiently, an improvement on [Blum & Lykouris][1]'s stronger condition that the model class must be small. Our approach can handle online linear regression and online combinatorial optimization problems like online shortest paths. Beyond providing theoretical regret bounds, we evaluate this algorithm with an extensive set of experiments on synthetic data and on two real data sets --- Medical costs and the Adult income dataset, both instantiated with intersecting groups defined in terms of race, sex, and other demographic characteristics. We find that uniformly across groups, our algorithm gives substantial error improvements compared to running a standard online linear regression algorithm with no groupwise regret guarantees.

SPOTLIGHT-155: SGD Finds then Tunes Features in Two-Layer Neural Networks with near-Optimal Sample Complexity: A Case Study in the XOR problem

Primary Area: learning theory

Keywords: optimization stochastic gradient descent two-layer neural network sample complexity

Scores: [ 8 8 8 6 ]

In this work, we consider the optimization process of minibatch stochastic gradient descent (SGD) on a 2-layer neural network with data separated by a quadratic ground truth function. We prove that with data drawn from the Boolean hypercube labeled by the quadratic XOR'' function y=xixj , it is possible to train to a population error o(1) with Θ(dpolylog(d)) samples. Our result considers simultaneously training both layers of the two-layer-neural network with ReLU activations via standard minibatch SGD on the logistic loss. To our knowledge, this work is the first to give a sample complexity of for efficiently learning the XOR function on isotropic data on a standard neural network with standard training. Our main technique is showing that the network evolves in two phases: a \em signal-finding \em phase where the network is small and many of the neurons evolve independently to find features, and a \em signal-heavy \em phase, where SGD maintains and balances the features. We leverage the simultaneous training of the layers to show that it is sufficient for only a small fraction of the neurons to learn features, since those neurons will be amplified by the simultaneous growth of their second layer weights.

POSTER-784: Rethinking Information-theoretic Generalization: Loss Entropy Induced PAC Bounds

Primary Area: learning theory

Keywords: information theory generalization bound learning theory minimum error entropy

Scores: [ 8 6 6 6 ]

Information-theoretic generalization analysis has achieved astonishing success in characterizing the generalization capabilities of noisy and iterative learning algorithms. However, current advancements are mostly restricted to average-case scenarios and necessitate the stringent bounded loss assumption, leaving a gap with regard to computationally tractable PAC generalization analysis, especially for long-tailed loss distributions. In this paper, we bridge this gap by introducing a novel class of PAC bounds through leveraging loss entropies. These bounds simplify the computation of key information metrics in previous PAC information-theoretic bounds to one-dimensional variables, thereby enhancing computational tractability. Moreover, our data-independent bounds provide novel insights into the generalization behavior of the minimum error entropy criterion, while our data-dependent bounds improve over previous results by alleviating the bounded loss assumption under both leave-one-out and supersample settings. Extensive numerical studies indicate strong correlations between the generalization error and the induced loss entropy, showing that the presented bounds adeptly capture the patterns of the true generalization gap under various learning scenarios.

SPOTLIGHT-156: Noisy Interpolation Learning with Shallow Univariate ReLU Networks

Primary Area: learning theory

Keywords: Interpolation Learning Benign Overfitting ReLU Networks

Scores: [ 8 8 8 ]

Understanding how overparameterized neural networks generalize despite perfect interpolation of noisy training data is a fundamental question. Mallinar et. al. (2022) noted that neural networks seem to often exhibit tempered overfitting'', wherein the population risk does not converge to the Bayes optimal error, but neither does it approach infinity, yielding non-trivial generalization. However, this has not been studied rigorously. We provide the first rigorous analysis of the overfiting behaviour of regression with minimum norm (2 of weights), focusing on univariate two-layer ReLU networks. We show overfitting is tempered (with high probability) when measured with respect to the L1 loss, but also show that the situation is more complex than suggested by Mallinar et. al., and overfitting is catastrophic with respect to the L2 loss, or when taking an expectation over the training set.

POSTER-785: Grokking in Linear Estimators -- A Solvable Model that Groks without Understanding

Primary Area: learning theory

Keywords: Grokking Random Matrix Theory Linear Regression Representation Learning

Scores: [ 3 8 5 6 ]

Grokking is the intriguing phenomenon where a model learns to generalize long after it has fit the training data. We show both analytically and numerically that grokking can surprisingly occur in linear networks performing linear tasks in a simple teacher-student setup. In this setting, the full training dynamics is derived in terms of the expected training and generalization data covariance matrix. We present exact predictions on how the grokking time depends on input and output dimensionality, train sample size, regularization, and network parameters initialization. The key findings are that late generalization increase may not imply a transition from "memorization" to "understanding", but can simply be an artifact of the accuracy measure. We provide empirical verification for these propositions, along with preliminary results indicating that some predictions also hold for deeper networks, with non-linear activations.

POSTER-786: Minimax optimality of convolutional neural networks for infinite dimensional input-output problems and separation from kernel methods

Primary Area: learning theory

Keywords: neural network nonlinear operator learning convolutional network estimation error analysis

Scores: [ 8 6 8 ]

Recent deep learning applications, exemplified by text-to-image tasks, often involve high-dimensional inputs and outputs. While several studies have investigated the function estimation capabilities of deep learning, research on dilated convolutional neural networks (CNNs) has mainly focused on cases where input dimensions are infinite but output dimensions are one-dimensional, similar to many other studies. However, many practical deep learning tasks involve high-dimensional (or even infinite dimensional) inputs and outputs.In this paper, we investigate the optimality of dilated CNNs for estimating a map between infinite-dimensional input and output spaces by analyzing their approximation and estimation abilities. For that purpose, we first show that approximation and estimation errors depend only on the smoothness and decay rate with respect to the infinity norm of the output, and their estimation accuracy actually achieve the {\it minimax optimal} rate of convergence.Second, we demonstrate that the dilated CNNs outperform {\it any} linear estimators including kernel ridge regression and k-NN estimators in a minimax error sense, highlighting the usefulness of feature learning realized by deep neural networks.Our theoretical analysis particularly explains the success of deep learning in recent high-dimensional input-output tasks.

POSTER-787: Learning from Aggregate responses: Instance Level versus Bag Level Loss Functions

Primary Area: learning theory

Keywords: aggregate learning asymptotic analysis bias variance privacy regularization

Scores: [ 6 6 6 6 6 ]

Due to the rise of privacy concerns, in many practical applications the training data is aggregated before being shared with the learner, in order to protect privacy of users' sensitive responses. In an aggregate learning framework, the dataset is grouped into bags of samples, where each bag is available only with an aggregate response, providing a summary of individuals’ responses in that bag. In this paper, we study two natural loss functions for learning from aggregate responses: bag-level loss and the instance-level loss. In the former, the model is learnt by minimizing a loss between aggregate responses and aggregate model predictions, while in the latter the model aims to fit individual predictions to the aggregate responses. In this work, we show that the instance-level loss can be perceived as a regularized form of the bag-level loss. This observation lets us compare the two approaches with respect to bias and variance of the resulting estimators, and introduce a novel interpolating estimator which combines the two approaches. For linear regression tasks, we provide a precise characterization of the risk of the interpolating estimator in an asymptotic regime where the size of the training set grows in proportion to the features dimension. Our analysis allows us to theoretically understand the effect of different factors, such as bag size on the model prediction risk. In addition, we propose a mechanism for differentially private learning from aggregate responses and derive the optimal bag size in terms of prediction risk-privacy trade-off. We also carry out thorough experiments to corroborate our theory and show the efficacy of the interpolating estimator.

POSTER-788: BayesPrompt: Prompting Large-Scale Pre-Trained Language Models on Few-shot Inference via Debiased Domain Abstraction

Primary Area: learning theory

Keywords: Prompt;Pre-Trained;Few-shot;Debiased;Domain Abstraction

Scores: [ 8 6 6 5 ]

As a novel and effective fine-tuning paradigm based on large-scale pre-trained language models (PLMs), prompt-tuning aims to reduce the gap between downstream tasks and pre-training objectives. While prompt-tuning has yielded continuous advancements in various tasks, such an approach still remains a persistent defect: prompt-tuning methods fail to generalize to specific few-shot patterns. From the perspective of distribution analyses, we disclose that the intrinsic issues behind the phenomenon are the over-multitudinous conceptual knowledge contained in PLMs and the abridged knowledge for target downstream domains, which jointly result in that PLMs mis-locate the knowledge distributions corresponding to the target domains in the universal knowledge embedding space. To this end, we intuitively explore to approximate the unabridged target domains of downstream tasks in a debiased manner, and then abstract such domains to generate discriminative prompts, thereby providing the de-ambiguous guidance for PLMs. Guided by such an intuition, we propose a simple yet effective approach, namely BayesPrompt, to learn prompts that contain the domain discriminative information against the interference from domain-irrelevant knowledge. BayesPrompt primitively leverages known distributions to approximate the debiased factual distributions of target domains and further uniformly samples certain representative features from the approximated distributions to generate the ultimate prompts for PLMs. We provide solid theoretical analyses on classification error bounds to prove that BayesPrompt tightens the upper bound of the classification error on the downstream inference of PLMs. Empirically, our method achieves state-of-the-art performance on benchmarks.

POSTER-789: Plugin estimators for selective classification with out-of-distribution detection

Primary Area: learning theory

Keywords: Selective classification Learning to reject Abstention OOD detection SCOD Loss functions Plug-in estimators Statistical consistency

Scores: [ 6 8 8 ]

Real-world classifiers can benefit from the option of abstaining from predicting on samples where they have low confidence. Such abstention is particularly useful on samples which are close to the learned decision boundary, or which are outliers with respect to the training sample. These settings have been the subject of extensive but disjoint study in the selective classification (SC) and out-of-distribution (OOD) detection literature. Recent work on selective classification with OOD detection (SCOD) has argued for the unified study of these problems; however, the formal underpinnings of this problem are still nascent, and existing techniques are heuristic in nature. In this paper, we propose new plugin estimators for SCOD that are theoretically grounded, effective, and generalise existing approaches from the SC and OOD detection literature. In the course of our analysis, we formally explicate how naïve use of existing SC and OOD detection baselines may be inadequate for SCOD. We empirically demonstrate that our approaches yields competitive SC and OOD detection trade-offs compared to common baselines.

POSTER-790: Unraveling the Enigma of Double Descent: An In-depth Analysis through the Lens of Learned Feature Space

Primary Area: learning theory

Keywords: neural network double descent classification interpretability

Scores: [ 5 5 3 5 ]

Double descent presents a counter-intuitive aspect within the machine learning domain, and researchers have observed its manifestation in various models and tasks. While some theoretical explanations have been proposed for this phenomenon in specific contexts, an accepted theory to account for its occurrence in deep learning remains yet to be established. In this study, we revisit the phenomenon of double descent and demonstrate that its occurrence is strongly influenced by the presence of noisy data. Through conducting a comprehensive analysis of the feature space of learned representations, we unveil that double descent arises in imperfect models trained with noisy data. We argue that double descent is a consequence of the model first learning the noisy data until interpolation and then adding implicit regularization via over-parameterization acquiring therefore capability to separate the information from the noise. We postulate that double descent should never occur in well-regularized models.

POSTER-791: Benign Overfitting and Grokking in ReLU Networks for XOR Cluster Data

Primary Area: learning theory

Keywords: benign overfitting grokking neural networks feature learning interpolation theory

Scores: [ 5 6 6 ]

Neural networks trained by gradient descent (GD) have exhibited a number of surprising generalization behaviors. First, they can achieve a perfect fit to noisy training data and still generalize near-optimally, showing that overfitting can sometimes be benign. Second, they can undergo a period of classical, harmful overfitting---achieving a perfect fit to training data with near-random performance on test data---before transitioning (''grokking'') to near-optimal generalization later in training. In this work, we show that both of these phenomena provably occur in two-layer ReLU networks trained by GD on XOR cluster data where a constant fraction of the training labels are flipped. In this setting, we show that after the first step of GD, the network achieves 100% training accuracy, perfectly fitting the noisy labels in the training data, but achieves near-random test accuracy. At a later training step, the network achieves near-optimal test accuracy while still fitting the random labels in the training data, exhibiting a ''grokking'' phenomenon. This provides the first theoretical result of benign overfitting in neural network classification when the data distribution is not linearly separable. Our proofs rely on analyzing the feature learning process under GD, which reveals that the network implements a non-generalizable linear classifier after one step and gradually learns generalizable features in later steps.

POSTER-792: Quantifying Interactions in Semi-supervised Multimodal Learning: Guarantees and Applications

Primary Area: learning theory

Keywords: multimodal learning multimodal interactions information theory self-supervised learning multimodal fusion

Scores: [ 8 6 6 ]

In many machine learning systems that jointly learn from multiple modalities, a core research question is to understand the nature of multimodal interactions: how modalities combine to provide new task-relevant information that was not present in either alone. We study this challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data and naturally co-occurring multimodal data (e.g., unlabeled images and captions, video and corresponding audio) but when labeling them is time-consuming. Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds to quantify the amount of multimodal interactions in this semi-supervised setting. We propose two lower bounds: one based on the shared information between modalities and the other based on disagreement between separately trained unimodal classifiers, and derive an upper bound through connections to approximate algorithms for min-entropy couplings. We validate these estimated bounds and show how they accurately track true interactions. Finally, we show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.

POSTER-793: Stochastic Modified Equations and Dynamics of Dropout Algorithm

Primary Area: learning theory

Keywords: Stochastic modified equations dropout noise structure flatness

Scores: [ 6 6 8 6 ]

Dropout is a widely utilized regularization technique in the training of neural networks, nevertheless, its underlying mechanism and impact on achieving good generalization abilities remain to be understood. In this work, we start by undertaking a rigorous theoretical derivation of the stochastic modified equations, with the primary aim of providing an effective approximation for the discrete iterative process of dropout. Subsequently, we empirically delve into the intricate mechanisms by which dropout facilitates the identification of flatter minima. This exploration is conducted through intuitive approximations, exploiting the structural analogies inherent in the Hessian of loss landscape and the covariance of dropout. Our empirical findings substantiate the ubiquitous presence of the Hessian-variance alignment relation throughout the training process of dropout. These dual facets of our research, encompassing both theoretical derivations and empirical observations, collectively constitute a substantial contribution towards a deeper understanding of the inherent tendency of dropout to locate flatter minima.

SPOTLIGHT-157: Out-Of-Domain Unlabeled Data Improves Generalization

Primary Area: learning theory

Keywords: Out-of-domain data Semi-supervised learing learning theory generalization bound adversarial robustness

Scores: [ 6 8 8 6 ]

We propose a novel framework for incorporating unlabeled data into semi-supervised classification problems, where scenarios involving the minimization of either i) adversarially robust or ii) non-robust loss functions have been considered. Notably, we allow the unlabeled samples to deviate slightly (in total variation sense) from the in-domain distribution. The core idea behind our framework is to combine Distributionally Robust Optimization (DRO) with self-supervised training. As a result, we also leverage efficient polynomial-time algorithms for the training stage. From a theoretical standpoint, we apply our framework on the classification problem of a mixture of two Gaussians in Rd, where in addition to the m independent and labeled samples from the true distribution, a set of n (usually with nm) out of domain and unlabeled samples are gievn as well. Using only the labeled data, it is known that the generalization error can be bounded by (d/m)1/2. However, using our method on both isotropic and non-isotropic Gaussian mixture models, one can derive a new set of analytically explicit and non-asymptotic bounds which show substantial improvement on the generalization error compared ERM. Our results underscore two significant insights: 1) out-of-domain samples, even when unlabeled, can be harnessed to narrow the generalization gap, provided that the true data distribution adheres to a form of the "cluster assumption", and 2) the semi-supervised learning paradigm can be regarded as a special case of our framework when there are no distributional shifts. We validate our claims through experiments conducted on a variety of synthetic and real-world datasets.

SPOTLIGHT-158: Role of Locality and Weight Sharing in Image-Based Tasks: A Sample Complexity Separation between CNNs, LCNs, and FCNs

Primary Area: learning theory

Keywords: Deep Learning Theory Sample Complexity Convolutional Neural Networks

Scores: [ 8 8 6 8 ]

Vision-based tasks are known to exhibit the properties of locality and translation invariance. The superior performance of convolutional neural networks (CNNs) on these tasks is attributed to the inductive bias of locality and weight sharing baked into their architecture. Existing attempts at quantifying the statistical benefits of these biases in CNNs over local convolutional neural networks (LCNs) and fully connected neural networks (FCNs) fall into one of the following categories: either do not establish a gap between the performance of these architectures, or ignore optimization considerations, or consider stylized settings that are not reflective of image-like tasks, particularly translation invariance. We introduce the Dynamic Signal Distribution (DSD), a data model that is designed to capture properties of real-world images such as locality and translation invariance. In DSD, each image is modeled with k patches, with each patch of dimension d, and the label is determined by a d-sparse signal vector that can freely appear in any one of the k patches. Under this task, we show that CNNs trained using gradient descent require O~(k+d) samples, whereas LCNs require Ω(kd) samples for predicting the label, establishing the statistical advantages of weight sharing in translation invariant tasks. Additionally, LCNs need O~(k(k+d)) samples, compared to FCNs, which need Ω(k2d) samples, showcasing the benefits of locality in local tasks.

POSTER-794: General Stability Analysis for Zeroth-Order Optimization Algorithms

Primary Area: learning theory

Keywords: Stability Analysis; Zeroth-Order Optimization; Black-Box Learning

Scores: [ 8 8 6 6 ]

POSTER-795: Risk Bounds of Accelerated SGD for Overparameterized Linear Regression

Primary Area: learning theory

Keywords: Accelerated stochastic gradient descent excess risk linear regression overparameterization

Scores: [ 6 6 6 ]

Accelerated stochastic gradient descent (ASGD) is a workhorse in deep learning and often achieves better generalization performance than SGD. However, existing optimization theory can only explain the faster convergence of ASGD, but cannot explain its better generalization. In this paper, we study the generalization of ASGD for overparameterized linear regression, which is possibly the simplest setting of learning with overparameterization. We establish an instance-dependent excess risk bound for ASGD within each eigen-subspace of the data covariance matrix. Our analysis shows that (i) ASGD outperforms SGD in the subspace of small eigenvalues, exhibiting a faster rate of exponential decay for bias error, while in the subspace of large eigenvalues, its bias error decays slower than SGD; and (ii) the variance error of ASGD is always larger than that of SGD. Our result suggests that ASGD can outperform SGD when the difference between the initialization and the true weight vector is mostly confined to the subspace of small eigenvalues. Additionally, when our analysis is specialized to linear regression in the strongly convex setting, it yields a tighter bound for bias error than the best-known result.

POSTER-796: Transformers vs. Message Passing GNNs: Distinguished in Uniform

Primary Area: learning theory

Keywords: Graph Neural Networks Message Passing Graph Transformers Virtual Nodes Expressivity Uniform Expressivity

Scores: [ 6 8 6 6 ]

Graph Transformers (GTs) such as SAN and GPS have been shown to be universal function approximators. We show that when extending MPGNNs and even 2-layer MLPs with the same positional encodings that GTs use, they also become universal function approximators on graphs. All these results hold in the non-uniform case where a different network may be used for every graph size. In order to show meaningful differences between GTs and MPGNNs we then consider the uniform setting where a single network needs to work for all graph sizes. First, we show that none of the above models is universal in that setting. Then, our main technical result is that there are functions that GTs can express while MPGNNs with virtual nodes cannot and vice versa, making their uniform expressivity provably different. We show this difference empirically on synthetic data and observe that on real-world data global information exchange through graph transformers and conceptually simpler MPGNNs with virtual nodes achieve similar performance gains over message passing on various datasets.

SPOTLIGHT-159: Implicit regularization of deep residual networks towards neural ODEs

Primary Area: learning theory

Keywords: deep learning theory residual networks neural ODEs optimization implicit regularization gradient flow

Scores: [ 8 6 8 6 ]

Residual neural networks are state-of-the-art deep learning models. Their continuous-depth analog, neural ordinary differential equations (ODEs), are also widely used. Despite their success, the link between the discrete and continuous models still lacks a solid mathematical foundation. In this article, we take a step in this direction by establishing an implicit regularization of deep residual networks towards neural ODEs, for nonlinear networks trained with gradient flow. We prove that if the network is initialized as a discretization of a neural ODE, then such a discretization holds throughout training. Our results are valid for a finite training time, and also as the training time tends to infinity provided that the network satisfies a Polyak-Łojasiewicz condition. Importantly, this condition holds for a family of residual networks where the residuals are two-layer perceptrons with an overparameterization in width that is only linear, and implies the convergence of gradient flow to a global minimum. Numerical experiments illustrate our results.

POSTER-797: Adaptive Regret for Bandits Made Possible: Two Queries Suffice

Primary Area: learning theory

Keywords: adaptive regret multi arm bandit

Scores: [ 8 5 8 8 ]

Fast changing states or volatile environments pose a significant challenge to online optimization, which needs to perform rapid adaptation under limited observation. In this paper, we give query and regret optimal bandit algorithms under the strict notion of strongly adaptive regret, which measures the maximum regret over any contiguous interval I. Due to its worst-case nature, there is an almost-linear Ω(|I|1ϵ) regret lower bound, when only one query per round is allowed [Daniely el al, ICML 2015]. Surprisingly, with just two queries per round, we give Strongly Adaptive Bandit Learner (StABL) that achieves O~(n|I|) adaptive regret for multi-armed bandits with n arms. The bound is tight and cannot be improved in general. Our algorithm leverages a multiplicative update scheme of varying stepsizes and a carefully chosen observation distribution to control the variance. Furthermore, we extend our results and provide optimal algorithms in the bandit convex optimization setting. Finally, we empirically demonstrate the superior performance of our algorithms under volatile environments and for downstream tasks, such as algorithm selection for hyperparameter optimization.

SPOTLIGHT-160: On the hardness of learning under symmetries

Primary Area: learning theory

Keywords: Equivariance statistical query lower bound computational hardness invariance symmetry neural networks

Scores: [ 8 6 8 ]

We study the problem of learning equivariant neural networks via gradient descent. The incorporation of known symmetries ("equivariance") into neural nets has empirically improved the performance of learning pipelines, in domains ranging from biology to computer vision. However, a rich yet separate line of learning theoretic research has demonstrated that actually learning shallow, fully-connected (i.e. non-symmetric) networks has exponential complexity in the correlational statistical query (CSQ) model, a framework encompassing gradient descent. In this work, we ask: are known problem symmetries sufficient to alleviate the fundamental hardness of learning neural nets with gradient descent? We answer this question in the negative. In particular, we give lower bounds for shallow graph neural networks, convolutional networks, invariant polynomials, and frame-averaged networks for permutation subgroups, which all scale either superpolynomially or exponentially in the relevant input dimension. Therefore, in spite of the significant inductive bias imparted via symmetry, actually learning the complete classes of functions represented by equivariant neural networks via gradient descent remains hard.

POSTER-798: Learning Nash equilibria in Rank-1 games: Going beyond the Minty Property

Primary Area: learning theory

Keywords: learning in games rank games Nash equilibria Minty optimistic gradient

Scores: [ 5 6 8 5 ]

Learning Nash equilibria (NE) in games has garnered significant attention, particularly in the context of training Generative Adversarial Networks (GANs) and multi-agent Reinforcement Learning. The current state-of-the-art in efficiently learning games focuses on landscapes that meet the (weak) Minty property or games characterized by a unique function, often referred to as potential games. A significant challenge in this domain is that computing Nash equilibria is a computationally intractable task [Daskalakis et al. 2009]. In this paper we focus on bimatrix games (A,B) called rank-1. These are games in which the sum of the payoff matrices A+B is a rank 1 matrix; note that standard zero-sum games are rank 0. We show that optimistic gradient descent/ascent converges to an \epsilon-approximate NE after 1/\epsilon^2 log(1/\epsilon) iterates in rank-1 games. We achieve this by leveraging structural results about the NE landscape of rank-1 games Adsul et al. 2021. Notably, our approach bypasses the fact that these games do not satisfy the MVI property.

POSTER-799: One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention

Primary Area: learning theory

Keywords: Linear Self-Attention In-context learning Gradient Descent Theoretical Understanding

Scores: [ 5 8 5 6 ]

Recent works have empirically analyzed in-context learning and shown that transformers trained on synthetic linear regression tasks can learn to implement ridge regression, which is the Bayes-optimal predictor, given sufficient capacity (Akyurek et al., 2023), while one-layer transformers with linear self-attention and no MLP layer will learn to implement one step of gradient descent (GD) on a least-squares linear regression objective (von Oswald et al., 2022). However, the theory behind these observations remains poorly understood. We theoretically study transformers with a single layer of linear self-attention, trained on synthetic noisy linear regression data. First, we mathematically show that when the covariates are drawn from a standard Gaussian distribution, the one-layer transformer which minimizes the pre-training loss will implement a single step of GD on the least-squares linear regression objective. Then, we find that changing the distribution of the covariates and weight vector to a non-isotropic Gaussian distribution has a strong impact on the learned algorithm: the global minimizer of the pre-training loss now implements a single step of pre-conditioned GD. However, if only the distribution of the responses is changed, then this does not have a large effect on the learned algorithm: even when the response comes from a more general family of nonlinear functions, the global minimizer of the pre-training loss still implements a single step of GD on a least-squares linear regression objective.

POSTER-800: Sample Relationship from Learning Dynamics Matters for Generalisation

Primary Area: learning theory

Keywords: generalisation data selection neural tangent kernel sample interaction learning dynamics

Scores: [ 8 6 6 ]

Although much research has been done on proposing new models or loss functions to improve the generalisation of artificial neural networks (ANNs), less attention has been directed to the impact of the training data on generalisation. In this work, we start from approximating the interaction between samples, i.e. how learning one sample would modify the model's prediction on other samples. Through analysing the terms involved in weight updates in supervised learning, we find that labels influence the interaction between samples. Therefore, we propose the labelled pseudo Neural Tangent Kernel (lpNTK) which takes label information into consideration when measuring the interactions between samples. We first prove that lpNTK asymptotically converges to the empirical neural tangent kernel in terms of the Frobenius norm under certain assumptions. Secondly, we illustrate how lpNTK helps to understand learning phenomena identified in previous work, specifically the learning difficulty of samples and forgetting events during learning. Moreover, we also show that using lpNTK to identify and remove poisoning training samples does not hurt the generalisation performance of ANNs.

POSTER-801: A Policy Gradient Method for Confounded POMDPs

Primary Area: learning theory

Keywords: Offline Reinforcement Learning Confounded POMDP Policy Gradient Statistical Guarantee Function Approximation

Scores: [ 8 8 8 8 ]

In this paper, we propose a policy gradient method for confounded partially observable Markov decision processes (POMDPs) with continuous state and observation spaces in the offline setting. We first establish a novel identification result to non-parametrically estimate any history-dependent policy gradient under POMDPs using the offline data. The identification enables us to solve a sequence of conditional moment restrictions and adopt the min-max learning procedure with general function approximation for estimating the policy gradient. We then provide a finite-sample non-asymptotic bound for estimating the gradient uniformly over a pre-specified policy class in terms of the sample size, length of horizon, concentratability coefficient and the measure of ill-posedness in solving the conditional moment restrictions. Lastly, by deploying the proposed gradient estimation in the gradient ascent algorithm, we show the global convergence of the proposed algorithm in finding the history-dependent optimal policy under some technical conditions. To the best of our knowledge, this is the first work studying the policy gradient method for POMDPs under the offline setting.

POSTER-802: Towards Establishing Guaranteed Error for Learned Database Operations

Primary Area: learning theory

Keywords: Learned Indexing Learned Cardinality Estimation Machine learning for Data Management

Scores: [ 3 8 5 8 ]

Machine learning models have demonstrated substantial performance enhancements over non-learned alternatives in various fundamental data management operations, including indexing (locating items in an array), cardinality estimation (estimating the number of matching records in a database), and range-sum estimation (estimating aggregate attribute values for query-matched records). However, real-world systems frequently favor less efficient non-learned methods due to their ability to offer (worst-case) error guarantees — an aspect where learned approaches often fall short. The primary objective of these guarantees is to ensure system reliability, ensuring that the chosen approach consistently delivers the desired level of accuracy across all databases. In this paper, we embark on the first theoretical study of such guarantees for learned methods, presenting the necessary conditions for such guarantees to hold when using machine learning to perform indexing, cardinality estimation and range-sum estimation. Specifically, we present the first known lower bounds on the model size required to achieve the desired accuracy for these three key database operations. Our results bound the required model size for given average and worst-case errors in performing database operations, serving as the first theoretical guidelines governing how model size must change based on data size to be able to guarantee an accuracy level. More broadly, our established guarantees pave the way for the broader adoption and integration of learned models into real-world systems.

SPOTLIGHT-161: Learning to Relax: Setting Solver Parameters Across a Sequence of Linear System Instances

Primary Area: learning theory

Keywords: scientific computing data-driven algorithm design online learning multi-armed bandits contextual bandits numerical analysis learning-augmented algorithms algorithms with predictions

Scores: [ 8 8 8 8 ]

POSTER-803: Some Intriguing Aspects about Lipschitz Continuity of Neural Networks

Primary Area: learning theory

Keywords: Lipschitz continuity Double Descent Label Noise Generalization

Scores: [ 3 6 6 8 ]

Lipschitz continuity is a crucial functional property of any predictive model, that naturally governs its robustness, generalisation, as well as adversarial vulnerability. Contrary to other works that focus on obtaining tighter bounds and developing different practical strategies to enforce certain Lipschitz properties, we aim to thoroughly examine and characterise the Lipschitz behaviour of Neural Networks. Thus, we carry out an empirical investigation in a range of different settings (namely, architectures, datasets, label noise, and more) by exhausting the limits of the simplest and the most general lower and upper bounds. As a highlight of this investigation, we showcase a remarkable fidelity of the lower Lipschitz bound, identify a striking Double Descent trend in both upper and lower bounds to the Lipschitz and explain the intriguing effects of label noise on function smoothness and generalisation.

POSTER-804: Towards Non-Asymptotic Convergence for Diffusion-Based Generative Models

Primary Area: learning theory

Keywords: diffusion models score-based generative modeling non-asymptotic theory reverse SDE probability flow ODE denoising diffusion probabilistic model

Scores: [ 8 8 8 6 ]

SPOTLIGHT-162: OPTIMAL ROBUST MEMORIZATION WITH RELU NEURAL NETWORKS

Primary Area: learning theory

Keywords: Memorization expressive power of network optimal robust memorization computation complexity Lipschitz constant

Scores: [ 5 5 8 ]

SPOTLIGHT-163: Generalization error of spectral algorithms

Primary Area: learning theory

Keywords: gradient descent kernel ridge regression optimal algorithm generalization asymptotic error rates power-laws

Scores: [ 8 8 8 ]

The asymptotically precise estimation of the generalization of kernel methods has recently received attention due to the parallels between neural networks and their associated kernels. However, prior works derive such estimates for training by kernel ridge regression (KRR), whereas neural networks are typically trained with gradient descent (GD). In the present work, we consider the training of kernels with a family of \emph{spectral algorithms} specified by profile h(λ), and including KRR and GD as special cases. Then, we derive the generalization error as a functional of learning profile h(λ) for two data models: high-dimensional Gaussian and low-dimensional translation-invariant model. Under power-law assumptions on the spectrum of the kernel and target, we use our framework to (i) give full loss asymptotics for both noisy and noiseless observations (ii) show that the loss localizes on certain spectral scales, giving a new perspective on the KRR saturation phenomenon (iii) conjecture, and demonstrate for the considered data models, the universality of the loss w.r.t. non-spectral details of the problem, but only in case of noisy observation.

POSTER-805: On Representation Complexity of Model-based and Model-free Reinforcement Learning

Primary Area: learning theory

Keywords: model-based and model-free RL representation complexity circuit complexity approximation error

Scores: [ 5 6 8 ]

We study the representation complexity of model-based and model-free reinforcement learning (RL) in the context of circuit complexity. We prove theoretically that there exists a broad class of MDPs such that their underlying transition and reward functions can be represented by constant depth circuits with polynomial size, while the optimal Q-function suffers an exponential circuit complexity in constant-depth circuits. By drawing attention to the approximation errors and building connections to complexity theory, our theory provides unique insights into why model-based algorithms usually enjoy better sample complexity than model-free algorithms from a novel representation complexity perspective: in some cases, the ground-truth rule (model) of the environment is simple to represent, while other quantities, such as Q-function, appear complex. We empirically corroborate our theory by comparing the approximation error of the transition kernel, reward function, and optimal Q-function in various Mujoco environments, which demonstrates that the approximation errors of the transition kernel and reward function are consistently lower than those of the optimal Q-function. To the best of our knowledge, this work is the first to study the circuit complexity of RL, which also provides a rigorous framework for future research.

POSTER-806: Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

Primary Area: learning theory

Keywords: Chain of thought language modeling circuit complexity deep learning theory

Scores: [ 8 6 8 3 8 5 ]

Generating a sequence of intermediate steps, \emph{a.k.a.}, a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length n, previous works have constant-depth transformers with finite precision poly(n) embedding size can only solve problems in TC0 without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in AC0, a proper subset of $ \mathsf{TC}^0$. However, with T steps of CoT, constant-depth transformers using constant-bit precision and O(logn) embedding size can solve any problem solvable by boolean circuits of size T. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers.

POSTER-807: Beating Price of Anarchy and Gradient Descent without Regret in Potential Games

Primary Area: learning theory

Keywords: q-replicator dynamics potential games average price of anarchy learning

Scores: [ 8 6 ]

Arguably one of the thorniest problems in game theory is that of equilibrium selection. Specifically, in the presence of multiple equilibria do self-interested learning dynamics typically select the socially optimal ones? We study a rich class of continuous-time no-regret dynamics in potential games (PGs). Our class of dynamics, Q-Replicator Dynamics (QRD), include gradient descent (GD), log-barrier and replicator dynamics (RD) as special cases. We start by establishing pointwise convergence of all QRD to Nash equilibria in almost all PGs. In the case of GD, we show a tight average case performance within a factor of two of optimal, for a class of symmetric 2×2 potential games with unbounded Price of Anarchy (PoA). Despite this positive result, we show that GD is not always the optimal choice even in this restricted setting. Specifically, GD outperforms RD, if and only if risk- and payoff-dominance equilibria coincide. Finally, we experimentally show how these insights extend to all QRD dynamics and that unbounded gaps between average case performance and PoA analysis are common even in larger settings.

POSTER-808: Sparsistency for inverse optimal transport

Primary Area: metric learning, kernel learning, and sparse coding

Keywords: Optimal transport sparsity sparsistency metric learning

Scores: [ 5 8 8 6 ]

POSTER-809: Threshold-Consistent Margin Loss for Open-World Deep Metric Learning

Primary Area: metric learning, kernel learning, and sparse coding

Keywords: Deep metric learning Open-world visual recognition Threshold consistency

Scores: [ 5 8 6 ]

Existing losses used in deep metric learning (DML) for image retrieval often lead to highly non-uniform intra-class and inter-class representation structures across test classes and data distributions. When combined with the common practice of using a fixed threshold to declare a match, this gives rise to significant performance variations in terms of false accept rate (FAR) and false reject rate (FRR) across test classes and data distributions. We define this issue in DML as threshold inconsistency. In real-world applications, such inconsistency often complicates the threshold selection process when deploying large-scale image retrieval systems. To measure this inconsistency, we propose a novel variance-based metric called Operating-Point-Inconsistency-Score (OPIS) that quantifies the variance in the operating characteristics across classes. Using the OPIS metric, we find that achieving high accuracy levels in a DML model does not automatically guarantee threshold consistency. In fact, our investigation reveals a Pareto frontier in the high-accuracy regime, where existing methods to improve accuracy often lead to degradation in threshold consistency. To address this trade-off, we introduce the Threshold-Consistent Margin (TCM) loss, a simple yet effective regularization technique that promotes uniformity in representation structures across classes by selectively penalizing hard sample pairs. Large-scale experiments demonstrate TCM's effectiveness in enhancing threshold consistency while preserving accuracy, simplifying the threshold selection process in practical DML settings.

POSTER-810: Mathematical Justification of Hard Negative Mining via Isometric Approximation Theorem

Primary Area: metric learning, kernel learning, and sparse coding

Keywords: Metric Learning Triplet Loss Hard Negative Mining

Scores: [ 8 3 5 6 ]

POSTER-811: Neural Snowflakes: Universal Latent Graph Inference via Trainable Latent Geometries

Primary Area: metric learning, kernel learning, and sparse coding

Keywords: Latent Graph Inference Representation Learning Metric Embedding Geometric Deep Learning Graph Neural Networks

Scores: [ 6 8 5 8 ]

The inductive bias of a graph neural network (GNN) is largely encoded in its specified graph. Latent graph inference relies on latent geometric representations to dynamically rewire or infer a GNN's graph to maximize the GNN's predictive downstream performance, but it lacks solid theoretical foundations in terms of embedding-based representation guarantees. This paper addresses this issue by introducing a trainable deep learning architecture, coined \textit{neural snowflake}, that can adaptively implement fractal-like metrics on Rd. We prove that any given finite weights graph can be isometrically embedded by a standard MLP encoder. Furthermore, when the latent graph can be represented in the feature space of a sufficiently regular kernel, we show that the combined neural snowflake and MLP encoder do not succumb to the curse of dimensionality by using only a low-degree polynomial number of parameters in the number of nodes. This implementation enables a low-dimensional isometric embedding of the latent graph. We conduct synthetic experiments to demonstrate the superior metric learning capabilities of neural snowflakes when compared to more familiar spaces like Euclidean space. Additionally, we carry out latent graph inference experiments on graph benchmarks. Consistently, the neural snowflake model achieves predictive performance that either matches or surpasses that of the state-of-the-art latent graph inference models. Importantly, this performance improvement is achieved without requiring random search for optimal latent geometry. Instead, the neural snowflake model achieves this enhancement in a differentiable manner.

SPOTLIGHT-164: MMD Graph Kernel: Effective Metric Learning for Graphs via Maximum Mean Discrepancy

Primary Area: metric learning, kernel learning, and sparse coding

Keywords: graph kernel graph metric learning maximum mean discrepancy

Scores: [ 8 8 6 8 ]

This paper focuses on graph metric learning. First, we present a class of maximum mean discrepancy (MMD) based graph kernels, called MMD-GK. These kernels are computed by applying MMD to the node representations of two graphs with message-passing propagation. Compared to classical graph kernels such as the Weisfeiler-Lehman kernel, our MMD-GKs have much lower computational costs and are able to exploit nodes' features of graphs effectively. Secondly, we provide a class of deep MMD-GKs that are able to learn graph metrics and implicit graph features adaptively in an unsupervised manner. Thirdly, we propose a class of supervised deep MMD-GKs that are able to utilize label information of graphs and hence yield more discriminative metrics. Besides the algorithms, we provide theoretical analysis for the proposed methods. The proposed methods are evaluated in comparison to many baselines such as graph kernels and graph neural networks in the tasks of graph clustering and graph classification. The numerical results demonstrate the effectiveness and superiority of our methods.

POSTER-812: Scalable Neural Network Kernels

Primary Area: metric learning, kernel learning, and sparse coding

Keywords: scalable kernel methods random features deep neural networks

Scores: [ 5 5 8 8 ]

We introduce the concept of scalable neural network kernels (SNNKs), the replacements of regular feedforward layers (FFLs), capable of approximating the latter, but with favorable computational properties. SNNKs effectively disentangle the inputs from the parameters of the neural network in the FFL, only to connect them in the final computation via the dot-product kernel. They are also strictly more expressive, as allowing to model complicated relationships beyond the functions of the dot-products of parameter-input vectors. We also introduce the neural network bundling process that applies SNNKs to compactify deep neural network architectures, resulting in additional compression gains. In its extreme version, it leads to the fully bundled network whose optimal parameters can be expressed via explicit formulae for several loss functions (e.g. mean squared error), opening a possibility to bypass backpropagation. As a by-product of our analysis, we introduce the mechanism of the universal random features (or URFs), applied to instantiate several SNNK variants, and interesting on its own in the context of scalable kernel methods. We provide rigorous theoretical analysis of all these concepts as well as an extensive empirical evaluation, ranging from point-wise kernel estimation to Transformers' fine-tuning with novel adapter layers inspired by SNNKs. Our mechanism provides up to 5x reduction in the number of trainable parameters, while maintaining competitive accuracy.

POSTER-813: Adaptive Retrieval and Scalable Indexing for k-NN Search with Cross-Encoders

Primary Area: metric learning, kernel learning, and sparse coding

Keywords: cross-encoder kNN retrieval nearest-neighbor search

Scores: [ 5 6 8 6 ]

Cross-encoder (CE) models which compute similarity by jointly encoding a query-item pair perform better than using dot-product with embedding-based models (dual-encoders) at estimating query-item relevance. Existing approaches perform k-NN search with cross-encoders by approximating the CE similarity with a vector embedding space fit either with dual-encoders (DE) or CUR matrix factorization. DE-based retrieve-and-rerank approaches suffer from poor recall as DE generalize poorly to new domains and the test-time retrieval with DE is decoupled from the CE. While CUR-based approaches can be more accurate than the DE-based retrieve-and-rerank approach, such approaches require prohibitively large number of CE calls to compute item embeddings, thus making it impractical for deployment as scale. In this paper, we address these shortcomings with our proposed sparse-matrix factorization based method that efficiently computes latent query and item representations to approximate CE scores and performs k-NN search with the approximate CE similarity. In an offline indexing stage, we compute item embeddings by factorizing a sparse matrix containing query-item CE scores for a set of train queries. Our method produces a high quality approximation while requiring only a fraction of CE similarity calls as compared to CUR-based methods, and allows for leveraging DE models to initialize the embedding space while avoiding compute- and resource-intensive finetuning of DE via distillation. At test-time, we keep item embeddings fixed and perform retrieval over multiple rounds, alternating between a) estimating the test-query embedding by minimizing error in approximating CE scores of items retrieved thus far, and b) using the updated test-query embedding for retrieving more items in the next round. Our proposed k-NN search method can achieve up to 5% and 54% improvement in k-NN recall for k=1 and 100 respectively over the widely-used DE-based retrieve-and-rerank approach. Furthermore, our proposed approach to index the items by aligning item embeddings with the CE achieves up to 100$\times$ and 5$\times$ speedup over CUR-based and dual-encoder distillation based approaches respectively while matching or improving k-NN search recall over baselines.

POSTER-814: LipSim: A Provably Robust Perceptual Similarity Metric

Primary Area: metric learning, kernel learning, and sparse coding

Keywords: Perceptual similarity metric certified defense deep learning

Scores: [ 6 5 5 ]

Recent years have seen growing interest in developing and applying perceptual similarity metrics. Research has shown the superiority of perceptual metrics over pixel-wise metrics in aligning with human perception and serving as a proxy for the human visual system.On the other hand, as perceptual metrics rely on neural networks, there is a growing concern regarding their resilience, given the established vulnerability of neural networks to adversarial attacks. It is indeed logical to infer that perceptual metrics may inherit both the strengths and shortcomings of neural networks.In this work, we demonstrate the vulnerability of state-of-the-art perceptual similarity metrics based on an ensemble of ViT-based feature extractors to adversarial attacks. We then propose a framework to train a robust perceptual similarity metric called LipSim (Lipschitz Similarity Metric) with provable guarantees. By leveraging 1-Lipschitz neural networks as the backbone, LipSim provides guarded areas around each data point and certificates for all perturbations within an 2 ball. Finally, a comprehensive set of experiments shows the performance of LipSim in terms of natural and certified scores and on the image retrieval application.

POSTER-815: Generating Pragmatic Examples to Train Neural Program Synthesizers

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: program synthesis pragmatics self-play

Scores: [ 6 8 6 6 ]

Programming-by-example is the task of synthesizing a program that is consistent with a set of user-provided input-output examples. As examples are often an under-specification of one's intent, a good synthesizer must choose the intended program from the many that are consistent with the given set of examples. Prior work frames program synthesis as a cooperative game between a listener (that synthesizes programs) and a speaker (a user choosing examples), and shows that models of computational pragmatic inference is effective in choosing the user intended programs. However, these models requires counterfactual reasoning over a large set of programs and examples, which is infeasible in realistic program spaces. In this paper, we propose a novel way to amortize this search. We sample pairs of programs and examples via self-play between listener and speaker models, and use pragmatic inference to choose informative training examples from this sample.We then use the informative dataset to train models to improve the synthesizer's ability to disambiguate user-provided examples _without human supervision. We validate our method on the challenging task of synthesizing regular expressions from example strings, finding that our method (1) outperforms models trained without choosing pragmatic examples by 23% (a 51% relative increase) (2) matches the performance of supervised learning on a dataset of pragmatic examples provided by humans, despite using no human data in training.

POSTER-816: Image Translation as Diffusion Visual Programmers

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: Image translation Diffusion model

Scores: [ 6 6 6 6 ]

We introduce the novel Diffusion Visual Programmer (DVP), a neuro-symbolic image translation framework. Our proposed DVP seamlessly embeds a condition-flexible diffusion model within the GPT architecture, orchestrating a coherent sequence of visual programs (i.e., computer vision models) for various pro-symbolic steps, which span RoI identification, style transfer, and position manipulation, facilitating transparent and controllable image translation processes. Extensive experiments demonstrate DVP’s remarkable performance, surpassing concurrent arts. This success can be attributed to several key features of DVP: First, DVP achieves condition-flexible translation via instance normalization, enabling the model to eliminate sensitivity caused by the manual guidance and optimally focus on textual descriptions for high-quality content generation. Second, the frame work enhances in-context reasoning by deciphering intricate high-dimensional concepts in feature spaces into more accessible low-dimensional symbols (e.g., [Prompt], [RoI object]), allowing for localized, context-free editing while maintaining overall coherence. Last but not least, DVP improves systemic controllability and explainability by offering explicit symbolic representations at each programming stage, empowering users to intuitively interpret and modify results. Our research marks a substantial step towards harmonizing artificial image translation processes with cognitive intelligence, promising broader applications.

POSTER-817: How Realistic Is Your Synthetic Data? Constraining Deep Generative Models for Tabular Data

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: Synthetic Data Generation

Scores: [ 6 5 8 6 ]

Deep Generative Models (DGMs) have been shown to be powerful tools for generating tabular data, as they have been increasingly able to capture the complex distributions that characterize them. However, to generate realistic synthetic data, it is often not enough to have a good approximation of their distribution, as it also requires compliance with constraints that encode essential background knowledge on the problem at hand. In this paper, we address this limitation and show how DGMs for tabular data can be transformed into Constrained Deep Generative Models (C-DGMs), whose generated samples are guaranteed to be compliant with the given constraints. This is achieved by automatically parsing the constraints and transforming them into a Constraint Layer (CL) seamlessly integrated with the DGM. Our extensive experimental analysis with various DGMs and tasks reveals that standard DGMs often violate constraints, some exceeding 95% non-compliance, while their corresponding C-DGMs are never non-compliant. Then, we quantitatively demonstrate that, at training time, C-DGMs are able to exploit the background knowledge expressed by the constraints to outperform their standard counterparts with up to 4.5% improvement in utility and detection. Further, we show how our CL does not necessarily need to be integrated at training time, as it can be also used as a guardrail at inference time, still producing some improvements in the overall performance of the models. Finally, we show that our CL does not hinder the sample generation time of the models.

POSTER-818: NeuroBack: Improving CDCL SAT Solving using Graph Neural Networks

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: Propositional satisfiability Graph Neural Networks CDCL SAT Solving Backbone Phase Prediction

Scores: [ 5 8 5 6 ]

Propositional satisfiability (SAT) is an NP-complete problem that impacts manyresearch fields, such as planning, verification, and security. Mainstream modernSAT solvers are based on the Conflict-Driven Clause Learning (CDCL) algorithm.Recent work aimed to enhance CDCL SAT solvers using Graph Neural Networks(GNNs). However, so far this approach either has not made solving more effective,or required substantial GPU resources for frequent online model inferences. Aimingto make GNN improvements practical, this paper proposes an approach calledNeuroBack, which builds on two insights: (1) predicting phases (i.e., values) ofvariables appearing in the majority (or even all) of the satisfying assignments areessential for CDCL SAT solving, and (2) it is sufficient to query the neural modelonly once for the predictions before the SAT solving starts. Once trained, theoffline model inference allows NeuroBack to execute exclusively on the CPU,removing its reliance on GPU resources. To train NeuroBack, a new dataset calledDataBack containing 120,286 data samples is created. Finally, NeuroBack is implementedas an enhancement to a state-of-the-art SAT solver called Kissat. As a result,it allowed Kissat to solve 5.2% more problems on the recent SAT competitionproblem set, SATCOMP-2022. NeuroBack therefore shows how machine learningcan be harnessed to improve SAT solving in an effective and practical manner.

POSTER-819: Guess & Sketch: Language Model Guided Transpilation

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: transpilation program translation assembly code language model neurosymbolic machine learning

Scores: [ 8 8 5 ]

Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze.Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.

POSTER-820: Learning Grounded Action Abstractions from Language

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: planning abstractions hierarchical planning library learning learning from language

Scores: [ 6 8 6 5 6 ]

Long-horizon planning is dauntingly hard -- it requires modeling relevant aspects of the environment and searching over large, complex action spaces. \textit{Hierarchical planning} approaches make complex problems more tractable using temporal \textit{action abstractions}, decomposing hard tasks into smaller abstract subproblems that can be solved modularly. However, actually learning useful action abstractions has long posed significant challenges without human expert knowledge. Here, we introduce a system that leverages background information in language to learn a \textit{library of symbolic action abstractions and accompanying low-level policies} that can be composed to solve increasingly complex tasks. Our approach queries large language models (LLMs) as a prior for proposing useful symbolic action definitions, but integrates these proposals into a formal hierarchical planning system to ground and verify proposed actions. On two language-guided interactive planning domains (\textit{Mini Minecraft} and the \textit{ALFRED Household Tasks} benchmark), our approach far outperforms other baseline approaches that use LLMs in planning, enabling far more accurate planning and enable better generalization to more complex tasks.

POSTER-821: Magnushammer: A Transformer-Based Approach to Premise Selection

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: transformers interactive theorem proving automated reasoning contrastive learning premise selection

Scores: [ 8 8 8 8 ]

This paper presents a novel approach to premise selection, a crucial reasoning task in automated theorem proving. Traditionally, symbolic methods that rely on extensive domain knowledge and engineering effort are applied to this task. In contrast, this work demonstrates that contrastive training with the transformer architecture can achieve higher-quality retrieval of relevant premises, without the knowledge or feature engineering overhead. Our method, Magnushammer, outperforms the most advanced and widely used automation tool in interactive theorem proving called Sledgehammer. On the PISA and miniF2f benchmarks Magnushammer achieves 59.5% (against 38.3%) and 34.0% (against 20.9%) success rates, respectively. By combining Magnushammer with a language-model-based automated theorem prover, we further improve the state-of-the-art proof success rate from 57.0% to 71.0% on the PISA benchmark using $4$x fewer parameters. Moreover, we develop and open source a novel dataset for premise selection, containing textual representations of (proof state, relevant premise) pairs. To the best of our knowledge, this is the largest available premise selection dataset, and the first dataset of this kind for the Isabelle proof assistant.

POSTER-822: Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: Knowledge Graph Chain-of-Thought Large Language Models

Scores: [ 8 8 6 6 ]

POSTER-823: Faithful Rule Extraction for Differentiable Rule Learning Models

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: Faithful Rule Extraction Rule Learning Knowledge Graph Datalog

Scores: [ 6 8 8 5 ]

POSTER-824: Logical Languages Accepted by Transformer Encoders with Hard Attention

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: transformer encoders languages AC0 first order logic linear temporal logic counting terms parity majority

Scores: [ 8 8 3 3 6 ]

We contribute to the study of formal languages that can be recognized by transformer encoders. We focus on two self-attention mechanisms: (1) UHAT (Unique Hard Attention Transformers) and (2) AHAT (Average Hard Attention Transformers). UHAT encoders are known to recognize only languages inside the circuit complexity class AC0, i.e., accepted by a family of poly-sized and depth-bounded boolean circuits with unbounded fan-ins. On the other hand, AHAT encoders can recognize languages outside AC0), but their expressive power still lies within the bigger circuit complexity class TC0, i.e., AC0-circuits extended by majority gates.We first show a negative result that there is an AC0-language that cannot be recognized by an UHAT encoder. On the positive side, we show that UHAT encoders can recognize a rich fragment of AC0-languages, namely, all languages definable in first-order logic with arbitrary unary numerical predicates. This logic, includes, for example, all regular languages from AC0. We then show that AHAT encoders can recognize all languages of our logic even when we enrich it with counting terms. We apply these results to derive new results on the expressive power of UHAT and AHAT up to permutation of letters (a.k.a. Parikh images).

POSTER-825: REFACTOR: Learning to Extract Theorems from Proofs

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: theorem extraction mathematical reasoning theorem proving

Scores: [ 8 5 8 8 ]

Human mathematicians are often good at recognizing modular and reusable theorems that make complex mathematical results within reach. In this paper, we propose a novel method called theoREm-from-prooF extrACTOR (REFACTOR) for training neural networks to mimic this ability in formal mathematical theorem proving. We show on a set of unseen proofs, REFACTOR is able to extract 19.6% of the theorems that humans would use to write the proofs. When applying the model to the existing Metamath library, REFACTOR extracted 16 new theorems. With newly extracted theorems, we show that the existing proofs in the MetaMath database can be refactored. The new theorems are used very frequently after refactoring, with an average usage of 733.5 times, and help shorten the proof lengths. Lastly, we demonstrate that the prover trained on the new-theorem refactored dataset proves more test theorems and outperforms state-of-the-art baselines by frequently leveraging a diverse set of newly extracted theorems.

POSTER-826: Modeling Boundedly Rational Agents with Latent Inference Budgets

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: neurosymbolic planning rationality

Scores: [ 5 6 8 6 ]

We study the problem of modeling a population of agents pursuing unknown goals subject to unknown computational constraints. In standard models of bounded rationality, sub-optimal decision-making is simulated by adding homoscedastic noise to optimal decisions rather than actually simulating constrained inference. In this work, we introduce a latent inference budget model (L-IBM) that models these constraints explicitly, via a latent variable (inferred jointly with a model of agents’ goals) that controls the runtime of an iterative inference algorithm. L-IBMs make it possible to learn agent models using data from diverse populations of suboptimal actors. In three modeling tasks—inferring navigation goals from routes, inferring communicative intents from human utterances, and predicting next moves in human chess games—we show that L-IBMs match or outperforms Boltzmann models of decision-making under uncertainty. Moreover, the inferred inference budgets are themselves meaningful, efficient to compute, and correlated with measures of player skill, partner skill and task difficulty.

POSTER-827: Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: mathematical reasoning autoformalization automated theorem proving quantitative reasoning

Scores: [ 3 8 8 6 ]

Large language models (LLM), such as Google's Minerva and OpenAI's GPT families, are becoming increasingly capable of solving mathematical quantitative reasoning problems. However, they still make unjustified logical and computational errors in their reasoning steps and answers. In this paper, we leverage the fact that if the training corpus of LLMs contained sufficiently many examples of formal mathematics (e.g. in Isabelle, a formal theorem proving environment), they can be prompted to translate i.e. autoformalize informal mathematical statements into formal Isabelle code --- which can be verified automatically for internal consistency. This provides a mechanism to automatically reject solutions whose formalized versions are inconsistent within themselves or with the formalized problem statement. We evaluate our method on GSM8K, MATH and MultiArith datasets and demonstrate that our approach provides a consistently better heuristic than vanilla majority voting --- the previously best method to identify correct answers, by more than 12% on GSM8K. In our experiments it improves results consistently across all datasets and LLM model sizes.

POSTER-828: LILO: Learning Interpretable Libraries by Compressing and Documenting Code

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: program synthesis language models library learning compression refactoring abstraction documentation neurosymbolic

Scores: [ 6 8 6 8 ]

While large language models (LLMs) now excel at code generation, a key aspect of software development is the art of refactoring: consolidating code into libraries of reusable and readable programs. In this paper, we introduce LILO, a neurosymbolic framework that iteratively synthesizes, compresses, and documents code to build libraries tailored to particular problem domains. Computationally, library learning presents a challenging optimization problem that requires formal reasoning about program structure at scale. LILO combines LLM-guided program synthesis with recent algorithmic advances in automated refactoring from Stitch: a symbolic compression system that efficiently identifies optimal lambda abstractions across large code corpora. To make these abstractions interpretable, we introduce an auto-documentation (AutoDoc) procedure that infers natural language names and docstrings based on contextual examples of usage. In addition to improving human readability, we find that AutoDoc boosts performance by helping LILO's synthesizer to interpret and deploy learned abstractions. We evaluate LILO on three inductive program synthesis benchmarks for string editing, scene reasoning, and graphics composition. Compared to existing neural and symbolic methods—including the state-of-the-art library learning algorithm DreamCoder—LILO solves more complex tasks and learns richer libraries that are grounded in linguistic knowledge. In sum, LILO provides a general design pattern for human-interpretable systems that build up shared libraries of program abstractions to solve complex software problems.

POSTER-829: CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: Large Language Models In-context Learning Self-Verification Self-Correction Truthfulness Tool-use Interaction

Scores: [ 5 8 8 5 ]

Recent developments in large language models (LLMs) have been impressive. However, these models sometimes show inconsistencies and problematic behavior, such as hallucinating facts, generating flawed code, or creating offensive and toxic content. Unlike these models, humans typically utilize external tools to cross-check and refine their initial content, like using a search engine for fact-checking, or a code interpreter for debugging. Inspired by this observation, we introduce a framework called CRITIC that allows LLMs, which are essentially “black boxes” to validate and progressively amend their own outputs in a manner similar to human interaction with tools. More specifically, starting with an initial output, CRITIC interacts with appropriate tools to evaluate certain aspects of the text, and then revises the output based on the feedback obtained during this validation process. Comprehensive evaluations involving free-form question answering, mathematical program synthesis, and toxicity reduction demonstrate that CRITIC consistently enhances the performance of LLMs. Meanwhile, our research highlights the crucial importance of external feedback in promoting the ongoing self-improvement of LLMs.

POSTER-830: Incremental Randomized Smoothing Certification

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: Verification Certification Approximation Robustness

Scores: [ 6 6 6 6 ]

Randomized smoothing-based certification is an effective approach for obtaining robustness certificates of deep neural networks (DNNs) against adversarial attacks. This method constructs a smoothed DNN model and certifies its robustness through statistical sampling, but it is computationally expensive, especially when certifying with a large number of samples. Furthermore, when the smoothed model is modified (e.g., quantized or pruned), certification guarantees may not hold for the modified DNN, and recertifying from scratch can be prohibitively expensive.We present the first approach for incremental robustness certification for randomized smoothing, IRS. We show how to reuse the certification guarantees for the original smoothed model to certify an approximated model with very few samples. IRS significantly reduces the computational cost of certifying modified DNNs while maintaining strong robustness guarantees. We experimentally demonstrate the effectiveness of our approach, showing up to 4.1x certification speedup over the certification that applies randomized smoothing of the approximate model from scratch.

POSTER-831: Hierarchical Data-efficient Representation Learning for Tertiary Structure-based RNA Design

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: Bioinformatics

Scores: [ 3 3 6 ]

While artificial intelligence has made remarkable strides in revealing the relationship between biological macromolecules' primary sequence and tertiary structure, designing RNA sequences based on specified tertiary structures remains challenging. Though existing approaches in protein design have thoroughly explored structure-to-sequence dependencies in proteins, RNA design still confronts difficulties due to structural complexity and data scarcity. Adding to the problem, direct transplantation of protein design methodologies into RNA design fails to achieve satisfactory outcomes although sharing similar structural components. In this study, we aim to systematically construct a data-driven RNA design pipeline. We crafted a large, well-curated benchmark dataset and designed a comprehensive structural modeling approach to represent the complex RNA tertiary structure. More importantly, we proposed a hierarchical data-efficient representation learning framework that learns structural representations through contrastive learning at both cluster-level and sample-level to fully leverage the limited data. By constraining data representations within a limited hyperspherical space, the intrinsic relationships between data points could be explicitly imposed. Moreover, we incorporated extracted secondary structures with base pairs as prior knowledge to facilitate the RNA design process. Extensive experiments demonstrate the effectiveness of our proposed method, providing a reliable baseline for future RNA design tasks. The source code and benchmark dataset will be released publicly.

POSTER-832: Less is More: One-shot Subgraph Reasoning on Large-scale Knowledge Graphs

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: knowledge graph reasoning graph sampling

Scores: [ 5 6 3 ]

To deduce new facts on knowledge graph (KG), a reasoning system learns from the graph structure and collects local evidence to find the answer. However, existing methods suffer from a severe scalability problem due to the utilization of the whole KG for reasoning, which hinders their promise on large-scale KG and cannot be directly addressed by vanilla sampling methods. In this work, we propose the one-shot subgraph reasoning to achieve efficient as well as adaptive KG reasoning. The design principle is that, instead of directly acting on the whole KG, the reasoning procedure is decoupled into two steps, i.e., (i) extracting only one query-dependent subgraph and (ii) reasoning on this single subgraph. We reveal that the non-parametric and computation-efficient heuristics Personalized PageRank (PPR) can effectively identify the potential answers and supports to the reasoning. With the promoted efficiency, we further introduce the subgraph-based searching of optimal configurations in both data and model spaces. Empirically, our method achieves promoted efficiency and also leading performances on five large-scale benchmarks.

POSTER-833: Lemur: Integrating Large Language Models in Automated Program Verification

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: Large Language Models Formal verification

Scores: [ 8 5 5 5 ]

The demonstrated code-understanding capability of LLMs raises the question of whether they can be used for automated program verification, a task that typically demands high-level abstract reasoning about program properties that is challenging for verification tools. We propose a general methodology to combine the power of LLMs and automated reasoners for automated program verification. We formally describe this methodology as a set of derivation rules and prove its soundness. We instantiate the calculus as a sound automated verification procedure, which led to practical improvements on a set of synthetic and competition benchmarks.

POSTER-834: Reinforcement Symbolic Regression Machine

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: symbolic regression reinforcement learning equation discovery

Scores: [ 6 8 6 6 ]

In nature, the behaviors of many complex systems can be described by parsimonious math equations. Automatically distilling these equations from limited data is cast as a symbolic regression (SR) process which hitherto remains a grand challenge. Keen efforts have been placed on tackling this issue and demonstrated success in { SR}. However, there still exist bottlenecks that current methods struggle to break when the discrete search space tends toward infinity and especially when the underlying math formula is intricate. To this end, we propose a novel Reinforcement Symbolic Regression Machine (RSRM) that masters the capability of uncovering complex math equations from only scarce data. The RSRM model is composed of three key modules: (1) a Monte Carlo tree search (MCTS) agent, { designed for exploration}, that explores optimal math expression trees consisting of pre-defined math operators and variables, (2) a Double Q-learning block, { designed for exploitation}, that helps reduce the feasible search space of MCTS via properly understanding the distribution of reward, and (3) a modulated sub-tree discovery block that heuristically learns and defines new math operators to improve representation ability of math expression trees. Binding of these modules yields the SOTA performance of RSRM in SR as demonstrated by multiple benchmark datasets. The RSRM shows clear superiority over several representative baseline models.

POSTER-835: Reclaiming the Source of Programmatic Policies: Programmatic versus Latent Spaces

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: programmatic policy reinforcement learning

Scores: [ 5 3 3 ]

Recent works have introduced LEAPS and HPRL, systems that learn latent spaces of domain-specific languages, which are used to define programmatic policies for partially observable Markov decision processes (POMDPs). These systems induce a latent space while optimizing losses such as the behavior loss, which aim to achieve locality in program behavior, meaning that vectors close in the latent space should correspond to similarly behaving programs. In this paper, we show that the programmatic space, induced by the domain-specific language and requiring no training, presents values for the behavior loss similar to those observed in latent spaces presented in previous work. Moreover, algorithms searching in the programmatic space significantly outperform those in LEAPS and HPRL. To explain our results, we measured the friendliness'' of the two spaces to local search algorithms. We discovered that algorithms are more likely to stop at local maxima when searching in the latent space than when searching in the programmatic space. This implies that the optimization topology of the programmatic space, induced by the reward function in conjunction with the neighborhood function, is more conducive to search than that of the latent space. This result provides an explanation for the superior performance in the programmatic space.

POSTER-836: Generative Neuro-Symbolic Visual Reasoning by Growing and Reusing Modules

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: Large language models Neuro-symbolic Visual Reasoning

Scores: [ 6 6 5 6 ]

Recent works have shown that Large Language Models (LLMs) could empower traditional neuro-symbolic models via programming capabilities to translate lan- guages into module descriptions, thus achieving strong visual reasoning results while maintaining the model’s transparency and efficiency. However, these mod- els usually exhaustively generate the entire code snippet given each new instance of a task, which is extremely ineffective. On the contrary, human beings grad- ually acquire knowledge that can be reused and grow into more profound skills for fast generalization to new tasks since we are an infant. Inspired by this, we propose generative neuro-symbolic visual reasoning by growing and reusing mod- ules. Specifically, our model consists of three unique stages, module initialization, module generation, and module execution. First, given a vision-language task, we adopt LLMs to examine whether we could reuse and grow over established mod- ules to handle this new task. If not, we initialize a new module needed by the task and specify the inputs and outputs of this new module. After that, the new module is created by querying LLMs to generate corresponding code snippets that match the requirements. In order to get a better sense of the new module’s ability, we treat few-shot training examples as test cases to see if our new module could pass these cases. If yes, the new module is added to the module library for future reuse. Finally, we evaluate the performance of our model on the testing set by executing the parsed programs with the newly made visual modules to get the results. We find the proposed GNSVR model possesses several advantages. First, it performs competitively on standard tasks like visual question answering and referring ex- pression comprehension; Second, the visual modules learned from one task can be seamlessly transferred to new tasks; Last but not least, it is able to adapt to new visual reasoning tasks by observing a few training examples and reusing modules.

SPOTLIGHT-165: MetaPhysiCa: Improving OOD Robustness in Physics-informed Machine Learning

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: physics-informed machine learning OOD robustness meta learning causal structure discovery

Scores: [ 6 8 6 8 ]

A fundamental challenge in physics-informed machine learning (PIML) is the design of robust PIML methods for out-of-distribution (OOD) forecasting tasks. These OOD tasks require learning-to-learn from observations of the same (ODE) dynamical system with different unknown ODE parameters, and demand accurate forecasts even under out-of-support initial conditions and out-of-support ODE parameters. In this work we propose to improve the OOD robustness of PIML via a meta-learning procedure for causal structure discovery. Using three different OOD tasks, we empirically observe that the proposed approach significantly outperforms existing state-of-the-art PIML and deep learning methods (with 2× to 28× lower OOD errors).

POSTER-837: AirPhyNet: Harnessing Physics-Guided Neural Networks for Air Quality Prediction

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: air quality prediction physics-informed spatiotemporal-learning interpretability

Scores: [ 6 8 6 6 ]

Air quality prediction and modelling plays a pivotal role in public health and environment management, for individuals and authorities to make informed decisions. Although traditional data-driven models have shown promise in this domain,their long-term prediction accuracy can be limited, especially in scenarios with sparse or incomplete data and they often rely on "black-box" deep learning structures that lack solid physical foundation leading to reduced transparency and interpretability in predictions. To address these limitations, this paper presents a novel approach named Physics guided Neural Network for Air Quality Prediction (AirPhyNet). Specifically, we leverage two well-established physics principles of air particle movement (diffusion and advection) by representing them as differential equation networks. Then, we utilize a graph structure to integrate physics knowledge into a neural network architecture and exploit latent representations to capture spatio-temporal relationships within the air quality data. Experiments on two real-world benchmark datasets demonstrate that AirPhyNet outperforms state-of-the-art models for different testing scenarios including different lead time (24h, 48h, 72h), sparse data and sudden change prediction, achieving reduction in prediction errors up to 10%. Moreover, a case study further validates that our model captures underlying physical processes of particle movement and generates accurate predictions with real physical meaning. The code is available at:: https://anonymous.4open.science/r/AirPhyNet-230F/

SPOTLIGHT-166: PINNACLE: PINN Adaptive ColLocation and Experimental points selection

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: Physics-informed Neural Networks PINNs adaptive training points selection

Scores: [ 8 8 8 6 ]

POSTER-838: Hypothesis Search: Inductive Reasoning with Language Models

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: Inductive reasoning large language models

Scores: [ 6 3 6 5 3 6 ]

Inductive reasoning is a core problem-solving capacity: humans can identify underlying principles from a few examples, which can then be robustly generalized to novel scenarios. Recent work has evaluated large language models (LLMs) on inductive reasoning tasks by directly prompting them yielding "in context learning." This can work well for straightforward inductive tasks, but performs very poorly on more complex tasks such as the Abstraction and Reasoning Corpus (ARC). In this work, we propose to improve the inductive reasoning ability of LLMs by generating explicit hypotheses at multiple levels of abstraction: we prompt the LLM to propose multiple abstract hypotheses about the problem, in natural language, then implement the natural language hypotheses as concrete Python programs. These programs can be directly verified by running on the observed examples and generalized to novel inputs. To reduce the hypothesis search space, we explore steps to filter the set of hypotheses to be implemented as programs: we either ask the LLM to summarize them into a smaller set of hypotheses, or ask human annotators to select a subset. We verify our pipeline's effectiveness on the ARC visual inductive reasoning benchmark, its variant 1D-ARC, and string transformation dataset SyGuS. On a random 40-problem subset of ARC, our automated pipeline using LLM summaries achieves 27.5% accuracy, significantly outperforming the direct prompting baseline (accuracy of 12.5%). With the minimal human input of selecting from LLM-generated candidates, the performance is boosted to 37.5%. Our ablation studies show that abstract hypothesis generation and concrete program representations are both beneficial for LLMs to perform inductive reasoning tasks.

POSTER-839: Neural-Symbolic Recursive Machine for Systematic Generalization

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: Neuro-symbolic AI Systematic Generalization Compositional Generalization

Scores: [ 8 3 8 6 ]

Current learning models often struggle with human-like systematic generalization; learning compositional rules from limited data and extrapolating them to unseen combinations. To address this, we introduce Neural-Symbolic Recursive Machine (NSR), a model whose core representation is a Grounded Symbol System (GSS ), with its combinatorial syntax and semantics emerging entirely from the training data. The NSR adopts a modular approach, incorporating neural perception, syntactic parsing, and semantic reasoning, which are jointly learned through a deduction-abduction algorithm. We establish that NSR possesses sufficient expressiveness to handle a variety of sequence-to-sequence tasks and attains superior systematic generalization, thanks to the inductive biases of equivariance and recursiveness inherent in each module. We assess NSR ’s performance against four rigorous benchmarks designed to test systematic generalization: SCAN for semantic parsing, PCFG for string manipulation, HINT for arithmetic reasoning, and a task involving compositional machine translation. Our results indicate that NSR outperforms existing neural or hybrid models in terms of generalization and transferability.

POSTER-840: Dissecting Neural Network Robustness Proofs

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: Neural Network Verification Robustness Certification

Scores: [ 6 6 6 6 ]

In recent years numerous methods have been developed to formally verify the robustness of deep neural networks (DNNs). Though the proposed techniques are effective in providing mathematical guarantees about the DNNs' behavior, it is not clear whether the proofs generated by these methods are human understandable. In this paper, we bridge this gap by developing new concepts, algorithms, and representations to generate human understandable insights into the internal workings of DNN robustness proofs. Leveraging the proposed method, we show that the robustness proofs of standard DNNs rely more on spurious input features as compared to the proofs of DNNs trained to be robust. Robustness proofs of the provably robust DNNs filter out a larger number of spurious input features as compared to adversarially trained DNNs, sometimes even leading to the pruning of semantically meaningful input features.The proofs for the DNNs combining adversarial and provably robust training tend to achieve the middle ground.

POSTER-841: ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: Large Language Models Mathematical Reasoning Tool Learning

Scores: [ 6 5 8 8 ]

Large language models have made significant progress in various language tasks, yet they still struggle with complex mathematics. In this paper, we propose ToRA a series of Tool-integrated Reasoning Agents designed to solve challenging mathematical problems by seamlessly integrating natural language reasoning with the utilization of external tools (e.g., computation libraries and symbolic solvers), thereby amalgamating the analytical prowess of language and the computational efficiency of tools. To train ToRA, we curate interactive tool-use trajectories on mathematical datasets, apply imitation learning on the annotations, and propose output space shaping to further refine models' reasoning behavior. As a result, ToRA models significantly outperform open-source models on 10 mathematical reasoning datasets across all scales with 13%-19% absolute improvements on average. Notably, ToRA-7B reaches 44.6% on the competition-level dataset MATH, surpassing the best open-source model WizardMath-70B by 22% absolute. ToRA-34B is also the first open-source model that achieves an accuracy exceeding 50% on MATH, which significantly outperforms GPT-4's CoT result, and is competitive with GPT-4 solving problems with programs. Additionally, we conduct a comprehensive analysis of the benefits and remaining challenges of tool interaction for mathematical reasoning, providing valuable insights for future research.

POSTER-842: Neurosymbolic Grounding for Compositional Generalization

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: neurosymbolic learning machine learning world modeling compositional generalization

Scores: [ 6 6 6 ]

We introduce Cosmos, a framework for object-centric world modeling that is designed for compositional generalization (CG), i.e., high performance on unseen input scenes obtained through the composition of known visual "atoms." The central insight behind Cosmos is the use of a novel form of neurosymbolic grounding. Specifically, the framework introduces two new tools: (i) neurosymbolic scene encodings, which represent each entity in a scene using a real vector computed using a neural encoder, as well as a vector of composable symbols describing attributes of the entity, and (ii) a neurosymbolic attention mechanism that binds these entities to learned rules of interaction. Cosmos is end-to-end differentiable; also, unlike traditional neurosymbolic methods that require representations to be manually mapped to symbols, it computes an entity's symbolic attributes using vision-language foundation models. Through an evaluation that considers two different forms of CG on an established blocks-pushing domain, we show that the framework establishes a new state-of-the-art for CG in world modeling.

ORAL-42: LEGO-Prover: Neural Theorem Proving with Growing Libraries

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Keywords: Theorem proving Large language model Autoformalization

Scores: [ 8 6 8 8 ]

Despite the success of large language models (LLMs), the task of theorem proving still remains one of the hardest reasoning tasks that is far from being fully solved. Prior methods using language models have demonstrated promising results, but they still struggle to prove even middle school level theorems. One common limitation of these methods is that they assume a fixed theorem library during the whole theorem proving process. However, as we all know, creating new useful theorems or even new theories is not only helpful but crucial and necessary for advancing mathematics and proving harder and deeper results.In this work, we present LEGO-Prover, which employs a growing skill library containing verified lemmas as skills to augment the capability of LLMs used in theorem proving. By constructing the proof modularly, LEGO-Prover enables LLMs to utilize existing skills retrieved from the library and to create new skills during the proving process. These skills are further evolved (by prompting an LLM) to enrich the library on another scale. Modular and reusable skills are constantly added to the library to enable tackling increasingly intricate mathematical problems. Moreover, the learned library further bridges the gap between human proofs and formal proofs by making it easier to impute missing steps. LEGO-Prover advances the state-of-the-art pass rate on miniF2F-valid (48.0% to 57.0%) and miniF2F-test (45.5% to 47.1%). During the proving process, LEGO-Prover also manages to generate over 20,000 skills (theorems/lemmas) and adds them to the growing library. Our ablation study indicates that these newly added skills are indeed helpful for proving theorems, resulting in an improvement from a success rate of 47.1% to 50.4%. We also release our code and all the generated skills.

POSTER-843: A ROBUST DIFFERENTIAL NEURAL ODE OPTIMIZER

Primary Area: optimization

Keywords: Neural Optimizer Differential Dynamic Programming Adversarial Defense Game Theory

Scores: [ 6 6 6 ]

Neural networks and neural ODEs tend to be vulnerable to adversarial attacks, rendering robust optimizers critical to curb the success of such attacks. In this regard, the key insight of this work is to interpret Neural ODE optimization as a min-max optimal control problem. More particularly, we present Game Theoretic Second-Order Neural Optimizer (GTSONO), a robust game theoretic optimizer based on the principles of min-max Differential Dynamic Programming.The proposed method exhibits significant computational benefits due to efficient matrix decompositions and provides convergence guarantees to local saddle points.Empirically, the robustness of the proposed optimizer is demonstrated through greater robust accuracy compared to benchmark optimizers when trained on clean images. Additionally, its ability to provide a performance increase when adapted to an already existing adversarial defense technique is also illustrated.Finally, the superiority of the proposed update law over its gradient based counterpart highlights the potential benefits of incorporating robust optimal control paradigms into adversarial training methods.

POSTER-844: A Quadratic Synchronization Rule for Distributed Deep Learning

Primary Area: optimization

Keywords: distributed training Local SGD local gradient methods generalization implicit bias sharpness

Scores: [ 8 5 6 8 ]

In distributed deep learning with data parallelism, synchronizing gradients at each training step can cause a huge communication overhead, especially when many nodes work together to train large models. Local gradient methods, such as Local SGD, address this issue by allowing workers to compute locally for H steps without synchronizing with others, hence reducing communication frequency. While H has been viewed as a hyperparameter to trade optimization efficiency for communication cost, recent research indicates that setting a proper H value can lead to generalization improvement. Yet, selecting a proper H is elusive. This work proposes a theory-grounded method for determining H, named the Quadratic Synchronization Rule (QSR), which recommends dynamically setting H in proportion to 1η2 as the learning rate η decays over time. Extensive ImageNet experiments on ResNet and ViT show that local gradient methods with QSR consistently improve the test accuracy over other synchronization strategies. Compared to the standard data parallel training, QSR enables Local AdamW to cut the training time on 16 or 64 GPUs down from 26.7 to 20.2 hours or from 8.6 to 5.5 hours and, at the same time, achieves 1.16% or 0.84% higher top-1 validation accuracy.

POSTER-845: Revisiting the Last-Iterative Convergence of Stochastic Gradient Methods

Primary Area: optimization

Keywords: Convex Optimization Stochastic Optimization Last Iterates

Scores: [ 8 6 6 ]

In the past several years, the convergence of the last iterate of the Stochastic Gradient Descent (SGD) algorithm has triggered people's great interest due to its good performance in practice but lack of theoretical understanding. For Lipschtiz and convex functions, different works have established the optimal O(log(1/δ)logT/T) or O(log(1/δ)/T) high-probability convergence rates for the final iterate, where T is the time horizon and δ is the failure probability. However, to prove these bounds, all the existing works are limited to compact domains, and almost all of them also require almost surely bounded noises. It is natural to ask whether the last iterate of SGD can still guarantee the optimal convergence rate but without these two restrictive assumptions. Besides this important question, there are still lots of theoretical problems lacking an answer. For example, compared with the last iterate convergence of SGD for non-smooth problems, only very few results for smooth optimization have yet been developed. Additionally, the existing results are all limited to a single objective and the standard Euclidean norm. It still remains unclear whether the last-iterative convergence can be provably extended to wider composite optimization and non-Euclidean norms. In this work, to address the issues mentioned above, we revisit the last-iterative convergence of stochastic gradient methods and provide the first unified way to prove the convergence rates both in expectation and in high probability to accommodate general domains, composite objectives, non-Euclidean norms, Lipschitz conditions, smoothness and (strong) convexity simultaneously.

POSTER-846: Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion

Primary Area: optimization

Keywords: mlp batch-normalization optimization depth calculus theory deep-learning non-asymptotic

Scores: [ 6 6 6 6 ]

Normalization layers are one of the key building blocks for deep neural networks. Several theoretical studies have shown that batch normalization improves the signal propagation, by avoiding the representations from becoming collinear across the layers. However, results on mean-field theory of batch normalization also conclude that this benefit comes at the expense of exploding gradients in depth. Motivated by these two aspects of batch normalization, in this study we pose the following question: Can a batch-normalized network keep the optimal signal propagation properties, but avoid exploding gradients? We answer this question in the affirmative by giving a particular construction of an MLP with linear activations and batch-normalization that provably has bounded gradients at any depth. Based on Weingarten calculus, we develop a rigorous and non-asymptotic theory for this constructed MLP that gives a precise characterization of forward signal propagation, while proving that gradients remain bounded for linearly independent input samples, which holds in most practical settings. Inspired by our theory, we also design an activation shaping scheme that empirically achieves the same properties for non-linear activations.

SPOTLIGHT-167: Constrained Bi-Level Optimization: Proximal Lagrangian Value function Approach and Hessian-free Algorithm

Primary Area: optimization

Keywords: Bi-level Optimization Constrained Optimization Hessian-free Single-loop Value Function Convergence Analysis

Scores: [ 5 8 8 6 5 ]

POSTER-847: Correlated Noise Provably Beats Independent Noise for Differentially Private Learning

Primary Area: optimization

Keywords: differentially private optimization stochastic gradient descent linear regression theory private deep learning

Scores: [ 1 8 8 ]

Differentially private learning algorithms inject noise into the learning process. While the most common private learning algorithm, DP-SGD, adds independent Gaussian noise in each iteration, recent work on matrix factorization mechanisms has shown empirically that introducing correlations in the noise can greatly improve their utility. We characterize the asymptotic learning utility for any choice of the correlation function, giving precise analytical bounds for linear regression and as the solution to a convex program for general convex functions. We show, using these bounds, how correlated noise provably improves upon vanilla DP-SGD as a function of problem parameters such as the effective dimension and condition number. Moreover, our analytical expression for the near-optimal correlation function circumvents the cubic complexity of the semi-definite program used to optimize the noise correlation matrix in previous work. We validate these theoretical results with experiments on private deep learning. Our work matches or outperforms prior work while being efficient both in terms of computation and memory.

POSTER-848: A Restoration Network as an Implicit Prior

Primary Area: optimization

Keywords: computational imaging inverse problems deep learning plug-and-play priors

Scores: [ 5 6 8 6 ]

Image denoisers have been shown to be powerful priors for solving inverse problems in imaging. In this work, we introduce a generalization of these methods that allows any image restoration network to be used as an implicit prior. The proposed method uses priors specified by deep neural networks pre-trained as general restoration operators. The method provides a principled approach for adapting state-of-the-art restoration models for other inverse problems. Our theoretical result analyzes its convergence to a stationary point of a global functional associated with the restoration operator. Numerical results show that the method using a super-resolution prior achieves state-of-the-art performance both quantitatively and qualitatively. Overall, this work offers a step forward for solving inverse problems by enabling the use of powerful pre-trained restoration models as priors.

POSTER-849: On the Effect of Batch Size in Byzantine-Robust Distributed Learning

Primary Area: optimization

Keywords: distributed learning Byzantine robustness batch size

Scores: [ 6 6 6 ]

Byzantine-robust distributed learning (BRDL), in which computing devices are likely to behave abnormally due to accidental failures or malicious attacks, has recently become a hot research topic. However, even in the independent and identically distributed (i.i.d.) case, existing BRDL methods will suffer a significant drop on model accuracy due to the large variance of stochastic gradients. Increasing batch sizes is a simple yet effective way to reduce the variance. However, when the total number of gradient computation is fixed, a too-large batch size will lead to a too-small iteration number (update number), which may also degrade the model accuracy. In view of this challenge, we mainly study the effect of batch size when the total number of gradient computation is fixed in this work. In particular, we show that when the total number of gradient computation is fixed, the optimal batch size corresponding to the tightest theoretical upper bound in BRDL increases with the fraction of Byzantine workers. Therefore, compared to the case without attacks, a larger batch size is preferred when under Byzantine attacks. Motivated by the theoretical finding, we propose a novel method called Byzantine-robust stochastic gradient descent with normalized momentum (ByzSGDnm) in order to further increase model accuracy in BRDL. We theoretically prove the convergence of ByzSGDnm for general non-convex cases under Byzantine attacks. Empirical results show that when under Byzantine attacks, compared to the cases of small batch sizes, setting a relatively large batch size can significantly increase the model accuracy, which is consistent with our theoretical results. Moreover, ByzSGDnm can achieve higher model accuracy than existing BRDL methods when under deliberately crafted attacks. In addition, we empirically show that increasing batch sizes has the bonus of training acceleration.

POSTER-850: Grokking as the transition from lazy to rich training dynamics

Primary Area: optimization

Keywords: Grokking Feature Learning Neural Tangent Kernel Generalization Training Dynamics

Scores: [ 8 8 3 5 ]

We propose that the grokking phenomenon, where the train loss of a neural network decreases much earlier than its test loss, can arise due to a neural network transitioning from lazy training dynamics to a rich, feature learning regime. To illustrate this mechanism, we study the simple setting of vanilla gradient descent on a polynomial regression problem with a two layer neural network which exhibits grokking without regularization in a way that cannot be explained by existing theories. We identify sufficient statistics for the test loss of such a network, and tracking these over training reveals that grokking arises in this setting when the network first attempts to fit a kernel regression solution with its initial features, followed by late-time feature learning where a generalizing solution is identified after train loss is already low. We find that the key determinants of grokking are the rate of feature learning---which can be controlled precisely by parameters that scale the network output---and the alignment of the initial features with the target function y(x). We argue this delayed generalization arises when (1) the top eigenvectors of the initial neural tangent kernel and the task labels y(x) are misaligned, but (2) the dataset size is large enough so that it is possible for the network to generalize eventually, but not so large that train loss perfectly tracks test loss at all epochs, and (3) the network begins training in the lazy regime so does not learn features immediately. We conclude with evidence that this transition from lazy (linear model) to rich training (feature learning) can control grokking in more general settings, like on MNIST, one-layer Transformers, and student-teacher networks.

POSTER-851: SYMBOL: Generating Flexible Black-Box Optimizers through Symbolic Equation Learning

Primary Area: optimization

Keywords: Black-Box Optimization Meta-Black-Box Optimization Deep Reinforcement Learning Symbolic Equation Learning

Scores: [ 8 6 6 6 ]

Recent Meta-learning for Black-Box Optimization (MetaBBO) methods harness neural networks to meta-learn configurations of traditional black-box optimizers. Despite their success, they are inevitably restricted by the limitations of predefined hand-crafted optimizers. In this paper, we present SYMBOL, a novel framework that promotes the automated discovery of black-box optimizers through symbolic equation learning. Specifically, we propose a Symbolic Equation Generator (SEG) that allows closed-form optimization rules to be dynamically generated for specific tasks and optimization steps. Within SYMBOL, we then develop three distinct strategies based on reinforcement learning, so as to meta-learn the SEG efficiently. Extensive experiments reveal that the optimizers generated by SYMBOL not only surpass the state-of-the-art BBO and MetaBBO baselines, but also exhibit exceptional zero-shot generalization abilities across entirely unseen tasks with different problem dimensions, population sizes, and optimization horizons. Furthermore, we conduct in-depth analyses of our SYMBOL framework and the optimization rules that it generates, underscoring its desirable flexibility and interpretability.

ORAL-43: Statistically Optimal K-means Clustering via Nonnegative Low-rank Semidefinite Programming

Primary Area: optimization

Keywords: clustering Burer-Monteiro semidefinite programming

Scores: [ 8 3 8 8 ]

K-means clustering is a widely used machine learning method for identifying patterns in large datasets. Semidefinite programming (SDP) relaxations have recently been proposed for solving the K-means optimization problem that enjoy strong statistical optimality guarantees, but the prohibitive cost of implementing an SDP solver renders these guarantees inaccessible to practical datasets. By contrast, nonnegative matrix factorization (NMF) is a simple clustering algorithm that is widely used by machine learning practitioners, but without a solid statistical underpinning nor rigorous guarantees. In this paper, we describe an NMF-like algorithm that works by solving a \emph{nonnegative} low-rank restriction of the SDP relaxed K-means formulation using a nonconvex Burer--Monteiro factorization approach. The resulting algorithm is just as simple and scalable as state-of-the-art NMF algorithms, while also enjoying the same strong statistical optimality guarantees as the SDP. In our experiments, we observe that our algorithm achieves substantially smaller mis-clustering errors compared to the existing state-of-the-art.

POSTER-852: A Discretization Framework for Robust Contextual Stochastic Optimization

Primary Area: optimization

Keywords: Robust Optimization Stochastic Optimization End-to-End learning

Scores: [ 8 6 8 3 ]

We study contextual stochastic optimization problems. Optimization problems have uncertain parameters stemming from unknown, context-dependent, distributions. Due to the inherent uncertainty in these problems, one is often interested not only in minimizing expected cost, but also to be robust and protect against worst case scenarios. We propose a novel method that combines the learning stage with knowledge of the downstream optimization task. The method prescribes decisions which aim to maximize the likelihood that the cost is below a (user-controlled) threshold. The key idea is (1) to discretize the feasible region into subsets so that the uncertain objective function can be well approximated deterministically within each subset, and (2) devise a secondary optimization problem to prescribe decisions by integrating the individual approximations determined in step (1). We provide theoretical guarantees bounding the underlying regret of decisions proposed by our method. In addition, experimental results demonstrate that our approach is competitive in terms of average regret and yields more robust solutions than other methods proposed in the literature, including up to 20 times lower worst-case cost on a real-world electricity generation problem.

POSTER-853: Fast Equilibrium of SGD in Generic Situations

Primary Area: optimization

Keywords: optimization fast mixing fast equilibrium SDE SGD large deviation principle

Scores: [ 6 8 6 6 ]

Normalization layers are ubiquitous in deep learning, greatly accelerating optimization. However, they also introduce many unexpected phenomena during training, for example, the Fast Equilibrium conjecture proposed by (Li et al.,2020), which states that the scale-invariant normalized network, when trained by SGD with η learning rate and λ weight decay, mixes to an equilibrium in O~(1/ηλ) steps, as opposed to classical eO(η1) mixing time. Recent works by Wang & Wang (2022); Li et al. (2022c) proved this conjecture under different sets of assumptions. This paper aims to answer the fast equilibrium conjecture in full generality by removing the non-generic assumptions of Wang & Wang (2022); Li et al. (2022c) that the minima are isolated, that the region near minima forms a unique basin, and that the set of minima is an analytic set. Our main technical contribution is to show that with probability close to 1, in exponential time trajectories will not escape the attracting basin containing its initial position.

POSTER-854: DeepZero: Scaling Up Zeroth-Order Optimization for Deep Model Training

Primary Area: optimization

Keywords: gradient-free learning zeroth-order optimization gradient sparsity

Scores: [ 5 3 6 8 8 6 ]

POSTER-855: Tractable MCMC for Private Learning with Pure and Gaussian Differential Privacy

Primary Area: optimization

Keywords: Pure Differential Privacy Monte Carlo sampling Gaussian Differential Privacy Exponential Mechanism

Scores: [ 8 6 6 ]

Posterior sampling, i.e., exponential mechanism to sample from the posterior distribution, provides ε-pure differential privacy (DP) guarantees and does not suffer from potentially unbounded privacy breach introduced by (ε,δ)-approximate DP. In practice, however, one needs to apply approximate sampling methods such as Markov chain Monte Carlo (MCMC), thus re-introducing the unappealing δ-approximation error into the privacy guarantees. To bridge this gap, we propose the Approximate SAample Perturbation (abbr. ASAP) algorithm which perturbs an MCMC sample with noise proportional to its Wasserstein-infinity (W) distance from a reference distribution that satisfies pure DP or pure Gaussian DP (i.e., δ=0). We then leverage a Metropolis-Hastings algorithm to generate the sample and prove that the algorithm converges in W$_\infty$ distance. We show that by combining our new techniques with a careful localization step, we obtain the first nearly linear-time algorithm that achieves the optimal rates in the DP-ERM problem with strongly convex and smooth losses.

POSTER-856: Near-Optimal Quantum Algorithm for Minimizing the Maximal Loss

Primary Area: optimization

Keywords: Quantum Algorithms Quantum Query Complexity Convex Optimization Minimizing Loss

Scores: [ 6 8 5 5 ]

The problem of minimizing the maximum of N convex, Lipschitz functions plays significant roles in optimization and machine learning. It has a series of results, with the most recent one requiring O(Nϵ2/3+ϵ8/3) queries to a first-order oracle to compute an ϵ-suboptimal point. On the other hand, quantum algorithms for optimization are rapidly advancing with speedups shown on many important optimization problems. In this paper, we conduct a systematic study of quantum algorithms and lower bounds for minimizing the maximum of N convex, Lipschitz functions. On one hand, we develop quantum algorithms with an improved complexity bound of O~(Nϵ5/3+ϵ8/3). On the other hand, we prove that quantum algorithms must take Ω~(Nϵ2/3) queries to a first-order quantum oracle, showing that our dependence on N is optimal up to poly-logarithmic factors.

POSTER-857: Advancing the Lower Bounds: an Accelerated, Stochastic, Second-order Method with Optimal Adaptation to Inexactness

Primary Area: optimization

Keywords: Second-order methods convex optimization stochastic optimization Cubic Newton Method high-order methods tensor methods lower bounds

Scores: [ 6 6 1 8 ]

We present a new accelerated stochastic second-order method that is robust to both gradient and Hessian inexactness, typical in machine learning. We establish theoretical lower bounds and prove that our algorithm achieves optimal convergence in both gradient and Hessian inexactness in this key setting. We further introduce a tensor generalization for stochastic higher-order derivatives. When the oracles are non-stochastic, the proposed tensor algorithm matches the global convergence of Nesterov Accelerated Tensor method. Both algorithms allow for approximate solutions of their auxiliary subproblems with verifiable conditions on the accuracy of the solution.

POSTER-858: Communication-Efficient Federated Non-Linear Bandit Optimization

Primary Area: optimization

Keywords: federated optimization communication cost non-linear bandit bandit optimization cumulative regret

Scores: [ 6 6 6 6 ]

Federated optimization studies the problem of collaborative function optimization among multiple clients (e.g. mobile devices or organizations) under the coordination of a central server. Since the data is collected separately by each client and always remains decentralized, federated optimization preserves data privacy and allows for large-scale computing, which makes it a promising decentralized machine learning paradigm. Though it is often deployed for tasks that are online in nature, e.g., next-word prediction on keyboard apps, most works formulate it as an offline problem. The few exceptions that consider federated bandit optimization are limited to very simplistic function classes, e.g., linear, generalized linear, or non-parametric function class with bounded RKHS norm, which severely hinders its practical usage. In this paper, we propose a new algorithm, named Fed-GO-UCB, for federated bandit optimization with generic non-linear objective function. Under some mild conditions, we rigorously prove that Fed-GO-UCB is able to achieve sub-linear rate for both cumulative regret and communication cost. At the heart of our theoretical analysis are distributed regression oracle and individual confidence set construction, which can be of independent interests. Empirical evaluations also demonstrate the effectiveness of the proposed algorithm.

ORAL-44: An Analytical Solution to Gauss-Newton Loss for Direct Image Alignment

Primary Area: optimization

Keywords: feature-metric image alignment

Scores: [ 8 6 8 ]

Direct image alignment is a widely used technique for relative 6DoF pose estimation between two images, but its accuracy strongly depends on pose initialization.Therefore, recent end-to-end frameworks focused on training objectives, such as the Gauss-Newton loss, which increase the convergence basin of the learned feature descriptors.However, the training data may be biased toward a specific type of motion and pose initialization,thus limiting the generalization of these methods.In this work, we derive a closed-form solution to the expected optimum of the Gauss-Newton loss. The solution is agnostic to the underlying feature representation and allows us to dynamically adjust the basin of convergence according to our assumptions about the uncertainty in the current estimates. This offers effective control over the convergence properties of the algorithm.Despite using self-supervised feature embeddings, our solution achieves compelling accuracy w.r.t. the state-of-the-art direct image alignment methods trained end-to-end with pose supervision, and exhibits improved robustness to pose initialization.Our analytical solution provides insight into the inherent limitations of end-to-end learning with the Gauss-Newton loss and establishes an intriguing connection between direct image alignment and feature-matching approaches.

POSTER-859: EControl: Fast Distributed Optimization with Compression and Error Control

Primary Area: optimization

Keywords: Optimization Federated Learning Contractive Compression

Scores: [ 6 6 6 8 ]

SPOTLIGHT-168: TRAM: Bridging Trust Regions and Sharpness Aware Minimization

Primary Area: optimization

Keywords: sharpness-aware minimization sam trust region optimization cross-lingual transfer language modeling

Scores: [ 5 6 8 8 ]

By reducing the curvature of the loss surface in the parameter space,Sharpness-aware minimization (SAM) yields widespread robustness improvementunder domain transfer. Instead of focusing on parameters, however, this workconsiders the transferability of representations as the optimizationtarget for out-of-domain generalization in a fine-tuning setup. To encourage theretention of transferable representations, we consider trust region-basedfine-tuning methods, which exploit task-specific skills without forgettingtask-agnostic representations from pre-training. We unify parameter- andrepresentation-space smoothing approaches by using trust region bounds to informSAM-style regularizers on both of these optimization surfaces. We proposeTrust Region Aware Minimization (TRAM), a fine-tuning algorithm thatoptimizes for flat minima and smooth, informative representations withoutforgetting pre-trained structure. We find that TRAM outperforms bothsharpness-aware and trust region-based optimization methods on cross-domainlanguage modeling and cross-lingual transfer, where robustness to domaintransfer and representation generality are critical for success. TRAMestablishes a new standard in training generalizable models with minimaladditional computation.

POSTER-860: Local Composite Saddle Point Optimization

Primary Area: optimization

Keywords: Saddle Point Optimization Distributed Optimization Composite Optimization Dual Extrapolation Mirror Prox Convex Optimization Bregman Divergence

Scores: [ 6 5 6 ]

Distributed optimization (DO) approaches for saddle point problems (SPP) have recently gained in popularity due to the critical role they play in machine learning (ML). Existing works mostly target smooth unconstrained objectives in Euclidean space, whereas ML problems often involve constraints or non-smooth regularization, which results in a need for composite optimization. Moreover, although non-smooth regularization often serves to induce structure (e.g., sparsity), standard aggregation schemes in distributed optimization break this structure. Addressing these issues, we propose Federated Dual Extrapolation (FeDualEx), an extra-step primal-dual algorithm with local updates, which is the first of its kind to encompass both saddle point optimization and composite objectives under the distributed paradigm. Using a generalized notion of Bregman divergence, we analyze its convergence and communication complexity in the homogeneous setting. Furthermore, the empirical evaluation demonstrates the effectiveness of FeDualEx for inducing structure in these challenging settings.

POSTER-861: Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization

Primary Area: optimization

Keywords: neural network optimization progressive sharpening edge of stability adaptive gradient methods batch normalization

Scores: [ 8 5 3 6 ]

POSTER-862: Deep Reinforcement Learning Guided Improvement Heuristic for Job Shop Scheduling

Primary Area: optimization

Keywords: Deep Reinforcement Learning Graph Neural Network Job Shop Scheduling Combinatorial Optimization

Scores: [ 8 8 8 6 ]

Recent studies in using deep reinforcement learning (DRL) to solve Job-shop scheduling problems (JSSP) focus on construction heuristics. However, their performance is still far from optimality, mainly because the underlying graph representation scheme is unsuitable for modelling partial solutions at each construction step. This paper proposes a novel DRL-guided improvement heuristic for solving JSSP, where graph representation is employed to encode complete solutions. We design a Graph-Neural-Network-based representation scheme, consisting of two modules to effectively capture the information of dynamic topology and different types of nodes in graphs encountered during the improvement process. To speed up solution evaluation during improvement, we present a novel message-passing mechanism that can evaluate multiple solutions simultaneously. We prove that the computational complexity of our method scales linearly with problem size. Experiments on classic benchmarks show that the improvement policy learned by our method outperforms state-of-the-art DRL-based methods by a large margin.

SPOTLIGHT-169: Stochastic Controlled Averaging for Federated Learning with Communication Compression

Primary Area: optimization

Keywords: federated learning communication compression data heterogeneity controlled averaging

Scores: [ 8 8 8 ]

POSTER-863: Scaling Convex Neural Networks with Burer-Monteiro Factorization

Primary Area: optimization

Keywords: burer-monteiro convex optimization neural networks stationary points global optima relu activation

Scores: [ 8 5 6 6 8 ]

It has been demonstrated that the training problem for a variety of (non) linear two-layer neural networks (such as two-layer perceptrons, convolutional networks, and self-attention) can be posed as equivalent convex optimization problems, with an induced regularizer which encourages low rank. However, this regularizer becomes prohibitively expensive to compute at moderate scales, impeding training convex neural networks. To this end, we propose applying the Burer-Monteiro factorization to convex neural networks, which for the first time enables a Burer-Monteiro perspective on neural networks with non-linearities. This factorization leads to an equivalent yet computationally tractable non-convex alternative with no spurious local minima. We develop a novel relative optimality bound of stationary points of the Burer-Monteiro factorization, providing verifiable conditions under which any stationary point is a global optimum. Further, for the first time, we show that linear self-attention with sufficiently many heads has no spurious local minima. Our experiments validate the novel relative optimality bound and the utility of the Burer-Monteiro factorization for scaling convex neural networks.

POSTER-864: On the Hardness of Online Nonconvex Optimization with Single Oracle Feedback

Primary Area: optimization

Keywords: online nonconvex optimization local regret optimal rate analysis

Scores: [ 6 6 8 6 ]

Online nonconvex optimization has been an active area of research recently. Previous studies either considered the global regret with full information about the objective functions, or studied the local regret with window-smoothed objective functions, which required access to unlimited number of gradient oracles per time step. In this paper, we focus on the more challenging and practical setting, where access to only a single oracle is allowed per time step, and take the local regret of the original (i.e., unsmoothed) objective functions as the performance metric. Specifically, for both settings respectively with a single exact and stochastic gradient oracle feedback, we derive lower bounds on the local regret and show that the classical online (stochastic) gradient descent algorithms are optimal. Moreover, for the more challenging setting with a single function value oracle feedback, we develop an online algorithm based on a one-point running difference gradient estimator, and show that such an algorithm achieves a local regret that a generic stochastic gradient oracle can best achieve.

POSTER-865: FedP3: Federated Personalized and Privacy-friendly Network Pruning under Model Heterogeneity

Primary Area: optimization

Keywords: Network Pruning Federated Learning Model Heterogeneity

Scores: [ 8 3 6 5 8 6 ]

The interest in federated learning has surged in recent research due to its unique ability to train a global model using privacy-secured information held locally on each client. This paper pays particular attention to the issue of client-side model heterogeneity, a pervasive challenge in the practical implementation of FL that escalates its complexity. Assuming a scenario where each client possesses varied memory storage, processing capabilities and network bandwidth - a phenomenon referred to as system heterogeneity - there is a pressing need to customize a unique model for each client. In response to this, we present an effective and adaptable federated framework FedP3, representing Federated Personalized and Privacy-friendly network Pruning, tailored for model heterogeneity scenarios. Our proposed methodology can incorporate and adapt well-established techniques to its specific instances.

POSTER-866: Communication-Efficient Gradient Descent-Accent Methods for Distributed Variational Inequalities: Unified Analysis and Local Updates

Primary Area: optimization

Keywords: Federated Learning Variational Inequality Minimax Optimization Communication Complexity

Scores: [ 6 5 6 8 ]

Distributed and federated learning algorithms and techniques associated primarily with minimization problems. However, with the increase of minimax optimization and variational inequality problems in machine learning, the necessity of designing efficient distributed/federated learning approaches for these problems is becoming more apparent. In this paper, we provide a unified convergence analysis of communication-efficient local training methods for distributed variational inequality problems (VIPs). Our approach is based on a general key assumption on the stochastic estimates that allows us to propose and analyze several novel local training algorithms under a single framework for solving a class of structured non-monotone VIPs. We present the first local gradient descent-accent algorithms with provable improved communication complexity for solving distributed variational inequalities on heterogeneous data. The general algorithmic framework recovers state-of-the-art algorithms and their sharp convergence guarantees when the setting is specialized to minimization or minimax optimization problems. Finally, we demonstrate the strong performance of the proposed algorithms compared to state-of-the-art methods when solving federated minimax optimization problems.

POSTER-867: On the Parameterization of Second-Order Optimization Effective towards the Infinite Width

Primary Area: optimization

Keywords: Deep learning Second-order optimization K-FAC Feature learning Infinite width Maximum update parameterization

Scores: [ 6 8 8 ]

Second-order optimization has been developed to accelerate the training of deep neural networks and it is being applied to increasingly larger-scale models. In this study, towards training on further larger scales, we identify a specific parameterization for second-order optimization that promotes feature learning in a stable manner even if the network width increases significantly. Inspired by a maximal update parametrization, we consider a one-step update of the gradient and reveal the appropriate scales of hyperparameters including random initialization, learning rates, and damping terms. Our approach covers two major second-order optimization algorithms, K-FAC and Shampoo, and we demonstrate that our parametrization achieves higher generalization performance in feature learning.In particular, it enables us to transfer the hyperparameters across models with different widths.

POSTER-868: Adaptive Federated Learning with Auto-Tuned Clients

Primary Area: optimization

Keywords: federated learning adaptive optimization distributed optimization

Scores: [ 6 8 6 6 ]

Federated learning (FL) is a distributed machine learning framework where the global model of a central server is trained via multiple collaborative steps by participating clients without sharing their data. While being a flexible framework, where the distribution of local data, participation rate, and computing power of each client can greatly vary, such flexibility gives rise to many new challenges, especially in the hyperparameter tuning on the client side. We propose Δ-SGD, a simple step size rule for SGD that enables each client to use its own step size by adapting to the local smoothness of the function each client is optimizing. We provide theoretical and empirical results where the benefit of the client adaptivity is shown in various FL scenarios.

POSTER-869: Global Optimality for Non-linear Constrained Restoration Problems via Invexity

Primary Area: optimization

Keywords: Constrained optimization Invexity Quasi-invexity Global optima

Scores: [ 6 3 5 ]

Signal restoration is an important constrained optimization problem with significant applications in various domains. Although non-convex constrained optimization problems have been shown to perform better than convex counterparts in terms of reconstruction quality, convex constrained optimization problems have been preferably for its global optima guarantees. Despite the success of non-convex methods in a large number of applications, it is not an overstatement to say that there is little or no hope for non-convex problems to ensure global optima. In this paper, for the first time, we develop invex constrained optimization theory to mitigate the loss of guarantees for global optima in non-convex constrained inverse problems, where the invex function is a mapping where any critical point is a global minimizer. We also develop relevant theories to extend the global optima guarantee to a set of quasi-invex functions - the largest optimizable mappings. More specifically, we propose a family of invex/quasi-invex of functions for handling constrained inverse problems using the non-convex setting along with guarantees for their global optima. Our experimental evaluation shows that the proposed approach is very promising and can aid in extending existing convex optimization algorithms, such as the alternating direction method of multipliers, and accelerated proximal gradient methods.

POSTER-870: Ito Diffusion Approximation of Universal Ito Chains for Sampling, Optimization and Boosting

Primary Area: optimization

Keywords: markov chains diffusion processes

Scores: [ 3 6 8 ]

POSTER-871: New Insight of Variance reduce in Zero-Order Hard-Thresholding: Mitigating Gradient Error and Expansivity Contradictions

Primary Area: optimization

Keywords: zeroth-order gradient hard-thresholding variance reduction

Scores: [ 6 6 3 6 5 6 ]

Hard-thresholding is an important type of algorithm in machine learning that is used to solve 0 constrained optimization problems. However, the true gradient of the objective function can be difficult to access in certain scenarios, which normally can be approximated by zeroth-order (ZO) methods. SZOHT algorithm is the only algorithm tackling 0 sparsity constraints with zeroth-order gradients so far. Unfortunately, SZOHT has a notable limitation on the number of random directions due to the inherent conflict between the deviation of ZO gradients and the expansivity of the hard-thresholding operator. This paper approaches this problem by considering the role of variance and provides a new insight into variance reduction: mitigating the unique conflicts between ZO gradients and hard-thresholding. Under this perspective, we propose a generalized variance reduced ZO hard-thresholding algorithm as well as the generalized convergence analysis under standard assumptions. The theoretical results demonstrate the new algorithm eliminates the restrictions on the number of random directions, leading to improved convergence rates and broader applicability compared with SZOHT. Finally, we illustrate the utility of our method on a portfolio optimization problem as well as black-box adversarial attacks.

POSTER-872: Near-Optimal Solutions of Constrained Learning Problems

Primary Area: optimization

Keywords: Constrained Learning Convex Optimization Duality Constrained Optimization Fairness

Scores: [ 6 8 6 3 6 ]

With the widespread adoption of machine learning systems, the need to curtail their behavior has become increasingly apparent. This is evidenced by recent advancements towards developing models that satisfy robustness, safety and fairness requirements. Imposing these requirements leads to constrained learning problems, which can be tackled with dual ascent methods. However, convergence guarantees for dual ascent algorithms typically involve a randomized or averaged sequence of primal iterates. These solutions are impractical, since they require storing an ever growing sequence of models. Although it has been observed that final iterates perform well in practice, theoretical guarantees for their optimality and feasibility have remained elusive. In this work, we characterize the infeasibility of Lagrangian minimizers associated with optimal dual variables, which leads to a sub-optimality bound for best primal iterates. To do this, we leverage the fact that constrained learning problems are parametrized versions of convex functional programs. This bound sheds light on how the richness of the parametrization and the curvature of the objective impact the convergence of primal iterates. We empirically validate this finding in learning problems with fairness constraints.

SPOTLIGHT-170: Lion Secretly Solves a Constrained Optimization: As Lyapunov Predicts

Primary Area: optimization

Keywords: Lion Optimization Lyapunov Analysis

Scores: [ 8 6 8 8 ]

POSTER-873: On the Scalability and Memory Efficiency of Semidefinite Programs for Lipschitz Constant Estimation of Neural Networks

Primary Area: optimization

Keywords: Semidefinite programming Lipschitz constant Deep learning

Scores: [ 5 6 6 8 ]

Lipschitz constant estimation plays an important role for understanding generalization, robustness, and fairness in deep learning. Unlike naive bounds based on the network weight norm product, semidefinite programs (SDPs) have shown great promise in providing less conservative Lipschitz bounds with polynomial-time complexity guarantees. However, due to the memory consumption and running speed, standard SDP algorithms cannot scale to modern neural network structures. In this paper, we transform the SDPs for Lipschitz constant estimation into an eigenvalue problem, which aligns with the modern large optimization paradigms based on first-order methods. This is amenable to autodiff frameworks such as PyTorch and TensorFlow, requiring significantly less memory than standard SDP algorithms. The transformation also allows us to leverage various existing numerical techniques for eigenvalue optimization, opening the way for further memory improvement and computational speedup. The essential technique of our eigenvalue-problem transformation is to introduce redundant quadratic constraints and then utilize both Lagrangian and Shor's SDP relaxations. Numerical examples demonstrate that our technique is more scalable than existing approaches. For networks that existing SDP solvers cannot handle, we improve the Lipschitz constant estimation by up to 58% compared to the weight matrix norm product bound.

POSTER-874: Relaxing the Additivity Constraints in Decentralized No-Regret High-Dimensional Bayesian Optimization

Primary Area: optimization

Keywords: Bayesian Optimization Online Learning Black-box Optimization

Scores: [ 5 8 5 5 ]

Bayesian Optimization (BO) is typically used to optimize an unknown function f that is noisy and costly to evaluate, by exploiting an acquisition function that must be maximized at each optimization step. Even if provably asymptotically optimal BO algorithms are efficient at optimizing low-dimensional functions, scaling them to high-dimensional spaces remains an open problem, often tackled by assuming an additive structure for f. By doing so, BO algorithms typically introduce additional restrictive assumptions on the additive structure that reduce their applicability domain. This paper contains two main contributions: (i) we relax the restrictive assumptions on the additive structure of f, at the expense of weakening the maximization guarantees of the acquisition function, and (ii) we address the over-exploration problem for decentralized BO algorithms. To these ends, we propose DumBO, an asymptotically optimal decentralized BO algorithm that achieves very competitive performance against state-of-the-art BO algorithms, especially when the additive structure of f comprises high-dimensional factors.

ORAL-45: Small-scale proxies for large-scale Transformer training instabilities

Primary Area: optimization

Keywords: Small Transformers Training Stability

Scores: [ 8 8 8 8 ]

Teams that have trained large Transformer-based models have reported training instabilities at large scale that did not appear when training with the same hyperparameters at smaller scales. Although the causes of such instabilities are of scientific interest, the amount of resources required to reproduce them has made investigation difficult. In this work, we seek ways to reproduce and study training instability at smaller scales. First, we focus on two sources of training instability described in previous work: the growth of logits in attention layers (Dehghani et al., 2023) and divergence of the output logits from the log probabilities (Chowdhery et al., 2022). By measuring the relationship between learning rate and loss across scales, we show that these instabilities also appear in small models when training at high learning rates, and that mitigations previously employed at large scales are equally effective in this regime. This prompts us to investigate the extent to which other known optimizer and model interventions influence the sensitivity of the final loss to changes in the learning rate. To this end, we study methods such as warm-up, weight decay, and the MuParam (Yang et al., 2022), and combine techniques to train small models that achieve similar losses across orders of magnitude of learning rate variation. Finally, to conclude our exploration we study two cases where instabilities can be predicted before they emerge by examining the scaling behavior of model characteristics such as activation and gradient norms.

POSTER-875: Adaptive Stochastic Gradient Algorithm for Black-box Multi-Objective Learning

Primary Area: optimization

Keywords: Multi-Objective Optimization Black-Box Optimization Black-Box Multi-Objective Optimization

Scores: [ 6 6 6 6 ]

Multi-objective optimization (MOO) has become an influential framework for various machine learning problems, including reinforcement learning and multi-task learning. In this paper, we study the black-box multi-objective optimization problem, where we aim to optimize multiple potentially conflicting objectives with function queries only. To address this challenging problem and find a Pareto optimal solution or the Pareto stationary solution, we propose a novel adaptive stochastic gradient algorithm for black-box MOO, called ASMG. Specifically, we use the stochastic gradient approximation method to obtain the gradient for the distribution parameters of the Gaussian smoothed MOO with function queries only. Subsequently, an adaptive weight is employed to aggregate all stochastic gradients to optimize all objective functions effectively. Theoretically, we explicitly provide the connection between the original MOO problem and the corresponding Gaussian smoothed MOO problem and prove the convergence rate for the proposed ASMG algorithm in both convex and non-convex scenarios.Empirically, the proposed ASMG method achieves competitive performance on multiple numerical benchmark problems. Additionally, the state-of-the-art performance on the black-box multi-task learning problem demonstrates the effectiveness of the proposed ASMG method.

SPOTLIGHT-171: Efficient Distributed Training with Full Communication-Computation Overlap

Primary Area: optimization

Keywords: Distributed Training Data Parallelism Local Updating Asynchronous Communication

Scores: [ 8 6 6 8 6 8 ]

The fundamental success of large language models hinges upon the efficacious implementation of large-scale distributed training techniques. Nevertheless, building a vast, high-performance cluster featuring high-speed communication interconnectivity is prohibitively costly, and accessible only to prominent entities. In this work, we aim to lower this barrier and democratize large-scale training with limited bandwidth clusters. We propose a new approach called CO2 that introduces local-updating and asynchronous communication to the distributed data-parallel training, thereby facilitating the full overlap of COmunication with COmputation. CO2 is able to attain a remarkable 100% scalability even on extensive multi-node clusters constrained by very limited communication bandwidth. We further propose the staleness gap penalty and outer momentum clipping techniques together with CO2 to bolster its convergence and training stability. Besides, CO2 exhibits seamless integration with well-established ZeRO-series optimizers which mitigate memory consumption of model states with large model training. We also provide a mathematical proof of convergence, accompanied by the establishment of a stringent upper bound. Furthermore, we validate our findings through an extensive set of practical experiments encompassing a wide range of tasks in the fields of computer vision and natural language processing. These experiments serve to demonstrate the capabilities of CO2 in terms of convergence, generalization, and scalability when deployed across configurations comprising up to 128 A100 GPUs. The outcomes emphasize the outstanding capacity of CO2 to achieve perfect 100% scalability, no matter on clusters with 800Gbps RDMA or 80Gbps TCP/IP inter-node connections.

SPOTLIGHT-172: A Lightweight Method for Tackling Unknown Participation Statistics in Federated Averaging

Primary Area: optimization

Keywords: federated learning partial client participation adaptation aggregation weights

Scores: [ 6 5 6 10 ]

In federated learning (FL), clients usually have diverse participation statistics that are unknown a priori, which can significantly harm the performance of FL if not handled properly. Existing works aiming at addressing this problem are usually based on global variance reduction, which requires a substantial amount of additional memory in a multiplicative factor equal to the total number of clients. An important open problem is to find a lightweight method for FL in the presence of clients with unknown participation rates. In this paper, we address this problem by adapting the aggregation weights in federated averaging (FedAvg) based on the participation history of each client. We first show that, with heterogeneous participation statistics, FedAvg with non-optimal aggregation weights can diverge from the optimal solution of the original FL objective, indicating the need of finding optimal aggregation weights. However, it is difficult to compute the optimal weights when the participation statistics are unknown. To address this problem, we present a new algorithm called FedAU, which improves FedAvg by adaptively weighting the client updates based on online estimates of the optimal weights without knowing the statistics of client participation. We provide a theoretical convergence analysis of FedAU using a novel methodology to connect the estimation error and convergence. Our theoretical results reveal important and interesting insights, while showing that FedAU converges to an optimal solution of the original objective and has desirable properties such as linear speedup. Our experimental results also verify the advantage of FedAU over baseline methods with various participation patterns.

POSTER-876: Det-CGD: Compressed Gradient Descent with Matrix Stepsizes for Non-Convex Optimization

Primary Area: optimization

Keywords: Optimization First-order optimization Non-convex optimization Distributed optimization

Scores: [ 3 6 6 ]

This paper introduces a new method for minimizing matrix-smooth non-convex objectives through the use of novel Compressed Gradient Descent (CGD) algorithms enhanced with a matrix-valued stepsize. The proposed algorithms are theoretically analyzed first in the single-node and subsequently in the distributed settings. Our theoretical results reveal that the matrix stepsize in CGD can capture the objective’s structure and lead to faster convergence compared to a scalar stepsize. As a byproduct of our general results, we emphasize the importance of selecting the compression mechanism and the matrix stepsize in a layer-wise manner, taking advantage of model structure. Moreover, we provide theoretical guarantees for free compression, by designing specific layer-wise compressors for the non-convex matrix smooth objectives. Our findings are supported with empirical evidence.

SPOTLIGHT-173: Symmetric Mean-field Langevin Dynamics for Distributional Minimax Problems

Primary Area: optimization

Keywords: mean-field Langevin dynamics minimax optimization zero-sum games Markov games

Scores: [ 8 6 6 8 6 ]

In this paper, we extend mean-field Langevin dynamics to minimax optimization over probability distributions for the first time with symmetric and provably convergent updates. We propose \emph{mean-field Langevin averaged gradient} (MFL-AG), a single-loop algorithm that implements gradient descent ascent in the distribution space with a novel weighted averaging, and establish average-iterate convergence to the mixed Nash equilibrium. We also study both time and particle discretization regimes and prove a new uniform-in-time propagation of chaos result which accounts for the dependency of the particle interactions on all previous distributions. Furthermore, we propose \emph{mean-field Langevin anchored best response} (MFL-ABR), a symmetric double-loop algorithm based on best response dynamics with linear last-iterate convergence. Finally, we study applications to zero-sum Markov games and conduct simulations to demonstrate the long-term optimality of MFL-AG and MFL-ABR.

SPOTLIGHT-174: Leveraging augmented-Lagrangian techniques for differentiating over infeasible quadratic programs in machine learning

Primary Area: optimization

Keywords: Machine Learning Optimization Differentiable Optimization Optimization layers

Scores: [ 6 8 8 6 ]

POSTER-877: A Precise Characterization of SGD Stability Using Loss Surface Geometry

Primary Area: optimization

Keywords: SGD linear dynamics sharpness implicit regularization

Scores: [ 6 8 3 6 ]

Stochastic Gradient Descent (SGD) stands as a cornerstone optimization algorithm with proven real-world empirical successes but relatively limited theoretical understanding. Recent research has illuminated a key factor contributing to its practical efficacy: the implicit regularization it instigates. Several studies have investigated the linear stability property of SGD in the vicinity of a stationary point as a predictive proxy for sharpness and generalization error in overparameterized neural networks (Wu et al., 2022; Jastrzebski et al., 2019; Cohen et al., 2021). In this paper, we delve deeper into the relationship between linear stability and sharpness. More specifically, we meticulously delineate the necessary and sufficient conditions for linear stability, contingent on hyperparameters of SGD and the sharpness at the optimum. Towards this end, we introduce a novel coherence measure of the loss Hessian that encapsulates pertinent geometric properties of the loss function that are relevant to the linear stability of SGD. It enables us to provide a simplified sufficient condition for identifying linear instability at an optimum. Notably, compared to previous works, our analysis relies on significantly milder assumptions and is applicable for a broader class of loss functions than known before, encompassing not only mean-squared error but also cross-entropy loss.

POSTER-878: Momentum Benefits Non-iid Federated Learning Simply and Provably

Primary Area: optimization

Keywords: federated learning momentum data heterogeneity non-convex optimization

Scores: [ 5 5 8 5 ]

Federated learning is a powerful paradigm for large-scale machine learning, but itfaces significant challenges due to unreliable network connections, slow commu-nication, and substantial data heterogeneity across clients. FedAvg and SCAFFOLD are two prominent algorithms to address these challenges. In particular,FedAvg employs multiple local updates before communicating with a centralserver, while SCAFFOLD maintains a control variable on each client to compen-sate for “client drift” in its local updates. Various methods have been proposedto enhance the convergence of these two algorithms, but they either make imprac-tical adjustments to algorithmic structure, or rely on the assumption of boundeddata heterogeneity. This paper explores the utilization of momentum to enhancethe performance of FedAvg and SCAFFOLD. When all clients participate in thetraining process, we demonstrate that incorporating momentum allows FedAvgto converge without relying on the assumption of bounded data heterogeneity evenusing a constant local learning rate. This is novel and fairly suprising as existinganalyses for FedAvg require bounded data heterogeneity even with diminishinglocal learning rates. In partial client participation, we show that momentum en-ables SCAFFOLD to converge provably faster without imposing any additionalassumptions. Furthermore, we use momentum to develop new variance-reducedextensions of FedAvg and SCAFFOLD, which exhibit state-of-the-art conver-gence rates. Our experimental results support all theoretical findings.

POSTER-879: Zeroth-Order Optimization Meets Human Feedback: Provable Learning via Ranking Oracles

Primary Area: optimization

Keywords: Learng from human feedback zeroth-order optimization Stable Diffusion ranking and preferences

Scores: [ 6 6 6 ]

In this study, we delve into an emerging optimization challenge involving a black-box objective function that can only be gauged via a ranking oracle—a situation frequently encountered in real-world scenarios, especially when the function is evaluated by human judges. A prominent instance of such a situation is Reinforcement Learning with Human Feedback (RLHF), an approach recently employed to enhance the performance of Large Language Models (LLMs) using human guidance [Ouyang et al. 2022, Liu et al. 2023, OpenAI et al. 2022, Bai et al. 2022}. We introduce ZO-RankSGD, an innovative zeroth-order optimization algorithm designed to tackle this optimization problem, accompanied by theoretical assurances. Our algorithm utilizes a novel rank-based random estimator to determine the descent direction and guarantees convergence to a stationary point. Moreover, ZO-RankSGD is readily applicable to policy optimization problems in Reinforcement Learning (RL), particularly when only ranking oracles for the episode reward are available. Last but not least, we demonstrate the effectiveness of ZO-RankSGD in a novel application: improving the quality of images generated by a diffusion generative model with human ranking feedback. Throughout experiments, we found that ZO-RankSGD can significantly enhance the detail of generated images with only a few rounds of human feedback. Overall, our work advances the field of zeroth-order optimization by addressing the problem of optimizing functions with only ranking feedback, and offers a new and effective approach for aligning Artificial Intelligence (AI) with human intentions.

SPOTLIGHT-175: Distributionally Robust Optimization with Bias & Variance Reduced Gradients

Primary Area: optimization

Keywords: stochastic optimization convex optimization distributionally robust learning spectral risk measures incremental optimization

Scores: [ 8 8 8 8 ]

We consider the distributionally robust (DR) optimization problem with spectral risk-based uncertainty set and f-divergence penalty. This formulation includes common risk-sensitive learning objectives such as regularized condition value-at-risk (CVaR) and average top-k loss. We present Prospect, a stochastic gradient-based algorithm that only requires tuning a single learning rate hyperparameter, and prove that it enjoys linear convergence for smooth regularized losses. This contrasts with previous algorithms that either require tuning multiple hyperparameters or potentially fail to converge due to biased gradient estimates or inadequate regularization. Empirically, we show that Prospect can converge 2-3x faster than baselines such as SGD and stochastic saddle-point methods on distribution shift and fairness benchmarks spanning tabular, vision, and language domains.

POSTER-880: Neur2RO: Neural Two-Stage Robust Optimization

Primary Area: optimization

Keywords: robust optimization deep learning discrete optimization

Scores: [ 8 6 6 ]

Robust optimization provides a mathematical framework for modeling and solving decision-making problems under worst-case uncertainty. This work addresses two-stage robust optimization (2RO) problems (also called adjustable robust optimization), wherein first-stage and second-stage decisions are made before and after uncertainty is realized, respectively. This results in a nested min-max-min optimization problem which is extremely challenging computationally, especially when the decisions are discrete. We propose Neur2RO, an efficient machine learning-driven instantiation of column-and-constraint generation (CCG), a classical iterative algorithm for 2RO. Specifically, we learn to estimate the value function of the second-stage problem via a novel neural network architecture that is easy to optimize over by design. Embedding our neural network into CCG yields high-quality solutions quickly as evidenced by experiments on two 2RO benchmarks, knapsack and capital budgeting. On small or easy instances, Neur2RO recovers solutions of nearly the same quality as state-of-the-art methods but is most advantageous on large-scale instances, where it finds better solutions on average.

POSTER-881: A Primal-Dual Approach to Solving Variational Inequalities with General Constraints

Primary Area: optimization

Keywords: variational inequalities optimization general constraints primal-dual interior-point method projection-free

Scores: [ 6 6 8 6 ]

Yang et al. (2023) recently showed how to use first-order gradient methods to solve general variational inequalities (VIs) under a limiting assumption that analytic solutions of specific subproblems are available. In this paper, we circumvent this assumption via a warm-starting technique where we solve subproblems approximately and initialize variables with the approximate solution found at the previous iteration. We prove the convergence of this method and show that the gap function of the last iterate of the method decreases at a rate of O(1K) when the operator is L-Lipschitz and monotone. In numerical experiments, we show that this technique can converge much faster than its exact counterpart. Furthermore, for the cases when the inequality constraints are simple, we introduce an alternative variant of ACVI and establish its convergence under the same conditions.Finally, we relax the smoothness assumptions in Yang et al., yielding, to our knowledge, the first convergence result for VIs with general constraints that does not rely on the assumption that the operator is L-Lipschitz.

POSTER-882: Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms

Primary Area: optimization

Keywords: Sketching Residual error Low-rank approximation sparse recovery

Scores: [ 5 8 8 6 ]

We study the problem of residual error estimation for matrix and vector norms using a linear sketch. Such estimates can be used, for example, to quickly assess how useful a more expensive low-rank approximation computation will be. The matrix case concerns the Frobenius norm and the task is to approximate the k-residual AAkF of the input matrix A within a (1+ϵ)-factor, where Ak is the optimal rank-k approximation. We provide a tight bound of Θ(k2/ϵ4) on the size of bilinear sketches, which have the form of a matrix product SAT. This improves the previous O(k2/ϵ6) upper bound in (Andoni et al. SODA 2013) and gives the first non-trivial lower bound, to the best of our knowledge. In our algorithm, our sketching matrices S and T can both be sparse matrices, allowing for a very fast update time. We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work. For the vector case, we consider the p-norm for p>2, where the task is to approximate the k-residual xxkp up to a constant factor, where xk is the optimal k-sparse approximation to x. Such vector norms are frequently studied in the data stream literature and are useful for finding frequent items or so-called heavy hitters. We establish an upper bound of O(k2/pn12/ppoly(logn)) for constant ϵ on the dimension of a linear sketch for this problem. Our algorithm can be extended to the p sparse recovery problem with the same sketching dimension, which seems to be the first such bound for p>2. We also show an Ω(k2/pn12/p) lower bound for the sparse recovery problem, which is tight up to a poly(logn) factor.

POSTER-883: Efficient Instance-Optimal Finite-Sum Minimization

Primary Area: optimization

Keywords: Finite Sum Minimization Variance Reduction Optimization

Scores: [ 6 8 6 5 ]

Given a sequence of functions f1,,fn with fi:DR, finite-sum minimization seeks a point xD minimizing j=1nfj(x)/n. In this work, we propose a key twist into the finite-sum minimization, dubbed as instance-optimal finite-sum minimization, that asks for a sequence of points x1,,xnD such that each xiD minimizes the prefix-sum j=1ifj(x)/i. Assuming that each prefix-sum is strongly convex, we develop a first-order stochastic instance optimal gradient method SIOPTGrad producing an ϵ-optimal sequence with O~(n/ϵ1/3+1/ϵ) overall first-order oracles (FO). An FO corresponds to the computation of a single gradient fj(x) at a given xD for some j[n]. Our approach significantly improves upon the O(n/ϵ) FOs that StochasticGradientDescent requires and the O(n2log(1/ϵ)) FOs that state-of-the-art variance reduction methods such as Katyusha require. We also prove that there is no natural first-order method with O(n/ϵα) gradient complexity for α<1/4, establishing that the first-order complexity of our method is nearly tight.

POSTER-884: Weaker MVI Condition: Extragradient Methods with Multi-Step Exploration

Primary Area: optimization

Keywords: extragradient minimax nonconvex-nonconcave variational inequilities saddle point problem

Scores: [ 8 6 5 5 ]

This paper proposes a new framework of algorithms that is extended from the celebrated extragradient algorithm. Min-max problem has attracted increasing attention because of application in machine learning missions such as generative adversarial network (GAN) training. While there has been exhaustive researches on convex-concave setting, problem on nonconvex-nonconcave setting faces many challenges, such as convergence to limit cycles. Since general min-max optimization is proved intractable, recent research focus has been put on structured problems. One of these follows the weak Minty variational inequality (weak MVI), which is motivated by relaxing Minty variational inequality without compromising convergence guarantee of extragradient algorithm. Existing extragradient-type algorithms involve one exploration step and one update step per iteration. We analyze the algorithms with multiple exploration steps and show that current assumption can be further relaxed when more exploration is introduced. Furthermore, we design an adaptive algorithm that explores until the optimal improvement is achieved. This process exploit information from the whole trajectory and effectively tackle cyclic behaviors.

POSTER-885: Quadratic models for understanding neural network dynamics

Primary Area: optimization

Keywords: quadratic models wide neural networks catapult phase optimization dynamics

Scores: [ 5 8 5 8 6 ]

While neural networks can be approximated by linear models as their width increases, certain properties of wide neural networks cannot be captured by linear models. In this work we show that recently proposed Neural Quadratic Models can exhibit the "catapult phase" Lewkowycz et al. (2020) that arises when training such models with large learning rates. We then empirically show that the behaviour of quadratic models parallels that of neural networks in generalization, especially in the catapult phase regime. Our analysis further demonstrates that quadratic models are an effective tool for analysis of neural networks.

POSTER-886: Learning to Solve Bilevel Programs with Binary Tender

Primary Area: optimization

Keywords: Deep Learning Bilevel Program Binary Tender Enhanced Sampling Input Supermodular Neural Network

Scores: [ 3 6 6 ]

Bilevel programs (BPs) find a wide range of applications in fields such as energy, transportation, and machine learning. As compared to BPs with continuous (linear/convex) optimization problems in both levels, the BPs with discrete decision variables have received much less attention, largely due to the ensuing computational intractability and the incapability of gradient-based algorithms for handling discrete optimization formulations. In this paper, we develop deep learning techniques to address this challenge. Specifically, we consider a BP with binary tender, wherein the upper and lower levels are linked via binary variables. We train a neural network to approximate the optimal value of the lower-level problem, as a function of the binary tender. Then, we obtain a single-level reformulation of the BP through a mixed-integer representation of the value function. Furthermore, we conduct a comparative analysis between two types of neural networks: general neural networks and the novel input supermodular neural networks, studying their representational capacities. To solve high-dimensional BPs, we introduce an enhanced sampling method to generate higher-quality samples and implement an iterative process to refine solutions. We demonstrate the performance of these approaches through extensive numerical experiments, whose lower-level problems are linear and mixed-integer programs, respectively.

POSTER-887: Decentralized Riemannian Conjugate Gradient Method on the Stiefel Manifold

Primary Area: optimization

Keywords: Decentralized optimization Stiefel manifold Conjugate gradient method

Scores: [ 8 6 5 ]

The conjugate gradient method is a crucial first-order optimization method that generally converges faster than the steepest descent method, and its computational cost is much lower than the second-order methods. However, while various types of conjugate gradient methods have been studied in Euclidean spaces and on Riemannian manifolds, there is little study for those in distributed scenarios. This paper proposes a decentralized Riemannian conjugate gradient descent (DRCGD) method that aims at minimizing a global function over the Stiefel manifold. The optimization problem is distributed among a network of agents, where each agent is associated with a local function, and the communication between agents occurs over an undirected connected graph. Since the Stiefel manifold is a non-convex set, a global function is represented as a finite sum of possibly non-convex (but smooth) local functions. The proposed method is free from expensive Riemannian geometric operations such as retractions, exponential maps, and vector transports, thereby reducing the computational complexity required by each agent. To the best of our knowledge, DRCGD is the first decentralized Riemannian conjugate gradient algorithm to achieve global convergence over the Stiefel manifold. The numerical experiments reveal that the advantages of our DRCGD.

SPOTLIGHT-176: PILOT: An O(1/T)-Convergent Approach for Policy Evaluation with Nonlinear Function Approximation

Primary Area: optimization

Keywords: min-max optimization adaptive batch size policy evaluation.

Scores: [ 8 6 8 6 ]

Learning an accurate value function for a given policy is a critical step in solving reinforcement learning (RL) problems. So far, however, the convergence speed and sample complexity performances of most existing policy evaluation algorithms remain unsatisfactory, particularly with non-linear function approximation. This challenge motivates us to develop a new path-integrated primal-dual stochastic gradient (PILOT) method, that is able to achieve a fast convergence speed for RL policy evaluation with nonlinear function approximation. To further alleviate the periodic full gradient evaluation requirement, we further propose an enhanced method with an adaptive-batch adjustment called PILOT$+$. The main advantages of our methods include: i) PILOT allows the use of {\em{constant}} step sizes and achieves the O(1/K) convergence rate to first-order stationary points of non-convex policy evaluation problems; ii) PILOT is a generic {\em{single}}-timescale algorithm that is also applicable for solving a large class of non-convex strongly-concave minimax optimization problems; iii) By adaptively adjusting the batch size via historical stochastic gradient information, PILOT$+$ is more sample-efficient empirically without loss of theoretical convergence rate. Our extensive numerical experiments verify our theoretical findings and showcase the high efficiency of the proposed PILOT and PILOT$^+$ algorithms compared with the state-of-the-art methods.

POSTER-888: Doubly Robust Instance-Reweighted Adversarial Training

Primary Area: optimization

Keywords: adversarial training distributionally robust optimization bilevel optimization instance reweighting

Scores: [ 8 6 6 ]

POSTER-889: FOSI: Hybrid First and Second Order Optimization

Primary Area: optimization

Keywords: convex optimization nonconvex optimization first order optimization second order optimization deep learning

Scores: [ 6 8 5 6 ]

POSTER-890: Towards Imitation Learning to Branch for MIP: A Hybrid Reinforcement Learning based Sample Augmentation Approach

Primary Area: optimization

Keywords: hybrid RL Sample Augmentation Learning to branch Imitation learning

Scores: [ 6 6 5 ]

Branch-and-bound (B&B) has long been favored for tackling complex Mixed Integer Programming (MIP) problems, where the choice of branching strategy plays a pivotal role. Recently, Imitation Learning (IL)-based policies have emerged as potent alternatives to traditional rule-based approaches. However, it is nontrivial of acquiring high-quality training samples, and IL often converges to suboptimal variable choices for branching, restricting the overall performance. In response to these challenges, we propose a novel hybrid online and offline reinforcement learning (RL) approach to enhance the branching policy by cost-effective training sample augmentation. In online phase, we train an online RL agent to dynamically decide the sample generation processes, drawing from either the learning-based policy or the expert policy. The objective here is to strike an optimal balance between the exploration and exploitation of the sample generation process. In offline phase, a value function is trained to fit the cumulative reward for each decision and to filter the samples with high cumulative returns. This dual-purpose function not only reduces training complexity but also enhances the quality of the samples. To assess the efficacy of our proposed data augmentation mechanism, we conduct comprehensive evaluations across a range of MIP problems. The results consistently show that our method excels in making superior branching decisions compared to state-of-the-art learning-based models and the open-source solver SCIP. Notably, it even often outperforms the commercial solver Gurobi.

POSTER-891: Low Rank Matrix Completion via Robust Alternating Minimization in Nearly Linear Time

Primary Area: optimization

Keywords: matrix completion

Scores: [ 6 6 6 ]

Given a matrix MRm×n, the low rank matrix completion problem asks us to find a rank-k approximation of M as UV for URm×k and VRn×k by only observing a few entries specified by a set of entries Ω[m]×[n]. In particular, we examine an approach that is widely used in practice --- the alternating minimization framework. Jain, Netrapalli and Sanghavi showed that if M has incoherent rows and columns, then alternating minimization provably recovers the matrix M by observing a nearly linear in n number of entries. While the sample complexity has been subsequently improved, alternating minimization steps are required to be computed exactly. This hinders the development of more efficient algorithms and fails to depict the practical implementation of alternating minimization, where the updates are usually performed approximately in favor of efficiency.In this paper, we take a major step towards a more efficient and error-robust alternating minimization framework. To this end, we develop an analytical framework for alternating minimization that can tolerate moderate amount of errors caused by approximate updates. Moreover, our algorithm runs in time O~(|Ω|k), which is nearly linear in the time to verify the solution while preserving the sample complexity. This improves upon all prior known alternating minimization approaches which require O~(|Ω|k2) time.

ORAL-46: LoftQ: LoRA-Fine-Tuning-aware Quantization for Large Language Models

Primary Area: optimization

Keywords: quantization compression large language models NLP machine learning low rank

Scores: [ 8 8 6 ]

Quantization is an indispensable technique for serving Large Language Models (LLMs) and has recently found its way into LoRA fine-tuning (Dettmers et al., 2023). In this work we focus on the scenario where quantization and LoRA fine- tuning are applied together on a pre-trained model. In such cases it is common to observe a consistent gap in the performance on downstream tasks between full fine-tuning and quantization plus LoRA fine-tuning approach. In response, we propose LoftQ (LoRA-Fine-Tuning-aware Quantization), a novel quantization framework that simultaneously quantizes an LLM and finds a proper low-rank initialization for LoRA fine-tuning. Such an initialization alleviates the discrep- ancy between the quantized and full-precision model and significantly improves the generalization in downstream tasks. We evaluate our method on natural lan- guage understanding, question answering, summarization, and natural language generation tasks. Experiments show that our method is highly effective and out- performs existing quantization methods, especially in the challenging 2-bit and 2/4-bit mixed precision regimes. We will release our code.

SPOTLIGHT-177: Bilevel Optimization under Unbounded Smoothness: A New Algorithm and Convergence Analysis

Primary Area: optimization

Keywords: Bilevel Optimization Unbounded Smoothness Deep Learning

Scores: [ 6 6 6 8 8 ]

Bilevel optimization is an important formulation for many machine learning problems, such as meta-learning and hyperparameter optimization. Current bilevel optimization algorithms assume that the gradient of the upper-level function is Lipschitz (i.e., the upper-level function has a bounded smoothness parameter). However, recent studies reveal that certain neural networks such as recurrent neural networks (RNNs) and long-short-term memory networks (LSTMs) exhibit potential unbounded smoothness, rendering conventional bilevel optimization algorithms unsuitable for these neural networks. In this paper, we design a new bilevel optimization algorithm, namely BO-REP, to address this challenge. This algorithm updates the upper-level variable using normalized momentum and incorporates two novel techniques for updating the lower-level variable: \textit{initialization refinement} and \textit{periodic updates}. Specifically, once the upper-level variable is initialized, a subroutine is invoked to obtain a refined estimate of the corresponding optimal lower-level variable, and the lower-level variable is updated only after every specific period instead of each iteration. When the upper-level problem is nonconvex and unbounded smooth, and the lower-level problem is strongly convex, we prove that our algorithm requires O~(1/ϵ4) \footnote{Here O~() compresses logarithmic factors of 1/ϵ and 1/δ, where δ(0,1) denotes the failure probability.} iterations to find an ϵ-stationary point in the stochastic setting, where each iteration involves calling a stochastic gradient or Hessian-vector product oracle. Notably, this result matches the state-of-the-art complexity results under the bounded smoothness setting up to logarithmic factors. Our proof relies on novel technical lemmas for the periodically updated lower-level variable, which are of independent interest. Our experiments on hyper-representation learning, hyperparameter optimization, and data hyper-cleaning for text classification tasks demonstrate the effectiveness of our proposed algorithm.

POSTER-892: JoMA: Demystifying Multilayer Transformers via Joint Dynamics of MLP and Attention

Primary Area: optimization

Keywords: multilayer transformer training dynamics theoretical analysis self-attention interpretability neural network understanding

Scores: [ 5 6 6 6 ]

We propose Joint MLP/Attention (JoMA) dynamics, a novel mathematical framework to understand the training procedure of multilayer Transformer architectures. This is achieved by integrating out the self-attention layer in Transformers, producing a modified dynamics of MLP layers only. JoMA removes unrealistic assumptions in previous analysis (e.g., lack of residual connection), and predicts that the attention first becomes sparse (to learn salient tokens), then dense (to learn less salient tokens) in the presence of nonlinear activations, while in the linear case, it is consistent with existing works. We leverage JoMA to qualitatively explains how tokens are combined to form hierarchies in multilayer Transformers, when the input tokens are generated by a latent hierarchical generative model. Experiments on models trained from real-world dataset (Wikitext2/Wikitext103) and various pre- trained models (OPT, Pythia) verify our theoretical findings.

ORAL-47: Improving Convergence and Generalization Using Parameter Symmetries

Primary Area: optimization

Keywords: Symmetry optimization generalization

Scores: [ 6 8 8 8 ]

In overparametrized models, different values of the parameters may result in the same loss value. Parameter space symmetries are loss-invariant transformations that change the model parameters. Teleportation applies such transformations to accelerate optimization. However, the exact mechanism behind this algorithm's success is not well understood. In this paper, we show that teleportation not only speeds up optimization in the short-term, but gives overall faster time to convergence. Additionally, teleporting to minima with different curvatures improves generalization, which suggests a connection between the curvature of the minima and generalization ability. Finally, we show that integrating teleportation into a wide range of optimization algorithms and optimization-based meta-learning improves convergence. Our results showcase the versatility of teleportation and demonstrate the potential of incorporating symmetry in optimization.

POSTER-893: Accelerating Sinkhorn algorithm with sparse Newton iterations

Primary Area: optimization

Keywords: Optimal transport Convex optimization Quasi-Newton methods Non-asymptotic analysis Extremal combinatorics

Scores: [ 8 8 6 6 ]

Computing the optimal transport distance between statistical distributions is a fundamental task in machine learning. One remarkable recent advancement is entropic regularization and the Sinkhorn algorithm, which utilizes only matrix scaling and guarantees an approximated solution with near-linear runtime. Despite the success of the Sinkhorn algorithm, its runtime may still be slow due to the potentially large number of iterations needed for convergence. To achieve possibly super-exponential convergence, we introduce Sinkhorn-Newton-Sparse (SNS), an extension to the Sinkhorn algorithm, by introducing early stopping for the matrix scaling steps and a second stage featuring a Newton-type subroutine. Adopting the variational viewpoint that the Sinkhorn algorithm maximizes a concave Lyapunov potential, we offer the insight that the Hessian matrix of the potential function is approximately sparse. Sparsification of the Hessian results in a fast O(n2) per-iteration complexity, the same as the Sinkhorn algorithm. In terms of total iteration count, we observe that the SNS algorithm converges orders of magnitude faster across a wide range of practical cases, including optimal transportation between empirical distributions and calculating the Wasserstein W1,W2 distance of discretized continuous densities. The empirical performance is corroborated by a rigorous bound on the approximate sparsity of the Hessian matrix.

POSTER-894: Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit

Primary Area: optimization

Keywords: Deep Learning Theory Hyperparameter Transfer Training Dynamics Residual Networks

Scores: [ 8 6 8 8 ]

The cost of hyperparameter tuning in deep learning has been rising with model sizes, prompting practitioners to find new tuning methods using a proxy of smaller networks. One such proposal uses $\mu$P parameterized networks, where the optimal hyperparameters for small width networks transfer to networks with arbitrarily large width. However, in this scheme, hyperparameters do not transfer across depths. As a remedy, we study residual networks with a residual branch scale of 1/depth in combination with the $\mu$P parameterization. We provide experiments demonstrating that residual architectures including convolutional ResNets and vision transformers trained with this parameterization exhibit transfer of optimal hyperparameters across width and depth on CIFAR-10 and ImageNet. Furthermore, our empirical findings are supported and motivated by theory. Using recent developments in the dynamical mean field theory (DMFT) description of neural network learning dynamics, we show that this parameterization of ResNets admits a well-defined feature learning joint infinite-width and infinite-depth limit and show convergence of finite-size network dynamics towards this limit.

POSTER-895: Private Zeroth-Order Nonsmooth Nonconvex Optimization

Primary Area: optimization

Keywords: optimization differential privacy non-convex non-smooth

Scores: [ 8 5 6 6 ]

We introduce a new zeroth-order algorithm for private stochastic optimization on nonconvex and nonsmooth objectives.Given a dataset of size M, our algorithm ensures (α,αρ2/2)-Renyi differential privacy and finds a (δ,ϵ)-stationary point so long as M=Ω~(dδϵ3+d3/2ρδϵ2).This matches the optimal complexity found in its non-private zeroth-order analog. Notably, although the objective is not smooth, we have privacy for free'' when ρdϵ.

POSTER-896: Efficiently Computing Similarities to Private Datasets

Primary Area: optimization

Keywords: differential privacy distance estimation sublinear algorithms high dimensional data analysis kernels

Scores: [ 8 8 6 8 ]

Many methods in differentially private model training rely on computing the similarity between a query point (such as public or synthetic data) and private data. We abstract out this common subroutine and study the following fundamental algorithmic problem: Given a similarity function f and a large high-dimensional private dataset XRd, output a differentially private (DP) data-structure which approximates xXf(x,y) for any query y. We consider the cases where f is a kernel function, such as f(x,y)=exy22/σ2 (also known as DP kernel density estimation), or a distance function such as f(x,y)=xy2, among others. Our theoretical results improve upon prior work and give better privacy-utility trade-offs as well as faster query times for a wide range of kernels and distance functions. The unifying approach behind our results is leveraging `low-dimensional structures' present in the specific functions f that we study, using tools such as provable dimensionality reduction, approximation theory, and one-dimensional decomposition of the functions. Our algorithms empirically exhibit improved query times and accuracy over prior state of the art. We also present an application to DP classification. Our experiments demonstrate that the simple methodology of classifying based on average similarity is orders of magnitude faster than prior DP-SGD based approaches for comparable accuracy.

SPOTLIGHT-178: On Penalty Methods for Nonconvex Bilevel Optimization and First-Order Stochastic Approximation

Primary Area: optimization

Keywords: Bilevel-Optimization Penalty Methods Landscape Analysis Non-Asymptotic Analysis First-Order Methods

Scores: [ 6 6 6 5 8 ]

In this work, we study first-order algorithms for solving Bilevel Optimization (BO) where the objective functions are smooth but possibly nonconvex in both levels and the variables are restricted to closed convex sets. As a first step, we study the landscape of BO through the lens of penalty methods, in which the upper- and lower-level objectives are combined in a weighted sum with penalty parameter σ>0. In particular, we establish a strong connection between the penalty function and the hyper-objective by explicitly characterizing the conditions under which the values and derivatives of the two must be O(σ)-close. A by-product of our analysis is the explicit formula for the gradient of hyper-objective when the lower-level problem has multiple solutions under minimal conditions, which could be of independent interest. Next, viewing the penalty formulation as O(σ)-approximation of the original BO, we propose first-order algorithms that find an ϵ-stationary solution by optimizing the penalty formulation with σ=O(ϵ). When the perturbed lower-level problem uniformly satisfies the {\it small-error} proximal error-bound (EB) condition, we propose a first-order algorithm that converges to an ϵ-stationary point of the penalty function using in total O(ϵ7) accesses to first-order stochastic gradient oracles. Under an additional assumption on stochastic oracles, we show that the algorithm can be implemented in a fully {\it single-loop} manner, {\it i.e.,} with O(1) samples per iteration, and achieves the improved oracle-complexity of O(ϵ5).

POSTER-897: Error Feedback Reloaded: From Quadratic to Arithmetic Mean of Smoothness Constants

Primary Area: optimization

Keywords: error feedback greedy sparsification distributed optimization communication complexity machine cloning weighted error feedback quadratic mean arithmetic mean large stepsizes

Scores: [ 6 8 3 6 ]

POSTER-898: Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise

Primary Area: optimization

Keywords: optimization theory convergence analysis heavy-ball momentum learning rate schedule near-optimal convergence rate

Scores: [ 8 6 6 5 ]

Heavy-ball momentum with decaying learning rates is widely used with SGD for optimizing deep learning models. In contrast to its empirical popularity, the understanding of its theoretical property is still quite limited, especially under the standard anisotropic gradient noise condition for quadratic regression problems. Although it is widely conjectured that heavy-ball momentum method can provide accelerated convergence and should work well in large batch settings, there is no rigorous theoretical analysis. In this paper, we fill this theoretical gap by establishing a non-asymptotic convergence bound for stochastic heavy-ball methods with step decay scheduler on quadratic objectives, under the anisotropic gradient noise condition. As a direct implication, we show that heavy-ball momentum can provide O~(κ) accelerated convergence of the bias term of SGD while still achieving near-optimal convergence rate with respect to the stochastic variance term. The combined effect implies an overall convergence rate within log factors from the statistical minimax rate. This means SGD with heavy-ball momentum is useful in the large-batch settings such as distributed machine learning or federated learning, where a smaller number of iterations can significantly reduce the number of communication rounds, leading to acceleration in practice.

POSTER-899: A Fast and Provable Algorithm for Sparse Phase Retrieval

Primary Area: optimization

Keywords: Sparse Phase Retrieval Nonconvex Optimization Quadratic Convergence

Scores: [ 6 8 6 8 ]

POSTER-900: An improved analysis of per-sample and per-update clipping in federated learning

Primary Area: optimization

Keywords: optimization clipping federated learning decentralized learning distributed optimization

Scores: [ 6 6 8 5 5 ]

ORAL-48: Accelerating Distributed Stochastic Optimization via Self-Repellent Random Walks

Primary Area: optimization

Keywords: Distributed Learning Self-Repellent Random Walk Token Algorithm Central Limit Theorem Asymptotic Analysis

Scores: [ 8 6 8 6 ]

We study a family of distributed stochastic optimization algorithms where gradients are sampled by a token traversing a network of agents in random-walk fashion. Typically, these random-walks are chosen to be Markov chains that asymptotically sample from a desired target distribution, and play a critical role in the convergence of the optimization iterates. In this paper, we take a novel approach by replacing the standard linear Markovian token by one which follows a non-linear Markov chain - namely the Self-Repellent Radom Walk (SRRW). Defined for any given 'base' Markov chain, the SRRW, parameterized by a positive scalar alpha, is less likely to transition to states that were highly visited in the past, thus the name. In the context of MCMC sampling on a graph, a recent breakthrough in Doshi et al. (2023) shows that the SRRW achieves O(1/alpha) decrease in the asymptotic variance for sampling. We propose the use of a `generalized' version of the SRRW to drive token algorithms for distributed stochastic optimization in the form of stochastic approximation, termed SA-SRRW. We prove that the optimization iterate errors of the resulting SA-SRRW converge to zero almost surely and prove a central limit theorem, deriving the explicit form of the resulting asymptotic covariance matrix corresponding to iterate errors. This asymptotic covariance is always smaller than that of an algorithm driven by the base Markov chain and decreases at rate O(1/alpha2) - the performance benefit of using SRRW thereby amplified in the stochastic optimization context. Empirical results support our theoretical findings.

POSTER-901: A Multi-Level Framework for Accelerating Training Transformer Models

Primary Area: optimization

Keywords: Large Model Transformer Multi-Level Training Acceleration

Scores: [ 6 6 5 6 ]

The fast growing capabilities of large-scale deep learning models, such as Bert, GPT and ViT, are revolutionizing the landscape of NLP, CV and many other domains. Training such models, however, poses an unprecedented demand for computing power, which incurs exponentially increasing energy cost and carbon dioxide emissions. It is thus critical to develop efficient training solutions to reduce the training costs. Motivated by a set of key observations of inter- and intra-layer similarities among feature maps and attentions that can be identified from typical training processes, we propose a multi-level framework for training acceleration. Specifically, the framework is based on three basic operators, Coalescing, De-coalescing and Interpolation, which can be orchestrated to build a multi-level training framework. The framework consists of a V-cycle training process, which progressively down- and up-scales the model size and projects the parameters between adjacent levels of models via coalescing and de-coalescing. The key idea is that a smaller model that can be trained for fast convergence and the trained parameters provides high-qualities intermediate solutions for the next level larger network. The interpolation operator is designed to break the symmetry of neurons incurred by de-coalescing for better convergence performance. Our experiments on transformer-based language models (e.g. Bert, GPT) as well as a vision model (e.g. DeiT) prove that the proposed framework reduces the computational cost by about 20% on training BERT/GPT-Base models and up to 51.6% on training the BERT-Large model while preserving the performance.

POSTER-902: One Forward is Enough for Neural Network Training via Likelihood Ratio Method

Primary Area: optimization

Keywords: stochastic optimization gradient estimation

Scores: [ 6 6 8 8 ]

While backpropagation (BP) is the mainstream approach for gradient computation in neural network training, its heavy reliance on the chain rule of differentiation constrains the designing flexibility of network architecture and training pipelines. We avoid the recursive computation in BP and develop a unified likelihood ratio (ULR) method for gradient estimation with just one forward propagation. Not only can ULR be extended to train a wide variety of neural network architectures, but the computation flow in BP can also be rearranged by ULR for better device adaptation. Moreover, we propose several variance reduction techniques to further accelerate the training process. Our experiments offer numerical results across diverse aspects, including various neural network training scenarios, computation flow rearrangement, and fine-tuning of pre-trained models. All findings demonstrate that ULR effectively enhances the flexibility of neural network training by permitting localized module training without compromising the global objective and significantly boosts the network robustness.

POSTER-903: Time Fairness in Online Knapsack Problems

Primary Area: optimization

Keywords: fairness online knapsack learning-augmented algorithm Pareto-optimality robustness consistency

Scores: [ 6 6 8 ]

The online knapsack problem is a classic problem in the field of online algorithms. Its canonical version asks how to pack items of different values and weights arriving online into a capacity-limited knapsack so as to maximize the total value of the admitted items. Although optimal competitive algorithms are known for this problem, they may be fundamentally unfair, i.e., individual items may be treated inequitably in different ways. We formalize a practically-relevant notion of time fairness which effectively models a trade off between static and dynamic pricing in a motivating application such as cloud resource allocation, and show that existing algorithms perform poorly under this metric. We propose a parameterized deterministic algorithm where the parameter precisely captures the Pareto-optimal trade-off between fairness (static pricing) and competitiveness (dynamic pricing). We show that randomization is theoretically powerful enough to be simultaneously competitive and fair; however, it does not work well in experiments. To further improve the trade-off between fairness and competitiveness, we develop a nearly-optimal learning-augmented algorithm which is fair, consistent, and robust (competitive), showing substantial performance improvements in numerical experiments.

POSTER-904: Combining Axes Preconditioners through Kronecker Approximation for Deep Learning

Primary Area: optimization

Keywords: Optimization

Scores: [ 8 6 8 ]

Adaptive regularization based optimization methods such as full-matrix Adagrad which use gradient second-moment information hold significant potential for fast convergence in deep neural network (DNN) training, but are memory intensive and computationally demanding for large neural nets. We develop a technique called Combining AxeS PReconditioners (CASPR), which optimizes matrix-shaped DNN parameters by finding different preconditioners for each mode/axis of the parameter and combining them using a Kronecker-sum based approximation. We show tighter convergence guarantees in stochastic optimization compared to a Kronecker product based preconditioner, Shampoo, which arises as a special case of CASPR. Furthermore, our experiments demonstrates that CASPR approximates the gradient second-moment matrix in full-matrix Adagrad more accurately, and shows significant improvement in training and generalization performance compared to existing practical adaptive regularization based methods such as Shampoo and Adam in a variety of tasks including graph neural network on OGBG-molpcba, Transformer on a universal dependencies dataset and auto-regressive large language modeling on C4 dataset.

POSTER-905: Flag Aggregator: Scalable Distributed Training under Failures and Augmented Losses using Convex Optimization

Primary Area: optimization

Keywords: Robust Aggregation Distributed Training Failure Augmented Byzantine Resilience

Scores: [ 6 6 6 ]

Modern ML applications increasingly rely on complex deep learning models and large datasets. There has been an exponential growth in the amount of computation needed to train the largest models. Therefore, to scale computation and data, these models are inevitably trained in a distributed manner in clusters of nodes, and their updates are aggregated before being applied to the model. However, a distributed setup is prone to Byzantine failures of individual nodes, components, and software. With data augmentation added to these settings, there is a critical need for robust and efficient aggregation systems. We define the quality of workers as reconstruction ratios (0,1], and formulate aggregation as a Maximum Likelihood Estimation procedure using Beta densities. We show that the Regularized form of log-likelihood wrt subspace can be approximately solved using iterative least squares solver, and provide convergence guarantees using recent Convex Optimization landscape results. Our empirical findings demonstrate that our approach significantly enhances the robustness of state-of-the-art Byzantine resilient aggregators. We evaluate our method in a distributed setup with a parameter server, and show simultaneous improvements in communication efficiency and accuracy across various tasks.

POSTER-906: Two-timescale Extragradient for Finding Local Minimax Points

Primary Area: optimization

Keywords: Minimax optimization Nonconvex-nonconcave optimization Extragradient method Dynamical systems

Scores: [ 6 8 8 6 ]

Minimax problems are notoriously challenging to optimize. However, we present that two-timescale extragradient can be a viable solution. By utilizing dynamical systems theory, we show that it converges to points that satisfy the second-order necessary condition of local minimax points, under mild conditions. This work provably improves upon all previous results as we eliminate a crucial assumption that the Hessian, with respect to the maximization variable, is nondegenerate.

POSTER-907: Plug-and-Play Posterior Sampling under Mismatched Measurement and Prior Models

Primary Area: optimization

Keywords: Inverse problems plug-and-play priors posterior sampling unadjusted Langevin algorithm

Scores: [ 6 6 5 6 ]

Posterior sampling has been shown to be a powerful Bayesian approach for solving imaging inverse problems. The recent plug-and-play unadjusted Langevin algorithm (PnP-ULA) has emerged as a promising method for Monte Carlo sampling and minimum mean squared error (MMSE) estimation by combining physical measurement models with deep-learning priors specified using image denoisers. However, the intricate relationship between the sampling distribution of PnP-ULA and the mismatched data-fidelity and denoiser has not been theoretically analyzed. We address this gap by proposing a posterior-L2 pseudometric and using it to quantify an explicit error bound for PnP-ULA under mismatched posterior distribution. We numerically validate our theory on several inverse problems such as sampling from Gaussian mixture models and image deblurring. Our results suggest that the sensitivity of the sampling distribution of PnP-ULA to a mismatch in the measurement model and the denoiser can be precisely characterized.

POSTER-908: Contextual Bandits with Online Neural Regression

Primary Area: optimization

Keywords: Contextual Bandits Regret Bounds Deep Learning Neural Bandits

Scores: [ 8 6 5 5 6 ]

Recent works have shown a reduction from contextual bandits to online regression under a realizability assumption \citep{foster2020beyond,foster2021efficient}. In this work, we investigate the use of neural networks for such online regression and associated Neural Contextual Bandits (NeuCBs). Using existing results for wide networks, one can readily show a O(T) regret for online regression with square loss, which via the reduction implies a O(KT3/4) regret for NeuCBs. Departing from this standard approach, we first show a O(logT) regret for online regression with almost convex losses that satisfy QG (Quadratic Growth) condition, a generalization of the PL (Polyak-\L ojasiewicz) condition, and that have a unique minima. Although not directly applicable to wide networks since they do not have unique minima, we show that adding a suitable small random perturbation to the network predictions surprisingly makes the loss satisfy QG with unique minima. Based on such a perturbed prediction, we show a O(logT) regret for online regression with both squared loss and KL loss, and subsequently convert these respectively to O~(KT) and O~(KL+K) regret for NeuCB, where L is the loss of the best policy. Separately, we also show that existing regret bounds for NeuCBs are Ω(T) or assume i.i.d. contexts, unlike this work. Finally, our experimental results on various datasets demonstrate that our algorithms, especially the one based on KL loss, persistently outperform existing algorithms.

POSTER-909: Tackling the Data Heterogeneity in Asynchronous Federated Learning with Cached Update Calibration

Primary Area: optimization

Keywords: Federated learning Nonconvex optimization

Scores: [ 6 8 8 ]

Asynchronous federated learning, which enables local clients to send their model update asynchronously to the server without waiting for others, has recently emerged for its improved efficiency and scalability over traditional synchronized federated learning. In this paper, we study how the asynchronous delay affects the convergence of asynchronous federated learning under non-i.i.d. distributed data across clients. Through the theoretical convergence analysis of one representative asynchronous federated learning algorithm under standard nonconvex stochastic settings, we show that the asynchronous delay can largely slow down the convergence, especially with high data heterogeneity. To further improve the convergence of asynchronous federated learning under heterogeneous data distributions, we propose a novel asynchronous federated learning method with a cached update calibration. Specifically, we let the server cache the latest update for each client and reuse these variables for calibrating the global update at each round. We theoretically prove the convergence acceleration for our proposed method under nonconvex stochastic settings. Extensive experiments on several vision and language tasks demonstrate our superior performances compared to other asynchronous federated learning baselines.

POSTER-910: Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Primary Area: optimization

Keywords: large language models pretraining optimization in deep learning

Scores: [ 8 8 6 8 ]

Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping. The clipping controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory. Sophia only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time and memory overhead. On language modeling with GPT models of sizes ranging from 125M to 1.5B, Sophia achieves a 2x speed-up compared to Adam in the number of steps, total compute, and wall-clock time, achieving the same perplexity with 50% fewer steps, less total compute, and reduced wall-clock time.

POSTER-911: Why is SAM Robust to Label Noise?

Primary Area: optimization

Keywords: generalization sharpness robustness SAM

Scores: [ 6 6 6 6 ]

Sharpness-Aware Minimization (SAM) is most known for achieving state-of the-art performances on natural image and language tasks. However, its most pronounced improvements (of tens of percent) is rather in the presence of label noise. Understanding SAM's label noise robustness requires a departure from characterizing the robustness of minimas lying in flatter'' regions of the loss landscape. In particular, the peak performance occurs with early stopping, far before the loss converges. We decompose SAM's robustness into two effects: one induced by changes to the logit term and the other induced by changes to the network Jacobian. The first can be observed in linear logistic regression where SAM provably upweights the gradient contribution from clean examples. Although this explicit upweighting is also observable in neural networks, when we intervene and modify SAM to remove this effect, surprisingly, we see no visible degradation in performance. We infer that SAM's effect in deeper networks is instead explained entirely by the effect SAM has on the network Jacobian. We theoretically derive the explicit regularization induced by this Jacobian effect in two layer linear networks. Motivated by our analysis, we see that cheaper alternatives to SAM that explicitly induce these regularization effects largely recover the benefits even in deep networks trained on real-world datasets.

POSTER-912: The Marginal Value of Momentum for Small Learning Rate SGD

Primary Area: optimization

Keywords: momentum SGD dynamics

Scores: [ 5 6 6 5 ]

Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without stochastic gradient noise. In stochastic optimization, such as training neural networks, folklore suggests that momentum may help deep learning optimization by reducing the variance of the stochastic gradient update, but previous theoretical analyses do not find momentum to offer any provable acceleration. Theoretical results in this paper clarify the role of momentum in stochastic settings where the learning rate is small and gradient noise is the dominant source of instability, suggesting that SGD with and without momentum behave similarly in the short and long time horizons. Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training regimes where the optimal learning rate is not very large, including small- to medium-batch training from scratch on ImageNet and fine-tuning language models on downstream tasks.

POSTER-913: Light-MILPopt: Solving Large-scale Mixed Integer Linear Programs with Small-scale Optimizer and Small Training Dataset

Primary Area: optimization

Keywords: Large-scale MILP Learning for Optimization Lightweight Optimization Framework

Scores: [ 3 5 6 6 ]

Machine Learning (ML)-based optimization approaches emerge as a promising technique for solving large-scale Mixed Integer Linear Programs (MILPs). However, existing ML-based frameworks suffer from high model computation complexity, weak problem reduction, and reliance on large-scale optimizers and large training datasets, resulting in performance bottlenecks for large-scale MILPs. This paper proposes Light-MILPopt, a lightweight large-scale optimization framework that only uses a small-scale optimizer and small training dataset to solve large-scale MILPs. Specifically, Light-MILPopt can be divided into four stages: Problem Formulation for problem division to reduce model computational costs, Model-based Initial Solution Prediction for predicting and constructing the initial solution using a small-scale training dataset, Problem Reduction for both variable and constraint reduction, and Data-driven Optimization for current solution improvement employing a small-scale optimizer. Experimental evaluations on four large-scale benchmark MILPs and a real-world case study demonstrate that Light-MILPopt, leveraging a small-scale optimizer and small training dataset, outperforms the state-of-the-art ML-based optimization framework and advanced large-scale solvers (e.g. Gurobi, SCIP). The results and further analyses substantiate the ML-based framework's feasibility and effectiveness in solving large-scale MILPs.

POSTER-914: Simple Minimax Optimal Byzantine Robust Algorithm for Nonconvex Objectives with Uniform Gradient Heterogeneity

Primary Area: optimization

Keywords: Byzantine Robustness Nonconvex Optimization Federated Learning

Scores: [ 6 6 3 ]

In this study, we consider nonconvex federated learning problems with the existence of Byzantine workers. We propose a new simple Byzantine robust algorithm called Momentum Screening. The algorithm is adaptive to the Byzantine fraction, i.e., all its hyperparameters do not depend on the number of Byzantine workers. We show that our method achieves the best optimization error of O(δ2ζmax2) for nonconvex smooth local objectives satisfying ζmax-uniform gradient heterogeneity condition under δ-Byzantine fraction, which can be better than the best known error rate of O(δζmean2) for local objectives satisfying ζmean-mean heterogeneity condition when δ(ζmax/ζmean)2. Furthermore, we derive an algorithm independent lower bound for local objectives satisfying ζmax-uniform gradient heterogeneity condition and show the minimax optimality of our proposed method on this class. In numerical experiments, we validate the superiority of our method over the existing robust aggregation algorithms and verify our theoretical results.

POSTER-915: Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality

Primary Area: optimization

Keywords: dataset distillation dataset condensation

Scores: [ 8 5 6 ]

Dataset distillation aims to reduce the time and memory requirement of training deep networks on large datasets by synthesizing a small number of synthetic images that can provide a similar generalization performance to that of the full data. Despite the recent efforts, existing dataset distillation methods suffer from a significant performance gap compared to training on the original data. In this work, we argue that distilling the entire data into one synthetic subset cannot achieve a superior generalization performance. This is because the training dynamics of deep networks drastically change during the training. Hence, multiple synthetic subsets are required to capture the training dynamics at different phases of training. To improve the distillation performance, we propose progressive dataset distillation (PDD), which synthesizes multiple small sets of synthetic images conditioned on the previous ones and trains the model on the union of the subsets generated so far. Our extensive experiments show that PDD can effectively improve the performance of existing dataset distillation methods by up to 4.3%. In addition, our method for the first time enable generating considerably larger synthetic datasets.

POSTER-916: Feature Learning in Infinite Depth Neural Networks

Primary Area: optimization

Keywords: Tensor Programs mup deep learning optimization optimal hyperparameter transfer

Scores: [ 8 8 5 6 8 ]

Empirical studies have consistently demonstrated that increasing the size of neural networks often yields superior performance in practical applications. However, there is a lack of consensus regarding the appropriate scaling strategy, particularly when it comes to increasing the depth of neural networks. In practice, excessively large depths can lead to model performance degradation. In this paper, we introduce Depth-$\mu$P, a principled approach for depth scaling, allowing for the training of arbitrarily deep architectures while maximizing feature learning and diversity among nearby layers. Our method involves dividing the contribution of each residual block and the parameter update by the square root of the depth. Through the use of Tensor Programs, we rigorously establish the existence of a limit for infinitely deep neural networks under the proposed scaling scheme. This scaling strategy ensures more stable training for deep neural networks and guarantees the transferability of hyperparameters from shallow to deep models. To substantiate the efficacy of our scaling method, we conduct empirical validation on neural networks with depths up to 210.

POSTER-917: Linear attention is (maybe) all you need (to understand Transformer optimization)

Primary Area: optimization

Keywords: Transformer optimization adam clipping heavy-tailed noise directional smoothness

Scores: [ 6 8 6 6 ]

Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training Transformers by carefully studying a simple yet canonical linearized shallow Transformer model. Specifically, we train linear Transformers to solve regression tasks, inspired by J. von Oswald et al. (ICML 2023), and K. Ahn et al. (NeurIPS 2023). Most importantly, we observe that our proposed linearized models can reproduce several prominent aspects of Transformer training dynamics. Consequently, the results obtained in this paper suggest that a simple linearized Transformer model could actually be a valuable, realistic abstraction for understanding Transformer optimization.

POSTER-918: Variance Reduced Halpern Iteration for Finite-Sum Monotone Inclusions

Primary Area: optimization

Keywords: finite-sum problems monotone inclusion operator norm residual stochastic Halpern iteration last iterate convergence variance reduction min-max optimization

Scores: [ 6 6 6 ]

Machine learning approaches relying on such criteria as adversarial robustness or multi-agent settings have raised the need for solving game-theoretic equilibrium problems. Of particular relevance to these applications are methods targeting finite-sum structure, which generically arises in empirical variants of learning problems in these contexts. Further, methods with computable approximation errors are highly desirable, as they provide verifiable exit criteria. Motivated by these applications, we study finite-sum monotone inclusion problems, which model broad classes of equilibrium problems. Our main contributions are variants of the classical Halpern iteration that employ variance reduction to obtain improved complexity guarantees in which n component operators in the finite sum are on average'' either cocoercive or Lipschitz continuous and monotone, with parameter L. The resulting oracle complexity of our methods, which provide guarantees for the last iterate and for a (computable) operator norm residual, is O~(n+nLε1), which improves upon existing methods by a factor up to n. This constitutes the first variance reduction-type result for general finite-sum monotone inclusions and for more specific problems such as convex-concave optimization when operator norm residual is the optimality measure. We further argue that, up to poly-logarithmic factors, this complexity is unimprovable in the monotone Lipschitz setting; i.e., the provided result is near-optimal.

SPOTLIGHT-179: Dropout Enhanced Bilevel Training

Primary Area: optimization

Keywords: Bilevel Optimization Overfitting

Scores: [ 6 8 8 6 ]

Bilevel optimization problems appear in many widely used machine learning tasks. Bilevel optimization models are sensitive to small changes, and bilevel training tasks typically involve limited datasets. Therefore, overfitting is a common challenge in bilevel training tasks. This paper considers the use of dropout to address this problem. We propose a bilevel optimization model that depends on the distribution of dropout masks. We investigate how the dropout rate affects the hypergradient of this model. We propose a dropout bilevel method to solve the dropout bilevel optimization model. Subsequently, we analyze the resulting dropout bilevel method from an optimization perspective. Analyzing the optimization properties of methods with dropout is essential because it provides convergence guarantees for methods using dropout. However, there has been limited investigation in this research direction. We provide the complexity of the resulting dropout bilevel method in terms of reaching an ϵ stationary point of the proposed stochastic bilevel model. Empirically, we demonstrate that overfitting occurs in data cleaning problems, and the method proposed in this work mitigates this issue.

POSTER-919: Fast and unified path gradient estimators for normalizing flows

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Normalizing Flows Gradient Estimators Lattice Field Theory Variational Infernce

Scores: [ 8 6 8 8 ]

Recent work shows that path gradient estimators for normalizing flows have lower variance compared to standard estimators, resulting in improved training. However, they are often prohibitively more expensive from a computational point of view and cannot be applied to maximum likelihood training in a scalable manner, which severely hinders their widespread adoption. In this work, we overcome these crucial limitations. Specifically, we propose a fast path gradient estimator which works for all normalizing flow architectures of practical relevance for sampling from an unnormalized target distribution. We then show that this estimator can also be applied to maximum likelihood training and empirically establish its superior performance for several natural sciences applications.

SPOTLIGHT-180: Pre-Training and Fine-Tuning Generative Flow Networks

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Generative Flow Network (GFlowNets) Pre-train Goal-conditioned

Scores: [ 8 6 8 6 ]

Generative Flow Networks (GFlowNets) are amortized samplers that learn stochastic policies to sequentially generate compositional objects from a given unnormalized reward distribution.They can generate diverse sets of high-reward objects, which is an important consideration in scientific discovery tasks. However, as they are typically trained from a given extrinsic reward function, it remains an important open challenge about how to leverage the power of pre-training and train GFlowNets in an unsupervised fashion for efficient adaptation to downstream tasks.Inspired by recent successes of unsupervised pre-training in various domains, we introduce a novel approach for reward-free pre-training of GFlowNets. By framing the training as a self-supervised problem, we propose an outcome-conditioned GFlowNet (OC-GFN) that learns to explore the candidate space. Specifically, OC-GFN learns to reach any targeted outcomes, akin to goal-conditioned policies in reinforcement learning. We show that the pre-trained OC-GFN model can allow for a direct extraction of a policy capable of sampling from any new reward functions in downstream tasks.Nonetheless, adapting OC-GFN on a downstream task-specific reward involves an intractable marginalization over possible outcomes. We propose a novel way to approximate this marginalization by learning an amortized predictor enabling efficient fine-tuning.Extensive experimental results validate the efficacy of our approach, demonstrating the effectiveness of pre-training the OC-GFN, and its ability to swiftly adapt to downstream tasks and discover modes more efficiently.This work may serve as a foundation for further exploration of pre-training strategies in the context of GFlowNets.

POSTER-920: Chain of Log-Concave Markov Chains

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Langevin diffusion log-concave distribution kernel smoothing

Scores: [ 8 6 8 6 ]

We introduce a theoretical framework for sampling from unnormalized densities based on a smoothing scheme that uses an isotropic Gaussian kernel with a single fixed noise scale. We prove one can decompose sampling from a density (minimal assumptions made on the density) into a sequence of sampling from log-concave conditional densities via accumulation of noisy measurements with equal noise levels. Our construction is unique in that it keeps track of a history of samples, making it non-Markovian as a whole, but it is lightweight algorithmically as the history only shows up in the form of a running empirical mean of samples. Our sampling algorithm generalizes walk-jump sampling (Saremi & Hyvärinen, 2019). The "walk" phase becomes a (non-Markovian) chain of (log-concave) Markov chains. The "jump" from the accumulated measurements is obtained by empirical Bayes. We study our sampling algorithm quantitatively using the 2-Wasserstein metric and compare it with various Langevin MCMC algorithms. We also report a remarkable capacity of our algorithm to "tunnel" between modes of a distribution.

POSTER-921: TACTiS-2: Better, Faster, Simpler Attentional Copulas for Multivariate Time Series

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: time series forecasting probabilistic multivariate copula transformer density estimation

Scores: [ 8 5 6 5 ]

We introduce a new model for multivariate probabilistic time series prediction, designed to flexibly address a range of tasks including forecasting, interpolation, and their combinations. Building on copula theory, we propose a simplified objective for the recently-introduced transformer-based attentional copulas (TACTiS), wherein the number of distributional parameters now scales linearly with the number of variables instead of factorially. The new objective requires the introduction of a training curriculum, which goes hand-in-hand with necessary changes to the original architecture. We show that the resulting model has significantly better training dynamics and achieves state-of-the-art performance across diverse real-world forecasting tasks, while maintaining the flexibility of prior work, such as seamless handling of unaligned and unevenly-sampled time series.

SPOTLIGHT-181: Subtractive Mixture Models via Squaring: Representation and Learning

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: tractable inference distribution estimation probabilistic circuits tensor networks

Scores: [ 8 8 6 8 6 ]

Mixture models are traditionally represented and learned by adding several distributions as components. Allowing mixtures to subtract probability mass or density can drastically reduce the number of components needed to model complex distributions. However, learning such subtractive mixtures while ensuring they still encode a non-negative function is challenging. We investigate how to learn and perform inference on deep subtractive mixtures by squaring them. We do this in the framework of probabilistic circuits, which enable us to represent tensorized mixtures and generalize several other subtractive models. We theoretically prove that the class of squared circuits allowing subtractions can be exponentially more expressive than traditional additive mixtures; and, we empirically show this increased expressiveness on a series of real-world distribution estimation tasks.

POSTER-922: Enhancing Transfer Learning with Flexible Nonparametric Posterior Sampling

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Transfer Learning Nonparametric Learning Bayesian Neural Network

Scores: [ 6 6 6 6 ]

Transfer learning has recently shown significant performance across various tasks involving deep neural networks. In these transfer learning scenarios, the prior distribution for downstream data becomes crucial in Bayesian model averaging (BMA). While previous works proposed the prior over the neural network parameters centered around the pre-trained solution, such strategies have limitations when dealing with distribution shifts between upstream and downstream data. This paper introduces nonparametric transfer learning (NPTL), a flexible posterior sampling method to address the distribution shift issue within the context of nonparametric learning. The nonparametric learning (NPL) method is a recent approach that employs a nonparametric prior for posterior sampling, efficiently accounting for model misspecification scenarios, which is suitable for transfer learning scenarios that may involve the distribution shift between upstream and downstream tasks. Through extensive empirical validations, we demonstrate that our approach surpasses other baselines in BMA performance.

POSTER-923: Faster Sampling from Log-Concave Densities over Polytopes via Efficient Linear Solvers

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Logconcave sampling Dikin walk Markov chain Monte Carlo interior point methods

Scores: [ 3 8 8 8 ]

We consider the problem of sampling from a logconcave distribution π(θ)ef(θ) constrained to a polytope K:={θRd:Aθb}, where ARm×d and bRm. The fastest-known algorithm for the setting when f is O(1)-Lipschitz or O(1)-smooth runs in roughly O(md×mdω1) arithmetic operations, where the mdω1 term arises because each Markov chain step requires computing a matrix inversion and determinant (ω2.37 is the matrix multiplication constant). We present a nearly-optimal implementation of this Markov chain with per-step complexity that is roughly the number of non-zero entries of A while the number of Markov chain steps remains the same. The key technical ingredients are 1) to show that the matrices that arise in this Dikin walk change slowly, 2) to deploy efficient linear solvers which can leverage this slow change to speed up matrix inversion by using information computed in previous steps, and 3) to speed up the computation of the determinantal term in the Metropolis filter step via a randomized Taylor series-based estimator. This result directly improves the runtime for applications that involve sampling from Gibbs distributions constrained to polytopes that arise in Bayesian statistics and private optimization.

POSTER-924: Expected flow networks in stochastic environments and two-player zero-sum games

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: generative flow networks GFlowNets protein design game theory self-play adversarial learning stochastic environments quantal response equilibrium Luce agents sequential decision making

Scores: [ 6 3 8 5 ]

Generative flow networks (GFlowNets) are sequential sampling models trained to match a given distribution. GFlowNets have been successfully applied to various structured object generation tasks, sampling a diverse set of high-reward objects quickly. We propose expected flow networks (EFlowNets), which extend GFlowNets to stochastic environments. We show that EFlowNets outperform other GFlowNet formulations in stochastic tasks such as protein design. We then extend the concept of EFlowNets to adversarial environments, proposing adversarial flow networks (AFlowNets) for two-player zero-sum games. We show that AFlowNets learn to find above 80% of optimal moves in Connect-4 via self-play and outperform AlphaZero in tournaments.

POSTER-925: Fiber Monte Carlo

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Stochastic Methods

Scores: [ 6 8 6 ]

Integrals with discontinuous integrands are ubiquitous, arising from discrete structure in applications like topology optimization, graphics, and computational geometry. These integrals are often part of a forward model in an inverse problem where it is necessary to reason backwards about the parameters, ideally using gradient-based optimization. Monte Carlo methods are widely used to estimate the value of integrals, but this results in a non-differentiable approximation that is amenable to neither conventional automatic differentiation nor reparameterization-based gradient methods. This significantly disrupts efforts to integrate machine learning methods in areas that exhibit these discontinuities: physical simulation and robotics, design, graphics, and computational geometry. Although bespoke domain-specific techniques can handle special cases, a general methodology to wield automatic differentiation in these discrete contexts is wanting. We introduce a differentiable variant of the simple Monte Carlo estimator which samples line segments rather than points from the domain. We justify our estimator analytically as conditional Monte Carlo and demonstrate the diverse functionality of the method as applied to image stylization, topology optimization, and computational geometry.

POSTER-926: Conformal Prediction via Regression-as-Classification

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Conformal Prediction Probability Function Regression Neural Networks

Scores: [ 3 8 8 6 ]

Conformal prediction (CP) for regression can be challenging, especially when the output distribution is heteroscedastic, multimodal, or skewed. Some of the issues can be addressed by estimating a distribution over the output, but in reality, such approaches can be sensitive to estimation error and yield unstable intervals. Here, we circumvent the challenges by converting regression to a classification problem and then use CP for classification to obtain CP sets for regression. To preserve the ordering of the continuous-output space, we design a new loss function and present necessary modifications to the CP classification techniques. Empirical results on many benchmarks show that this simple approach gives surprisingly good results on many practical problems.

SPOTLIGHT-182: Variational Inference for SDEs Driven by Fractional Noise

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: variational inference neural sdes stochastic differential equations brownian motion fractional noise fractional brownian motion markov approximation markov representation

Scores: [ 8 8 5 8 ]

We present a novel variational framework for performing inference in (neural) stochastic differential equations (SDEs) driven by Markov-approximate fractional Brownian motion (fBM). SDEs offer a versatile tool for modeling real-world continuous-time dynamic systems with inherent noise and randomness. Combining SDEs with the powerful inference capabilities of variational methods, enables the learning of representative function distributions through stochastic gradient descent. However, conventional SDEs typically assume the underlying noise to follow a Brownian motion (BM), which hinders their ability to capture long-term dependencies. In contrast, fractional Brownian motion (fBM) extends BM to encompass non-Markovian dynamics, but existing methods for inferring fBM parameters are either computationally demanding or statistically inefficient. In this paper, building upon the Markov approximation of fBM, we derive the evidence lower bound essential for efficient variational inference of posterior path measures, drawing from the well-established field of stochastic analysis. Additionally, we provide a closed-form expression to determine optimal approximation coefficients. Furthermore, we propose the use of neural networks to learn the drift, diffusion and control terms within our variational posterior, leading to the variational training of neural-SDEs. In this framework, we also optimize the Hurst index, governing the nature of our fractional noise. Beyond validation on synthetic data, we contribute a novel architecture for variational latent video prediction,—an approach that, to the best of our knowledge, enables the first variational neural-SDE application to video perception.

POSTER-927: Conformal Language Modeling

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: conformal prediction uncertainty estimation language models generative models confidence prediction sets sampling

Scores: [ 8 8 3 6 ]

In this paper, we propose a novel approach to conformal prediction for language models (LMs) in which we produce prediction sets with performance guarantees. LM responses are typically sampled from a predicted distribution over the large, combinatorial output space of language. Translating this to conformal prediction, we calibrate a stopping rule for sampling LM outputs that get added to a growing set of candidates until we are confident that the set covers at least one acceptable response. Since some samples may be low-quality, we also simultaneously calibrate a rejection rule for removing candidates from the output set to reduce noise. Similar to conformal prediction, we can prove that the final output set obeys certain desirable distribution-free guarantees. Within these sets of candidate responses, we also show that we can also identify subsets of individual components---such as phrases or sentences---that are each independently correct (e.g., that are not hallucinations''), again with guarantees. Our method can be applied to any LM API that supports sampling. Furthermore, we empirically demonstrate that we can achieve many desired coverage levels within a limited number of total samples when applying our method to multiple tasks in open-domain question answering, text summarization, and radiology report generation using different LM variants.

POSTER-928: AmortizedPeriod: Attention-based Amortized Inference for Periodicity Identification

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Periodicity Identification Probabilistic Models Amortized Inference Attention Time Series Analysis

Scores: [ 6 6 6 5 ]

POSTER-929: Copula Conformal prediction for multi-step time series prediction

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Conformal Prediction time series uncertainty quantification calibration RNN

Scores: [ 5 8 6 6 ]

POSTER-930: A Unified Framework for Bayesian Optimization under Contextual Uncertainty

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Bayesian optimization Gaussian processes

Scores: [ 8 5 5 6 ]

Bayesian optimization under contextual uncertainty (BOCU) is a family of BO problems in which the learner makes a decision prior to observing the context and must manage the risks involved. Distributionally robust BO (DRBO) is a subset of BOCU that affords robustness against context distribution shift, and includes the optimization of expected values and worst-case values as special cases. By considering the first derivatives of the DRBO objective, we generalize DRBO to one that includes several other uncertainty objectives studied in the BOCU literature such as worst-case sensitivity (and thus notions of risk such as variance, range, and conditional value-at-risk) and mean-risk tradeoffs. We develop a general Thompson sampling algorithm that is able to optimize any objective within the BOCU framework, analyze its theoretical properties, and compare it to suitable baselines across different experimental settings and uncertainty objectives.

POSTER-931: Entropy-MCMC: Sampling from Flat Basins with Ease

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: MCMC Bayesian Deep Learning Flatness-aware Learning

Scores: [ 5 6 6 8 6 ]

Bayesian deep learning counts on the quality of posterior distribution estimation. However, the posterior of deep neural networks is highly multi-modal in nature, with local modes exhibiting varying generalization performances. Given a practical budget, sampling from the original posterior can lead to suboptimal performances, as some samples may become trapped in "bad" modes and suffer from overfitting. Leveraging the observation that "good" modes with low generalization error often reside in flat basins of the energy landscape, we propose to bias the sampling on the posterior toward these flat regions. Specifically, we introduce an auxiliary guiding variable, the stationary distribution of which resembles a smoothed posterior free from sharp modes, to lead the MCMC sampler to flat basins. We prove the convergence of our method and further show that it converges faster than several existing flatness-aware methods in the strongly convex setting. Empirical results demonstrate that our method can successfully sample from flat basins of the posterior, and outperforms all compared baselines on multiple benchmarks including classification, calibration and out-of-distribution detection.

POSTER-932: Leveraging Uncertainty Estimates To Improve Classifier Performance

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Uncertainty estimation binary classification imbalanced classification score recalibration uncertainty based decision making classification decision boundary bin packing estimation bias posterior networks

Scores: [ 3 6 6 ]

Binary classification typically involves predicting the label of an instance based on whether the model score for the positive class exceeds a threshold chosen based on the application requirements (e.g., maximizing recall for a precision bound). However, model scores are often not aligned with true positivity rate. This is especially true when the training involves a differential sampling of classes or there is distributional drift between train and test settings. In this paper, we provide theoretical analysis and empirical evidence of the dependence of estimation bias on both uncertainty and model score. Further, we formulate the decision boundary selection using both model score and uncertainty, prove that it is NP-hard, and present algorithms based on dynamic programming and isotonic regression. Evaluation of the proposed algorithms on three real-world datasets yield 25%-40% improvement in recall at high precision bounds over the traditional approach of using model score alone, highlighting the benefits of leveraging uncertainty.

SPOTLIGHT-183: Input-gradient space particle inference for neural network ensembles

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: deep ensembles diversity input gradient robustness covariate shift particle variational inference

Scores: [ 6 8 6 8 ]

Deep Ensembles (DEs) demonstrate improved accuracy, calibration and robustness to perturbations over single neural networks partly due to their functional diversity. Particle-based variational inference (ParVI) methods enhance diversity by formalizing a repulsion term based on a network similarity kernel. However, weight-space repulsion is inefficient due to over-parameterization, while direct function-space repulsion has been found to produce little improvement over DEs. To sidestep these difficulties, we propose First-order Repulsive Deep Ensemble (FoRDE), an ensemble learning method based on ParVI, which performs repulsion in the space of first-order input gradients. As input gradients uniquely characterize a function up to translation and are much smaller in dimension than the weights, this method guarantees that ensemble members are functionally different. Intuitively, diversifying the input gradients encourages each network to learn different features, which is expected to improve the robustness of an ensemble. Experiments on image classification datasets and transfer learning tasks show that FoRDE significantly outperforms the gold-standard DEs and other ensemble methods in accuracy and calibration under covariate shift due to input perturbations.

ORAL-49: Monte Carlo guided Denoising Diffusion models for Bayesian linear inverse problems.

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Monte Carlo Denoising Diffusion model score-based generative models Sequential Monte Carlo Bayesian Inverse Problems Generative Models.

Scores: [ 6 10 8 10 ]

Ill-posed linear inverse problems arise frequently in various applications, from computational photography to medical imaging.A recent line of research exploits Bayesian inference with informative priors to handle the ill-posedness of such problems.Amongst such priors, score-based generative models (SGM) have recently been successfully applied to several different inverse problems.In this study, we exploit the particular structure of the prior defined by the SGM to define a sequence of intermediate linear inverse problems. As the noise level decreases, the posteriors of these inverse problems get closer to the target posterior of the original inverse problem. To sample from this sequence of posteriors, we propose the use of Sequential Monte Carlo (SMC) methods.The proposed algorithm, \algo, is shown to be theoretically grounded and we provide numerical simulations showing that it outperforms competing baselines when dealing with ill-posed inverse problems in a Bayesian setting.

SPOTLIGHT-184: A Hierarchical Bayesian Model for Few-Shot Meta Learning

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Bayesian models Meta learning Few-shot learning

Scores: [ 6 6 8 6 6 8 ]

SPOTLIGHT-185: On Diffusion Modeling for Anomaly Detection

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Diffusion based models Anomaly detection Probabilistic Inference

Scores: [ 6 8 6 8 ]

Known for their impressive performance in generative modeling, diffusion models are attractive candidates for density-based anomaly detection. This paper investigates different variations of diffusion modeling for unsupervised and semi-supervised anomaly detection. In particular, we find that Denoising Diffusion Probability Models (DDPM) are performant on anomaly detection benchmarks yet computationally expensive. By simplifying DDPM in application to anomaly detection, we are naturally led to an alternative approach called Diffusion Time Estimation (DTE). DTE estimates the distribution over diffusion time for a given input and uses the mode or mean of this distribution as the anomaly score. We derive an analytical form for this density and leverage a deep neural network to improve inference efficiency. Through empirical evaluations on the ADBench benchmark, we demonstrate that all diffusion-based anomaly detection methods perform competitively for both semi-supervised and unsupervised settings. Notably, DTE achieves orders of magnitude faster inference time than DDPM, while outperforming it on this benchmark. These results establish diffusion-based anomaly detection as a scalable alternative to traditional methods and recent deep-learning techniques for standard unsupervised and semi-supervised anomaly detection settings.

POSTER-933: Reverse Diffusion Monte Carlo

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Posterior Sampling Multi-modal sampling

Scores: [ 8 6 8 6 ]

The efficacy of modern generative models is commonly contingent upon the precision of score estimation along the diffusion path, with a focus on diffusion models and their ability to generate high-quality data samples. This study delves into the application of reverse diffusion to Monte Carlo sampling. It is shown that score estimation can be transformed into a mean estimation problem via the decomposition of the transition kernel. By estimating the mean of the posterior distribution, we derive a novel Monte Carlo sampling algorithm from the reverse diffusion process, which is distinct from traditional Markov Chain Monte Carlo (MCMC) methods. We calculate the error requirements and sample size for the posterior distribution, and use the result to derive an algorithm that can approximate the target distribution to any desired accuracy. Additionally, by estimating the log-Sobolev constant of the posterior distribution, we show under suitable conditions the problem of sampling from the posterior can be easier than direct sampling from the target distribution using traditional MCMC techniques. For Gaussian mixture models, we demonstrate that the new algorithm achieves significant improvement over the traditional Langevin-style MCMC sampling methods both theoretically and practically. Our algorithm offers a new perspective and solution beyond classical MCMC algorithms for challenging complex distributions.

POSTER-934: Non-Exchangeable Conformal Risk Control

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: conformal prediction conformal risk control non-exchangeable data uncertainty

Scores: [ 6 5 5 8 ]

Split conformal prediction has recently sparked great interest due to its ability to provide formally guaranteed uncertainty sets or intervals for predictions made by black-box neural models, ensuring a predefined probability of containing the actual ground truth. While the original formulation assumes data exchangeability, some extensions handle non-exchangeable data, which is often the case in many real-world scenarios. In parallel, some progress has been made in conformal methods that provide statistical guarantees for a broader range of objectives, such as bounding the best F1-score or minimizing the false negative rate in expectation. In this paper, we leverage and extend these two lines of work by proposing non-exchangeable conformal risk control, which allows controlling the expected value of any monotone loss function when the data is not exchangeable. Our framework is flexible, makes very few assumptions, and allows weighting the data based on its statistical similarity with the test examples; a careful choice of weights may result on tighter bounds, making our framework useful in the presence of change points, time series, or other forms of distribution drift. Experiments with both synthetic and real world data show the usefulness of our method.

POSTER-935: Kernelised Normalising Flows

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Machine Learning Maximum Likelihood Density Estimation Statistics Kernels

Scores: [ 8 6 5 8 ]

Normalising Flows are non-parametric statistical models known for their dual capabilities of density estimation and generation. They are distinguished by their inherently invertible architecture. However, the requirement of invertibility imposes constraints on their expressiveness, necessitating a large number of parameters and innovative architectural designs to achieve satisfactory outcomes. Whilst flow-based models predominantly rely on neural-network-based transformations for expressive designs, alternative transformation methods have received limited attention. In this work, we present Ferumal flow, a novel kernelised normalising flow paradigm that integrates kernels into the framework. Our results demonstrate that a kernelised flow can yield competitive or superior results compared to neural network-based flows whilst maintaining parameter efficiency. Kernelised flows excel especially in the low-data regime, enabling flexible non-parametric density estimation in applications with sparse data availability.

POSTER-936: Langevin Monte Carlo for strongly log-concave distributions: Randomized midpoint revisited

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Markov Chain Monte Carlo Kinetic Langevin Langevin algorithm Midpoint method Mixing rate

Scores: [ 6 5 8 6 ]

We revisit the problem of sampling from a target distribution that has a smooth strongly log-concave density everywhere in Rp. In this context, if no additional density information is available, the randomized midpoint discretization for the kinetic Langevin diffusion is known to be the most scalable method in high dimensions with large condition numbers. Our main result is a nonasymptotic and easy to compute upper bound on the W2-error of this method. To provide a more thorough explanation of our method for establishing the computable upper bound, we conduct an analysis of the midpoint discretization for the vanilla Langevin process. This analysis helps to clarify the underlying principles and provides valuable insights that we use to establish an improved upper bound for the kinetic Langevin process with the midpoint discretization. Furthermore, by applying these techniques we establish new guarantees for the kinetic Langevin process with Euler discretization, which have a better dependence on the condition number than existing upper bounds

POSTER-937: Improved sampling via learned diffusions

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Schrödinger bridge sampling from densities stochastic optimal control diffusion-based generative modeling

Scores: [ 6 6 6 8 ]

Recently, a series of papers proposed deep learning-based approaches to sample from unnormalized target densities using controlled diffusion processes. In this work, we identify these approaches as special cases of the Schrödinger bridge problem, seeking the most likely stochastic evolution between a given prior distribution and the specified target, and propose the perspective from measures on path space as a unifying framework. The optimal controls of such entropy-constrained optimal transport problems can then be described by systems of partial differential equations and corresponding backward stochastic differential equations. Building on these optimality conditions and exploiting the path measure perspective, we obtain variational formulations of the respective approaches and recover the objectives which can be approached via gradient descent. Our formulations allow to introduce losses different from the typically employed reverse Kullback-Leibler divergence that is known to suffer from mode collapse. In particular, we propose the so-called log-variance loss, which exhibits favorable numerical properties and leads to significantly improved performance across all considered approaches.

POSTER-938: Efficient Backpropagation with Variance Controlled Adaptive Sampling

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: efficient training algorithms stochastic gradient descent importance sampling variance reduction

Scores: [ 6 6 6 8 ]

Sampling-based algorithms, which eliminate "unimportant" computations during forward and/or backpropagation (BP), offer potential solutions to accelerate neural network training. However, since sampling introduces approximations to training, such algorithms may not consistently maintain accuracy across various tasks. In this work, we introduce a variance-controlled adaptive sampling (VCAS) method designed to minimize the computational load of BP. VCAS computes an unbiased stochastic gradient with fine-grained layerwise importance sampling in data dimension for activation gradient calculation and leverage score sampling in token dimension for weight gradient calculation. To preserve accuracy, we control the additional variance introduced by learning the sample ratio jointly with model parameters during training. We assessed VCAS on multiple fine-tuning and pre-training tasks in both vision and natural language domains. On all the tasks, VCAS can preserve the original training loss trajectory and validation accuracy with an up to 73.87% FLOPs reduction of BP and 49.58% FLOPs reduction of the whole training process.

POSTER-939: Stochastic Gradient Descent for Gaussian Processes Done Right

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Gaussian process stochastic gradient descent

Scores: [ 8 6 8 5 5 ]

We study the optimisation problem associated with Gaussian process regression using squared loss. The most common approach to this problem is to apply an exact solver, such as conjugate gradient descent, either directly on the problem or on a reduced-order version of it. However, stochastic gradient descent has recently gained traction in the Gaussian process literature, driven largely by its successes in deep learning. In this paper, we show that this approach when done right---by which we mean using specific insights from the optimisation and kernel communities---is highly effective.We thus introduce a particular stochastic dual gradient descent algorithm, conveniently implementable with a few lines of code using any deep learning framework. We explain our design decisions by illustrating their advantage against alternatives with ablation studies.We then show that the new method is highly competitive: our evaluations on standard regression benchmarks and a Bayesian optimisation task set our approach apart from conjugate gradients, variational Gaussian process approximations, and a prior version of stochastic gradient descent tailored for Gaussian processes. On a molecular binding affinity prediction task, our method places Gaussian process regression on par in terms of performance with graph neural networks.

POSTER-940: Closing the Curious Case of Neural Text Degeneration

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: softmax bottleneck truncation sampling nucleus top-k theory linear programming linear algebra decoding generation autoregressive language model NLP open-ended generation

Scores: [ 8 8 6 8 ]

Despite their ubiquity in language generation, it remains unknown why truncation sampling heuristics like nucleus sampling are so effective. We provide a theoretical explanation for the effectiveness of the truncation sampling by proving that truncation methods that discard tokens below some probability threshold (the most common type of truncation) can guarantee that all sampled tokens have nonzero true probability. However, thresholds are a coarse heuristic, and necessarily discard some tokens with nonzero true probability as well. In pursuit of a more precise sampling strategy, we show that we can leverage a known source of model errors, the softmax bottleneck, to prove that certain tokens have nonzero true probability, without relying on a threshold. Based on our findings, we develop an experimental truncation strategy and the present pilot studies demonstrating the promise of this type of algorithm. Our evaluations show that our method outperforms its threshold-based counterparts under automatic and human evaluation metrics for low-entropy (i.e., close to greedy) open-ended text generation. Our theoretical findings and pilot experiments provide both insight into why truncation sampling works, and make progress toward more expressive sampling algorithms that better surface the generative capabilities of large language models.

POSTER-941: Uncertainty Quantification via Stable Distribution Propagation

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: propagating distributions uncertainty stable distributions input uncertainty

Scores: [ 8 6 6 8 6 ]

We propose a new approach for propagating stable probability distributions through neural networks. Our method is based on local linearization, which we show to be an optimal approximation in terms of total variation distance for the ReLU non-linearity. This allows propagating Gaussian and Cauchy input uncertainties through neural networks to quantify their output uncertainties. To demonstrate the utility of propagating distributions, we apply the proposed method to predicting calibrated confidence intervals and selective prediction on out-of-distribution data. The results demonstrate a broad applicability of propagating distributions and show the advantages of our method over other approaches such as moment matching.

POSTER-942: Implicit Maximum a Posteriori Filtering via Adaptive Optimization

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Bayesian filtering optimization neural networks

Scores: [ 8 6 6 ]

Bayesian filtering approximates the true underlying behavior of a time-varying system by inverting an explicit generative model to convert noisy measurements into state estimates. This process typically requires matrix storage, inversion, and multiplication or Monte Carlo estimation, none of which are practical in high-dimensional state spaces such as the weight spaces of artificial neural networks. Here, we consider the standard Bayesian filtering problem as optimization over a time-varying objective. Instead of maintaining matrices for the filtering equations or simulating particles, we specify an optimizer that defines the Bayesian filter implicitly. In the linear-Gaussian setting, we show that every Kalman filter has an equivalent formulation using K steps of gradient descent. In the nonlinear setting, our experiments demonstrate that our framework results in filters that are effective, robust, and scalable to high-dimensional systems, comparing well against the standard toolbox of Bayesian filtering solutions. We suggest that it is easier to fine-tune an optimizer than it is to specify the correct filtering equations, making our framework an attractive option for high-dimensional filtering problems.

POSTER-943: Entropy Coding of Unordered Data Structures

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: graph compression entropy coding neural compression bits-back coding lossless compression generative models information theory probabilistic models graph neural networks multiset compression asymmetric numeral systems compression entropy shuffle coding

Scores: [ 8 6 6 5 ]

We present shuffle coding, a general method for optimal compression of sequences of unordered objects using bits-back coding. Data structures that can be compressed using shuffle coding include multisets, graphs, hypergraphs, and others. We release an implementation that can easily be adapted to different data types and statistical models, and demonstrate that our implementation achieves state-of-the-art compression rates on a range of graph datasets including molecular data.

POSTER-944: On the Posterior Distribution in Denoising: Application to Uncertainty Quantification

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Gaussian Denoising Posterior Moments Estimation Uncertainty Quantification Uncertainty Visualization

Scores: [ 6 3 6 8 ]

POSTER-945: Functional Bayesian Tucker Decomposition for Continuous-indexed Tensor Data

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Bayesian model Tensor decomposition Gaussian Processes

Scores: [ 5 6 8 5 ]

Tucker decomposition is a powerful tensor model to handle multi-aspect data. It demonstrates the low-rank property by decomposing the grid-structured data as interactions between a core tensor and a set of object representations (factors). A fundamental assumption of such decomposition is that there were finite objects in each aspect or mode, corresponding to discrete indexes of data entries. However, many real-world data are not naturally posed in the setting. For example, geographic data is represented as continuous indexes of latitude and longitude coordinates, and cannot fit tensor models directly. To generalize Tucker decomposition to such scenarios, we propose FunBaT: Functional Bayesian Tucker Decomposition. We treat the continuous-indexed data as the interaction between the Tucker core and a group of latent functions. We use Gaussian processes (GP) as functional priors to model the latent functions, and then convert the GPs into state-space priors by constructing equivalent stochastic differential equations (SDE) to reduce computational cost. An efficient inference algorithm is further developed for scalable posterior approximation based on advanced message-passing techniques. The advantage of our method is shown in both synthetic data and several real-world applications.

POSTER-946: Posterior Sampling Based on Gradient Flows of the MMD with Negative Distance Kernel

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Bayesian inverse Problems MMD Gradient Flows Deep Learning

Scores: [ 6 5 8 5 ]

We propose conditional flows of the maximum mean discrepancy (MMD) with the negative distance kernel for posterior sampling and conditional generative modelling. This MMD, which is also known as energy distance, has several advantageous properties like efficient computation via slicing and sorting. We approximate the joint distribution of the ground truth and the observations using discrete Wasserstein gradient flows and establish an error bound for the posterior distributions. Further, we prove that our particle flow is indeed a Wasserstein gradient flow of an appropriate functional. The power of our method is demonstrated by numerical examples including conditional image generation and inverse problems like superresolution, inpainting and computed tomography in low-dose and limited-angle settings.

POSTER-947: Implicit Gaussian process representation of vector fields over arbitrary latent manifolds

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Gaussian processes neuroscience vector field tangent bundle connection Laplacian EEG Alzheimer's disease

Scores: [ 6 8 8 ]

POSTER-948: Generative Sliced MMD Flows with Riesz Kernels

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: sliced maximum mean discrepancy energy distance gradient flows Riesz kernels generative modelling

Scores: [ 6 8 5 5 ]

Maximum mean discrepancy (MMD) flows suffer from high computational costs in large scale computations.In this paper, we show that MMD flows with Riesz kernels K(x,y)=xyr, $r \in (0,2)$have exceptional properties which allow their efficient computation.We prove that the MMD of Riesz kernels, which isalso known as energy distance, coincides with the MMD of their sliced version.As a consequence, the computation of gradients of MMDs can be performed in the one-dimensional setting.Here, for r=1, a simple sorting algorithm can be applied to reduce the complexityfrom O(MN+N2) to O((M+N)log(M+N)) for two measures with M and N support points.As another interesting follow-up result, the MMD of compactly supported measurescan be estimated from above and below by the Wasserstein-1 distance.For the implementations we approximate the gradient of the sliced MMD by using only a finite number P of slices. We show that the resulting error has complexity O(d/P), where d is the data dimension. These results enable us to train generative models by approximating MMD gradient flows by neural networks evenfor image applications. We demonstrate the efficiency of our model by image generation on MNIST, FashionMNIST and CIFAR10.

POSTER-949: Order-Preserving GFlowNets

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: probabilistic sampling multi-objective optimization GFlowNet

Scores: [ 6 6 8 6 6 ]

POSTER-950: Training Bayesian Neural Networks with Sparse Subspace Variational Inference

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Bayesian neural networks sparse Bayesian learning variational inference

Scores: [ 6 8 6 ]

Bayesian neural networks (BNNs) offer uncertainty quantification but come with the downside of substantially increased training and inference costs. Sparse BNNs have been investigated for efficient inference, typically by either slowly introducing sparsity throughout the training or by post-training compression of dense BNNs. The dilemma of how to cut down massive training costs remains, particularly given the requirement to learn about the uncertainty. To solve this challenge, we introduce Sparse Subspace Variational Inference (SSVI), the first fully sparse BNN framework that maintains a consistently sparse Bayesian model throughout the training and inference phases. Starting from a randomly initialized low-dimensional sparse subspace, our approach alternately optimizes the sparse subspace basis selection and its associated parameters. While basis selection is characterized as a non-differentiable problem, we approximate the optimal solution with a removal-and-addition strategy, guided by novel criteria based on weight distribution statistics. Our extensive experiments show that SSVI sets new benchmarks in crafting sparse BNNs, achieving, for instance, a 10-20× compression in model size with under 3% performance drop, and up to 20× FLOPs reduction during training. Remarkably, SSVI also demonstrates enhanced robustness to hyperparameters, reducing the need for intricate tuning in VI and occasionally even surpassing VI-trained dense BNNs.

SPOTLIGHT-186: Variational Bayesian Last Layers

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: bayesian deep learning variational methods bayesian last layers neural linear models

Scores: [ 8 10 6 8 ]

We introduce a deterministic variational formulation for training Bayesian last layer neural networks. This yields a sampling-free, single-pass model and loss that effectively improves uncertainty estimation. Our variational Bayesian last layer (VBLL) can be trained and evaluated with only quadratic complexity in last layer width, and is thus (nearly) computationally free to add to standard architectures. We experimentally investigate VBLLs, and show that they improve predictive accuracy, calibration, and out of distribution detection over baselines across both regression and classification. Finally, we investigate combining VBLL layers with variational Bayesian feature learning, yielding a lower variance collapsed variational inference method for Bayesian neural networks.

SPOTLIGHT-187: R-EDL: Relaxing Nonessential Settings of Evidential Deep Learning

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: uncertainty quantification evidential deep learning subjective logic single-forward-pass uncertainty method

Scores: [ 6 6 8 8 8 ]

A newly-arising uncertainty estimation method named Evidential Deep Learning (EDL), which can obtain reliable predictive uncertainty in a single forward pass, has garnered increasing interest. Guided by the subjective logic theory, EDL obtains Dirichlet concentration parameters from deep neural networks, thus constructing a Dirichlet probability density function (PDF) to model the distribution of class probabilities. Despite its great success, we argue that EDL keeps nonessential settings in both stages of model construction and optimization.In this work, our analysis indicates that (1) in the construction of the Dirichlet PDF, a commonly ignored parameter termed prior weight governs the balance between leveraging the proportion of evidence and its magnitude in deriving predictive scores, and (2) in model optimization, a variance-minimized regularization term adopted by traditional EDL encourages the Dirichlet PDF to approach a Dirac delta function, potentially exacerbating overconfidence. Therefore, we propose the R-EDL (Relaxed-EDL) method by relaxing these nonessential settings. Specifically, R-EDL treats the prior weight as an adjustable hyper-parameter instead of a fixed scalar, and directly optimizes the expectation of the Dirichlet PDF provided to deprecate the variance-minimized regularization term. Extensive experiments and SOTA performances demonstrate the effectiveness of our method. Source codes are provided in Supplementary Material.

POSTER-951: A Study of Bayesian Neural Network Surrogates for Bayesian Optimization

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Bayesian Optimization Gaussian Processes Bayesian Neural Networks

Scores: [ 6 8 8 ]

Bayesian optimization is a highly efficient approach to optimizing objective functions which are expensive to query. These objectives are typically represented by Gaussian process (GP) surrogate models which are easy to optimize and support exact inference. While standard GP surrogates have been well-established in Bayesian optimization, Bayesian neural networks (BNNs) have recently become practical function approximators, with many benefits over standard GPs such as the ability to naturally handle non-stationarity and learn representations for high-dimensional data. In this paper, we study BNNs as alternatives to standard GP surrogates for optimization. We consider a variety of approximate inference procedures for finite-width BNNs, including high-quality Hamiltonian Monte Carlo, low-cost stochastic MCMC, and heuristics such as deep ensembles. We also consider infinite-width BNNs, linearized Laplace approximations, and partially stochastic models such as deep kernel learning. We evaluate this collection of surrogate models on diverse problems with varying dimensionality, number of objectives, non-stationarity, and discrete and continuous inputs. We find: (i) the ranking of methods is highly problem dependent, suggesting the need for tailored inductive biases; (ii) HMC is the most successful approximate inference procedure for fully stochastic BNNs; (iii) full stochasticity may be unnecessary as deep kernel learning is relatively competitive; (iv) deep ensembles perform relatively poorly; (v) infinite-width BNNs are particularly promising, especially in high dimensions.

POSTER-952: t3-Variational Autoencoder: Learning Heavy-tailed Data with Student's t and Power Divergence

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Variational autoencoder Information geometry Heavy-tail learning Generative model

Scores: [ 8 6 8 ]

The variational autoencoder (VAE) typically employs a standard normal prior as a regularizer for the probabilistic latent encoder. However, the Gaussian tail often decays too quickly to effectively accommodate the encoded points, failing to preserve crucial structures hidden in the data. In this paper, we explore the use of heavy-tailed models to combat over-regularization. Drawing upon insights from information geometry, we propose $t3$VAE, a modified VAE framework that incorporates Student's t-distributions for the prior, encoder, and decoder. This results in a joint model distribution of a power form which we argue can better fit real-world datasets. We derive a new objective by reformulating the evidence lower bound as joint optimization of a KL divergence between two statistical manifolds and replacing with γ-power divergence, a natural alternative for power families. $t3$VAE demonstrates superior generation of low-density regions when trained on heavy-tailed synthetic data. Furthermore, we show that our model excels at capturing rare features through real-data experiments on CelebA and imbalanced CIFAR datasets.

POSTER-953: Transport meets Variational Inference: Controlled Monte Carlo Diffusions

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: SDEs Diffusion Models Optimal Transport Annealed Importance Sampling Schroedinger Bridges Variational Inference

Scores: [ 6 8 6 8 8 ]

Connecting optimal transport and variational inference, we present a principled and systematic framework for sampling and generative modelling centred around divergences on path space. Our work culminates in the development of the Controlled Monte Carlo Diffusion sampler (CMCD) for Bayesian computation, a score-based annealing technique that crucially adapts both forward and backward dynamics in a diffusion model. On the way, we clarify the relationship between the EM-algorithm and iterative proportional fitting (IPF) for Schroedinger bridges, deriving as well a regularised objective that bypasses the iterative bottleneck of standard IPF-updates. Finally, we show that CMCD has a strong foundation in the Jarzinsky and Crooks identities from statistical physics, and that it convincingly outperforms competing approaches across a wide array of experiments.

ORAL-50: Amortizing intractable inference in large language models

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: large language models LLMs Bayesian inference chain-of-thought reasoning latent variable models generative flow networks GFlowNets

Scores: [ 8 8 10 5 ]

Autoregressive large language models (LLMs) compress knowledge from their training data through next-token conditional distributions. This limits tractable querying of this knowledge to start-to-end autoregressive sampling. However, many tasks of interest---including sequence continuation, infilling, and other forms of constrained generation---involve sampling from intractable posterior distributions. We address this limitation by using amortized Bayesian inference to sample from these intractable posteriors. Such amortization is algorithmically achieved by fine-tuning LLMs via diversity-seeking reinforcement learning algorithms: generative flow networks (GFlowNets). We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training and reward-maximizing policy optimization. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem and demonstrate that our approach enables data-efficient adaptation of LLMs to tasks that require multi-step rationalization and tool use.

POSTER-954: Estimating Conditional Mutual Information for Dynamic Feature Selection

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: dynamic feature selection adaptive feature selection mutual information information theory

Scores: [ 8 6 8 ]

Dynamic feature selection, where we sequentially query features to make accurate predictions with a minimal budget, is a promising paradigm to reduce feature acquisition costs and provide transparency into the prediction process. The problem is challenging, however, as it requires both making predictions with arbitrary feature sets and learning a policy to identify the most valuable selections. Here, we take an information-theoretic perspective and prioritize features based on their mutual information with the response variable. The main challenge is implementing this policy, and we design a new approach that estimates the mutual information in a discriminative rather than a generative fashion. Building on our learning approach, we introduce several further improvements: allowing variable feature budgets across samples, enabling non-uniform costs between features, incorporating prior information, and exploring modern architectures to handle partial input information. We find that our method provides consistent gains over recent state-of-the-art methods across a variety of datasets.

POSTER-955: Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: probabilistic inference sampling stochastic optimal control gflownets

Scores: [ 8 8 8 6 ]

SPOTLIGHT-188: A General Framework for User-Guided Bayesian Optimization

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Bayesian Optimization Hyperparameter Optimization Gaussian Processes

Scores: [ 6 8 6 8 ]

The optimization of expensive-to-evaluate black-box functions is prevalent in various scientific disciplines. Bayesian optimization is an automatic, general and sample-efficient method to solve these problems with minimal knowledge of the the underlying function dynamics. However, the ability of Bayesian optimization to incorporate prior knowledge or beliefs about the function at hand in order to accelerate the optimization is limited, which reduces its appeal for knowledgeable practitioners with tight budgets. To allow domain experts to customize the optimization routine, we propose ColaBO, the first Bayesian-principled framework for incorporating prior beliefs beyond the typical kernel structure, such as the likely location of the optimizer or the optimal value. The generality of ColaBO makes it applicable across different Monte Carlo acquisition functions and types of user beliefs. We empirically demonstrate ColaBO's ability to substantially accelerate optimization when the prior information is accurate, and to retain approximately default performance when it is misleading.

POSTER-956: Image Inpainting via Tractable Steering of Diffusion Models

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Probabilistic Methods Tractable Probabilistic Models Diffusion Models

Scores: [ 5 6 6 5 ]

Diffusion models are the current state of the art for generating photorealistic images. Controlling the sampling process for constrained image generation tasks such as inpainting, however, remains challenging since exact conditioning on such constraints is intractable. While existing methods use various techniques to approximate the constrained posterior, this paper proposes to exploit the ability of Tractable Probabilistic Models (TPMs) to exactly and efficiently compute the constrained posterior, and to leverage this signal to steer the denoising process of diffusion models. Specifically, this paper adopts a class of expressive TPMs termed Probabilistic Circuits (PCs). Building upon prior advances, we further scale up PCs and make them capable of guiding the image generation process of diffusion models. Empirical results suggest that our approach can consistently improve the overall quality and semantic coherence of inpainted images across three natural image datasets (i.e., CelebA-HQ, ImageNet, and LSUN) with only ~10% additional computational overhead brought by the TPM. Further, with the help of an image encoder and decoder, our method can readily accept semantic constraints on specific regions of the image, which opens up the potential for more controlled image generation tasks. In addition to proposing a new framework for constrained image generation, this paper highlights the benefit of more tractable models and motivates the development of expressive TPMs.

POSTER-957: Fast Ensembling with Diffusion Schr"odinger Bridge

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Deep Ensemble Diffusion Schr"odinger Bridge Bayesian

Scores: [ 6 8 6 8 5 ]

Deep Ensemble approach is a straightforward technique used to enhance the performance of deep neural networks by training them from different initial points, converging towards various local optima. However, a limitation of this methodology lies in its high computational overhead for inference, arising from the necessity to store numerous learned parameters and execute individual forward passes for each parameter during the inference stage.We propose a novel approach called Diffusion Bridge Network to address this challenge. Based on the theory of Schr"odinger bridge, this method directly learns to simulate an Stochastic Differential Equation (SDE) that connects the output distribution of a single ensemble member to the output distribution of the ensembled model, allowing us to obtain ensemble prediction without having to invoke forward pass through all the ensemble models. By substituting the heavy ensembles with this lightweight neural network constructing DBN, we achieved inference with reduced computational cost while maintaining accuracy and uncertainty scores on benchmark datasets such as CIFAR-10, CIFAR-100, and TinyImageNet.

POSTER-958: Delta-AI: Local objectives for amortized inference in sparse graphical models

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: amortized inference variational inference graphical models Markov random fields generative flow networks GFlowNets

Scores: [ 8 6 6 ]

We present a new algorithm for amortized inference in sparse probabilistic graphical models (PGMs), which we call Δ-amortized inference (Δ-AI). Our approach is based on the observation that when the sampling of variables in a PGM is seen as a sequence of actions taken by an agent, sparsity of the PGM enables local credit assignment in the agent's policy learning objective. This yields a local constraint that can be turned into a local loss in the style of generative flow networks (GFlowNets) that enables off-policy training but avoids the need to instantiate all the random variables for each parameter update, thus speeding up training considerably. The Δ-AI objective matches the conditional distribution of a variable given its Markov blanket in a tractable learned sampler, which has the structure of a Bayesian network, with the same conditional distribution under the target PGM. As such, the trained sampler recovers marginals and conditional distributions of interest and enables inference of partial subsets of variables. We illustrate Δ-AI's effectiveness for sampling from synthetic PGMs and training latent variable models with sparse factor structure.

SPOTLIGHT-189: Forward χ2 Divergence Based Variational Importane Sampling

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Importance sampling χ2 divergency latent variable models

Scores: [ 8 8 6 5 ]

POSTER-959: Bayesian low-rank adaptation for large language models

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Large language models Bayesian deep learning Laplace approximation uncertainty calibration

Scores: [ 8 6 6 6 ]

Parameter-efficient fine-tuning (PEFT) has emerged as a new paradigm for cost-efficient fine-tuning of large language models (LLMs), with low-rank adaptation (LoRA) being a widely adopted choice. However, fine-tuned LLMs often become overconfident especially when fine-tuned on small datasets. Bayesian methods, with their inherent ability to estimate uncertainty, serve as potent tools to mitigate overconfidence and enhance calibration. In this work, we introduce Laplace-LoRA, a straightforward yet effective Bayesian method, which applies the Laplace approximation to the LoRA parameters and, considerably boosts the calibration of fine-tuned LLMs.

POSTER-960: Leveraging Previous Tasks in Optimizing Risk Measures with Gaussian Processes

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: risk measure value-at-risk conditional value-at-risk

Scores: [ 8 6 6 ]

Research on optimizing the risk measure of a blackbox function using Gaussian processes, especially Bayesian optimization (BO) of risk measures, has become increasingly important due to the inevitable presence of uncontrollable variables in real-world applications. Nevertheless, existing works on BO of risk measures start the optimization from scratch for every new task without considering the results of previous tasks. In contrast, its vanilla BO counterpart has received a thorough investigation on utilizing previous tasks to speed up the current task through the body of works on meta-BO which, however, have not considered risk measures. To bridge this gap, this paper presents the first algorithm for meta-BO of risk measures (i.e., value-at-risk (VaR) and the conditional VaR) by introducing a novel adjustment to the upper confidence bound acquisition function. Our proposed algorithm exhibits two desirable properties: (i) invariance to scaling and vertical shifting of the blackbox function and (ii) robustness to previous harmful tasks. We provide a theoretical performance guarantee for our algorithm and empirically demonstrate its performance using several synthetic function benchmarks and real-world objective functions.

POSTER-961: Optimistic Bayesian Optimization with Unknown Constraints

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Bayesian optimization black box constraint decoupled query

Scores: [ 6 8 5 8 ]

Though some research efforts have been dedicated to constrained Bayesian optimization (BO), there remains a notable absence of a principled approach with a theoretical performance guarantee in the decoupled setting. Such a setting involves independent evaluations of the objective function and constraints at different inputs, and is hence a relaxation of the commonly-studied coupled setting where functions must be evaluated together. As a result, the decoupled setting requires an adaptive selection between evaluating either the objective function or a constraint, in addition to selecting an input (in the coupled setting). This paper presents a novel constrained BO algorithm with a provable performance guarantee that can address the above relaxed setting. Specifically, it considers the fundamental trade-off between exploration and exploitation in constrained BO, and, interestingly, affords a noteworthy connection to active learning. The performance of our proposed algorithms is also empirically evaluated using several synthetic and real-world optimization problems.

POSTER-962: Decoupled Marked Temporal Point Process using Neural Ordinary Differential Equations

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Marked Temporal Point Process Neural ODE

Scores: [ 8 5 6 ]

A Marked Temporal Point Process (MTPP) is a stochastic process whose realization is a set of event-time data. MTPP is often used to understand complex dynamics of asynchronous temporal events such as money transaction, social media, healthcare, etc. Recent studies have utilized deep neural networks to capture complex temporal dependencies of events and generate embedding that aptly represent the observed events. While most previous studies focus on the inter-event dependencies and their representations, how individual events influence the overall dynamics over time has been under-explored. In this regime, we propose a Decoupled MTPP framework that disentangles characterization of a stochastic process into a set of evolving influences from different events. Our approach employs Neural Ordinary Differential Equations (Neural ODEs) to learn flexible continuous dynamics of these influences while simultaneously addressing multiple inference problems, such as density estimation and survival rate computation. We emphasize the significance of disentangling the influences by comparing our framework with state-of-the-art methods on real-life datasets, and provide analysis on the model behavior for potential applications.

POSTER-963: LogicMP: A Neuro-symbolic Approach for Encoding First-order Logic Constraints

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Variational Inference

Scores: [ 6 5 8 5 ]

Integrating first-order logic constraints (FOLCs) with neural networks is a crucial but challenging problem since it involves modeling intricate correlations to satisfy the constraints. This paper proposes a novel neural layer, LogicMP, which performs mean-field variational inference over an MLN. It can be plugged into any off-the-shelf neural network to encode FOLCs while retaining modularity and efficiency.By exploiting the structure and symmetries in MLNs, we theoretically demonstrate that our well-designed, efficient mean-field iterations greatly mitigate the difficulty of MLN inference, reducing the inference from sequential calculation to a series of parallel tensor operations.Empirical results in three kinds of tasks over images, graphs, and text show that LogicMP outperforms advanced competitors in both performance and efficiency.

POSTER-964: Bayesian Optimization through Gaussian Cox Process Models for Spatio-temporal Data

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Bayesian optimization Gaussian Cox process

Scores: [ 8 8 6 ]

POSTER-965: Uncertainty-aware Graph-based Hyperspectral Image Classification

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Uncertainty Quantification Graph Hyperspectral Image Classification

Scores: [ 5 6 6 5 6 ]

POSTER-966: Adversarial Adaptive Sampling: Unify PINN and Optimal Transport for the Approximation of PDEs

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: adversarial adaptive sampling optimal transport neural network approximation of PDEs

Scores: [ 8 10 6 5 ]

Solving partial differential equations (PDEs) is a central task in scientific computing. Recently, neural network approximation of PDEs has received increasing attention due to its flexible meshless discretization and its potential for high-dimensional problems. One fundamental numerical difficulty is that random samples in the training set introduce statistical errors into the discretization of the loss functional which may become the dominant error in the final approximation, and therefore overshadow the modeling capability of the neural network. In this work, we propose a new minmax formulation to optimize simultaneously the approximate solution, given by a neural network model, and the random samples in the training set, provided by a deep generative model. The key idea is to use a deep generative model to adjust the random samples in the training set such that the residual induced by the neural network model can maintain a smooth profile in the training process. Such an idea is achieved by implicitly embedding the Wasserstein distance between the residual-induced distribution and the uniform distribution into the loss, which is then minimized together with the residual. A nearly uniform residual profile means that its variance is small for any normalized weight function such that the Monte Carlo approximation error of the loss functional is reduced significantly for a certain sample size. The adversarial adaptive sampling (AAS) approach proposed in this work is the first attempt to formulate two essential components, minimizing the residual and seeking the optimal training set, into one minmax objective functional for the neural network approximation of PDEs.

SPOTLIGHT-190: Graphical Multioutput Gaussian Process with Attention

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Gaussian process regression Multioutput Gaussian process Attention mechanism

Scores: [ 6 8 8 8 ]

Integrating information while recognizing dependence from multiple data sources and enhancing the predictive performance of the multi-output regression are challenging tasks. Multioutput Gaussian Process (MOGP) methods offer outstanding solutions with tractable predictions and uncertainty quantification.However, their practical applications are hindered by high computational complexity and storage demand. Additionally, there exist model mismatches in existing MOGP models when dealing with non-Gaussian data. To improve the model representation ability in terms of flexibility, optimality, and scalability,this paper introduces a novel multi-output regression framework, termed Graphical MOGP (GMOGP), which is empowered by:(i) generating flexible Gaussian process priors consolidated from identified parents, (ii) providing dependent processes with attention-based graphical representations, and (iii) achieving Pareto optimal solutions via a distributed learning framework. Numerical results confirm that the proposed GMOGP significantly outperforms state-of-the-art MOGP alternatives in predictive performance, as well as in time and memory efficiency, across various synthetic and real datasets.Our code and datasets are available at https://anonymous.4open.science/r/GMOGP-5ED3/.

POSTER-967: Implicit Neural Representation Inference for Low-Dimensional Bayesian Deep Learning

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: Bayesian Deep Learning Implicit neural representations Probabilistic machine learning Hypernetworks

Scores: [ 8 5 5 5 ]

Bayesian inference is the standard for providing full predictive distributions with well calibrated uncertainty estimates. However, scaling to a modern, overparameterized deep learning setting typically comes at the cost of severe and restrictive approximations, sacrificing model predictive strength. With our approach, we factor model parameters as a function of deterministic and probabilistic components; the model is solved by combining maximum a posteriori estimation of the former, with inference over a low-dimensional, Implicit Neural Representation of the latter. This results in a solution that combines both predictive accuracy and good calibration, as it entails inducing stochasticity over the full set of model weights while being comparatively cheap to compute. Experimentally, our approach compares favorably to the state of the art, including much more expensive methods as well as less expressive posterior approximations over full network parameters.

POSTER-968: PAC Prediction Sets Under Label Shift

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: prediction set label shift distribution-free uncertainty quantification probably approximately correct Clopper-Pearson binomial interval rejection sampling

Scores: [ 8 6 6 6 6 ]

Prediction sets capture uncertainty by predicting sets of labels rather than individual labels, enabling downstream decisions to conservatively account for all plausible outcomes. Conformal inference algorithms construct prediction sets guaranteed to contain the true label with high probability. These guarantees fail to hold in the face of distribution shift, which is precisely when reliable uncertainty quantification can be most useful. We propose a novel algorithm for constructing prediction sets with PAC guarantees in the label shift setting. It estimates importance weights, then propagates uncertainty in these estimates through a Gaussian elimination algorithm to compute confidence intervals that contain the importance weights, and finally uses these intervals to construct prediction sets. We evaluate our approach on four datasets: the CIFAR-10 and ChestX-Ray image datasets, the tabular CDC Heart Dataset, and the AGNews text dataset. Our algorithm satisfies the PAC guarantee while producing smaller prediction set sizes compared to several baselines.

SPOTLIGHT-191: Conformal Risk Control

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Keywords: conformal prediction uncertainty quantification

Scores: [ 8 6 8 8 6 6 ]

We extend conformal prediction to control the expected value of any monotone loss function. The algorithm generalizes split conformal prediction together with its coverage guarantee. Like conformal prediction, the conformal risk control procedure is tight up to an