Morning Session (chair: Emtiyaz Khan)
Abstract: Many researchers have pondered the same existential questions in this day and age: Is scale really all you need? Will the future of machine learning rely exclusively on foundation models? Should we all drop our current research agenda and work on the next large language model instead? In this talk, I will try to make the case that the answer to all these questions should be a convinced “no” and that now, maybe more than ever, should be the time to focus on fundamental questions in machine learning again. I will provide evidence for this by presenting three modern use cases of Bayesian deep learning in the areas of interpretable additive modeling, neural network sparsification, and subspace inference for fine-tuning. Together, these will show that the research field of Bayesian deep learning is very much alive and thriving and that its potential for valuable real-world impact is only just unfolding.
Bio: Vincent Fortuin is a tenure-track research group leader at Helmholtz AI in Munich, leading the group for Efficient Learning and Probabilistic Inference for Science (ELPIS). He is also junior faculty at the Technical University of Munich, a Fellow of the Konrad Zuse School for Reliable AI, affiliated with the Munich Center of Machine Learning, and a Branco Weiss Fellow. His research focuses on reliable and data-efficient AI approaches leveraging Bayesian deep learning, deep generative modeling, meta-learning, and PAC-Bayesian theory. Before that, he did his PhD in Machine Learning at ETH Zürich and was a Research Fellow at the University of Cambridge. He is a member and unit faculty of ELLIS, a regular reviewer and area chair for all major machine learning conferences, and a co-organizer of the Symposium on Advances in Approximate Bayesian Inference (AABI) and the ICBINB initiative.
Abstract: In the era of large-scale foundation models, Bayesian deep learning remains an indispensable tool due to its capability to quantify uncertainty in a principled manner and adapt continuously to dynamic environments. However, the well-known challenges of Bayesian learning—difficulty in posterior inference, high cost of Bayesian model averaging (BMA), and selecting appropriate prior distributions—are even more pronounced in modern AI models. In this talk, I will present our recent work addressing these challenges. First, for more efficient posterior inference in Bayesian neural networks (BNNs), I will discuss how meta-learning can enhance the mixing of stochastic gradient MCMC (SGMCMC) algorithms for various BNNs. Second, I will introduce our newly developed algorithm that reduces the cost of BMA using diffusion-based distribution matching techniques. Finally, I will present our work on meta-learning stochastic processes, which can serve as priors for a range of downstream tasks.
Bio: Dr. Juho Lee is an associate professor at the Kim Jaechul Graduate School of AI Korea Advanced Institute of Science and Technology (KAIST). He earned his Ph.D. in Computer Science & Engineering from Pohang University of Science & Technology (POSTECH) and did his postdoc in the Computational Statistics & Machine Learning group at the University of Oxford, working with Professor François Caron. His research primarily focuses on Bayesian deep learning, with significant contributions to Bayesian nonparametrics, meta-learning, and generative modeling.
Afternoon Session (chair: Vincent Fortuin)
Abstract: If you predict a label y of a new object with y_pred, how confident are you that "y = y_pred"? The conformal prediction method provides an elegant framework for answering such a question by establishing a confidence set for an unobserved response of a feature vector based on previous similar observations of responses and features. This is performed without assumptions about the distribution of the data. I will try, in this presentation, to discuss this approach, to evaluate its validity, strength and limitation. Last but not least, is there a "hidden / implicit" link with the classic Bayesian approach?
Bio: I am currently a researcher in the Machine Learning Group @Apple in Paris. I focus mainly on optimization and uncertainty quantification. I was previously a postdoctoral researcher both at Georgia Institute of Technology (USA) and Riken AIP (Japan). I hold a PhD in Applied Mathematics from University of Paris Saclay (France). My doctoral thesis focused on the design and analysis of faster and safer optimization algorithms for variable selection and hyperparameter calibration in high dimension.
Abstract: Conformal prediction methodologies have significantly advanced the quantification of uncertainties in predictive models. Yet, the construction of confidence regions for model parameters presents a notable challenge, often necessitating stringent assumptions regarding data distribution or merely providing asymptotic guarantees. We introduce a novel approach termed CCR, which employs a combination of conformal prediction intervals for the model outputs to establish confidence regions for model parameters. We present coverage guarantees under minimal assumptions on noise and that is valid in finite sample regime. Our approach is applicable to both split conformal predictions and black-box methodologies including full or cross-conformal approaches. In the specific case of linear models, the derived confidence region manifests as the feasible set of a Mixed-Integer Linear Program (MILP), facilitating the deduction of confidence intervals for individual parameters and enabling robust optimization. We empirically compare CCR to recent advancements in challenging settings such as with heteroskedastic and non-Gaussian noise.
Bio: I am currently a researcher in the Machine Learning Group @Apple in Paris. I focus mainly on optimization and uncertainty quantification. I was previously a postdoctoral researcher both at Georgia Institute of Technology (USA) and Riken AIP (Japan). I hold a PhD in Applied Mathematics from University of Paris Saclay (France). My doctoral thesis focused on the design and analysis of faster and safer optimization algorithms for variable selection and hyperparameter calibration in high dimension.
Abstract: Choosing optimal hyperparameters for deep learning can be highly expensive due to trial-and-error procedures and required expertise. Bayesian model selection can help to overcome such issues using gradient-based optimization and does not require a held-out validation set. However, it requires estimation and differentiation of the marginal likelihood, which is inherently intractable. In my talk, I will first discuss how scalable Laplace approximations enable Bayesian model selection for advanced applications. Further, I will demonstrate how to derive faster approximations using lower bounds and dualities.
Bio: I am a last-year PhD student at ETH Zürich, Max Planck Institute for Intelligent Systems, and Google Research. I work on probabilistic deep learning with a focus on Bayesian model selection and scientific applications.
Morning Session (chair: Alexander Immer)
Abstract: I'll discuss how to use Variational Bayes (VB) to correct approximations rather than using VB to get approximations itself.
Bio: Haavard Rue is a professor of Statistics at CEMSE Division, at the King Abdullah University of Science and Technology in Saudi Arabia, since 2017, and before that a professor at the Department of Mathematical Sciences at the Norwegian University for Science and Technology. He was named a highly cited researcher according to the Highly Cited Researchers in the years 2019--2021, from the Web of Science Group, gave the Bahadur Memorial Lectures at Univ of Chicago in 2018, and in 2021 awarded the Royal Statistical Society (RSS) Guy Medal in Silver for his work on Integrated Nested Laplace Approximations (INLA) and the Stochastic Partial Differential Equation (SPDE) approach represent and compute with Gaussian fields. His research is mainly centred around the "R-INLA project", see www.r-inla.org.
Abstract: In this tutorial, I will motivate duality through various applications in machine learning. I will also give a gentle and self-contained introduction to convex duality, Fenchel conjugate functions and how dual variables are related to sensitivities to perturbations of the problem's parameters or data.
Bio: Thomas Möllenhoff received his PhD in Informatics from the Technical University of Munich in 2020. From 2020 to 2023, he was a post-doc in the Approximate Bayesian Inference Team at RIKEN. Since 2023 he works at RIKEN as a tenured research scientist. His current research focuses on optimization and Bayesian deep learning and has been awarded several times, including the Best Paper Honorable Mention award at CVPR 2016 and a first-place at the NeurIPS 2021 Challenge on Approximate Inference.
Afternoon Session (chair: Eugene Ndiaye)
Abstract: I argue that information processing has a deep-rooted connection with Bayes' rule, and therefore all good algorithms must have Bayesian roots. I will discuss a Bayesian idea, which I call, the conjugate computations where information processing reduce to a simple addition. In general, such computations are realized through the Bayesian learning rule (BLR) and a wide-variety of algorithms can be derived from it (I will briefly outline the derivation of RMSprop and Adam). Time permitting, I will discuss the ``dual'' view of the BLR which is the starting point for Bayes-duality.
Bio: Available here
Abstract: We propose a novel approach to sequential Bayesian inference based on variational Bayes. The standard variational loss is a sum of expected negative log-likelihood and KL divergence from the prior. Our key insight is that, in the online setting, we can drop the KL term and instead perform a single step of natural gradient descent on the expected NLL, starting from the prior predictive (which comes from the posterior at the previous timestep). Thus instead of explicitly regularizing to the prior, we do so implicitly. We prove this method recovers exact Bayesian updating if the model is conjugate, and empirically outperforms other online VB methods in the non-conjugate such as online learning for neural networks, especially when controlling for computational costs.
Bio: Professor of Psychology at University of Colorado and recent Visiting Faculty at Google Brain/DeepMind
Morning Session (chair: Haavard Rue)
Abstract: In this talk, I shall present several generalizations of Bregman divergences for machine learning with algorithmic and geometric considerations. In particular, I will describe duality structures and a few algorithms on Bregman manifolds, introduce the Bregman duo pseudo-divergences, and present a generalization of convexity which yields conformal Bregman divergences.
Bio: Frank Nielsen prepared his PhD on computational geometry (1996) at INRIA Sophia-Antipolis (France). He is a Senior Researcher and Fellow of Sony Computer Science Laboratories Inc. (Sony CSL, Tokyo) where he currently conducts research on Structures, Dynamics, and Geometric Computing for AI and Information Theory. He serves the following journals: Information Geometry (Springer), Transactions on Information Theory (IEEE), and Entropy (MDPI). Frank Nielsen co-organizes with Frederic Barbaresco the biannual conference Geometric Science of Information.
Abstract: Prevalent setups in continual learning often fall short of practical applicability, posing unrealistic assumptions and constraints. First, I will try to address these gaps by presenting various continual learning setups (mostly for class incremental learning) that are more realistic and feasible, on which we would need to evaluate our algorithms. Then, I will discuss one of our approaches to address a continual learning (e.g., class incremental learning) in a bit realistic set-up, online continuous data stream and beyond.
Bio: Jonghyun Choi received the B.S. and M.S. degrees in electrical engineering and computer science from Seoul National University, Seoul, South Korea in 2003 and 2008 respectively. He received a Ph.D. degree from University of Maryland, College Park in 2015, under the supervision of Prof. Larry S. Davis. He is currently an associate professor at Seoul National University, Seoul, South Korea. During his PhD, he has worked as a research intern in a number of research labs including US Army Research Lab (2012), Adobe Research (2013), Disney Research Pittsburgh (2014) and Microsoft Research Redmond (2014). He was a senior researcher at Comcast Applied AI Research, Washington, DC from 2015 to 2016. He was a research scientist at Allen Institute for Artificial Intelligence (AI2), Seattle, WA from 2016 to 2018 and is currently an affiliated research scientist. He was an associate professor at Yonsei University, Seoul, South Korea from 2022-2024, and an assistant professor at GIST, Gwangju, South Korea from 2018-2022. He serves as an area chair at IEEE/CVF Conference of CVPR and WACV, NeurIPS, BMVC, and was a paper review chair at CoLLAs 2023, and an associate editor of IEEE Transactions on PAMI. His research interest includes visual recognition using weakly supervised data for semantic understanding of images and videos (e.g., continual learning, unlearning) and visual understanding for edge devices and household robots.
Afternoon Session
Abstract: Machine learning studies the design of models and training algorithms in order to learn how to solve tasks from data. Whereas traditional machine learning concentrates on predefined training datasets, the renaissance of continual learning also takes into account that the world is constantly evolving. In this tutorial, I will focus on the challenge of catastrophic interference when attempting respective continual updates and summarize mechanisms to counteract it. To this end, I will survey three pillars of approaches, spanning the perspective from data, to optimization, and choice of model. Finally, I will highlight how the challenge of sequential updates relates to broader machine learning and which further elements are required on the road to true lifelong learning systems.
Bio: Martin is an independent research group leader at TU Darmstadt and hessian.AI, where he leads the Open World Lifelong Learning lab. He is also a board member at the non-profit ContinualAI, a core-organizer of Queer in AI, was the Diversity & Inclusion chair at AAAI-24 and currently serves as Review Process Chair for CoLLAs 2024. Previously, he was an interim professor and postdoctoral researcher at TU Darmstadt, has obtained a CS PhD from Goethe University Frankfurt, and holds a Masters degree in Physics. The main vision behind his research is to transcend static machine learning systems towards adaptive and sustainable lifecycles.
Abstract: Bayesian experimental design (BED) provides a powerful and general framework for optimizing the design of experiments. However, its deployment often poses substantial computational challenges that can undermine its practical use. In this tutorial, I will outline the Bayesian experimental design framework and explain how recent advances have transformed our ability to overcome these challenges and thus utilize BED effectively, before discussing some key areas for future development in the field. Related review paper
Bio: I am a Senior Researcher in Machine Learning (and from September, Associate Professor) in the Department of Statistics at the University of Oxford, where I run the RainML Research Lab. My research covers a wide range of topics in and around machine learning and experimental design, with areas of particular interest including Bayesian experimental design, deep learning, representation learning, generative models, Monte Carlo methods, active learning, probabilistic programming, and variational inference.
Morning Session (chair: Emtiyaz Khan)
Abstract: Deep models tend to fall victim to catastrophic forgetting when updated sequentially or transferred to new scenarios. Continual learning seeks to provide a remedy to this phenomenon. To this end, prevalent approaches construct a relevant memory of the past, typically by retaining what is considered to be most important data. The common assumption is that the solution to continual learning lies in the ability to accumulate prior knowledge. In this talk, I challenge this assumption at the hand of two examples. I will first detail how knowledge can be distilled from a differentiable classifier into a second model, even if we don’t have access to any training data or model internals. Conversely, I will show that models can perform catastrophically in the presence of confounders over time, even if they are equipped with perfect memory of all previously observed data. Together, these examples challenge our present view of memory solutions for continual learning and call out for new techniques to be developed.
Bio: Martin is an independent research group leader at TU Darmstadt and hessian.AI, where he leads the Open World Lifelong Learning lab. He is also a board member at the non-profit ContinualAI, a core-organizer of Queer in AI, was the Diversity & Inclusion chair at AAAI-24 and currently serves as Review Process Chair for CoLLAs 2024. Previously, he was an interim professor and postdoctoral researcher at TU Darmstadt, has obtained a CS PhD from Goethe University Frankfurt, and holds a Masters degree in Physics. The main vision behind his research is to transcend static machine learning systems towards adaptive and sustainable lifecycles.
Abstract: In federated learning, data is split heterogeneously over local clients, and our goal is to train a global model while minimising communication cost between the global server and local clients. Techniques usually suffer from client drift, where optimising on a local client leads to overfitting on its data. We introduce FedLap, a method derived from a Bayesian approach to federated learning. FedLap enjoys desirable properties of existing algorithms, including recent algorithms that overcome client drift. Additionally, our Bayesian perspective (i) provides interpretations of hyperparameters, and (ii) allows us to improve FedLap by improving the Bayesian approximation. Specifically, we improve FedLap by including Gaussian covariances and function-space information, linking our method to both Bayesian techniques and Federated Distillation methods.
Bio: Siddharth Swaroop is a postdoctoral fellow at the Data to Actionable Knowledge Lab at Harvard University. His current research focusses on probabilistic deep learning for a broad range of knowledge adaptation problems (such as continual learning, federated learning, and knowledge distillation), as well as reinforcement learning problems that arise in human-AI settings. He also works on machine learning interpretability, privacy, and Large Language Model biases. Siddharth obtained his PhD in machine learning at the University of Cambridge, supervised by Professor Richard Turner, where he was awarded a Microsoft Research EMEA PhD Award and an Honorary Vice-Chancellor’s Award from the Cambridge Trust.
Afternoon Session (chair: Kenichi Bannai)
Abstract: Modern deep neural networks can have billions of parameters, which makes conventional variational learning impractical. With the advent of the Improved Variational Online Newton (IVON) optimizer, variational learning on large models with billions of parameters is not tractable. However, there are numerous challenges when training at such large scale. I will describe some of these challenges in this talk.
Bio: Rio Yokota is a Professor at the Global Scientific Information and Computing Center, Tokyo Institute of Technology. His research interests lie at the intersection of high performance computing, linear algebra, and machine learning. He is the developer numerous libraries for fast multipole methods (ExaFMM), hierarchical low-rank algorithms (Hatrix), and information matrices in deep learning (ASDL) that scale to the full system on the largest supercomputers today.
Morning Session (chair: Martin Mundt)
The day will feature talks with many recent works from the CREST team.
Abstract: Modern data-driven AI systems can work extremely well but also fail miserably for unknown reasons. Fixing such failures are known to be extremely hard and designing systems that are transparent and trustworthy is even harder. I argue that this is possible but we have to fundamentally change the way we currently train our AI systems. I will discuss some of these insights that arise from our work on Bayesian principles, where the key idea is to design training methods that help us understand uncertainties and sensitivities of the AI building blocks. I will keep the discussion highly non-technical but can go into more detailed techniques if the need arises.
Bio: Available here
Abstract: We present extensive evidence against the common belief that variational Bayesian learning is ineffective for large neural networks. First, we show that a recent deep learning method called sharpness-aware minimization (SAM) solves an optimal convex relaxation of the variational Bayesian objective. Then, we demonstrate that a direct optimization of the variational objective with an Improved Variational Online Newton method (IVON) can consistently match or outperforms Adam for training large networks such as GPT-2 and ResNets from scratch. IVON's computational costs are nearly identical to Adam but its predictive uncertainty is better. We show several new use cases of variational learning where we improve fine-tuning and model merging in Large Language Models, accurately predict generalization error, and faithfully estimate sensitivity to data.
Bio: Thomas Möllenhoff received his PhD in Informatics from the Technical University of Munich in 2020. From 2020 to 2023, he was a post-doc in the Approximate Bayesian Inference Team at RIKEN. Since 2023 he works at RIKEN as a tenured research scientist. His current research focuses on optimization and Bayesian deep learning and has been awarded several times, including the Best Paper Honorable Mention award at CVPR 2016 and a first-place at the NeurIPS 2021 Challenge on Approximate Inference.
Afternoon Session (chair: Kenichi Bannai)
Abstract: We will discuss progress made in the above three topics. The discussion will be based on the following papers:
Morning Session (chair: Adam White)
Abstract: Continual/Lifelong Learning is a topic that has been gaining momentum in the last few years, however its impact on the machine learning community is not yet clear. In this talk I will rely on an optimization perspective to think about continual learning, describing briefly two distinct problems that had been studied in this space: catastrophic forgetting and loss of plasticity. I will argue that loss of plasticity can be understood from a signal propagation point of view, and show some recent observations that we made on the potential causes of this phenomenon. Additionally, taking inspiration from online learning, I will present a simple alteration of stochastic gradient descent that allows a form of soft reset of parameters which can be used to alleviate the problem. Finally I will discuss why catastrophic forgetting, viewed as interference during learning, is problematic even if the goal is not in revisiting tasks, and how it could potentially be seen as a source of inefficiency in learning. I will conclude with some final thoughts on the potential impact that CL is having and could potentially have for machine learning.
Bio: Razvan Pascanu has been a research scientist at Google DeepMind since 2014. Before this, he did his PhD in Montréal with prof. Yoshua Bengio, working on understanding deep networks, recurrent models and optimization. Since he joined Google DeepMind he has also had significant contributions in deep reinforcement learning, continual learning, meta-learning, graph neural networks as well as continuing his research agenda of understanding deep learning, recurrent models and optimization. Please see his scholar page for specific contributions. He is also actively promoting AI research and education as a main organizer of Conference on Life-long Learning Agents (CoLLAs) , Eastern European Machine Learning Summer School (EEML) and workshops as well as different workshops at NeurIPS, ICML and ICLR.
Abstract: Current AI systems predominantly learn offline, that is, based on pre-existing data or simulations, using vast computational resources. In continual learning, we take the position that the world always gives the learner new information and situations to learn from, making offline learning never sufficient. Hence, unlike current AI systems, a continual learning system learns in real time from a perpetual stream of data while being used in deployment.
A significant challenge of real-time learning is the often severe computational constraints of deployed systems. In this presentation, I focus on onboard reinforcement learning of control for mobile robots, where the resource constraints of continual learning are among the most extreme. Using robot learning examples, I demonstrate how existing deep reinforcement learning algorithms are unsuitable for onboard learning due to the computational limitations of onboard devices even for single-task learning, let alone continual learning. I argue that the bottleneck is an indispensable part of continual learning and deep reinforcement learning algorithms—namely, a buffer to replay and learn from past samples. I then present my recent works on continual learning that make significant steps to achieve deep learning with small or no replay buffers.
Bio: Rupam Mahmood is an Assistant Professor in the Department of Computing Science at the University of Alberta, a Canada CIFAR AI Chair, and a Fellow at the Alberta Machine Intelligence Institute (Amii). His work intersects reinforcement learning, continual learning, and robot learning. Mahmood focuses on developing deep representation learning algorithms to address issues of continual learning such as scalability, catastrophic forgetting, and loss of plasticity, with the ultimate goal of building autonomous continual learning systems such as freely-living robots. Before joining UAlberta, Mahmood led the AI Research team at the robotics company Kindred AI, where he developed SenseAct, the first open-source toolkit and benchmark task suite for reproducible real-time learning with various physical robots.
Afternoon Session (chair: Razvan Pascanu)
Abstract: In this talk, I will provide an overview of my lab’s recent works on tackling the lifelong learning problem through modern optimization tools. I will discuss how to design new optimizers that are more suitable for lifelong learning with less forgetting. I will also talk about the connections between pre-training, sharpness of the loss landscape, linear mode connectivity, and catastrophic forgetting. I will then talk about how to use ideas from optimization to decide when to expand a neural network during lifelong learning. Finally I will touch on the problem of plasticity and discuss if loss of plasticity is data-specific or architecture-specific or both.
Bio: Sarath Chandar is an Associate Professor at Polytechnique Montreal where he leads the Chandar Research Lab. He is also a core faculty member at Mila, the Quebec AI Institute. Sarath holds a Canada CIFAR AI Chair and the Canada Research Chair in Lifelong Machine Learning. His research interests include lifelong learning, deep learning, optimization, reinforcement learning, and natural language processing. To promote research in lifelong learning, Sarath created the Conference on Lifelong Learning Agents (CoLLAs) in 2022 and served as a program chair for 2022 and 2023. He received his PhD from the University of Montreal and MS by research from the Indian Institute of Technology Madras. Webpage: http://sarathchandar.in/.
Abstract: In this talk I will discuss whether the standard assumption of treating all the parameters in a Bayesian neural network stochastically is actually justified. I walk explain how our recent work has given compelling theoretical and empirical evidence that this standard construction may be unnecessary. From a theoretical perspective, I will demonstrate that expressive predictive distributions require only small amounts of stochasticity. From an empirical perspective, I will explain how our investigations found no systematic benefit of full stochasticity across four different inference modalities and eight datasets. I will try to leave plenty of time at the end of the talk for debate on the subject!
Bio: I am a Senior Researcher in Machine Learning (and from September, Associate Professor) in the Department of Statistics at the University of Oxford, where I run the RainML Research Lab. My research covers a wide range of topics in and around machine learning and experimental design, with areas of particular interest including Bayesian experimental design, deep learning, representation learning, generative models, Monte Carlo methods, active learning, probabilistic programming, and variational inference.
Abstract: In many real-world problems the agent is much smaller than the vast world in which it must operate. In such scenarios, the world appears non-stationary to the agent, and thus we require agents capable of stable, non-convergent, never-ending learning. Successful agents must balance specializing their learning to the current situation with the need to learn many things over time which can be combined to learn yet new things--a concept known as scaffolding. This talk will review my lab's recent work on architectures and algorithms for learning many things.
Bio: Adam is an assistant professor in the University of Alberta’s Department of Computing Science, a Canada CIFAR AI Chair, and the Director of Scientific Operations at the Alberta Machine Intelligence Institute (Amii). He is also a Principal Investigator in the Reinforcement Learning & Artificial Intelligence (RLAI) Lab. Adam co-created the Reinforcement Learning Specialization, taken by over 85,000 students on Coursera. Adam's research is focused on understanding the fundamental principles of learning in continual learning settings, both simulated worlds and real-world industrial-control applications. Adam's group is deeply passionate about good empirical practices and new methodologies to help determine if our algorithms are ready for deployment in the real world.
Morning Session (chair: Sarath Chandar)
Abstract: This quick tutorial aims to provide a concise overview of optimization techniques and heuristics widely used in training deep neural networks, offering insights into their motivation and effectiveness. Due to time constraints, this will not be an in-depth treatment but rather an initial exposure to motivate further exploration. The tutorial broadly covers challenges associated with two types of local structure in the loss surface that influence the trainability of deep networks: curvature and gradient. We begin with curvature, understanding its impact on gradient descent convergence and how to address challenges associated with ill-conditioned Hessians in deep learning. Approaches discussed include batch normalization and residual networks. We then shift to gradient challenges, namely their instability, which may lead to vanishing or exploding gradients, and discuss the effects of initialization, activation functions, and gradient clipping on the quality of training.
Bio: Hossein Mobahi is a senior research scientist at Google DeepMind, where his interests focus on the intersection of optimization and generalization in deep neural networks. Before joining Google in 2016, he was a postdoctoral researcher at MIT's CSAIL. He holds a PhD in Computer Science from the University of Illinois at Urbana-Champaign (UIUC).
Abstract: Recent work has shown that methods like SAM which either explicitly or implicitly penalize second order information can improve generalization in deep learning. Seemingly similar methods like weight noise and gradient penalties often fail to provide such benefits. We show that these differences can be explained by the structure of the Hessian of the loss. First, we show that a common decomposition of the Hessian can be quantitatively interpreted as separating the feature exploitation from feature exploration. The feature exploration, which can be described by the Nonlinear Modeling Error matrix (NME), is commonly neglected in the literature since it vanishes at interpolation. Our work shows that the NME is in fact important as it can explain why gradient penalties are sensitive to the choice of activation function. Using this insight we design interventions to improve performance. We also provide evidence that challenges the long held equivalence of weight noise and gradient penalties. This equivalence relies on the assumption that the NME can be ignored, which we find does not hold for modern networks since they involve significant feature learning. We find that regularizing feature exploitation but not feature exploration yields performance similar to gradient penalties.
Bio: Hossein Mobahi is a senior research scientist at Google DeepMind, where his interests focus on the intersection of optimization and generalization in deep neural networks. Before joining Google in 2016, he was a postdoctoral researcher at MIT's CSAIL. He holds a PhD in Computer Science from the University of Illinois at Urbana-Champaign (UIUC).
Afternoon Session
Abstract: We consider the high-dimensional geometry associated with the first and second order structure of deep learning (DL) losses. We briefly review ongoing work which is starting to highlight the importance of such structure. We then introduce three lines of advances which are utilizing such structure along with randomness. First, we revisit the challenges of differentially private (DP) DL, and demonstrate how DP-DL with gradient projections using the first order geometry hold considerable promise. Second, we consider online optimization with DL and, surprisingly, establish O(log T) regret bounds even though the problem does not fall under the standard online convex optimization paradigm. The development utilizes the second order geometry of DL, along with some unique use of random perturbations. Third, we introduce a new approach to generalization bounds which directly use the geometry in terms of the Gaussian width of gradients, which is shown to be related to the eigenvalue decay of the loss Hessian as well as the Gaussian width of the featurizer. Thus, learning algorithms which lead to fast decaying Hessian eigenvalues or feature selection in terms of having small number of active features while maintaining low training loss is expected to generalize well.
Bio: Arindam Banerjee is a Founder Professor at the Department of Computer Science, University of Illinois Urbana-Champaign. His research interests are in machine learning. His current research focuses on computational and statistical aspects of over-parameterized models including deep learning, spatial and temporal data analysis, generative models, and sequential decision making problems. His work also focuses on applications of machine learning in complex real-world and scientific domains including problems in climate science and ecology. He has won several awards, including the NSF CAREER award (2010), the IBM Faculty Award (2013), and seven best paper awards in top-tier venues.
Morning Session (chair: Zelda Mariet)
Abstract: One of my research dreams is to build a high-resolution video generation model that enables granularity controls in e.g., the scene appearance and the interactions between objects. I tried, and then realised the need of me inventing deep learning tricks for this goal is due to the issue of non-identifiability in my sequential deep generative models. In this talk I will discuss our research towards developing identifiable deep generative models in sequence modelling, and share some recent and on-going works regarding switching dynamic models. Throughout the talk I will highlight the balance between causality "Theorist" and deep learning "Alchemist", and discuss my opinions on the future of causal deep generative modelling research.
Bio: Dr Yingzhen Li is a Senior Lecturer in Machine Learning at Imperial College London, UK. Before that she worked at Microsoft Research Cambridge and Disney Research. She received her PhD from the University of Cambridge. Yingzhen is passionate about building reliable machine learning systems with probabilistic methods, and her published work has been applied in industrial systems and implemented in popular deep learning frameworks. Her work on Bayesian ML has been recognised in AAAI 2023 New Faculty Highlights, and she gave an invited tutorial on approximate inference at NeurIPS 2020. She regularly serves as (Senior) Area Chair for ICML, ICLR and NeurIPS, and she is a Program Chair for AISTATS 2024. When not at work, Yingzhen enjoys reading, hiking, video games, and following news on latest technology developments.
Abstract: In this talk, I will introduce our new problem called Learning Transfer, which can be seen as a complementary problem to Transfer Learning. This problem is formulated as the transfer problem of a learning trajectory from one initialization to another initialization. We leverage the permutation symmetry of neural networks, and provide a theoretical evidence of the existence of appropriate permutations for learning transfer in the case of overparameterized 2-layered MLPs. We also derive the first algorithm to solve this problem approximately in an efficient way, and demonstrate its application to reduce the cost of updating the underlying pre-trained model of fine-tuned models.
Bio: Daiki Chijiwa is a researcher at NTT, Computer and Data Science Laboratories in Tokyo, where he is currently working on both theoretical and empirical research in deep learning. Before joining NTT in 2019, he received an MSc degree in Mathematical Sciences from The University of Tokyo, where he worked on algebraic geometry and Hodge theory.
Afternoon Session (chair: Yingzhen Li)
Abstract: Ensembles are a straightforward, remarkably effective method for improving the accuracy,calibration, and robustness of models on classification tasks; yet, the reasons that underlie their success remain an active area of research. We build upon the extension to the bias-variance decomposition by Pfau (2013) in order to gain crucial insights into the behavior of ensembles of classifiers. Introducing a dual reparameterization of the bias-variance tradeoff, we first derive generalized laws of total expectation and variance for nonsymmetric losses typical of classification tasks. Comparing conditional and bootstrap bias/variance estimates, we then show that conditional estimates necessarily incur an irreducible error. Next, we show that ensembling in dual space reduces the variance and leaves the bias unchanged, whereas standard ensembling can arbitrarily affect the bias. Empirically, standard ensembling reducesthe bias, leading us to hypothesize that ensembles of classifiers may perform well in part because of this unexpected reduction.We conclude by an empirical analysis of recent deep learning methods that ensemble over hyperparameters, revealing that these techniques indeed favor bias reduction. This suggests that, contrary to classical wisdom, targeting bias reduction may be a promising direction for classifier ensembles.
Bio: Zelda Mariet is a cofounder of and Principal Research Scientist at Bioptimus, working on multi-scale foundation models for biology. Prior to joining, Zelda was at Google DeepMind, where she worked on the theory and practice of reliability for deep learning. Zelda obtained her PhD from MIT, with a dissertation centered on the beautiful world of negatively dependent measures and strongly Rayleigh polynomials.
Abstract: We first provide insights into how modern neural networks, such as transformers, learn representations and incorporate a memory mechanism in their architecture. Next, we present a general approach to learning such representations from a trained network and transferring those representations into a second network with a different size or architecture. Specifically, our strategy relies on a non-linear variant of PCA via Bregman divergences. Finally, we discuss some possible future directions for enhancing representation and memory in deep neural networks.
Bio: Ehsan Amid is a Research Scientist at Google DeepMind (formerly Google Brain). He received his PhD in Computer Science (with a focus on Machine Learning) from the University of California, Santa Cruz, and an MSc degree in Machine Learning and Data Mining from Aalto University, Finland. He works on machine learning theory, robust learning, optimization, and dimensionality reduction techniques. He is a Gemini core contributor for training multimodal large language models.