Logistics

Date: Friday, April 18th, 2025

Location: Northwestern University, 3rd floor Mudd library (Room: 3514) 2233 Tech Dr, Evanston, IL 60208.

Parking: For those driving to the workshop, attendees can park in the North Campus garage 2311 N Campus Dr #2300, Evanston, IL 60208https://maps.northwestern.edu/txt/facility/646 You’ll exit the garage on the opposite side from the car entrance and you’ll see Mudd Library directly in front of you across a grassy lawn area. Take the elevator to your right in the library lobby to the 3rd floor.

Parking passes will be provided at the workshop for free parking in designated NU parking building. Please remember to ask for a pass before leaving the workshop.

Registration: https://docs.google.com/forms/d/e/1FAIpQLSfX8dBNJdrab01oSff8MiRbdUU1Q9PMTcvizcyU-xIdym1OlA/viewform 

 

Streaming Link: 

Description:

There have been rapid advances in the practice of generative AI and deep learning. This workshop will bring together researchers from across Chicago and the wider research community to cover recent progress in understanding the mechanisms behind deep learning, and related aspects of generalization and optimization. The event is part of the Special Program on Deep Learning and Optimization hosted by the Institute for Data, Econometrics, Algorithms, and Learning (IDEAL).

 
Schedule:  

Time | Event

9:30 – 9:45 | Breakfast and opening remarks

9:45 – 10:30 | Talk 1: Sitan Chen (Harvard University) on Gradient dynamics for low-rank fine-tuning beyond kernels

10:30 – 11:00 | Coffee break

11:00 – 11:45 | Talk 2: Kaifeng Lyu (Simons Institute, UC Berkeley) on A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules

11:45 – 1 PM | Lunch

1PM – 2PM | Short Talks Session: Chang-Han Rhee (Northwestern), David Yunis (TTIC), Bravim Purohit (IIT), Dravy Sharma (TTIC), Anxin Guo (Northwestern)   

2:00 – 2:45 | Talk 3: Frederic Koehler (University of Chicago) on Inductive Bias in Generative Modeling

2:45 – 3:15 | Coffee break

3:15 – 4PM | Talk 4: Mengdi Wang (Princeton University) on From Genome to Theorem: Can Large Language Models Do Science?

4:00 – 5:00 | Poster session and networking 

Abstracts: 

Speaker: Sitan Chen

Title:  Gradient dynamics for low-rank fine-tuning beyond kernels

Abstract: LoRA has emerged as one of the de facto methods for fine-tuning foundation models with low computational cost and memory footprint. The idea is to only train a low-rank perturbation to the weights of a pre-trained model, given supervised data for a downstream task. Despite its empirical success, from a mathematical perspective it remains poorly understood what learning mechanisms ensure that gradient descent converges to useful low-rank perturbations. In this talk I describe a toy model for studying low-rank fine-tuning. We are given the weights of a two-layer base model f, as well as i.i.d. samples (x,f∗(x)) where x is Gaussian and f∗ is the teacher model given by perturbing the weights of f by a rank-1 matrix. This generalizes the setting of generalized linear model (GLM) regression where the weights of f are zero. When the rank-1 perturbation is comparable in norm to the weight matrix of f, the training dynamics are nonlinear. Nevertheless, in this regime we prove under mild assumptions that a student model which is initialized at the base model and trained with online gradient descent will converge to the teacher in dk^{O(1)} iterations, where k is the number of neurons in f. Importantly, unlike in the GLM setting, the complexity does not depend on fine-grained properties of the activation’s Hermite expansion. We also prove that in our setting, learning the teacher model “from scratch” can require significantly more iterations. Based on joint work with Kerem Dayi.

Speaker: Mengdi Wang

Title: From Genome to Theorem: Can Large Language Models Do Science?

Abstract: Large Language Models (LLMs) are increasingly being explored as tools for scientific reasoning — not just in language tasks, but across disciplines such as math, biology, and genomics. In this talk, I’ll discuss recent developments in AI for science, including genome language models, AI gene-editing co-scientist, and LLMs for math. I’ll highlight both the capabilities and current limitations of LLMs, and discuss key gaps between AI and science such as the overoptimism in AI’s capabilities and the lack of benchmark and rigorous evaluation. As we push toward AI systems that can assist with discovery, the question remains: can LLMs truly do science — or are we still in the early stages of bridging that divide?

Speaker: Frederic Koehler

Title: On Inductive Bias in Generative Modeling

Abstract: There has been a lot of work on inductive bias of gradient descent and other learning algorithms: for example in supervised settings such as linearized neural networks, matrix factorization, logistic regression, etc. There are, relatively speaking, fewer such examples which have been worked out in the case of generative modeling/density estimation. I will discuss one such example where we were able to rigorously analyze, for variational autoencoders, and the role that the data distribution plays in this setting.

Speaker: Kaifeng Lyu

Title: A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules

Abstract: Training large models is both resource-intensive and time-consuming, making it crucial to understand the quantitative relationship between model performance and hyperparameters. In this talk, I will present our recent work that proposes an empirical law to describe how the pretraining loss of large language models evolves under different learning rate schedules, such as constant, cosine, and step decay schedules. Our proposed law takes a multi-power form, combining a power law based on the sum of learning rates and additional power laws to account for a loss reduction effect induced by learning rate decay. We extensively validate this law on various model sizes and architectures, and demonstrate that after fitting on a few learning rate schedules, the law accurately predicts the loss curves for unseen schedules of different shapes and horizons. Moreover, by minimizing the predicted final pretraining loss across learning rate schedules, we are able to find a schedule that outperforms the widely used cosine learning rate schedule. Interestingly, this automatically discovered schedule bears some resemblance to the recently proposed Warmup-Stable-Decay (WSD) schedule (Hu et al, 2024) but achieves a slightly lower final loss. We believe these results could offer valuable insights for understanding the dynamics of pretraining and designing learning rate schedules to improve efficiency. Joint work with Kairong Luo, Haodong Wen, Shengding Hu, Zhenbo Sun, Zhiyuan Liu, Maosong Sun, Wenguang Chen. Paper Link: https://arxiv.org/abs/2503.12811

 

Parking visual for NU:

 

Organizers:

  • Zhiyuan Li (TTIC)
  • Aravindan Vijayaraghavan (Northwestern University) 
  • Yutong Wang (IIT)

Join Our Newsletter