Publications
2023
- Should We Learn Most Likely Functions or Parameters?Shikai Qiu, Tim GJ Rudner, Sanyam Kapoor, and Andrew Gordon WilsonAdvances in Neural Information Processing Systems 2023
Standard regularized training procedures correspond to maximizing a posterior distribution over parameters, known as maximum a posteriori (MAP) estimation. However, model parameters are of interest only insomuch as they combine with the functional form of a model to provide a function that can make good predictions. Moreover, the most likely parameters under the parameter posterior do not generally correspond to the most likely function induced by the parameter posterior. In fact, we can re-parametrize a model such that any setting of parameters can maximize the parameter posterior. As an alternative, we investigate the benefits and drawbacks of directly estimating the most likely function implied by the model and the data. We show that this procedure leads to pathological solutions when using neural networks and prove conditions under which the procedure is well-behaved, as well as a scalable approximation. Under these conditions, we find that function-space MAP estimation can lead to flatter minima, better generalization, and improved robustness to overfitting.
- Large Language Models Are Zero Shot Time Series ForecastersNate Gruver, Marc Finzi, Shikai Qiu, and Andrew Gordon WilsonAdvances in Neural Information Processing Systems 2023
By encoding time series as a string of numerical digits, we can frame time series forecasting as next-token prediction in text. Developing this approach, we find that large language models (LLMs) such as GPT-3 and LLaMA-2 can surprisingly zero-shot extrapolate time series at a level comparable to or exceeding the performance of purpose-built time series models trained on the downstream tasks. To facilitate this performance, we propose procedures for effectively tokenizing time series data and converting discrete distributions over tokens into highly flexible densities over continuous values. We argue the success of LLMs for time series stems from their ability to naturally represent multimodal distributions, in conjunction with biases for simplicity, and repetition, which align with the salient features in many time series, such as repeated seasonal trends. We also show how LLMs can naturally handle missing data without imputation through non-numerical text, accommodate textual side information, and answer questions to help explain predictions. While we find that increasing model size generally improves performance on time series, we show GPT-4 can perform worse than GPT-3 because of how it tokenizes numbers, and poor uncertainty calibration, which is likely the result of alignment interventions such as RLHF.
- Simple and Fast Group Robustness by Automatic Feature ReweightingShikai Qiu, Andres Potapczynski, Pavel Izmailov, and Andrew Gordon WilsonInternational Conference on Machine Learning (ICML) 2023
A major challenge to out-of-distribution generalization is reliance on spurious features – patterns that are predictive of the class label in the training data distribution, but not causally related to the target. Standard methods for reducing the reliance on spurious features typically assume that we know what the spurious feature is, which is rarely true in the real world. Methods that attempt to alleviate this limitation are complex, hard to tune, and lead to a significant computational overhead compared to standard training. In this paper, we propose Automatic Feature Reweighting (AFR), an extremely simple and fast method for updating the model to reduce the reliance on spurious features. AFR retrains the last layer of a standard ERM-trained base model with a weighted loss that emphasizes the examples where the ERM model predicts poorly, automatically upweighting the minority group without group labels. With this simple procedure, we improve upon the best reported results among competing methods trained without spurious attributes on several vision and natural language classification benchmarks, using only a fraction of their compute.
- Parton Labeling without Matching: Unveiling Emergent Labelling Capabilities in Regression ModelsShikai Qiu, Shuo Han, Xiangyang Ju, Benjamin Nachman, and Haichen WangThe European Physical Journal C 2023
Parton labeling methods are widely used when reconstructing collider events with top quarks or other massive particles. State-of-the-art techniques are based on machine learning and require training data with events that have been matched using simulations with truth information. In nature, there is no unique matching between partons and final state objects due to the properties of the strong force and due to acceptance effects. We propose a new approach to parton labeling that circumvents these challenges by recycling regression models. The final state objects that are most relevant for a regression model to predict the properties of a particular top quark are assigned to said parent particle without having any parton-matched training data. This approach is demonstrated using simulated events with top quarks and outperforms the widely-used \(χ^2\>\)method.
- Holistic approach to predicting top quark kinematic properties with the covariant particle transformerShikai Qiu, Shuo Han, Xiangyang Ju, Benjamin Nachman, and Haichen WangPhysical Review D 2023
Precise reconstruction of top quark properties is a challenging task at the Large Hadron Collider due to combinatorial backgrounds and missing information. We introduce a physics-informed neural network architecture called the Covariant Particle Transformer (CPT) for directly predicting the top quark kinematic properties from reconstructed final state objects. This approach is permutation invariant and partially Lorentz covariant and can account for a variable number of input objects. In contrast to previous machine learning-based reconstruction methods, CPT is able to predict top quark four-momenta regardless of the jet multiplicity in the event. Using simulations, we show that the CPT performs favorably compared with other machine learning top quark reconstruction approaches.
- Model-independent search for the presence of new physics in events including \(H→γγ\> \)with \(\sqrt s\) = 13 TeV pp data recorded by the ATLAS detector at the LHCATLAS Collaboration, and othersJournal of High Energy Physics 2023
A model-independent search for new physics leading to final states containing \(H→γγ\>\)decays is performed with 139 fb\(^{−1}\>\)of \(\sqrt s\) = 13 TeV pp collision data recorded by the ATLAS detector at the Large Hadron Collider at CERN. This search examines 22 final states categorized by the objects that are produced in association with the Higgs boson. These objects include isolated electrons or muons, hadronically decaying \(τ\)-leptons, additional photons, missing transverse momentum, and hadronic jets, as well as jets that are tagged as containing a \(b\)-hadron. No significant excesses above Standard Model expectations are observed and limits on the production cross section at 95% confidence level are set. Detector efficiencies are reported for all 22 signal regions, which can be used to convert detector-level cross-section limits reported in this paper to particle-level cross-section constraints.
2024
- Compute Better Spent: Replacing Dense Layers with Structured MatricesInternational Conference on Machine Learning (ICML) 2024
Dense linear layers are the dominant computational bottleneck in foundation models. Identifying more efficient alternatives to dense matrices has enormous potential for building more compute-efficient models, as exemplified by the success of convolutional networks in the image domain. In this work, we systematically explore structured matrices as replacements for dense matrices. We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance, especially as models scale. Using insights from the Maximal Update Parameterization, we determine the optimal scaling for initialization and learning rates of these unconventional layers. Finally, we measure the scaling laws of different structures to compare how quickly their performance improves with compute. We propose a novel matrix family containing Monarch matrices, the Block Tensor-Train (BTT), which we show performs better than dense matrices for the same compute on multiple tasks. On CIFAR-10/100 with augmentation, BTT achieves exponentially lower training loss than dense when training MLPs and ViTs. BTT matches dense ViT-S/32 performance on ImageNet-1k with 3.8 times less compute and is more efficient than dense for training small GPT-2 language models.