Recent UW theory papers

Theory   |   Crypto   |   CSE   |   MSR (ALG,   ML+O )
Theory   |   CSE   |   Courses   |   Papers   |   News  
  • 2024-02-24 POPSTAR: Lightweight Threshold Reporting with Reduced Leakage. Hanjun Li, Sela Navot, and Stefano Tessaro.

    Summary: This paper proposes POPSTAR, a new lightweight protocol for the private computation of heavy hitters, also known as a private threshold reporting system. In such a protocol, the users provide input measurements, and a report server learns which measurements appear more than a pre-specified threshold. POPSTAR follows the same architecture as STAR (Davidson et al, CCS 2022) by relying on a helper randomness server in addition to a main server computing the aggregate heavy hitter statistics. While STAR is extremely lightweight, it leaks a substantial amount of information, consisting of an entire histogram of the provided measurements (but only reveals the actual measurements that appear beyond the threshold). POPSTAR shows that this leakage can be reduced at a modest cost ($\sim$7$\times$ longer aggregation time). Our leakage is closer to that of Poplar (Boneh et al, S&P 2021), which relies however on distributed point functions and a different model which requires interactions of two non-colluding servers (with equal workloads) to compute the heavy hitters.

  • 2024-02-21 Private Gradient Descent for Linear Regression: Tighter Error Bounds and Instance-Specific Uncertainty Estimation. Gavin Brown, Krishnamurthy Dvijotham, Georgina Evans, Daogao Liu, Adam Smith, and Abhradeep Thakurta.

    Summary: We provide an improved analysis of standard differentially private gradient descent for linear regression under the squared error loss. Under modest assumptions on the input, we characterize the distribution of the iterate at each time step. Our analysis leads to new results on the algorithm's accuracy: for a proper fixed choice of hyperparameters, the sample complexity depends only linearly on the dimension of the data. This matches the dimension-dependence of the (non-private) ordinary least squares estimator as well as that of recent private algorithms that rely on sophisticated adaptive gradient-clipping schemes (Varshney et al., 2022; Liu et al., 2023). Our analysis of the iterates' distribution also allows us to construct confidence intervals for the empirical optimizer which adapt automatically to the variance of the algorithm on a particular data set. We validate our theorems through experiments on synthetic data.

  • 2024-02-13 On black-box separations of quantum digital signatures from pseudorandom states. Andrea Coladangelo and Saachi Mutreja.

    Summary: It is well-known that digital signatures can be constructed from one-way functions in a black-box way. While one-way functions are essentially the minimal assumption in classical cryptography, this is not the case in the quantum setting. A variety of qualitatively weaker and inherently quantum assumptions (e.g. EFI pairs, one-way state generators, and pseudorandom states) are known to be sufficient for non-trivial quantum cryptography. While it is known that commitments, zero-knowledge proofs, and even multiparty computation can be constructed from these assumptions, it has remained an open question whether the same is true for quantum digital signatures schemes (QDS). In this work, we show that there $\textit{does not}$ exist a black-box construction of a QDS scheme with classical signatures from pseudorandom states with linear, or greater, output length. Our result complements that of Morimae and Yamakawa (2022), who described a $\textit{one-time}$ secure QDS scheme with classical signatures, but left open the question of constructing a standard $\textit{multi-time}$ secure one.

  • 2024-02-02 Large language models cannot replace human participants because they cannot portray identity groups. Angelina Wang, Jamie Morgenstern, and John P. Dickerson.

    Summary: Large language models (LLMs) are increasing in capability and popularity, propelling their application in new domains -- including as replacements for human participants in computational social science, user testing, annotation tasks, and more. Traditionally, in all of these settings survey distributors are careful to find representative samples of the human population to ensure the validity of their results and understand potential demographic differences. This means in order to be a suitable replacement, LLMs will need to be able to capture the influence of positionality (i.e., relevance of social identities like gender and race). However, we show that there are two inherent limitations in the way current LLMs are trained that prevent this. We argue analytically for why LLMs are doomed to both misportray and flatten the representations of demographic groups, then empirically show this to be true on 4 LLMs through a series of human studies with 3200 participants across 16 demographic identities. We also discuss a third consideration about how identity prompts can essentialize identities. Throughout, we connect each of these limitations to a pernicious history that shows why each is harmful for marginalized demographic groups. Overall, we urge caution in use cases where LLMs are intended to replace human participants whose identities are relevant to the task at hand. At the same time, in cases where the goal is to supplement rather than replace (e.g., pilot studies), we provide empirically-better inference-time techniques to reduce, but not remove, these harms.

  • 2024-02-02 Enhancing Stochastic Gradient Descent: A Unified Framework and Novel Acceleration Methods for Faster Convergence. Yichuan Deng, Zhao Song, and Chiwun Yang.

    Summary: Based on SGD, previous works have proposed many algorithms that have improved convergence speed and generalization in stochastic optimization, such as SGDm, AdaGrad, Adam, etc. However, their convergence analysis under non-convex conditions is challenging. In this work, we propose a unified framework to address this issue. For any first-order methods, we interpret the updated direction $g_t$ as the sum of the stochastic subgradient $\nabla f_t(x_t)$ and an additional acceleration term $\frac{2|\langle v_t, \nabla f_t(x_t) \rangle|}{\|v_t\|_2^2} v_t$, thus we can discuss the convergence by analyzing $\langle v_t, \nabla f_t(x_t) \rangle$. Through our framework, we have discovered two plug-and-play acceleration methods: \textbf{Reject Accelerating} and \textbf{Random Vector Accelerating}, we theoretically demonstrate that these two methods can directly lead to an improvement in convergence rate.

  • 2024-01-18 Layout Graphs, Random Walks and the t-wise Independence of SPN Block Ciphers. Tianren Liu, Angelos Pelecanos, Stefano Tessaro, and Vinod Vaikuntanathan.

    Summary: We continue the study of $t$-wise independence of substitution-permutation networks (SPNs) initiated by the recent work of Liu, Tessaro, and Vaikuntanathan (CRYPTO 2021). Our key technical result shows that when the S-boxes are randomly and independently chosen and kept secret, an $r$-round SPN with input length $n = b \cdot k$ is $2^{-\Theta(n)}$-close to $t$-wise independent within $r = O(\min\{k, \log t\})$ rounds for any $t$ almost as large as $2^{b/2}$. Here, $b$ is the input length of the S-box and we assume that the underlying mixing achieves maximum branch number. We also analyze the special case of AES parameters (with random S-boxes), and show it is $2^{-128}$-close to pairwise independent in $7$ rounds. Central to our result is the analysis of a random walk on what we call the layout graph, a combinatorial abstraction that captures equality and inequality constraints among multiple SPN evaluations. We use our technical result to show concrete security bounds for SPNs with actual block cipher parameters and small-input $S$-boxes. (This is in contrast to the large body of results on ideal-model analyses of SPNs.) For example, for the censored-AES block cipher, namely AES with most of the mixing layers removed, we show that 192 rounds suffice to attain $2^{-128}$-closeness to pairwise independence. The prior such result for AES (Liu, Tessaro and Vaikuntanathan, CRYPTO 2021) required more than 9000 rounds.

  • 2024-01-10 Quantum Time-Space Tradeoffs for Matrix Problems. Paul Beame, Niels Kornerup, and Michael Whitmeyer.

    Summary: We consider the time and space required for quantum computers to solve a wide variety of problems involving matrices, many of which have only been analyzed classically in prior work. Our main results show that for a range of linear algebra problems -- including matrix-vector product, matrix inversion, matrix multiplication and powering -- existing classical time-space tradeoffs, several of which are tight for every space bound, also apply to quantum algorithms. For example, for almost all matrices $A$, including the discrete Fourier transform (DFT) matrix, we prove that quantum circuits with at most $T$ input queries and $S$ qubits of memory require $T=\Omega(n^2/S)$ to compute matrix-vector product $Ax$ for $x \in \{0,1\}^n$. We similarly prove that matrix multiplication for $n\times n$ binary matrices requires $T=\Omega(n^3 / \sqrt{S})$. Because many of our lower bounds match deterministic algorithms with the same time and space complexity, we show that quantum computers cannot provide any asymptotic advantage for these problems with any space bound. We obtain matching lower bounds for the stronger notion of quantum cumulative memory complexity -- the sum of the space per layer of a circuit. We also consider Boolean (i.e. AND-OR) matrix multiplication and matrix-vector products, improving the previous quantum time-space tradeoff lower bounds for $n\times n$ Boolean matrix multiplication to $T=\Omega(n^{2.5}/S^{1/4})$ from $T=\Omega(n^{2.5}/S^{1/2})$. Our improved lower bound for Boolean matrix multiplication is based on a new coloring argument that extracts more from the strong direct product theorem used in prior work. Our tight lower bounds for linear algebra problems require adding a new bucketing method to the recording-query technique of Zhandry that lets us apply classical arguments to upper bound the success probability of quantum circuits.

  • 2023-12-28 Variable Neighborhood Searching Rerandomization. Jiuyao Lu and Daogao Liu.

    Summary: Rerandomization discards undesired treatment assignments to ensure covariate balance in randomized experiments. However, rerandomization based on acceptance-rejection sampling is computationally inefficient, especially when numerous independent assignments are required to perform randomization-based statistical inference. Existing acceleration methods are suboptimal and are not applicable in structured experiments, including stratified experiments and experiments with clusters. Based on metaheuristics in combinatorial optimization, we propose a novel variable neighborhood searching rerandomization(VNSRR) method to draw balanced assignments in various experiments efficiently. We derive the unbiasedness and a lower bound for the variance reduction of the treatment effect estimator under VNSRR. Simulation studies and a real data example indicate that our method maintains the appealing statistical properties of rerandomization and can sample thousands of treatment assignments within seconds, even in cases where existing methods require an hour to complete the task.

  • 2023-12-21 LERNA: Secure Single-Server Aggregation via Key-Homomorphic Masking. Hanjun Li, Huijia Lin, Antigoni Polychroniadou, and Stefano Tessaro.

    Summary: This paper introduces LERNA, a new framework for single-server secure aggregation. Our protocols are tailored to the setting where multiple consecutive aggregation phases are performed with the same set of clients, a fraction of which can drop out in some of the phases. We rely on an initial secret sharing setup among the clients which is generated once-and-for-all, and reused in all following aggregation phases. Compared to prior works [Bonawitz et al. CCS’17, Bell et al. CCS’20], the reusable setup eliminates one round of communication between the server and clients per aggregation—i.e., we need two rounds for semi-honest security (instead of three), and three rounds (instead of four) in the malicious model. Our approach also significantly reduces the server’s computational costs by only requiring the reconstruction of a single secret-shared value (per aggregation). Prior work required reconstructing a secret-shared value for each client involved in the computation. We provide instantiations of LERNA based on both the Decisional Composite Residuosity (DCR) and (Ring) Learning with Rounding ((R)LWR) assumptions respectively and evaluate a version based on the latter assumption. In addition to savings in round-complexity (which result in reduced latency), our experiments show that the server computational costs are reduced by two orders of magnitude in comparison to the state-of-the-art. In settings with a large number of clients, we also reduce the computational costs up to twenty-fold for most clients, while a small set of “heavy clients” is subject to a workload that is still smaller than that of prior work.

  • 2023-12-19 Initializing Services in Interactive ML Systems for Diverse Users. Avinandan Bose, Mihaela Curmei, Daniel L. Jiang, Jamie Morgenstern, Sarah Dean, Lillian J. Ratliff, and Maryam Fazel.

    Summary: This paper studies ML systems that interactively learn from users across multiple subpopulations with heterogeneous data distributions. The primary objective is to provide specialized services for different user groups while also predicting user preferences. Once the users select a service based on how well the service anticipated their preference, the services subsequently adapt and refine themselves based on the user data they accumulate, resulting in an iterative, alternating minimization process between users and services (learning dynamics). Employing such tailored approaches has two main challenges: (i) Unknown user preferences: Typically, data on user preferences are unavailable without interaction, and uniform data collection across a large and diverse user base can be prohibitively expensive. (ii) Suboptimal Local Solutions: The total loss (sum of loss functions across all users and all services) landscape is not convex even if the individual losses on a single service are convex, making it likely for the learning dynamics to get stuck in local minima. The final outcome of the aforementioned learning dynamics is thus strongly influenced by the initial set of services offered to users, and is not guaranteed to be close to the globally optimal outcome. In this work, we propose a randomized algorithm to adaptively select very few users to collect preference data from, while simultaneously initializing a set of services. We prove that under mild assumptions on the loss functions, the expected total loss achieved by the algorithm right after initialization is within a factor of the globally optimal total loss with complete user preference data, and this factor scales only logarithmically in the number of services. Our theory is complemented by experiments on real as well as semi-synthetic datasets.

  • 2023-12-13 Fair Active Learning in Low-Data Regimes. Romain Camilleri, Andrew Wagenmaker, Jamie Morgenstern, Lalit Jain, and Kevin Jamieson.

    Summary: In critical machine learning applications, ensuring fairness is essential to avoid perpetuating social inequities. In this work, we address the challenges of reducing bias and improving accuracy in data-scarce environments, where the cost of collecting labeled data prohibits the use of large, labeled datasets. In such settings, active learning promises to maximize marginal accuracy gains of small amounts of labeled data. However, existing applications of active learning for fairness fail to deliver on this, typically requiring large labeled datasets, or failing to ensure the desired fairness tolerance is met on the population distribution. To address such limitations, we introduce an innovative active learning framework that combines an exploration procedure inspired by posterior sampling with a fair classification subroutine. We demonstrate that this framework performs effectively in very data-scarce regimes, maximizing accuracy while satisfying fairness constraints with high probability. We evaluate our proposed approach using well-established real-world benchmark datasets and compare it against state-of-the-art methods, demonstrating its effectiveness in producing fair models, and improvement over existing methods.

  • 2023-12-05 XOR Lemmas for Communication via Marginal Information. Siddharth Iyer and Anup Rao.

    Summary: We define the $\textit{marginal information}$ of a communication protocol, and use it to prove XOR lemmas for communication complexity. We show that if every $C$-bit protocol has bounded advantage for computing a Boolean function $f$, then every $\tilde \Omega(C \sqrt{n})$-bit protocol has advantage $\exp(-\Omega(n))$ for computing the $n$-fold xor $f^{\oplus n}$. We prove exponentially small bounds in the average case setting, and near optimal bounds for product distributions and for bounded-round protocols.

  • 2023-11-29 Sparsifying generalized linear models. Arun Jambulapati, James R. Lee, Yang P. Liu, and Aaron Sidford.

    Summary: We consider the sparsification of sums $F : \mathbb{R}^n \to \mathbb{R}$ where $F(x) = f_1(\langle a_1,x\rangle) + \cdots + f_m(\langle a_m,x\rangle)$ for vectors $a_1,\ldots,a_m \in \mathbb{R}^n$ and functions $f_1,\ldots,f_m : \mathbb{R} \to \mathbb{R}_+$. We show that $(1+\varepsilon)$-approximate sparsifiers of $F$ with support size $\frac{n}{\varepsilon^2} (\log \frac{n}{\varepsilon})^{O(1)}$ exist whenever the functions $f_1,\ldots,f_m$ are symmetric, monotone, and satisfy natural growth bounds. Additionally, we give efficient algorithms to compute such a sparsifier assuming each $f_i$ can be evaluated efficiently. Our results generalize the classic case of $\ell_p$ sparsification, where $f_i(z) = |z|^p$, for $p \in (0, 2]$, and give the first near-linear size sparsifiers in the well-studied setting of the Huber loss function and its generalizations, e.g., $f_i(z) = \min\{|z|^p, |z|^2\}$ for $0 < p \leq 2$. Our sparsification algorithm can be applied to give near-optimal reductions for optimizing a variety of generalized linear models including $\ell_p$ regression for $p \in (1, 2]$ to high accuracy, via solving $(\log n)^{O(1)}$ sparse regression instances with $m \le n(\log n)^{O(1)}$, plus runtime proportional to the number of nonzero entries in the vectors $a_1, \dots, a_m$.

  • 2023-11-28 Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine. Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, Renqian Luo, Scott Mayer McKinney, Robert Osazuwa Ness, Hoifung Poon, Tao Qin, Naoto Usuyama, Chris White, and Eric Horvitz.

    Summary: Generalist foundation models such as GPT-4 have displayed surprising capabilities in a wide variety of domains and tasks. Yet, there is a prevalent assumption that they cannot match specialist capabilities of fine-tuned models. For example, most explorations to date on medical competency benchmarks have leveraged domain-specific training, as exemplified by efforts on BioGPT and Med-PaLM. We build on a prior study of GPT-4's capabilities on medical challenge benchmarks in the absence of special training. Rather than using simple prompting to highlight the model's out-of-the-box capabilities, we perform a systematic exploration of prompt engineering. We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks. The prompting methods we explore are general purpose, and make no specific use of domain expertise, removing the need for expert-curated content. Our experimental design carefully controls for overfitting during the prompt engineering process. We introduce Medprompt, based on a composition of several prompting strategies. With Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark datasets in the MultiMedQA suite. The method outperforms leading specialist models such as Med-PaLM 2 by a significant margin with an order of magnitude fewer calls to the model. Steering GPT-4 with Medprompt achieves a 27% reduction in error rate on the MedQA dataset over the best methods to date achieved with specialist models and surpasses a score of 90% for the first time. Beyond medical problems, we show the power of Medprompt to generalize to other domains and provide evidence for the broad applicability of the approach via studies of the strategy on exams in electrical engineering, machine learning, philosophy, accounting, law, nursing, and clinical psychology.

  • 2023-11-22 Positional Description Matters for Transformers Arithmetic. Ruoqi Shen, Sébastien Bubeck, Ronen Eldan, Yin Tat Lee, Yuanzhi Li, and Yi Zhang.

    Summary: Transformers, central to the successes in modern Natural Language Processing, often falter on arithmetic tasks despite their vast capabilities --which paradoxically include remarkable coding abilities. We observe that a crucial challenge is their naive reliance on positional information to solve arithmetic problems with a small number of digits, leading to poor performance on larger numbers. Herein, we delve deeper into the role of positional encoding, and propose several ways to fix the issue, either by modifying the positional encoding directly, or by modifying the representation of the arithmetic task to leverage standard positional encoding differently. We investigate the value of these modifications for three tasks: (i) classical multiplication, (ii) length extrapolation in addition, and (iii) addition in natural language context. For (i) we train a small model on a small dataset (100M parameters and 300k samples) with remarkable aptitude in (direct, no scratchpad) 15 digits multiplication and essentially perfect up to 12 digits, while usual training in this context would give a model failing at 4 digits multiplication. In the experiments on addition, we use a mere 120k samples to demonstrate: for (ii) extrapolation from 10 digits to testing on 12 digits numbers while usual training would have no extrapolation, and for (iii) almost perfect accuracy up to 5 digits while usual training would be correct only up to 3 digits (which is essentially memorization with a training set of 120k samples).

  • 2023-11-17 A Whole New Ball Game: A Primal Accelerated Method for Matrix Games and Minimizing the Maximum of Smooth Functions. Yair Carmon, Arun Jambulapati, Yujia Jin, and Aaron Sidford.

    Summary: We design algorithms for minimizing $\max_{i\in[n]} f_i(x)$ over a $d$-dimensional Euclidean or simplex domain. When each $f_i$ is $1$-Lipschitz and $1$-smooth, our method computes an $\epsilon$-approximate solution using $\widetilde{O}(n \epsilon^{-1/3} + \epsilon^{-2})$ gradient and function evaluations, and $\widetilde{O}(n \epsilon^{-4/3})$ additional runtime. For large $n$, our evaluation complexity is optimal up to polylogarithmic factors. In the special case where each $f_i$ is linear -- which corresponds to finding a near-optimal primal strategy in a matrix game -- our method finds an $\epsilon$-approximate solution in runtime $\widetilde{O}(n (d/\epsilon)^{2/3} + nd + d\epsilon^{-2})$. For $n>d$ and $\epsilon=1/\sqrt{n}$ this improves over all existing first-order methods. When additionally $d = \omega(n^{8/11})$ our runtime also improves over all known interior point methods. Our algorithm combines three novel primitives: (1) A dynamic data structure which enables efficient stochastic gradient estimation in small $\ell_2$ or $\ell_1$ balls. (2) A mirror descent algorithm tailored to our data structure implementing an oracle which minimizes the objective over these balls. (3) A simple ball oracle acceleration framework suitable for non-Euclidean geometry.

  • 2023-11-16 Ghost Value Augmentation for $k$-ECSS and $k$-ECSM. D Ellis Hershkowitz, Nathan Klein, and Rico Zenklusen.

    Summary: We give a poly-time algorithm for the $k$-edge-connected spanning subgraph ($k$-ECSS) problem that returns a solution of cost no greater than the cheapest $(k+10)$-ECSS on the same graph. Our approach enhances the iterative relaxation framework with a new ingredient, which we call ghost values, that allows for high sparsity in intermediate problems. Our guarantees improve upon the best-known approximation factor of $2$ for $k$-ECSS whenever the optimal value of $(k+10)$-ECSS is close to that of $k$-ECSS. This is a property that holds for the closely related problem $k$-edge-connected spanning multi-subgraph ($k$-ECSM), which is identical to $k$-ECSS except edges can be selected multiple times at the same cost. As a consequence, we obtain a $\left(1+O\left(\frac{1}{k}\right)\right)$-approximation algorithm for $k$-ECSM, which resolves a conjecture of Pritchard and improves upon a recent $\left(1+O\left(\frac{1}{\sqrt{k}}\right)\right)$-approximation algorithm of Karlin, Klein, Oveis Gharan, and Zhang. Moreover, we present a matching lower bound for $k$-ECSM, showing that our approximation ratio is tight up to the constant factor in $O\left(\frac{1}{k}\right)$, unless $P=NP$.

  • 2023-11-15 From Trees to Polynomials and Back Again: New Capacity Bounds with Applications to TSP. Leonid Gurvits, Nathan Klein, and Jonathan Leake.

    Summary: We give simply exponential lower bounds on the probabilities of a given strongly Rayleigh distribution, depending only on its expectation. This resolves a weak version of a problem left open by Karlin-Klein-Oveis Gharan in their recent breakthrough work on metric TSP, and this resolution leads to a minor improvement of their approximation factor for metric TSP. Our results also allow for a more streamlined analysis of the algorithm. To achieve these new bounds, we build upon the work of Gurvits-Leake on the use of the productization technique for bounding the capacity of a real stable polynomial. This technique allows one to reduce certain inequalities for real stable polynomials to products of affine linear forms, which have an underlying matrix structure. In this paper, we push this technique further by characterizing the worst-case polynomials via bipartitioned forests. This rigid combinatorial structure yields a clean induction argument, which implies our stronger bounds. In general, we believe the results of this paper will lead to further improvement and simplification of the analysis of various combinatorial and probabilistic bounds and algorithms.

  • 2023-11-13 How to Use Quantum Indistinguishability Obfuscation. Andrea Coladangelo and Sam Gunn.

    Summary: Quantum copy protection, introduced by Aaronson, enables giving out a quantum program-description that cannot be meaningfully duplicated. Despite over a decade of study, copy protection is only known to be possible for a very limited class of programs. As our first contribution, we show how to achieve "best-possible" copy protection for all programs. We do this by introducing quantum state indistinguishability obfuscation (qsiO), a notion of obfuscation for quantum descriptions of classical programs. We show that applying qsiO to a program immediately achieves best-possible copy protection. Our second contribution is to show that, assuming injective one-way functions exist, qsiO is concrete copy protection for a large family of puncturable programs -- significantly expanding the class of copy-protectable programs. A key tool in our proof is a new variant of unclonable encryption (UE) that we call coupled unclonable encryption (cUE). While constructing UE in the standard model remains an important open problem, we are able to build cUE from one-way functions. If we additionally assume the existence of UE, then we can further expand the class of puncturable programs for which qsiO is copy protection. Finally, we construct qsiO relative to an efficient quantum oracle.

  • 2023-11-07 Computing Approximate $\ell_p$ Sensitivities. Swati Padmanabhan, David P. Woodruff, and Qiuyi Zhang.

    Summary: Recent works in dimensionality reduction for regression tasks have introduced the notion of sensitivity, an estimate of the importance of a specific datapoint in a dataset, offering provable guarantees on the quality of the approximation after removing low-sensitivity datapoints via subsampling. However, fast algorithms for approximating $\ell_p$ sensitivities, which we show is equivalent to approximate $\ell_p$ regression, are known for only the $\ell_2$ setting, in which they are termed leverage scores. In this work, we provide efficient algorithms for approximating $\ell_p$ sensitivities and related summary statistics of a given matrix. In particular, for a given $n \times d$ matrix, we compute $\alpha$-approximation to its $\ell_1$ sensitivities at the cost of $O(n/\alpha)$ sensitivity computations. For estimating the total $\ell_p$ sensitivity (i.e. the sum of $\ell_p$ sensitivities), we provide an algorithm based on importance sampling of $\ell_p$ Lewis weights, which computes a constant factor approximation to the total sensitivity at the cost of roughly $O(\sqrt{d})$ sensitivity computations. Furthermore, we estimate the maximum $\ell_1$ sensitivity, up to a $\sqrt{d}$ factor, using $O(d)$ sensitivity computations. We generalize all these results to $\ell_p$ norms for $p > 1$. Lastly, we experimentally show that for a wide class of matrices in real-world datasets, the total sensitivity can be quickly approximated and is significantly smaller than the theoretical prediction, demonstrating that real-world datasets have low intrinsic effective dimensionality.

  • 2023-11-07 User-level Differentially Private Stochastic Convex Optimization: Efficient Algorithms with Optimal Rates. Hilal Asi and Daogao Liu.

    Summary: We study differentially private stochastic convex optimization (DP-SCO) under user-level privacy, where each user may hold multiple data items. Existing work for user-level DP-SCO either requires super-polynomial runtime [Ghazi et al. (2023)] or requires the number of users to grow polynomially with the dimensionality of the problem with additional strict assumptions [Bassily et al. (2023)]. We develop new algorithms for user-level DP-SCO that obtain optimal rates for both convex and strongly convex functions in polynomial time and require the number of users to grow only logarithmically in the dimension. Moreover, our algorithms are the first to obtain optimal rates for non-smooth functions in polynomial time. These algorithms are based on multiple-pass DP-SGD, combined with a novel private mean estimation procedure for concentrated data, which applies an outlier removal step before estimating the mean of the gradients.

  • 2023-11-03 A Lower Bound for the Max Entropy Algorithm for TSP. Billy Jin, Nathan Klein, and David P. Williamson.

    Summary: One of the most famous conjectures in combinatorial optimization is the four-thirds conjecture, which states that the integrality gap of the subtour LP relaxation of the TSP is equal to $\frac43$. For 40 years, the best known upper bound was 1.5, due to Wolsey (1980). Recently, Karlin, Klein, and Oveis Gharan (2022) showed that the max entropy algorithm for the TSP gives an improved bound of $1.5 - 10^{-36}$. In this paper, we show that the approximation ratio of the max entropy algorithm is at least 1.375, even for graphic TSP. Thus the max entropy algorithm does not appear to be the algorithm that will ultimately resolve the four-thirds conjecture in the affirmative, should that be possible.

  • 2023-10-27 Minimax Optimal Submodular Optimization with Bandit Feedback. Artin Tajdini, Lalit Jain, and Kevin Jamieson.

    Summary: We consider maximizing a monotonic, submodular set function $f: 2^{[n]} \rightarrow [0,1]$ under stochastic bandit feedback. Specifically, $f$ is unknown to the learner but at each time $t=1,\dots,T$ the learner chooses a set $S_t \subset [n]$ with $|S_t| \leq k$ and receives reward $f(S_t) + \eta_t$ where $\eta_t$ is mean-zero sub-Gaussian noise. The objective is to minimize the learner's regret over $T$ times with respect to ($1-e^{-1}$)-approximation of maximum $f(S_*)$ with $|S_*| = k$, obtained through greedy maximization of $f$. To date, the best regret bound in the literature scales as $k n^{1/3} T^{2/3}$. And by trivially treating every set as a unique arm one deduces that $\sqrt{ {n \choose k} T }$ is also achievable. In this work, we establish the first minimax lower bound for this setting that scales like $\mathcal{O}(\min_{i \le k}(in^{1/3}T^{2/3} + \sqrt{n^{k-i}T}))$. Moreover, we propose an algorithm that is capable of matching the lower bound regret.

  • 2023-10-27 Structured Semidefinite Programming for Recovering Structured Preconditioners. Arun Jambulapati, Jerry Li, Christopher Musco, Kirankumar Shiragur, Aaron Sidford, and Kevin Tian.

    Summary: We develop a general framework for finding approximately-optimal preconditioners for solving linear systems. Leveraging this framework we obtain improved runtimes for fundamental preconditioning and linear system solving problems including the following. We give an algorithm which, given positive definite $\mathbf{K} \in \mathbb{R}^{d \times d}$ with $\mathrm{nnz}(\mathbf{K})$ nonzero entries, computes an $\epsilon$-optimal diagonal preconditioner in time $\widetilde{O}(\mathrm{nnz}(\mathbf{K}) \cdot \mathrm{poly}(\kappa^\star,\epsilon^{-1}))$, where $\kappa^\star$ is the optimal condition number of the rescaled matrix. We give an algorithm which, given $\mathbf{M} \in \mathbb{R}^{d \times d}$ that is either the pseudoinverse of a graph Laplacian matrix or a constant spectral approximation of one, solves linear systems in $\mathbf{M}$ in $\widetilde{O}(d^2)$ time. Our diagonal preconditioning results improve state-of-the-art runtimes of $\Omega(d^{3.5})$ attained by general-purpose semidefinite programming, and our solvers improve state-of-the-art runtimes of $\Omega(d^{\omega})$ where $\omega > 2.3$ is the current matrix multiplication constant. We attain our results via new algorithms for a class of semidefinite programs (SDPs) we call matrix-dictionary approximation SDPs, which we leverage to solve an associated problem we call matrix-dictionary recovery.

  • 2023-10-25 Detecting Pretraining Data from Large Language Models. Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer.

    Summary: Although large language models (LLMs) are widely deployed, the data used to train them is rarely disclosed. Given the incredible scale of this data, up to trillions of tokens, it is all but certain that it includes potentially problematic text such as copyrighted materials, personally identifiable information, and test data for widely reported reference benchmarks. However, we currently have no way to know which data of these types is included or in what proportions. In this paper, we study the pretraining data detection problem: given a piece of text and black-box access to an LLM without knowing the pretraining data, can we determine if the model was trained on the provided text? To facilitate this study, we introduce a dynamic benchmark WIKIMIA that uses data created before and after model training to support gold truth detection. We also introduce a new detection method Min-K% Prob based on a simple hypothesis: an unseen example is likely to contain a few outlier words with low probabilities under the LLM, while a seen example is less likely to have words with such low probabilities. Min-K% Prob can be applied without any knowledge about the pretraining corpus or any additional training, departing from previous detection methods that require training a reference model on data that is similar to the pretraining data. Moreover, our experiments demonstrate that Min-K% Prob achieves a 7.4% improvement on WIKIMIA over these previous methods. We apply Min-K% Prob to three real-world scenarios, copyrighted book detection, contaminated downstream example detection and privacy auditing of machine unlearning, and find it a consistently effective solution.

  • 2023-10-25 Fast Algorithms for Separable Linear Programs. Sally Dong, Gramoz Goranci, Lawrence Li, Sushant Sachdeva, and Guanghao Ye.

    Summary: In numerical linear algebra, considerable effort has been devoted to obtaining faster algorithms for linear systems whose underlying matrices exhibit structural properties. A prominent success story is the method of generalized nested dissection~[Lipton-Rose-Tarjan'79] for separable matrices. On the other hand, the majority of recent developments in the design of efficient linear program (LP) solves do not leverage the ideas underlying these faster linear system solvers nor consider the separable structure of the constraint matrix. We give a faster algorithm for separable linear programs. Specifically, we consider LPs of the form $\min_{\mathbf{A}\mathbf{x}=\mathbf{b}, \mathbf{l}\leq\mathbf{x}\leq\mathbf{u}} \mathbf{c}^\top\mathbf{x}$, where the graphical support of the constraint matrix $\mathbf{A} \in \mathbb{R}^{n\times m}$ is $O(n^\alpha)$-separable. These include flow problems on planar graphs and low treewidth matrices among others. We present an $\tilde{O}((m+m^{1/2 + 2\alpha}) \log(1/\epsilon))$ time algorithm for these LPs, where $\epsilon$ is the relative accuracy of the solution. Our new solver has two important implications: for the $k$-multicommodity flow problem on planar graphs, we obtain an algorithm running in $\tilde{O}(k^{5/2} m^{3/2})$ time in the high accuracy regime; and when the support of $\mathbf{A}$ is $O(n^\alpha)$-separable with $\alpha \leq 1/4$, our algorithm runs in $\tilde{O}(m)$ time, which is nearly optimal. The latter significantly improves upon the natural approach of combining interior point methods and nested dissection, whose time complexity is lower bounded by $\Omega(\sqrt{m}(m+m^{\alpha\omega}))=\Omega(m^{3/2})$, where $\omega$ is the matrix multiplication constant. Lastly, in the setting of low-treewidth LPs, we recover the results of [DLY,STOC21] and [GS,22] with significantly simpler data structure machinery.

  • 2023-10-19 Unmasking Transformers: A Theoretical Approach to Data Recovery via Attention Weights. Yichuan Deng, Zhao Song, Shenghao Xie, and Chiwun Yang.

    Summary: In the realm of deep learning, transformers have emerged as a dominant architecture, particularly in natural language processing tasks. However, with their widespread adoption, concerns regarding the security and privacy of the data processed by these models have arisen. In this paper, we address a pivotal question: Can the data fed into transformers be recovered using their attention weights and outputs? We introduce a theoretical framework to tackle this problem. Specifically, we present an algorithm that aims to recover the input data $X \in \mathbb{R}^{d \times n}$ from given attention weights $W = QK^\top \in \mathbb{R}^{d \times d}$ and output $B \in \mathbb{R}^{n \times n}$ by minimizing the loss function $L(X)$. This loss function captures the discrepancy between the expected output and the actual output of the transformer. Our findings have significant implications for the Localized Layer-wise Mechanism (LLM), suggesting potential vulnerabilities in the model's design from a security and privacy perspective. This work underscores the importance of understanding and safeguarding the internal workings of transformers to ensure the confidentiality of processed data.

  • 2023-10-18 Superiority of Softmax: Unveiling the Performance Edge Over Linear Attention. Yichuan Deng, Zhao Song, and Tianyi Zhou.

    Summary: Large transformer models have achieved state-of-the-art results in numerous natural language processing tasks. Among the pivotal components of the transformer architecture, the attention mechanism plays a crucial role in capturing token interactions within sequences through the utilization of softmax function. Conversely, linear attention presents a more computationally efficient alternative by approximating the softmax operation with linear complexity. However, it exhibits substantial performance degradation when compared to the traditional softmax attention mechanism. In this paper, we bridge the gap in our theoretical understanding of the reasons behind the practical performance gap between softmax and linear attention. By conducting a comprehensive comparative analysis of these two attention mechanisms, we shed light on the underlying reasons for why softmax attention outperforms linear attention in most scenarios.

  • 2023-10-15 FiLM: Fill-in Language Models for Any-Order Generation. Tianxiao Shen, Hao Peng, Ruoqi Shen, Yao Fu, Zaid Harchaoui, and Yejin Choi.

    Summary: Language models have become the backbone of today's AI systems. However, their predominant left-to-right generation limits the use of bidirectional context, which is essential for tasks that involve filling text in the middle. We propose the Fill-in Language Model (FiLM), a new language modeling approach that allows for flexible generation at any position without adhering to a specific generation order. Its training extends the masked language modeling objective by adopting varying mask probabilities sampled from the Beta distribution to enhance the generative capabilities of FiLM. During inference, FiLM can seamlessly insert missing phrases, sentences, or paragraphs, ensuring that the outputs are fluent and are coherent with the surrounding context. In both automatic and human evaluations, FiLM outperforms existing infilling methods that rely on left-to-right language models trained on rearranged text segments. FiLM is easy to implement and can be either trained from scratch or fine-tuned from a left-to-right language model. Notably, as the model size grows, FiLM's perplexity approaches that of strong left-to-right language models of similar sizes, indicating FiLM's scalability and potential as a large language model.

  • 2023-10-12 Stronger Coreset Bounds for Kernel Density Estimators via Chaining. Rainie Bozzai and Thomas Rothvoss.

    Summary: We apply the discrepancy method and a chaining approach to give improved bounds on the coreset complexity of a wide class of kernel functions. Our results give randomized polynomial time algorithms to produce coresets of size $O\big(\frac{\sqrt{d}}{\varepsilon}\sqrt{\log\log \frac{1}{\varepsilon}}\big)$ for the Gaussian and Laplacian kernels in the case that the data set is uniformly bounded, an improvement that was not possible with previous techniques. We also obtain coresets of size $O\big(\frac{1}{\varepsilon}\sqrt{\log\log \frac{1}{\varepsilon}}\big)$ for the Laplacian kernel for $d$ constant. Finally, we give the best known bounds of $O\big(\frac{\sqrt{d}}{\varepsilon}\sqrt{\log(2\max\{1,\alpha\})}\big)$ on the coreset complexity of the exponential, Hellinger, and JS Kernels, where $1/\alpha$ is the bandwidth parameter of the kernel.

  • 2023-10-12 On the Rational Degree of Boolean Functions and Applications. Vishnu Iyer, Siddhartha Jain, Matt Kovacs-Deak, Vinayak M. Kumar, Luke Schaeffer, Daochen Wang, and Michael Whitmeyer.

    Summary: We study a natural complexity measure of Boolean functions known as the (exact) rational degree. For total functions $f$, it is conjectured that $\mathrm{rdeg}(f)$ is polynomially related to $\mathrm{deg}(f)$, where $\mathrm{deg}(f)$ is the Fourier degree. Towards this conjecture, we show that symmetric functions have rational degree at least $\mathrm{deg}(f)/2$ and monotone functions have rational degree at least $\sqrt{\mathrm{deg}(f)}$. We observe that both of these lower bounds are tight. In addition, we show that all read-once depth-$d$ Boolean formulae have rational degree at least $\Omega(\mathrm{deg}(f)^{1/d})$. Furthermore, we show that almost every Boolean function on $n$ variables has rational degree at least $n/2 - O(\sqrt{n})$. In contrast to total functions, we exhibit partial functions that witness unbounded separations between rational and approximate degree, in both directions. As a consequence, we show that for quantum computers, post-selection and bounded-error are incomparable resources in the black-box model.

  • 2023-10-03 Learning quantum Hamiltonians at any temperature in polynomial time. Ainesh Bakshi, Allen Liu, Ankur Moitra, and Ewin Tang.

    Summary: We study the problem of learning a local quantum Hamiltonian $H$ given copies of its Gibbs state $\rho = e^{-\beta H}/\textrm{tr}(e^{-\beta H})$ at a known inverse temperature $\beta>0$. Anshu, Arunachalam, Kuwahara, and Soleimanifar (arXiv:2004.07266) gave an algorithm to learn a Hamiltonian on $n$ qubits to precision $\epsilon$ with only polynomially many copies of the Gibbs state, but which takes exponential time. Obtaining a computationally efficient algorithm has been a major open problem [Alhambra'22 (arXiv:2204.08349)], [Anshu, Arunachalam'22 (arXiv:2204.08349)], with prior work only resolving this in the limited cases of high temperature [Haah, Kothari, Tang'21 (arXiv:2108.04842)] or commuting terms [Anshu, Arunachalam, Kuwahara, Soleimanifar'21]. We fully resolve this problem, giving a polynomial time algorithm for learning $H$ to precision $\epsilon$ from polynomially many copies of the Gibbs state at any constant $\beta > 0$. Our main technical contribution is a new flat polynomial approximation to the exponential function, and a translation between multi-variate scalar polynomials and nested commutators. This enables us to formulate Hamiltonian learning as a polynomial system. We then show that solving a low-degree sum-of-squares relaxation of this polynomial system suffices to accurately learn the Hamiltonian.

  • 2023-09-11 Textbooks Are All You Need II: phi-1.5 technical report. Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee.

    Summary: We continue the investigation into the power of smaller Transformer-based language models as initiated by \textbf{TinyStories} -- a 10 million parameter model that can produce coherent English -- and the follow-up work on \textbf{phi-1}, a 1.3 billion parameter model with Python coding performance close to the state-of-the-art. The latter work proposed to use existing Large Language Models (LLMs) to generate ``textbook quality" data as a way to enhance the learning process compared to traditional web data. We follow the ``Textbooks Are All You Need" approach, focusing this time on common sense reasoning in natural language, and create a new 1.3 billion parameter model named \textbf{phi-1.5}, with performance on natural language tasks comparable to models 5x larger, and surpassing most non-frontier LLMs on more complex reasoning tasks such as grade-school mathematics and basic coding. More generally, \textbf{phi-1.5} exhibits many of the traits of much larger LLMs, both good -- such as the ability to ``think step by step" or perform some rudimentary in-context learning -- and bad, including hallucinations and the potential for toxic and biased generations -- encouragingly though, we are seeing improvement on that front thanks to the absence of web data. We open-source \textbf{phi-1.5} to promote further research on these urgent topics.

  • 2023-08-28 Faster Min-Cost Flow on Bounded Treewidth Graphs. Sally Dong and Guanghao Ye.

    Summary: We present a $\widetilde{O}(m\sqrt{\tau}+n\tau)$ time algorithm for finding a minimum-cost flow in graphs with $n$ vertices and $m$ edges, given a tree decomposition of width $\tau$ and polynomially bounded integer costs and capacities. This improves upon the current best algorithms for general linear programs bounded by treewidth which run in $\widetilde{O}(m \tau^{(\omega+1)/2})$ time by [Dong-Lee-Ye,21] and [Gu-Song,22], where $\omega \approx 2.37$ is the matrix multiplication exponent. Our approach leverages recent advances in structured linear program solvers and robust interior point methods. As a corollary, for any graph $G$ with $n$ vertices, $m$ edges, and treewidth $\tau$, we obtain a $\widetilde{O}(\tau^3 \cdot m)$ time algorithm to compute a tree decomposition of $G$ with width $O(\tau \cdot \log n)$.

  • 2023-08-21 Clustered Linear Contextual Bandits with Knapsacks. Yichuan Deng, Michalis Mamakos, and Zhao Song.

    Summary: In this work, we study clustered contextual bandits where rewards and resource consumption are the outcomes of cluster-specific linear models. The arms are divided in clusters, with the cluster memberships being unknown to an algorithm. Pulling an arm in a time period results in a reward and in consumption for each one of multiple resources, and with the total consumption of any resource exceeding a constraint implying the termination of the algorithm. Thus, maximizing the total reward requires learning not only models about the reward and the resource consumption, but also cluster memberships. We provide an algorithm that achieves regret sublinear in the number of time periods, without requiring access to all of the arms. In particular, we show that it suffices to perform clustering only once to a randomly selected subset of the arms. To achieve this result, we provide a sophisticated combination of techniques from the literature of econometrics and of bandits with constraints.

  • 2023-08-18 Do you know what q-means?. João F. Doriguello, Alessandro Luongo, and Ewin Tang.

    Summary: Clustering is one of the most important tools for analysis of large datasets, and perhaps the most popular clustering algorithm is Lloyd's iteration for $k$-means. This iteration takes $N$ vectors $v_1,\dots,v_N\in\mathbb{R}^d$ and outputs $k$ centroids $c_1,\dots,c_k\in\mathbb{R}^d$; these partition the vectors into clusters based on which centroid is closest to a particular vector. We present an overall improved version of the "$q$-means" algorithm, the quantum algorithm originally proposed by Kerenidis, Landman, Luongo, and Prakash (2019) which performs $\varepsilon$-$k$-means, an approximate version of $k$-means clustering. This algorithm does not rely on the quantum linear algebra primitives of prior work, instead only using its QRAM to prepare and measure simple states based on the current iteration's clusters. The time complexity is $O\big(\frac{k^{2}}{\varepsilon^2}(\sqrt{k}d + \log(Nd))\big)$ and maintains the polylogarithmic dependence on $N$ while improving the dependence on most of the other parameters. We also present a "dequantized" algorithm for $\varepsilon$-$k$-means which runs in $O\big(\frac{k^{2}}{\varepsilon^2}(kd + \log(Nd))\big)$ time. Notably, this classical algorithm matches the polylogarithmic dependence on $N$ attained by the quantum algorithms.

  • 2023-08-16 Convergence of Two-Layer Regression with Nonlinear Units. Yichuan Deng, Zhao Song, and Shenghao Xie.

    Summary: Large language models (LLMs), such as ChatGPT and GPT4, have shown outstanding performance in many human life task. Attention computation plays an important role in training LLMs. Softmax unit and ReLU unit are the key structure in attention computation. Inspired by them, we put forward a softmax ReLU regression problem. Generally speaking, our goal is to find an optimal solution to the regression problem involving the ReLU unit. In this work, we calculate a close form representation for the Hessian of the loss function. Under certain assumptions, we prove the Lipschitz continuous and the PSDness of the Hessian. Then, we introduce an greedy algorithm based on approximate Newton method, which converges in the sense of the distance to optimal solution. Last, We relax the Lipschitz condition and prove the convergence in the sense of loss value.

  • 2023-08-11 A Better-Than-1.6-Approximation for Prize-Collecting TSP. Jannis Blauth, Nathan Klein, and Martin Nägele.

    Summary: Prize-Collecting TSP is a variant of the traveling salesperson problem where one may drop vertices from the tour at the cost of vertex-dependent penalties. The quality of a solution is then measured by adding the length of the tour and the sum of all penalties of vertices that are not visited. We present a polynomial-time approximation algorithm with an approximation guarantee slightly below $1.6$, where the guarantee is with respect to the natural linear programming relaxation of the problem. This improves upon the previous best-known approximation ratio of $1.774$. Our approach is based on a known decomposition for solutions of this linear relaxation into rooted trees. Our algorithm takes a tree from this decomposition and then performs a pruning step before doing parity correction on the remainder. Using a simple analysis, we bound the approximation guarantee of the proposed algorithm by $(1+\sqrt{5})/2 \approx 1.618$, the golden ratio. With some additional technical care we further improve it to $1.599$.

  • 2023-08-02 Optimal Online Discrepancy Minimization. Janardhan Kulkarni, Victor Reis, and Thomas Rothvoss.

    Summary: We prove that there exists an online algorithm that for any sequence of vectors $v_1,\ldots,v_T \in \mathbb{R}^n$ with $\|v_i\|_2 \leq 1$, arriving one at a time, decides random signs $x_1,\ldots,x_T \in \{ -1,1\}$ so that for every $t \le T$, the prefix sum $\sum_{i=1}^t x_iv_i$ is $10$-subgaussian. This improves over the work of Alweiss, Liu and Sawhney who kept prefix sums $O(\sqrt{\log (nT)})$-subgaussian, and gives a $O(\sqrt{\log T})$ bound on the discrepancy $\max_{t \in T} \|\sum_{i=1}^t x_i v_i\|_\infty$. Our proof combines a generalization of Banaszczyk's prefix balancing result to trees with a cloning argument to find distributions rather than single colorings. We also show a matching $\Omega(\sqrt{\log T})$ strategy for an oblivious adversary.

  • 2023-07-30 Polytopes with Bounded Integral Slack Matrices Have Sub-Exponential Extension Complexity. Sally Dong and Thomas Rothvoss.

    Summary: We show that any bounded integral function $f : A \times B \mapsto \{0,1, \dots, \Delta\}$ with rank $r$ has deterministic communication complexity $\Delta^{O(\Delta)} \cdot \sqrt{r} \cdot \log^2 r$, where the rank of $f$ is defined to be the rank of the $A \times B$ matrix whose entries are the function values. As a corollary, we show that any $n$-dimensional polytope that admits a slack matrix with entries from $\{0,1,\dots,\Delta\}$ has extension complexity at most $\exp(\Delta^{O(\Delta)} \cdot \sqrt{n} \cdot \log^2 n)$.

  • 2023-07-17 Zero-th Order Algorithm for Softmax Attention Optimization. Yichuan Deng, Zhihang Li, Sridhar Mahadevan, and Zhao Song.

    Summary: Large language models (LLMs) have brought about significant transformations in human society. Among the crucial computations in LLMs, the softmax unit holds great importance. Its helps the model generating a probability distribution on potential subsequent words or phrases, considering a series of input words. By utilizing this distribution, the model selects the most probable next word or phrase, based on the assigned probabilities. The softmax unit assumes a vital function in LLM training as it facilitates learning from data through the adjustment of neural network weights and biases. With the development of the size of LLMs, computing the gradient becomes expensive. However, Zero-th Order method can approximately compute the gradient with only forward passes. In this paper, we present a Zero-th Order algorithm specifically tailored for Softmax optimization. We demonstrate the convergence of our algorithm, highlighting its effectiveness in efficiently computing gradients for large-scale LLMs. By leveraging the Zeroth-Order method, our work contributes to the advancement of optimization techniques in the context of complex language models.

  • 2023-07-07 Scalable Membership Inference Attacks via Quantile Regression. Martin Bertran, Shuai Tang, Michael Kearns, Jamie Morgenstern, Aaron Roth, and Zhiwei Steven Wu.

    Summary: Membership inference attacks are designed to determine, using black box access to trained models, whether a particular example was used in training or not. Membership inference can be formalized as a hypothesis testing problem. The most effective existing attacks estimate the distribution of some test statistic (usually the model's confidence on the true label) on points that were (and were not) used in training by training many \emph{shadow models} -- i.e. models of the same architecture as the model being attacked, trained on a random subsample of data. While effective, these attacks are extremely computationally expensive, especially when the model under attack is large. We introduce a new class of attacks based on performing quantile regression on the distribution of confidence scores induced by the model under attack on points that are not used in training. We show that our method is competitive with state-of-the-art shadow model attacks, while requiring substantially less compute because our attack requires training only a single model. Moreover, unlike shadow model attacks, our proposed attack does not require any knowledge of the architecture of the model under attack and is therefore truly ``black-box". We show the efficacy of this approach in an extensive series of experiments on various datasets and model architectures.

  • 2023-06-20 Textbooks Are All You Need. Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li.

    Summary: We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays surprising emergent properties compared to phi-1-base, our model before our finetuning stage on a dataset of coding exercises, and phi-1-small, a smaller model with 350M parameters trained with the same pipeline as phi-1 that still achieves 45% on HumanEval.

  • 2023-06-13 Efficient Algorithm for Solving Hyperbolic Programs. Yichuan Deng, Zhao Song, Lichen Zhang, and Ruizhe Zhang.

    Summary: Hyperbolic polynomials is a class of real-roots polynomials that has wide range of applications in theoretical computer science. Each hyperbolic polynomial also induces a hyperbolic cone that is of particular interest in optimization due to its generality, as by choosing the polynomial properly, one can easily recover the classic optimization problems such as linear programming and semidefinite programming. In this work, we develop efficient algorithms for hyperbolic programming, the problem in each one wants to minimize a linear objective, under a system of linear constraints and the solution must be in the hyperbolic cone induced by the hyperbolic polynomial. Our algorithm is an instance of interior point method (IPM) that, instead of following the central path, it follows the central Swath, which is a generalization of central path. To implement the IPM efficiently, we utilize a relaxation of the hyperbolic program to a quadratic program, coupled with the first four moments of the hyperbolic eigenvalues that are crucial to update the optimization direction. We further show that, given an evaluation oracle of the polynomial, our algorithm only requires $O(n^2d^{2.5})$ oracle calls, where $n$ is the number of variables and $d$ is the degree of the polynomial, with extra $O((n+m)^3 d^{0.5})$ arithmetic operations, where $m$ is the number of constraints.

  • 2023-06-02 An Empirical Study on Challenging Math Problem Solving with GPT-4. Yiran Wu, Feiran Jia, Shaokun Zhang, Hangyu Li, Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng, Qingyun Wu, and Chi Wang.

    Summary: Employing Large Language Models (LLMs) to address mathematical problems is an intriguing research endeavor, considering the abundance of math problems expressed in natural language across numerous science and engineering fields. While several prior works have investigated solving elementary mathematics using LLMs, this work explores the frontier of using GPT-4 for solving more complex and challenging math problems. We evaluate various ways of using GPT-4. Some of them are adapted from existing work, and one is MathChat, a conversational problem-solving framework newly proposed in this work. We perform the evaluation on difficult high school competition problems from the MATH dataset, which shows the advantage of the proposed conversational approach.

  • 2023-06-01 Faster Robust Tensor Power Method for Arbitrary Order. Yichuan Deng, Zhao Song, and Junze Yin.

    Summary: Tensor decomposition is a fundamental method used in various areas to deal with high-dimensional data. \emph{Tensor power method} (TPM) is one of the widely-used techniques in the decomposition of tensors. This paper presents a novel tensor power method for decomposing arbitrary order tensors, which overcomes limitations of existing approaches that are often restricted to lower-order (less than $3$) tensors or require strong assumptions about the underlying data structure. We apply sketching method, and we are able to achieve the running time of $\widetilde{O}(n^{p-1})$, on the power $p$ and dimension $n$ tensor. We provide a detailed analysis for any $p$-th order tensor, which is never given in previous works.

  • 2023-05-31 Doubly Constrained Fair Clustering. John Dickerson, Seyed A. Esmaeili, Jamie Morgenstern, and Claire Jie Zhang.

    Summary: The remarkable attention which fair clustering has received in the last few years has resulted in a significant number of different notions of fairness. Despite the fact that these notions are well-justified, they are often motivated and studied in a disjoint manner where one fairness desideratum is considered exclusively in isolation from the others. This leaves the understanding of the relations between different fairness notions as an important open problem in fair clustering. In this paper, we take the first step in this direction. Specifically, we consider the two most prominent demographic representation fairness notions in clustering: (1) Group Fairness (GF), where the different demographic groups are supposed to have close to population-level representation in each cluster and (2) Diversity in Center Selection (DS), where the selected centers are supposed to have close to population-level representation of each group. We show that given a constant approximation algorithm for one constraint (GF or DS only) we can obtain a constant approximation solution that satisfies both constraints simultaneously. Interestingly, we prove that any given solution that satisfies the GF constraint can always be post-processed at a bounded degradation to the clustering cost to additionally satisfy the DS constraint while the reverse is not true. Furthermore, we show that both GF and DS are incompatible (having an empty feasibility set in the worst case) with a collection of other distance-based fairness notions. Finally, we carry experiments to validate our theoretical findings.

  • 2023-05-25 Learning across Data Owners with Joint Differential Privacy. Yangsibo Huang, Haotian Jiang, Daogao Liu, Mohammad Mahdian, Jieming Mao, and Vahab Mirrokni.

    Summary: In this paper, we study the setting in which data owners train machine learning models collaboratively under a privacy notion called joint differential privacy [Kearns et al., 2018]. In this setting, the model trained for each data owner $j$ uses $j$'s data without privacy consideration and other owners' data with differential privacy guarantees. This setting was initiated in [Jain et al., 2021] with a focus on linear regressions. In this paper, we study this setting for stochastic convex optimization (SCO). We present an algorithm that is a variant of DP-SGD [Song et al., 2013; Abadi et al., 2016] and provides theoretical bounds on its population loss. We compare our algorithm to several baselines and discuss for what parameter setups our algorithm is more preferred. We also empirically study joint differential privacy in the multi-class classification problem over two public datasets. Our empirical findings are well-connected to the insights from our theoretical results.

  • 2023-05-15 Sparsifying sums of norms. Arun Jambulapati, James R. Lee, Yang P. Liu, and Aaron Sidford.

    Summary: For any norms $N_1,\ldots,N_m$ on $\mathbb{R}^n$ and $N(x) := N_1(x)+\cdots+N_m(x)$, we show there is a sparsified norm $\tilde{N}(x) = w_1 N_1(x) + \cdots + w_m N_m(x)$ such that $|N(x) - \tilde{N}(x)| \leq \epsilon N(x)$ for all $x \in \mathbb{R}^n$, where $w_1,\ldots,w_m$ are non-negative weights, of which only $O(\epsilon^{-2} n \log(n/\epsilon) (\log n)^{2.5} )$ are non-zero. Additionally, if $N$ is $\mathrm{poly}(n)$-equivalent to the Euclidean norm on $\mathbb{R}^n$, then such weights can be found with high probability in time $O(m (\log n)^{O(1)} + \mathrm{poly}(n)) T$, where $T$ is the time required to evaluate a norm $N_i$. This immediately yields analogous statements for sparsifying sums of symmetric submodular functions. More generally, we show how to sparsify sums of $p$th powers of norms when the sum is $p$-uniformly smooth.

  • 2023-05-15 Linear-Sized Sparsifiers via Near-Linear Time Discrepancy Theory. Arun Jambulapati, Victor Reis, and Kevin Tian.

    Summary: Discrepancy theory provides powerful tools for producing higher-quality objects which "beat the union bound" in fundamental settings throughout combinatorics and computer science. However, this quality has often come at the price of more expensive algorithms. We introduce a new framework for bridging this gap, by allowing for the efficient implementation of discrepancy-theoretic primitives. Our framework repeatedly solves regularized optimization problems to low accuracy to approximate the partial coloring method of [Rot17], and simplifies and generalizes recent work of [JSS23] on fast algorithms for Spencer's theorem. In particular, our framework only requires that the discrepancy body of interest has exponentially large Gaussian measure and is expressible as a sublevel set of a symmetric, convex function. We combine this framework with new tools for proving Gaussian measure lower bounds to give improved algorithms for a variety of sparsification and coloring problems. As a first application, we use our framework to obtain an $\widetilde{O}(m \cdot \epsilon^{-3.5})$ time algorithm for constructing an $\epsilon$-approximate spectral sparsifier of an $m$-edge graph, matching the sparsity of [BSS14] up to constant factors and improving upon the $\widetilde{O}(m \cdot \epsilon^{-6.5})$ runtime of [LeeS17]. We further give a state-of-the-art algorithm for constructing graph ultrasparsifiers and an almost-linear time algorithm for constructing linear-sized degree-preserving sparsifiers via discrepancy theory; in the latter case, such sparsifiers were not known to exist previously. We generalize these results to their analogs in sparsifying isotropic sums of positive semidefinite matrices. Finally, to demonstrate the versatility of our technique, we obtain a nearly-input-sparsity time constructive algorithm for Spencer's theorem (where we recover a recent result of [JSS23]).

  • 2023-05-05 On Optimization and Counting of Non-Broken Bases of Matroids. Dorna Abdolazimi, Kasper Lindberg, and Shayan Oveis Gharan.

    Summary: Given a matroid $M=(E,{\cal I})$, and a total ordering over the elements $E$, a broken circuit is a circuit where the smallest element is removed and an NBC independent set is an independent set in ${\cal I}$ with no broken circuit. The set of NBC independent sets of any matroid $M$ define a simplicial complex called the broken circuit complex which has been the subject of intense study in combinatorics. Recently, Adiprasito, Huh and Katz showed that the face of numbers of any broken circuit complex form a log-concave sequence, proving a long-standing conjecture of Rota. We study counting and optimization problems on NBC bases of a generic matroid. We find several fundamental differences with the independent set complex: for example, we show that it is NP-hard to find the max-weight NBC base of a matroid or that the convex hull of NBC bases of a matroid has edges of arbitrary large length. We also give evidence that the natural down-up walk on the space of NBC bases of a matroid may not mix rapidly by showing that for some family of matroids it is NP-hard to count the number of NBC bases after certain conditionings.

  • 2023-05-04 Automatic Prompt Optimization with "Gradient Descent" and Beam Search. Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng.

    Summary: Large Language Models (LLMs) have shown impressive performance as general purpose agents, but their abilities remain highly dependent on prompts which are hand written with onerous trial-and-error effort. We propose a simple and nonparametric solution to this problem, Automatic Prompt Optimization (APO), which is inspired by numerical gradient descent to automatically improve prompts, assuming access to training data and an LLM API. The algorithm uses minibatches of data to form natural language "gradients" that criticize the current prompt. The gradients are then "propagated" into the prompt by editing the prompt in the opposite semantic direction of the gradient. These gradient descent steps are guided by a beam search and bandit selection procedure which significantly improves algorithmic efficiency. Preliminary results across three benchmark NLP tasks and the novel problem of LLM jailbreak detection suggest that Automatic Prompt Optimization can outperform prior prompt editing techniques and improve an initial prompt's performance by up to 31%, by using data to rewrite vague task descriptions into more precise annotation instructions.

  • 2023-05-04 A Hyperbolic Extension of Kadison-Singer Type Results. Ruizhe Zhang and Xinzhi Zhang.

    Summary: In 2013, Marcus, Spielman, and Srivastava resolved the famous Kadison-Singer conjecture. It states that for $n$ independent random vectors $v_1,\cdots, v_n$ that have expected squared norm bounded by $\epsilon$ and are in the isotropic position in expectation, there is a positive probability that the determinant polynomial $\det(xI - \sum_{i=1}^n v_iv_i^\top)$ has roots bounded by $(1 + \sqrt{\epsilon})^2$. An interpretation of the Kadison-Singer theorem is that we can always find a partition of the vectors $v_1,\cdots,v_n$ into two sets with a low discrepancy in terms of the spectral norm (in other words, rely on the determinant polynomial). In this paper, we provide two results for a broader class of polynomials, the hyperbolic polynomials. Furthermore, our results are in two generalized settings: $\bullet$ The first one shows that the Kadison-Singer result requires a weaker assumption that the vectors have a bounded sum of hyperbolic norms. $\bullet$ The second one relaxes the Kadison-Singer result's distribution assumption to the Strongly Rayleigh distribution. To the best of our knowledge, the previous results only support determinant polynomials [Anari and Oveis Gharan'14, Kyng, Luh and Song'20]. It is unclear whether they can be generalized to a broader class of polynomials. In addition, we also provide a sub-exponential time algorithm for constructing our results.

  • 2023-04-20 Attention Scheme Inspired Softmax Regression. Yichuan Deng, Zhihang Li, and Zhao Song.

    Summary: Large language models (LLMs) have made transformed changes for human society. One of the key computation in LLMs is the softmax unit. This operation is important in LLMs because it allows the model to generate a distribution over possible next words or phrases, given a sequence of input words. This distribution is then used to select the most likely next word or phrase, based on the probabilities assigned by the model. The softmax unit plays a crucial role in training LLMs, as it allows the model to learn from the data by adjusting the weights and biases of the neural network. In the area of convex optimization such as using central path method to solve linear programming. The softmax function has been used a crucial tool for controlling the progress and stability of potential function [Cohen, Lee and Song STOC 2019, Brand SODA 2020]. In this work, inspired the softmax unit, we define a softmax regression problem. Formally speaking, given a matrix $A \in \mathbb{R}^{n \times d}$ and a vector $b \in \mathbb{R}^n$, the goal is to use greedy type algorithm to solve \begin{align*} \min_{x} \| \langle \exp(Ax), {\bf 1}_n \rangle^{-1} \exp(Ax) - b \|_2^2. \end{align*} In certain sense, our provable convergence result provides theoretical support for why we can use greedy algorithm to train softmax function in practice.

  • 2023-04-16 Thin trees for laminar families. Nathan Klein and Neil Olver.

    Summary: In the laminar-constrained spanning tree problem, the goal is to find a minimum-cost spanning tree which respects upper bounds on the number of times each cut in a given laminar family is crossed. This generalizes the well-studied degree-bounded spanning tree problem, as well as a previously studied setting where a chain of cuts is given. We give the first constant-factor approximation algorithm; in particular we show how to obtain a multiplicative violation of the crossing bounds of less than 22 while losing less than a factor of 5 in terms of cost. Our result compares to the natural LP relaxation. As a consequence, our results show that given a $k$-edge-connected graph and a laminar family $\mathcal{L} \subseteq 2^V$ of cuts, there exists a spanning tree which contains only an $O(1/k)$ fraction of the edges across every cut in $\mathcal{L}$. This can be viewed as progress towards the Thin Tree Conjecture, which (in a strong form) states that this guarantee can be obtained for all cuts simultaneously.

  • 2023-04-13 Solving Tensor Low Cycle Rank Approximation. Yichuan Deng, Yeqi Gao, and Zhao Song.

    Summary: Large language models have become ubiquitous in modern life, finding applications in various domains such as natural language processing, language translation, and speech recognition. Recently, a breakthrough work [Zhao, Panigrahi, Ge, and Arora Arxiv 2023] explains the attention model from probabilistic context-free grammar (PCFG). One of the central computation task for computing probability in PCFG is formulating a particular tensor low rank approximation problem, we can call it tensor cycle rank. Given an $n \times n \times n$ third order tensor $A$, we say that $A$ has cycle rank-$k$ if there exists three $n \times k^2$ size matrices $U , V$, and $W$ such that for each entry in each \begin{align*} A_{a,b,c} = \sum_{i=1}^k \sum_{j=1}^k \sum_{l=1}^k U_{a,i+k(j-1)} \otimes V_{b, j + k(l-1)} \otimes W_{c, l + k(i-1) } \end{align*} for all $a \in [n], b \in [n], c \in [n]$. For the tensor classical rank, tucker rank and train rank, it has been well studied in [Song, Woodruff, Zhong SODA 2019]. In this paper, we generalize the previous ``rotation and sketch'' technique in page 186 of [Song, Woodruff, Zhong SODA 2019] and show an input sparsity time algorithm for cycle rank.

  • 2023-04-10 Randomized and Deterministic Attention Sparsification Algorithms for Over-parameterized Feature Dimension. Yichuan Deng, Sridhar Mahadevan, and Zhao Song.

    Summary: Large language models (LLMs) have shown their power in different areas. Attention computation, as an important subroutine of LLMs, has also attracted interests in theory. Recently the static computation and dynamic maintenance of attention matrix has been studied by [Alman and Song 2023] and [Brand, Song and Zhou 2023] from both algorithmic perspective and hardness perspective. In this work, we consider the sparsification of the attention problem. We make one simplification which is the logit matrix is symmetric. Let $n$ denote the length of sentence, let $d$ denote the embedding dimension. Given a matrix $X \in \mathbb{R}^{n \times d}$, suppose $d \gg n$ and $\| X X^\top \|_{\infty} < r$ with $r \in (0,0.1)$, then we aim for finding $Y \in \mathbb{R}^{n \times m}$ (where $m\ll d$) such that \begin{align*} \| D(Y)^{-1} \exp( Y Y^\top ) - D(X)^{-1} \exp( X X^\top) \|_{\infty} \leq O(r) \end{align*} We provide two results for this problem. $\bullet$ Our first result is a randomized algorithm. It runs in $\widetilde{O}(\mathrm{nnz}(X) + n^{\omega} ) $ time, has $1-\delta$ succeed probability, and chooses $m = O(n \log(n/\delta))$. Here $\mathrm{nnz}(X)$ denotes the number of non-zero entries in $X$. We use $\omega$ to denote the exponent of matrix multiplication. Currently $\omega \approx 2.373$. $\bullet$ Our second result is a deterministic algorithm. It runs in $\widetilde{O}(\min\{\sum_{i\in[d]}\mathrm{nnz}(X_i)^2, dn^{\omega-1}\} + n^{\omega+1})$ time and chooses $m = O(n)$. Here $X_i$ denote the $i$-th column of matrix $X$. Our main findings have the following implication for applied LLMs task: for any super large feature dimension, we can reduce it down to the size nearly linear in length of sentence.

  • 2023-04-07 Convex Minimization with Integer Minima in $\widetilde O(n^4)$ Time. Haotian Jiang, Yin Tat Lee, Zhao Song, and Lichen Zhang.

    Summary: Given a convex function $f$ on $\mathbb{R}^n$ with an integer minimizer, we show how to find an exact minimizer of $f$ using $O(n^2 \log n)$ calls to a separation oracle and $O(n^4 \log n)$ time. The previous best polynomial time algorithm for this problem given in [Jiang, SODA 2021, JACM 2022] achieves $O(n^2\log\log n/\log n)$ oracle complexity. However, the overall runtime of Jiang's algorithm is at least $\widetilde{\Omega}(n^8)$, due to expensive sub-routines such as the Lenstra-Lenstra-Lov\'asz (LLL) algorithm [Lenstra, Lenstra, Lov\'asz, Math. Ann. 1982] and random walk based cutting plane method [Bertsimas, Vempala, JACM 2004]. Our significant speedup is obtained by a nontrivial combination of a faster version of the LLL algorithm due to [Neumaier, Stehl\'e, ISSAC 2016] that gives similar guarantees, the volumetric center cutting plane method (CPM) by [Vaidya, FOCS 1989] and its fast implementation given in [Jiang, Lee, Song, Wong, STOC 2020]. For the special case of submodular function minimization (SFM), our result implies a strongly polynomial time algorithm for this problem using $O(n^3 \log n)$ calls to an evaluation oracle and $O(n^4 \log n)$ additional arithmetic operations. Both the oracle complexity and the number of arithmetic operations of our more general algorithm are better than the previous best-known runtime algorithms for this specific problem given in [Lee, Sidford, Wong, FOCS 2015] and [Dadush, V\'egh, Zambelli, SODA 2018, MOR 2021].

  • 2023-03-27 Revisiting Area Convexity: Faster Box-Simplex Games and Spectrahedral Generalizations. Arun Jambulapati and Kevin Tian.

    Summary: We investigate different aspects of area convexity [Sherman '17], a mysterious tool introduced to tackle optimization problems under the challenging $\ell_\infty$ geometry. We develop a deeper understanding of its relationship with more conventional analyses of extragradient methods [Nemirovski '04, Nesterov '07]. We also give improved solvers for the subproblems required by variants of the [Sherman '17] algorithm, designed through the lens of relative smoothness [Bauschke-Bolte-Teboulle '17, Lu-Freund-Nesterov '18]. Leveraging these new tools, we give a state-of-the-art first-order algorithm for solving box-simplex games (a primal-dual formulation of $\ell_\infty$ regression) in a $d \times n$ matrix with bounded rows, using $O(\log d \cdot \epsilon^{-1})$ matrix-vector queries. As a consequence, we obtain improved complexities for approximate maximum flow, optimal transport, min-mean-cycle, and other basic combinatorial optimization problems. We also develop a near-linear time algorithm for a matrix generalization of box-simplex games, capturing a family of problems closely related to semidefinite programs recently used as subroutines in robust statistics and numerical linear algebra.

  • 2023-03-26 The Subspace Flatness Conjecture and Faster Integer Programming. Victor Reis and Thomas Rothvoss.

    Summary: In a seminal paper, Kannan and Lov\'asz (1988) considered a quantity $\mu_{KL}(\Lambda,K)$ which denotes the best volume-based lower bound on the covering radius $\mu(\Lambda,K)$ of a convex body $K$ with respect to a lattice $\Lambda$. Kannan and Lov\'asz proved that $\mu(\Lambda,K) \leq n \cdot \mu_{KL}(\Lambda,K)$ and the Subspace Flatness Conjecture by Dadush (2012) claims a $O(\log(2n))$ factor suffices, which would match the lower bound from the work of Kannan and Lov\'asz. We settle this conjecture up to a constant in the exponent by proving that $\mu(\Lambda,K) \leq O(\log^{3}(2n)) \cdot \mu_{KL} (\Lambda,K)$. Our proof is based on the Reverse Minkowski Theorem due to Regev and Stephens-Davidowitz (2017). Following the work of Dadush (2012, 2019), we obtain a $(\log(2n))^{O(n)}$-time randomized algorithm to solve integer programs in $n$ variables. Another implication of our main result is a near-optimal flatness constant of $O(n \log^{3}(2n))$.

  • 2023-03-22 Sparks of Artificial General Intelligence: Early experiments with GPT-4. Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang.

    Summary: Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction. We conclude with reflections on societal influences of the recent technological leap and future research directions.

  • 2023-03-14 Testing Causality for High Dimensional Data. Arun Jambulapati, Hilaf Hasson, Youngsuk Park, and Yuyang Wang.

    Summary: Determining causal relationship between high dimensional observations are among the most important tasks in scientific discoveries. In this paper, we revisited the \emph{linear trace method}, a technique proposed in~\citep{janzing2009telling,zscheischler2011testing} to infer the causal direction between two random variables of high dimensions. We strengthen the existing results significantly by providing an improved tail analysis in addition to extending the results to nonlinear trace functionals with sharper confidence bounds under certain distributional assumptions. We obtain our results by interpreting the trace estimator in the causal regime as a function over random orthogonal matrices, where the concentration of Lipschitz functions over such space could be applied. We additionally propose a novel ridge-regularized variant of the estimator in \cite{zscheischler2011testing}, and give provable bounds relating the ridge-estimated terms to their ground-truth counterparts. We support our theoretical results with encouraging experiments on synthetic datasets, more prominently, under high-dimension low sample size regime.

  • 2023-03-13 An Improved Sample Complexity for Rank-1 Matrix Sensing. Yichuan Deng, Zhihang Li, and Zhao Song.

    Summary: Matrix sensing is a problem in signal processing and machine learning that involves recovering a low-rank matrix from a set of linear measurements. The goal is to reconstruct the original matrix as accurately as possible, given only a set of linear measurements obtained by sensing the matrix [Jain, Netrapalli and Shanghavi, 2013]. In this work, we focus on a particular direction of matrix sensing, which is called rank-$1$ matrix sensing [Zhong, Jain and Dhillon, 2015]. We present an improvement over the original algorithm in [Zhong, Jain and Dhillon, 2015]. It is based on a novel analysis and sketching technique that enables faster convergence rates and better accuracy in recovering low-rank matrices. The algorithm focuses on developing a theoretical understanding of the matrix sensing problem and establishing its advantages over previous methods. The proposed sketching technique allows for efficiently extracting relevant information from the linear measurements, making the algorithm computationally efficient and scalable. Our novel matrix sensing algorithm improves former result [Zhong, Jain and Dhillon, 2015] on in two senses: $\bullet$ We improve the sample complexity from $\widetilde{O}(\epsilon^{-2} dk^2)$ to $\widetilde{O}(\epsilon^{-2} (d+k^2))$. $\bullet$ We improve the running time from $\widetilde{O}(md^2 k^2)$ to $\widetilde{O}(m d^2 k)$. The proposed algorithm has theoretical guarantees and is analyzed to provide insights into the underlying structure of low-rank matrices and the nature of the linear measurements used in the recovery process. It advances the theoretical understanding of matrix sensing and provides a new approach for solving this important problem.

  • 2023-03-08 Streaming Kernel PCA Algorithm With Small Space. Yichuan Deng, Zhao Song, Zifan Wang, and Han Zhang.

    Summary: Principal Component Analysis (PCA) is a widely used technique in machine learning, data analysis and signal processing. With the increase in the size and complexity of datasets, it has become important to develop low-space usage algorithms for PCA. Streaming PCA has gained significant attention in recent years, as it can handle large datasets efficiently. The kernel method, which is commonly used in learning algorithms such as Support Vector Machines (SVMs), has also been applied in PCA algorithms. We propose a streaming algorithm for Kernel PCA problems based on the traditional scheme by Oja. Our algorithm addresses the challenge of reducing the memory usage of PCA while maintaining its accuracy. We analyze the performance of our algorithm by studying the conditions under which it succeeds. Specifically, we show that, when the spectral ratio $R := \lambda_1/\lambda_2$ of the target covariance matrix is lower bounded by $C \cdot \log n\cdot \log d$, the streaming PCA can be solved with $O(d)$ space cost. Our proposed algorithm has several advantages over existing methods. First, it is a streaming algorithm that can handle large datasets efficiently. Second, it employs the kernel method, which allows it to capture complex nonlinear relationships among data points. Third, it has a low-space usage, making it suitable for applications where memory is limited.

  • 2023-03-07 Complete Log Concavity of Coverage-Like Functions. Dorna Abdolazimi and Shayan Oveis Gharan.

    Summary: We introduce an expressive subclass of non-negative almost submodular set functions, called strongly 2-coverage functions which include coverage and (sums of) matroid rank functions, and prove that the homogenization of the generating polynomial of any such function is completely log-concave, taking a step towards characterizing the coefficients of (homogeneous) completely log-concave polynomials. As a consequence we obtain that the "level sets" of any such function form an ultra-log concave sequence.

  • 2023-03-02 An Improved Classical Singular Value Transformation for Quantum Machine Learning. Ainesh Bakshi and Ewin Tang.

    Summary: We study quantum speedups in quantum machine learning (QML) by analyzing the quantum singular value transformation (QSVT) framework. QSVT, introduced by [GSLW, STOC'19, arXiv:1806.01838], unifies all major types of quantum speedup; in particular, a wide variety of QML proposals are applications of QSVT on low-rank classical data. We challenge these proposals by providing a classical algorithm that matches the performance of QSVT in this regime up to a small polynomial overhead. We show that, given a matrix $A \in \mathbb{C}^{m\times n}$, a vector $b \in \mathbb{C}^{n}$, a bounded degree-$d$ polynomial $p$, and linear-time pre-processing, we can output a description of a vector $v$ such that $\|v - p(A) b\| \leq \varepsilon\|b\|$ in $\widetilde{\mathcal{O}}(d^{11} \|A\|_{\mathrm{F}}^4 / (\varepsilon^2 \|A\|^4 ))$ time. This improves upon the best known classical algorithm [CGLLTW, STOC'20, arXiv:1910.06151], which requires $\widetilde{\mathcal{O}}(d^{22} \|A\|_{\mathrm{F}}^6 /(\varepsilon^6 \|A\|^6 ) )$ time, and narrows the gap with QSVT, which, after linear-time pre-processing to load input into a quantum-accessible memory, can estimate the magnitude of an entry $p(A)b$ to $\varepsilon\|b\|$ error in $\widetilde{\mathcal{O}}(d\|A\|_{\mathrm{F}}/(\varepsilon \|A\|))$ time. Our key insight is to combine the Clenshaw recurrence, an iterative method for computing matrix polynomials, with sketching techniques to simulate QSVT classically. We introduce several new classical techniques in this work, including (a) a non-oblivious matrix sketch for approximately preserving bi-linear forms, (b) a new stability analysis for the Clenshaw recurrence, and (c) a new technique to bound arithmetic progressions of the coefficients appearing in the Chebyshev series expansion of bounded functions, each of which may be of independent interest.

  • 2023-02-28 A CS guide to the quantum singular value transformation. Ewin Tang and Kevin Tian.

    Summary: We present a simplified exposition of some pieces of [Gily\'en, Su, Low, and Wiebe, STOC'19, arXiv:1806.01838], which introduced a quantum singular value transformation (QSVT) framework for applying polynomial functions to block-encoded matrices. The QSVT framework has garnered substantial recent interest from the quantum algorithms community, as it was demonstrated by [GSLW19] to encapsulate many existing algorithms naturally phrased as an application of a matrix function. First, we posit that the lifting of quantum singular processing (QSP) to QSVT is better viewed not through Jordan's lemma (as was suggested by [GSLW19]) but as an application of the cosine-sine decomposition, which can be thought of as a more explicit and stronger version of Jordan's lemma. Second, we demonstrate that the constructions of bounded polynomial approximations given in [GSLW19], which use a variety of ad hoc approaches drawing from Fourier analysis, Chebyshev series, and Taylor series, can be unified under the framework of truncation of Chebyshev series, and indeed, can in large part be matched via a bounded variant of a standard meta-theorem from [Trefethen, 2013]. We hope this work finds use to the community as a companion guide for understanding and applying the powerful framework of [GSLW19].

  • 2023-02-27 Query-optimal estimation of unitary channels in diamond distance. Jeongwan Haah, Robin Kothari, Ryan O'Donnell, and Ewin Tang.

    Summary: We consider process tomography for unitary quantum channels. Given access to an unknown unitary channel acting on a $\textsf{d}$-dimensional qudit, we aim to output a classical description of a unitary that is $\varepsilon$-close to the unknown unitary in diamond norm. We design an algorithm achieving error $\varepsilon$ using $O(\textsf{d}^2/\varepsilon)$ applications of the unknown channel and only one qudit. This improves over prior results, which use $O(\textsf{d}^3/\varepsilon^2)$ [via standard process tomography] or $O(\textsf{d}^{2.5}/\varepsilon)$ [Yang, Renner, and Chiribella, PRL 2020] applications. To show this result, we introduce a simple technique to "bootstrap" an algorithm that can produce constant-error estimates to one that can produce $\varepsilon$-error estimates with the Heisenberg scaling. Finally, we prove a complementary lower bound showing that estimation requires $\Omega(\textsf{d}^2/\varepsilon)$ applications, even with access to the inverse or controlled versions of the unknown unitary. This shows that our algorithm has both optimal query complexity and optimal space complexity.

  • 2023-02-24 Quantum trapdoor functions from classical one-way functions. Andrea Coladangelo.

    Summary: We formalize and study the notion of a quantum trapdoor function. This is an efficiently computable unitary that takes as input a "public" quantum state and a classical string $x$, and outputs a quantum state. This map is such that (i) it is hard to invert, in the sense that it is hard to recover $x$ given the output state (and many copies of the public state), and (ii) there is a classical trapdoor that allows efficient inversion. We show that a quantum trapdoor function can be constructed from any quantum-secure one-way function. A direct consequence of this result is that, assuming just the existence of quantum-secure one-way functions, there exists a public-key encryption scheme with a (pure) quantum public key.

  • 2023-02-21 $k$NN-Adapter: Efficient Domain Adaptation for Black-Box Language Models. Yangsibo Huang, Daogao Liu, Zexuan Zhong, Weijia Shi, and Yin Tat Lee.

    Summary: Fine-tuning a language model on a new domain is standard practice for domain adaptation. However, it can be infeasible when it comes to modern large-scale language models such as GPT-3, which can only be accessed through APIs, making it difficult to access the internal parameters of the model. In this paper, we propose $k$NN-Adapter, a method to effectively adapt these black-box large language models (LLMs) to a new domain. The $k$NN-Adapter builds on top of the retrieval-augmented language model, and adaptively learns to interpolate the output of the language model with retrieval results from a datastore consisting of the target domain data. Our experiments on four different domains demonstrate that $k$NN-Adapter significantly improves perplexity, and works particularly well in settings with limited access to LLMs. Additionally, we show that $k$NN-Adapter is more effective than fine-tuning when the amount of training data is limited. We also release a dataset to encourage further study.

  • 2023-02-20 Private (Stochastic) Non-Convex Optimization Revisited: Second-Order Stationary Points and Excess Risks. Arun Ganesh, Daogao Liu, Sewoong Oh, and Abhradeep Thakurta.

    Summary: We consider the problem of minimizing a non-convex objective while preserving the privacy of the examples in the training data. Building upon the previous variance-reduced algorithm SpiderBoost, we introduce a new framework that utilizes two different kinds of gradient oracles. The first kind of oracles can estimate the gradient of one point, and the second kind of oracles, less precise and more cost-effective, can estimate the gradient difference between two points. SpiderBoost uses the first kind periodically, once every few steps, while our framework proposes using the first oracle whenever the total drift has become large and relies on the second oracle otherwise. This new framework ensures the gradient estimations remain accurate all the time, resulting in improved rates for finding second-order stationary points. Moreover, we address a more challenging task of finding the global minima of a non-convex objective using the exponential mechanism. Our findings indicate that the regularized exponential mechanism can closely match previous empirical and population risk bounds, without requiring smoothness assumptions for algorithms with polynomial running time. Furthermore, by disregarding running time considerations, we show that the exponential mechanism can achieve a good population risk bound and provide a nearly matching lower bound.

  • 2023-02-13 Algorithmic Aspects of the Log-Laplace Transform and a Non-Euclidean Proximal Sampler. Sivakanth Gopi, Yin Tat Lee, Daogao Liu, Ruoqi Shen, and Kevin Tian.

    Summary: The development of efficient sampling algorithms catering to non-Euclidean geometries has been a challenging endeavor, as discretization techniques which succeed in the Euclidean setting do not readily carry over to more general settings. We develop a non-Euclidean analog of the recent proximal sampler of [LST21], which naturally induces regularization by an object known as the log-Laplace transform (LLT) of a density. We prove new mathematical properties (with an algorithmic flavor) of the LLT, such as strong convexity-smoothness duality and an isoperimetric inequality, which are used to prove a mixing time on our proximal sampler matching [LST21] under a warm start. As our main application, we show our warm-started sampler improves the value oracle complexity of differentially private convex optimization in $\ell_p$ and Schatten-$p$ norms for $p \in [1, 2]$ to match the Euclidean setting [GLL22], while retaining state-of-the-art excess risk bounds [GLLST23]. We find our investigation of the LLT to be a promising proof-of-concept of its utility as a tool for designing samplers, and outline directions for future exploration.

  • 2023-01-21 From Pseudorandomness to Multi-Group Fairness and Back. Cynthia Dwork, Daniel Lee, Huijia Lin, and Pranay Tankala.

    Summary: We identify and explore connections between the recent literature on multi-group fairness for prediction algorithms and the pseudorandomness notions of leakage-resilience and graph regularity. We frame our investigation using new, statistical distance-based variants of multicalibration that are closely related to the concept of outcome indistinguishability. Adopting this perspective leads us naturally not only to our graph theoretic results, but also to new, more efficient algorithms for multicalibration in certain parameter regimes and a novel proof of a hardcore lemma for real-valued functions.

  • 2023-01-13 Cumulative Memory Lower Bounds for Randomized and Quantum Computation. Paul Beame and Niels Kornerup.

    Summary: Cumulative memory -- the sum of space used per step over the duration of a computation -- is a fine-grained measure of time-space complexity that was introduced to analyze cryptographic applications like password hashing. It is a more accurate cost measure for algorithms that have infrequent spikes in memory usage and are run in environments such as cloud computing that allow dynamic allocation and de-allocation of resources during execution, or when many multiple instances of an algorithm are interleaved in parallel. We prove the first lower bounds on cumulative memory complexity for both sequential classical computation and quantum circuits. Moreover, we develop general paradigms for bounding cumulative memory complexity inspired by the standard paradigms for proving time-space tradeoff lower bounds that can only lower bound the maximum space used during an execution. The resulting lower bounds on cumulative memory that we obtain are just as strong as the best time-space tradeoff lower bounds, which are very often known to be tight. Although previous results for pebbling and random oracle models have yielded time-space tradeoff lower bounds larger than the cumulative memory complexity, our results show that in general computational models such separations cannot follow from known lower bound techniques and are not true for many functions. Among many possible applications of our general methods, we show that any classical sorting algorithm with success probability at least $1/\text{poly}(n)$ requires cumulative memory $\tilde \Omega(n^2)$, any classical matrix multiplication algorithm requires cumulative memory $\Omega(n^6/T)$, any quantum sorting circuit requires cumulative memory $\Omega(n^3/T)$, and any quantum circuit that finds $k$ disjoint collisions in a random function requires cumulative memory $\Omega(k^3n/T^2)$.

  • 2023-01-10 Elastic Cash. Anup Rao.

    Summary: Elastic Cash is a new decentralized mechanism for regulating the money supply. The mechanism operates by modifying the supply so that an interest rate determined by a public market is kept approximately fixed. It can be incorporated into the conventional monetary system to improve the elasticity of the US Dollar, and it can be used to design new elastic cryptocurrencies that remain decentralized.

  • 2023-01-01 ReSQueing Parallel and Private Stochastic Convex Optimization. Yair Carmon, Arun Jambulapati, Yujia Jin, Yin Tat Lee, Daogao Liu, Aaron Sidford, and Kevin Tian.

    Summary: We introduce a new tool for stochastic convex optimization (SCO): a Reweighted Stochastic Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density. Combining ReSQue with recent advances in ball oracle acceleration [CJJJLST20, ACJJS21], we develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings. For a SCO objective constrained to the unit ball in $\mathbb{R}^d$, we obtain the following results (up to polylogarithmic factors). We give a parallel algorithm obtaining optimization error $\epsilon_{\text{opt}}$ with $d^{1/3}\epsilon_{\text{opt}}^{-2/3}$ gradient oracle query depth and $d^{1/3}\epsilon_{\text{opt}}^{-2/3} + \epsilon_{\text{opt}}^{-2}$ gradient queries in total, assuming access to a bounded-variance stochastic gradient estimator. For $\epsilon_{\text{opt}} \in [d^{-1}, d^{-1/4}]$, our algorithm matches the state-of-the-art oracle depth of [BJLLS19] while maintaining the optimal total work of stochastic gradient descent. Given $n$ samples of Lipschitz loss functions, prior works [BFTT19, BFGT20, AFKT21, KLL21] established that if $n \gtrsim d \epsilon_{\text{dp}}^{-2}$, $(\epsilon_{\text{dp}}, \delta)$-differential privacy is attained at no asymptotic cost to the SCO utility. However, these prior works all required a superlinear number of gradient queries. We close this gap for sufficiently large $n \gtrsim d^2 \epsilon_{\text{dp}}^{-3}$, by using ReSQue to design an algorithm with near-linear gradient query complexity in this regime.

  • 2022-12-14 Learning threshold neurons via the "edge of stability". Kwangjun Ahn, Sébastien Bubeck, Sinho Chewi, Yin Tat Lee, Felipe Suarez, and Yi Zhang.

    Summary: Existing analyses of neural network training often operate under the unrealistic assumption of an extremely small learning rate. This lies in stark contrast to practical wisdom and empirical studies, such as the work of J. Cohen et al. (ICLR 2021), which exhibit startling new phenomena (the "edge of stability" or "unstable convergence") and potential benefits for generalization in the large learning rate regime. Despite a flurry of recent works on this topic, however, the latter effect is still poorly understood. In this paper, we take a step towards understanding genuinely non-convex training dynamics with large learning rates by performing a detailed analysis of gradient descent for simplified models of two-layer neural networks. For these models, we provably establish the edge of stability phenomenon and discover a sharp phase transition for the step size below which the neural network fails to learn "threshold-like" neurons (i.e., neurons with a non-zero first-layer bias). This elucidates one possible mechanism by which the edge of stability can in fact lead to better generalization, as threshold neurons are basic building blocks with useful inductive bias for many tasks.

  • 2022-12-13 A (Slightly) Improved Deterministic Approximation Algorithm for Metric TSP. Anna R. Karlin, Nathan Klein, and Shayan Oveis Gharan.

    Summary: We show that the max entropy algorithm can be derandomized (with respect to a particular objective function) to give a deterministic $3/2-\epsilon$ approximation algorithm for metric TSP for some $\epsilon > 10^{-36}$. To obtain our result, we apply the method of conditional expectation to an objective function constructed in prior work which was used to certify that the expected cost of the algorithm is at most $3/2-\epsilon$ times the cost of an optimal solution to the subtour elimination LP. The proof in this work involves showing that the expected value of this objective function can be computed in polynomial time (at all stages of the algorithm's execution).

  • 2022-12-03 Exploring the Limits of Differentially Private Deep Learning with Group-wise Clipping. Jiyan He, Xuechen Li, Da Yu, Huishuai Zhang, Janardhan Kulkarni, Yin Tat Lee, Arturs Backurs, Nenghai Yu, and Jiang Bian.

    Summary: Differentially private deep learning has recently witnessed advances in computational efficiency and privacy-utility trade-off. We explore whether further improvements along the two axes are possible and provide affirmative answers leveraging two instantiations of \emph{group-wise clipping}. To reduce the compute time overhead of private learning, we show that \emph{per-layer clipping}, where the gradient of each neural network layer is clipped separately, allows clipping to be performed in conjunction with backpropagation in differentially private optimization. This results in private learning that is as memory-efficient and almost as fast per training update as non-private learning for many workflows of interest. While per-layer clipping with constant thresholds tends to underperform standard flat clipping, per-layer clipping with adaptive thresholds matches or outperforms flat clipping under given training epoch constraints, hence attaining similar or better task performance within less wall time. To explore the limits of scaling (pretrained) models in differentially private deep learning, we privately fine-tune the 175 billion-parameter GPT-3. We bypass scaling challenges associated with clipping gradients that are distributed across multiple devices with \emph{per-device clipping} that clips the gradient of each model piece separately on its host device. Privately fine-tuning GPT-3 with per-device clipping achieves a task performance at $\epsilon=1$ better than what is attainable by non-privately fine-tuning the largest GPT-2 on a summarization task.

  • 2022-11-30 On Disperser/Lifting Properties of the Index and Inner-Product Functions. Paul Beame and Sajin Koroth.

    Summary: Query-to-communication lifting theorems, which connect the query complexity of a Boolean function to the communication complexity of an associated `lifted' function obtained by composing the function with many copies of another function known as a gadget, have been instrumental in resolving many open questions in computational complexity. Several important complexity questions could be resolved if we could make substantial improvements in the input size required for lifting with the Index function, from its current near-linear size down to polylogarithmic in the number of inputs $N$ of the original function or, ideally, constant. The near-linear size bound was shown by Lovett, Meka, Mertz, Pitassi and Zhang using a recent breakthrough improvement on the Sunflower Lemma to show that a certain graph associated with the Index function of near-linear size is a disperser. They also stated a conjecture about the Index function that is essential for further improvements in the size required for lifting with Index using current techniques. In this paper we prove the following; 1) The conjecture of Lovett et al. is false when the size of the Index gadget is $\log N-\omega(1)$. 2) Also, the Inner-Product function, which satisfies the disperser property at size $O(\log N)$, does not have this property when its size is $\log N-\omega(1)$. 3) Nonetheless, using Index gadgets of size at least 4, we prove a lifting theorem for a restricted class of communication protocols in which one of the players is limited to sending parities of its inputs. 4) Using the ideas from this lifting theorem, we derive a strong lifting theorem from decision tree size to parity decision tree size. We use this to derive a general lifting theorem in proof complexity from tree-resolution size to tree-like $Res(\oplus)$ refutation size, which yields many new exponential lower bounds on such proofs.

  • 2022-11-27 Dynamic Kernel Sparsifiers. Yichuan Deng, Wenyu Jin, Zhao Song, Xiaorui Sun, and Omri Weinstein.

    Summary: A geometric graph associated with a set of points $P= \{x_1, x_2, \cdots, x_n \} \subset \mathbb{R}^d$ and a fixed kernel function $\mathsf{K}:\mathbb{R}^d\times \mathbb{R}^d\to\mathbb{R}_{\geq 0}$ is a complete graph on $P$ such that the weight of edge $(x_i, x_j)$ is $\mathsf{K}(x_i, x_j)$. We present a fully-dynamic data structure that maintains a spectral sparsifier of a geometric graph under updates that change the locations of points in $P$ one at a time. The update time of our data structure is $n^{o(1)}$ with high probability, and the initialization time is $n^{1+o(1)}$. Under certain assumption, we can provide a fully dynamic spectral sparsifier with the robostness to adaptive adversary. We further show that, for the Laplacian matrices of these geometric graphs, it is possible to maintain random sketches for the results of matrix vector multiplication and inverse-matrix vector multiplication in $n^{o(1)}$ time, under updates that change the locations of points in $P$ or change the query vector by a sparse difference.

  • 2022-11-21 Upper and Lower Bounds on the Smoothed Complexity of the Simplex Method. Sophie Huiberts, Yin Tat Lee, and Xinzhi Zhang.

    Summary: The simplex method for linear programming is known to be highly efficient in practice, and understanding its performance from a theoretical perspective is an active research topic. The framework of smoothed analysis, first introduced by Spielman and Teng (JACM '04) for this purpose, defines the smoothed complexity of solving a linear program with $d$ variables and $n$ constraints as the expected running time when Gaussian noise of variance $\sigma^2$ is added to the LP data. We prove that the smoothed complexity of the simplex method is $O(\sigma^{-3/2} d^{13/4}\log^{7/4} n)$, improving the dependence on $1/\sigma$ compared to the previous bound of $O(\sigma^{-2} d^2\sqrt{\log n})$. We accomplish this through a new analysis of the \emph{shadow bound}, key to earlier analyses as well. Illustrating the power of our new method, we use our method to prove a nearly tight upper bound on the smoothed complexity of two-dimensional polygons. We also establish the first non-trivial lower bound on the smoothed complexity of the simplex method, proving that the \emph{shadow vertex simplex method} requires at least $\Omega \Big(\min \big(\sigma^{-1/2} d^{-1/2}\log^{-1/4} d,2^d \big) \Big)$ pivot steps with high probability. A key part of our analysis is a new variation on the extended formulation for the regular $2^k$-gon. We end with a numerical experiment that suggests this analysis could be further improved.

  • 2022-11-17 How to Fine-Tune Vision Models with SGD. Ananya Kumar, Ruoqi Shen, Sebastien Bubeck, and Suriya Gunasekar.

    Summary: SGD and AdamW are the two most used optimizers for fine-tuning large neural networks in computer vision. When the two methods perform the same, SGD is preferable because it uses less memory (12 bytes/parameter with momentum and 8 bytes/parameter without) than AdamW (16 bytes/parameter). However, on a suite of downstream tasks, especially those with distribution shifts, we find that fine-tuning with AdamW performs substantially better than SGD on modern Vision Transformer and ConvNeXt models. We find that large gaps in performance between SGD and AdamW occur when the fine-tuning gradients in the first "embedding" layer are much larger than in the rest of the model. Our analysis suggests an easy fix that works consistently across datasets and models: freezing the embedding layer (less than 1% of the parameters) leads to SGD with or without momentum performing slightly better than AdamW while using less memory (e.g., on ViT-L, SGD uses 33% less GPU memory). Our insights result in state-of-the-art accuracies on five popular distribution shift benchmarks: WILDS-FMoW, WILDS-Camelyon, BREEDS-Living-17, Waterbirds, and DomainNet.

  • 2022-11-09 A 4/3-Approximation Algorithm for Half-Integral Cycle Cut Instances of the TSP. Billy Jin, Nathan Klein, and David P. Williamson.

    Summary: A long-standing conjecture for the traveling salesman problem (TSP) states that the integrality gap of the standard linear programming relaxation of the TSP is at most 4/3. Despite significant efforts, the conjecture remains open. We consider the half-integral case, in which the LP has solution values in $\{0, 1/2, 1\}$. Such instances have been conjectured to be the most difficult instances for the overall four-thirds conjecture. Karlin, Klein, and Oveis Gharan, in a breakthrough result, were able to show that in the half-integral case, the integrality gap is at most 1.49993. This result led to the first significant progress on the overall conjecture in decades; the same authors showed the integrality gap is at most $1.5- 10^{-36}$ in the non-half-integral case. For the half-integral case, the current best-known ratio is 1.4983, a result by Gupta et al. With the improvements on the 3/2 bound remaining very incremental even in the half-integral case, we turn the question around and look for a large class of half-integral instances for which we can prove that the 4/3 conjecture is correct. The previous works on the half-integral case perform induction on a hierarchy of critical tight sets in the support graph of the LP solution, in which some of the sets correspond to "cycle cuts" and the others to "degree cuts". We show that if all the sets in the hierarchy correspond to cycle cuts, then we can find a distribution of tours whose expected cost is at most 4/3 times the value of the half-integral LP solution; sampling from the distribution gives us a randomized 4/3-approximation algorithm. We note that the known bad cases for the integrality gap have a gap of 4/3 and have a half-integral LP solution in which all the critical tight sets in the hierarchy are cycle cuts; thus our result is tight.

  • 2022-11-07 From approximate to exact integer programming. Daniel Dadush, Friedrich Eisenbrand, and Thomas Rothvoss.

    Summary: Approximate integer programming is the following: For a convex body $K \subseteq \mathbb{R}^n$, either determine whether $K \cap \mathbb{Z}^n$ is empty, or find an integer point in the convex body scaled by $2$ from its center of gravity $c$. Approximate integer programming can be solved in time $2^{O(n)}$ while the fastest known methods for exact integer programming run in time $2^{O(n)} \cdot n^n$. So far, there are no efficient methods for integer programming known that are based on approximate integer programming. Our main contribution are two such methods, each yielding novel complexity results. First, we show that an integer point $x^* \in (K \cap \mathbb{Z}^n)$ can be found in time $2^{O(n)}$, provided that the remainders of each component $x_i^* \mod{\ell}$ for some arbitrarily fixed $\ell \geq 5(n+1)$ of $x^*$ are given. The algorithm is based on a cutting-plane technique, iteratively halving the volume of the feasible set. The cutting planes are determined via approximate integer programming. Enumeration of the possible remainders gives a $2^{O(n)}n^n$ algorithm for general integer programming. This matches the current best bound of an algorithm by Dadush (2012) that is considerably more involved. Our algorithm also relies on a new asymmetric approximate Carath\'eodory theorem that might be of interest on its own. Our second method concerns integer programming problems in equation-standard form $Ax = b, 0 \leq x \leq u, \, x \in \mathbb{Z}^n$ . Such a problem can be reduced to the solution of $\prod_i O(\log u_i +1)$ approximate integer programming problems. This implies, for example that knapsack or subset-sum problems with polynomial variable range $0 \leq x_i \leq p(n)$ can be solved in time $(\log n)^{O(n)}$. For these problems, the best running time so far was $n^n \cdot 2^{O(n)}$.

  • 2022-11-03 Best-of-Both-Worlds Multiparty Quantum Computation with Publicly Verifiable Identifiable Abort. Kai-Min Chung, Mi-Ying Huang, Er-Cheng Tang, and Jiapeng Zhang.

    Summary: Alon et al. (CRYPTO 2021) introduced a multiparty quantum computation protocol that is secure with identifiable abort (MPQC-SWIA). However, their protocol allows only inside MPQC parties to know the identity of malicious players. This becomes problematic when two groups of people disagree and need a third party, like a jury, to verify who the malicious party is. This issue takes on heightened significance in the quantum setting, given that quantum states may exist in only a single copy. Thus, we emphasize the necessity of a protocol with publicly verifiable identifiable abort (PVIA), enabling outside observers with only classical computational power to agree on the identity of the malicious party in case of an abort. However, achieving MPQC with PVIA poses significant challenges due to the no-cloning theorem, and previous works proposed by Mahadev (STOC 2018) and Chung et al. (Eurocrypt 2022) for classical verification of quantum computation fall short. In this paper, we obtain the first MPQC-PVIA protocol assuming post-quantum oblivious transfer and a classical broadcast channel. The core component of our construction is a new authentication primitive called auditable quantum authentication (AQA) that identifies the malicious sender with overwhelming probability. Additionally, we provide the first MPQC protocol with best-of-both-worlds (BoBW) security, which guarantees output delivery with an honest majority and remains secure with abort even if the majority is dishonest. Our best-of-both-worlds MPQC protocol also satisfies PVIA upon abort.

  • 2022-10-29 The Vector Balancing Constant for Zonotopes. Laurel Heck, Victor Reis, and Thomas Rothvoss.

    Summary: The vector balancing constant $\mathrm{vb}(K,Q)$ of two symmetric convex bodies $K,Q$ is the minimum $r \geq 0$ so that any number of vectors from $K$ can be balanced into an $r$-scaling of $Q$. A question raised by Schechtman is whether for any zonotope $K \subseteq \mathbb{R}^d$ one has $\mathrm{vb}(K,K) \lesssim \sqrt{d}$. Intuitively, this asks whether a natural geometric generalization of Spencer's Theorem (for which $K = B^d_\infty$) holds. We prove that for any zonotope $K \subseteq \mathbb{R}^d$ one has $\mathrm{vb}(K,K) \lesssim \sqrt{d} \log \log \log d$. Our main technical contribution is a tight lower bound on the Gaussian measure of any section of a normalized zonotope, generalizing Vaaler's Theorem for cubes. We also prove that for two different normalized zonotopes $K$ and $Q$ one has $\mathrm{vb}(K,Q) \lesssim \sqrt{d \log d}$. All the bounds are constructive and the corresponding colorings can be computed in polynomial time.

  • 2022-10-27 Learning Single-Index Models with Shallow Neural Networks. Alberto Bietti, Joan Bruna, Clayton Sanford, and Min Jae Song.

    Summary: Single-index models are a class of functions given by an unknown univariate ``link'' function applied to an unknown one-dimensional projection of the input. These models are particularly relevant in high dimension, when the data might present low-dimensional structure that learning algorithms should adapt to. While several statistical aspects of this model, such as the sample complexity of recovering the relevant (one-dimensional) subspace, are well-understood, they rely on tailored algorithms that exploit the specific structure of the target function. In this work, we introduce a natural class of shallow neural networks and study its ability to learn single-index models via gradient flow. More precisely, we consider shallow networks in which biases of the neurons are frozen at random initialization. We show that the corresponding optimization landscape is benign, which in turn leads to generalization guarantees that match the near-optimal sample complexity of dedicated semi-parametric methods.

  • 2022-10-27 A distribution testing oracle separation between QMA and QCMA. Anand Natarajan and Chinmay Nirkhe.

    Summary: It is a long-standing open question in quantum complexity theory whether the definition of $\textit{non-deterministic}$ quantum computation requires quantum witnesses $(\textsf{QMA})$ or if classical witnesses suffice $(\textsf{QCMA})$. We make progress on this question by constructing a randomized classical oracle separating the respective computational complexity classes. Previous separations [Aaronson-Kuperberg (CCC'07), Fefferman-Kimmel (MFCS'18)] required a quantum unitary oracle. The separating problem is deciding whether a distribution supported on regular un-directed graphs either consists of multiple connected components (yes instances) or consists of one expanding connected component (no instances) where the graph is given in an adjacency-list format by the oracle. Therefore, the oracle is a distribution over $n$-bit boolean functions.

  • 2022-10-22 Discrepancy Minimization in Input-Sparsity Time. Yichuan Deng, Zhao Song, and Omri Weinstein.

    Summary: A recent work of Larsen [Lar23] gave a faster combinatorial alternative to Bansal's SDP algorithm for finding a coloring $x\in\{-1,1\}^n$ that approximately minimizes the discrepancy $\mathrm{disc}(A,x) : = \| A x \|_{\infty}$ of a general real-valued $m\times n$ matrix $A$. Larsen's algorithm runs in $\widetilde{O}(mn^2)$ time compared to Bansal's $\widetilde{O}(mn^{4.5})$-time algorithm, at the price of a slightly weaker logarithmic approximation ratio in terms of the hereditary discrepancy of $A$ [Ban10]. In this work we present a combinatorial $\widetilde{O}(\mathrm{nnz}(A) + n^3)$ time algorithm with the same approximation guarantee as Larsen, which is optimal for tall matrices $m=\mathrm{poly}(n)$. Using a more intricate analysis and fast matrix-multiplication, we achieve $\widetilde{O}(\mathrm{nnz}(A) + n^{2.53})$ time, which breaks cubic runtime for square matrices, and bypasses the barrier of linear-programming approaches [ES14] for which input-sparsity time is currently out of reach. Our algorithm relies on two main ideas: (i) A new sketching technique for finding a projection matrix with short $\ell_2$-basis using implicit leverage-score sampling; (ii) A data structure for faster implementation of the iterative Edge-Walk partial-coloring algorithm of Lovett-Meka, using an alternative analysis that enables ``lazy" batch-updates with low-rank corrections. Our result nearly closes the computational gap between real-valued and binary matrices (set-systems), for which input-sparsity time coloring was very recently obtained [JSS23].

  • 2022-10-21 Augmentation with Projection: Towards an Effective and Efficient Data Augmentation Paradigm for Distillation. Ziqi Wang, Yuexin Wu, Frederick Liu, Daogao Liu, Le Hou, Hongkun Yu, Jing Li, and Heng Ji.

    Summary: Knowledge distillation is one of the primary methods of transferring knowledge from large to small models. However, it requires massive task-specific data, which may not be plausible in many real-world applications. Data augmentation methods such as representation interpolation, token replacement, or augmentation with models are applied to tackle this problem. However, these data augmentation methods either potentially cause shifts in decision boundaries (representation interpolation), are not expressive enough (token replacement), or introduce too much computational overhead (augmentation with models). To this end, we propose AugPro (Augmentation with Projection), an effective and efficient data augmentation method for distillation. Our method builds on top of representation interpolation augmentation methods to maintain the diversity of expressions and converts the augmented data to tokens to avoid shifting decision boundaries. It uses simple operations that come with little computational overhead. The results on multiple GLUE tasks show that our methods can improve distillation performance by a large margin at a low time cost. Codes are available at https://github.com/google-research/google-research/tree/master/augpro.

  • 2022-10-15 A Nearly Optimal Size Coreset Algorithm with Nearly Linear Time. Yichuan Deng, Zhao Song, Yitan Wang, and Yuanyuan Yang.

    Summary: A coreset is a point set containing information about geometric properties of a larger point set. A series of previous works show that in many machine learning problems, especially in clustering problems, coreset could be very useful to build efficient algorithms. Two main measures of an coreset construction algorithm's performance are the running time of the algorithm and the size of the coreset output by the algorithm. In this paper we study the construction of coresets for the $(k,z)$-clustering problem, which is a generalization of $k$-means and $k$-median problem. By properly designing a sketching-based distance estimation data structure, we propose faster algorithms that construct coresets with matching size of the state-of-the-art results.

  • 2022-10-13 Condition-number-independent convergence rate of Riemannian Hamiltonian Monte Carlo with numerical integrators. Yunbum Kook, Yin Tat Lee, Ruoqi Shen, and Santosh S. Vempala.

    Summary: We study the convergence rate of discretized Riemannian Hamiltonian Monte Carlo on sampling from distributions in the form of $e^{-f(x)}$ on a convex body $\mathcal{M}\subset\mathbb{R}^{n}$. We show that for distributions in the form of $e^{-\alpha^{\top}x}$ on a polytope with $m$ constraints, the convergence rate of a family of commonly-used integrators is independent of $\left\Vert \alpha\right\Vert _{2}$ and the geometry of the polytope. In particular, the implicit midpoint method (IMM) and the generalized Leapfrog method (LM) have a mixing time of $\widetilde{O}\left(mn^{3}\right)$ to achieve $\epsilon$ total variation distance to the target distribution. These guarantees are based on a general bound on the convergence rate for densities of the form $e^{-f(x)}$ in terms of parameters of the manifold and the integrator. Our theoretical guarantee complements the empirical results of [KLSV22], which shows that RHMC with IMM can sample ill-conditioned, non-smooth and constrained distributions in very high dimension efficiently in practice.

  • 2022-10-12 Quantum Depth in the Random Oracle Model. Atul Singh Arora, Andrea Coladangelo, Matthew Coudron, Alexandru Gheorghiu, Uttam Singh, and Hendrik Waldner.

    Summary: We give a comprehensive characterization of the computational power of shallow quantum circuits combined with classical computation. Specifically, for classes of search problems, we show that the following statements hold, relative to a random oracle: (a) $\mathsf{BPP}^{\mathsf{QNC}^{\mathsf{BPP}}} \neq \mathsf{BQP}$. This refutes Jozsa's conjecture [QIP 05] in the random oracle model. As a result, this gives the first instantiatable separation between the classes by replacing the oracle with a cryptographic hash function, yielding a resolution to one of Aaronson's ten semi-grand challenges in quantum computing. (b) $\mathsf{BPP}^{\mathsf{QNC}} \nsubseteq \mathsf{QNC}^{\mathsf{BPP}}$ and $\mathsf{QNC}^{\mathsf{BPP}} \nsubseteq \mathsf{BPP}^{\mathsf{QNC}}$. This shows that there is a subtle interplay between classical computation and shallow quantum computation. In fact, for the second separation, we establish that, for some problems, the ability to perform adaptive measurements in a single shallow quantum circuit, is more useful than the ability to perform polynomially many shallow quantum circuits without adaptive measurements. (c) There exists a 2-message proof of quantum depth protocol. Such a protocol allows a classical verifier to efficiently certify that a prover must be performing a computation of some minimum quantum depth. Our proof of quantum depth can be instantiated using the recent proof of quantumness construction by Yamakawa and Zhandry [STOC 22].

  • 2022-10-06 NLTS Hamiltonians from classical LTCs. Zhiyang He and Chinmay Nirkhe.

    Summary: We provide a completely self-contained construction of a family of NLTS Hamiltonians [Freedman and Hastings, 2014] based on ideas from [Anshu, Breuckmann, and Nirkhe, 2022], [Cross, He, Natarajan, Szegedy, and Zhu, 2022] and [Eldar and Harrow, 2017]. Crucially, it does not require optimal-parameter quantum LDPC codes and can be built from simple classical LTCs such as the repetition code on an expander graph. Furthermore, it removes the constant-rate requirement from the construction of Anshu, Breuckmann, and Nirkhe.

  • 2022-09-21 Chaining, Group Leverage Score Overestimates, and Fast Spectral Hypergraph Sparsification. Arun Jambulapati, Yang P. Liu, and Aaron Sidford.

    Summary: We present an algorithm that given any $n$-vertex, $m$-edge, rank $r$ hypergraph constructs a spectral sparsifier with $O(n \varepsilon^{-2} \log n \log r)$ hyperedges in nearly-linear $\widetilde{O}(mr)$ time. This improves in both size and efficiency over a line of work (Bansal-Svensson-Trevisan 2019, Kapralov-Krauthgamer-Tardos-Yoshida 2021) for which the previous best size was $O(\min\{n \varepsilon^{-4} \log^3 n,nr^3 \varepsilon^{-2} \log n\})$ and runtime was $\widetilde{O}(mr + n^{O(1)})$. Independent Result: In an independent work, Lee (Lee 2022) also shows how to compute a spectral hypergraph sparsifier with $O(n \varepsilon^{-2} \log n \log r)$ hyperedges.

  • 2022-09-15 Multicalibrated Regression for Downstream Fairness. Ira Globus-Harris, Varun Gupta, Christopher Jung, Michael Kearns, Jamie Morgenstern, and Aaron Roth.

    Summary: We show how to take a regression function $\hat{f}$ that is appropriately ``multicalibrated'' and efficiently post-process it into an approximately error minimizing classifier satisfying a large variety of fairness constraints. The post-processing requires no labeled data, and only a modest amount of unlabeled data and computation. The computational and sample complexity requirements of computing $\hat f$ are comparable to the requirements for solving a single fair learning task optimally, but it can in fact be used to solve many different downstream fairness-constrained learning problems efficiently. Our post-processing method easily handles intersecting groups, generalizing prior work on post-processing regression functions to satisfy fairness constraints that only applied to disjoint groups. Our work extends recent work showing that multicalibrated regression functions are ``omnipredictors'' (i.e. can be post-processed to optimally solve unconstrained ERM problems) to constrained optimization.

  • 2022-09-09 Spectral hypergraph sparsification via chaining. James R. Lee.

    Summary: In a hypergraph on $n$ vertices where $D$ is the maximum size of a hyperedge, there is a weighted hypergraph spectral $\varepsilon$-sparsifier with at most $O(\varepsilon^{-2} \log(D) \cdot n \log n)$ hyperedges. This improves over the bound of Kapralov, Krauthgamer, Tardos and Yoshida (2021) who achieve $O(\varepsilon^{-4} n (\log n)^3)$, as well as the bound $O(\varepsilon^{-2} D^3 n \log n)$ obtained by Bansal, Svensson, and Trevisan (2019). The same sparsification result was obtained independently by Jambulapati, Liu, and Sidford (2022).

  • 2022-08-29 Online Bidding Algorithms for Return-on-Spend Constrained Advertisers. Zhe Feng, Swati Padmanabhan, and Di Wang.

    Summary: Online advertising has recently grown into a highly competitive and complex multi-billion-dollar industry, with advertisers bidding for ad slots at large scales and high frequencies. This has resulted in a growing need for efficient "auto-bidding" algorithms that determine the bids for incoming queries to maximize advertisers' targets subject to their specified constraints. This work explores efficient online algorithms for a single value-maximizing advertiser under an increasingly popular constraint: Return-on-Spend (RoS). We quantify efficiency in terms of regret relative to the optimal algorithm, which knows all queries a priori. We contribute a simple online algorithm that achieves near-optimal regret in expectation while always respecting the specified RoS constraint when the input sequence of queries are i.i.d. samples from some distribution. We also integrate our results with the previous work of Balseiro, Lu, and Mirrokni [BLM20] to achieve near-optimal regret while respecting both RoS and fixed budget constraints. Our algorithm follows the primal-dual framework and uses online mirror descent (OMD) for the dual updates. However, we need to use a non-canonical setup of OMD, and therefore the classic low-regret guarantee of OMD, which is for the adversarial setting in online learning, no longer holds. Nonetheless, in our case and more generally where low-regret dynamics are applied in algorithm design, the gradients encountered by OMD can be far from adversarial but influenced by our algorithmic choices. We exploit this key insight to show our OMD setup achieves low regret in the realm of our algorithm.

  • 2022-08-24 A Slightly Improved Bound for the KLS Constant. Arun Jambulapati, Yin Tat Lee, and Santosh S. Vempala.

    Summary: We refine the recent breakthrough technique of Klartag and Lehec to obtain an improved polylogarithmic bound for the KLS constant.

  • 2022-08-09 Training Overparametrized Neural Networks in Sublinear Time. Yichuan Deng, Hang Hu, Zhao Song, Omri Weinstein, and Danyang Zhuo.

    Summary: The success of deep learning comes at a tremendous computational and energy cost, and the scalability of training massively overparametrized neural networks is becoming a real barrier to the progress of artificial intelligence (AI). Despite the popularity and low cost-per-iteration of traditional backpropagation via gradient decent, stochastic gradient descent (SGD) has prohibitive convergence rate in non-convex settings, both in theory and practice. To mitigate this cost, recent works have proposed to employ alternative (Newton-type) training methods with much faster convergence rate, albeit with higher cost-per-iteration. For a typical neural network with $m=\mathrm{poly}(n)$ parameters and input batch of $n$ datapoints in $\mathbb{R}^d$, the previous work of [Brand, Peng, Song, and Weinstein, ITCS'2021] requires $\sim mnd + n^3$ time per iteration. In this paper, we present a novel training method that requires only $m^{1-\alpha} n d + n^3$ amortized time in the same overparametrized regime, where $\alpha \in (0.01,1)$ is some fixed constant. This method relies on a new and alternative view of neural networks, as a set of binary search trees, where each iteration corresponds to modifying a small subset of the nodes in the tree. We believe this view would have further applications in the design and analysis of deep neural networks (DNNs).

  • 2022-08-09 An Improved Trickle-Down Theorem for Partite Complexes. Dorna Abdolazimi and Shayan Oveis Gharan.

    Summary: We prove a strengthening of the trickle down theorem for partite complexes. Given a $(d+1)$-partite $d$-dimensional simplicial complex, we show that if "on average" the links of faces of co-dimension 2 are $\frac{1-\delta}{d}$-(one-sided) spectral expanders, then the link of any face of co-dimension $k$ is an $O(\frac{1-\delta}{k\delta})$-(one-sided) spectral expander, for all $3\leq k\leq d+1$. For an application, using our theorem as a black-box, we show that links of faces of co-dimension $k$ in recent constructions of bounded degree high dimensional expanders have spectral expansion at most $O(1/k)$ fraction of the spectral expansion of the links of the worst faces of co-dimension $2$.

  • 2022-08-07 Decomposable Non-Smooth Convex Optimization with Nearly-Linear Gradient Oracle Complexity. Sally Dong, Haotian Jiang, Yin Tat Lee, Swati Padmanabhan, and Guanghao Ye.

    Summary: Many fundamental problems in machine learning can be formulated by the convex program \[ \min_{\theta\in R^d}\ \sum_{i=1}^{n}f_{i}(\theta), \] where each $f_i$ is a convex, Lipschitz function supported on a subset of $d_i$ coordinates of $\theta$. One common approach to this problem, exemplified by stochastic gradient descent, involves sampling one $f_i$ term at every iteration to make progress. This approach crucially relies on a notion of uniformity across the $f_i$'s, formally captured by their condition number. In this work, we give an algorithm that minimizes the above convex formulation to $\epsilon$-accuracy in $\widetilde{O}(\sum_{i=1}^n d_i \log (1 /\epsilon))$ gradient computations, with no assumptions on the condition number. The previous best algorithm independent of the condition number is the standard cutting plane method, which requires $O(nd \log (1/\epsilon))$ gradient computations. As a corollary, we improve upon the evaluation oracle complexity for decomposable submodular minimization by Axiotis et al. (ICML 2021). Our main technical contribution is an adaptive procedure to select an $f_i$ term at every iteration via a novel combination of cutting-plane and interior-point methods.

  • 2022-07-27 Searching for Regularity in Bounded Functions. Siddharth Iyer and Michael Whitmeyer.

    Summary: Given a function $f$ on $\mathbb{F}_2^n$, we study the following problem. What is the largest affine subspace $\mathcal{U}$ such that when restricted to $\mathcal{U}$, all the non-trivial Fourier coefficients of $f$ are very small? For the natural class of bounded Fourier degree $d$ functions $f:\mathbb{F}_2^n \to [-1,1]$, we show that there exists an affine subspace of dimension at least $ \tilde\Omega(n^{1/d!}k^{-2})$, wherein all of $f$'s nontrivial Fourier coefficients become smaller than $ 2^{-k}$. To complement this result, we show the existence of degree $d$ functions with coefficients larger than $2^{-d\log n}$ when restricted to any affine subspace of dimension larger than $\Omega(dn^{1/(d-1)})$. In addition, we give explicit examples of functions with analogous but weaker properties. Along the way, we provide multiple characterizations of the Fourier coefficients of functions restricted to subspaces of $\mathbb{F}_2^n$ that may be useful in other contexts. Finally, we highlight applications and connections of our results to parity kill number and affine dispersers.

  • 2022-07-19 Pandora Box Problem with Nonobligatory Inspection: Hardness and Approximation Scheme. Hu Fu, Jiawei Li, and Daogao Liu.

    Summary: Weitzman (1979) introduced the Pandora Box problem as a model for sequential search with inspection costs, and gave an elegant index-based policy that attains provably optimal expected payoff. In various scenarios, the searching agent may select an option without making a costly inspection. The variant of the Pandora box problem with non-obligatory inspection has attracted interest from both economics and algorithms researchers. Various simple algorithms have proved suboptimal, with the best known 0.8-approximation algorithm due to Guha et al. (2008). No hardness result for the problem was known. In this work, we show that it is NP-hard to compute an optimal policy for Pandora's problem with nonobligatory inspection. We also give a polynomial-time approximation scheme (PTAS) that computes policies with an expected payoff at least $(1 - \epsilon)$-fraction of the optimal, for arbitrarily small $\epsilon > 0$. On the side, we show the decision version of the problem to be in NP.

  • 2022-07-18 Private Convex Optimization in General Norms. Sivakanth Gopi, Yin Tat Lee, Daogao Liu, Ruoqi Shen, and Kevin Tian.

    Summary: We propose a new framework for differentially private optimization of convex functions which are Lipschitz in an arbitrary norm $\|\cdot\|$. Our algorithms are based on a regularized exponential mechanism which samples from the density $\propto \exp(-k(F+\mu r))$ where $F$ is the empirical loss and $r$ is a regularizer which is strongly convex with respect to $\|\cdot\|$, generalizing a recent work of [Gopi, Lee, Liu '22] to non-Euclidean settings. We show that this mechanism satisfies Gaussian differential privacy and solves both DP-ERM (empirical risk minimization) and DP-SCO (stochastic convex optimization) by using localization tools from convex geometry. Our framework is the first to apply to private convex optimization in general normed spaces and directly recovers non-private SCO rates achieved by mirror descent as the privacy parameter $\epsilon \to \infty$. As applications, for Lipschitz optimization in $\ell_p$ norms for all $p \in (1, 2)$, we obtain the first optimal privacy-utility tradeoffs; for $p = 1$, we improve tradeoffs obtained by the recent works [Asi, Feldman, Koren, Talwar '21, Bassily, Guzman, Nandi '21] by at least a logarithmic factor. Our $\ell_p$ norm and Schatten-$p$ norm optimization frameworks are complemented with polynomial-time samplers whose query complexity we explicitly bound.

  • 2022-07-07 Approximate Carathéodory bounds via Discrepancy Theory. Victor Reis and Thomas Rothvoss.

    Summary: The approximate Carath\'eodory problem in general form is as follows: Given two symmetric convex bodies $P,Q \subseteq \mathbb{R}^m$, a parameter $k \in \mathbb{N}$ and $\mathbf{z} \in \textrm{conv}(X)$ with $X \subseteq P$, find $\mathbf{v}_1,\ldots,\mathbf{v}_k \in X$ so that $\|\mathbf{z} - \frac{1}{k}\sum_{i=1}^k \mathbf{v}_i\|_Q$ is minimized. Maurey showed that if both $P$ and $Q$ coincide with the $\| \cdot \|_p$-ball, then an error of $O(\sqrt{p/k})$ is possible. We prove a reduction to the vector balancing constant from discrepancy theory which for most cases can provide tight bounds for general $P$ and $Q$. For the case where $P$ and $Q$ are both $\| \cdot \|_p$-balls we prove an upper bound of $\sqrt{ \frac{\min\{ p, \log (\frac{2m}{k}) \}}{k}}$. Interestingly, this bound cannot be obtained taking independent random samples; instead we use the Lovett-Meka random walk. We also prove an extension to the more general case where $P$ and $Q$ are $\|\cdot \|_p$ and $\| \cdot \|_q$-balls with $2 \leq p \leq q \leq \infty$.

  • 2022-07-07 Individual Preference Stability for Clustering. Saba Ahmadi, Pranjal Awasthi, Samir Khuller, Matthäus Kleindessner, Jamie Morgenstern, Pattara Sukprasert, and Ali Vakilian.

    Summary: In this paper, we propose a natural notion of individual preference (IP) stability for clustering, which asks that every data point, on average, is closer to the points in its own cluster than to the points in any other cluster. Our notion can be motivated from several perspectives, including game theory and algorithmic fairness. We study several questions related to our proposed notion. We first show that deciding whether a given data set allows for an IP-stable clustering in general is NP-hard. As a result, we explore the design of efficient algorithms for finding IP-stable clusterings in some restricted metric spaces. We present a polytime algorithm to find a clustering satisfying exact IP-stability on the real line, and an efficient algorithm to find an IP-stable 2-clustering for a tree metric. We also consider relaxing the stability constraint, i.e., every data point should not be too far from its own cluster compared to any other cluster. For this case, we provide polytime algorithms with different guarantees. We evaluate some of our algorithms and several standard clustering approaches on real data sets.

  • 2022-07-01 When Does Differentially Private Learning Not Suffer in High Dimensions?. Xuechen Li, Daogao Liu, Tatsunori Hashimoto, Huseyin A. Inan, Janardhan Kulkarni, Yin Tat Lee, and Abhradeep Guha Thakurta.

    Summary: Large pretrained models can be privately fine-tuned to achieve performance approaching that of non-private models. A common theme in these results is the surprising observation that high-dimensional models can achieve favorable privacy-utility trade-offs. This seemingly contradicts known results on the model-size dependence of differentially private convex learning and raises the following research question: When does the performance of differentially private learning not degrade with increasing model size? We identify that the magnitudes of gradients projected onto subspaces is a key factor that determines performance. To precisely characterize this for private convex learning, we introduce a condition on the objective that we term \emph{restricted Lipschitz continuity} and derive improved bounds for the excess empirical and population risks that are dimension-independent under additional conditions. We empirically show that in private fine-tuning of large language models, gradients obtained during fine-tuning are mostly controlled by a few principal components. This behavior is similar to conditions under which we obtain dimension-independent bounds in convex settings. Our theoretical and empirical results together provide a possible explanation for recent successes in large-scale private fine-tuning. Code to reproduce our results can be found at \url{https://github.com/lxuechen/private-transformers/tree/main/examples/classification/spectral_analysis}.