
20240224 POPSTAR: Lightweight Threshold Reporting with Reduced Leakage.
Summary: This paper proposes POPSTAR, a new lightweight protocol for the private computation of heavy hitters, also known as a private threshold reporting system. In such a protocol, the users provide input measurements, and a report server learns which measurements appear more than a prespecified threshold. POPSTAR follows the same architecture as STAR (Davidson et al, CCS 2022) by relying on a helper randomness server in addition to a main server computing the aggregate heavy hitter statistics. While STAR is extremely lightweight, it leaks a substantial amount of information, consisting of an entire histogram of the provided measurements (but only reveals the actual measurements that appear beyond the threshold). POPSTAR shows that this leakage can be reduced at a modest cost ($\sim$7$\times$ longer aggregation time). Our leakage is closer to that of Poplar (Boneh et al, S&P 2021), which relies however on distributed point functions and a different model which requires interactions of two noncolluding servers (with equal workloads) to compute the heavy hitters.

20240221 Private Gradient Descent for Linear Regression: Tighter Error Bounds and InstanceSpecific Uncertainty Estimation.
Summary: We provide an improved analysis of standard differentially private gradient descent for linear regression under the squared error loss. Under modest assumptions on the input, we characterize the distribution of the iterate at each time step. Our analysis leads to new results on the algorithm's accuracy: for a proper fixed choice of hyperparameters, the sample complexity depends only linearly on the dimension of the data. This matches the dimensiondependence of the (nonprivate) ordinary least squares estimator as well as that of recent private algorithms that rely on sophisticated adaptive gradientclipping schemes (Varshney et al., 2022; Liu et al., 2023). Our analysis of the iterates' distribution also allows us to construct confidence intervals for the empirical optimizer which adapt automatically to the variance of the algorithm on a particular data set. We validate our theorems through experiments on synthetic data.

20240213 On blackbox separations of quantum digital signatures from pseudorandom states.
Summary: It is wellknown that digital signatures can be constructed from oneway functions in a blackbox way. While oneway functions are essentially the minimal assumption in classical cryptography, this is not the case in the quantum setting. A variety of qualitatively weaker and inherently quantum assumptions (e.g. EFI pairs, oneway state generators, and pseudorandom states) are known to be sufficient for nontrivial quantum cryptography. While it is known that commitments, zeroknowledge proofs, and even multiparty computation can be constructed from these assumptions, it has remained an open question whether the same is true for quantum digital signatures schemes (QDS). In this work, we show that there $\textit{does not}$ exist a blackbox construction of a QDS scheme with classical signatures from pseudorandom states with linear, or greater, output length. Our result complements that of Morimae and Yamakawa (2022), who described a $\textit{onetime}$ secure QDS scheme with classical signatures, but left open the question of constructing a standard $\textit{multitime}$ secure one.

20240202 Large language models cannot replace human participants because they cannot portray identity groups.
Summary: Large language models (LLMs) are increasing in capability and popularity, propelling their application in new domains  including as replacements for human participants in computational social science, user testing, annotation tasks, and more. Traditionally, in all of these settings survey distributors are careful to find representative samples of the human population to ensure the validity of their results and understand potential demographic differences. This means in order to be a suitable replacement, LLMs will need to be able to capture the influence of positionality (i.e., relevance of social identities like gender and race). However, we show that there are two inherent limitations in the way current LLMs are trained that prevent this. We argue analytically for why LLMs are doomed to both misportray and flatten the representations of demographic groups, then empirically show this to be true on 4 LLMs through a series of human studies with 3200 participants across 16 demographic identities. We also discuss a third consideration about how identity prompts can essentialize identities. Throughout, we connect each of these limitations to a pernicious history that shows why each is harmful for marginalized demographic groups. Overall, we urge caution in use cases where LLMs are intended to replace human participants whose identities are relevant to the task at hand. At the same time, in cases where the goal is to supplement rather than replace (e.g., pilot studies), we provide empiricallybetter inferencetime techniques to reduce, but not remove, these harms.

20240202 Enhancing Stochastic Gradient Descent: A Unified Framework and Novel Acceleration Methods for Faster Convergence.
Summary: Based on SGD, previous works have proposed many algorithms that have improved convergence speed and generalization in stochastic optimization, such as SGDm, AdaGrad, Adam, etc. However, their convergence analysis under nonconvex conditions is challenging. In this work, we propose a unified framework to address this issue. For any firstorder methods, we interpret the updated direction $g_t$ as the sum of the stochastic subgradient $\nabla f_t(x_t)$ and an additional acceleration term $\frac{2\langle v_t, \nabla f_t(x_t) \rangle}{\v_t\_2^2} v_t$, thus we can discuss the convergence by analyzing $\langle v_t, \nabla f_t(x_t) \rangle$. Through our framework, we have discovered two plugandplay acceleration methods: \textbf{Reject Accelerating} and \textbf{Random Vector Accelerating}, we theoretically demonstrate that these two methods can directly lead to an improvement in convergence rate.

20240118 Layout Graphs, Random Walks and the twise Independence of SPN Block Ciphers.
Summary: We continue the study of $t$wise independence of substitutionpermutation networks (SPNs) initiated by the recent work of Liu, Tessaro, and Vaikuntanathan (CRYPTO 2021). Our key technical result shows that when the Sboxes are randomly and independently chosen and kept secret, an $r$round SPN with input length $n = b \cdot k$ is $2^{\Theta(n)}$close to $t$wise independent within $r = O(\min\{k, \log t\})$ rounds for any $t$ almost as large as $2^{b/2}$. Here, $b$ is the input length of the Sbox and we assume that the underlying mixing achieves maximum branch number. We also analyze the special case of AES parameters (with random Sboxes), and show it is $2^{128}$close to pairwise independent in $7$ rounds. Central to our result is the analysis of a random walk on what we call the layout graph, a combinatorial abstraction that captures equality and inequality constraints among multiple SPN evaluations. We use our technical result to show concrete security bounds for SPNs with actual block cipher parameters and smallinput $S$boxes. (This is in contrast to the large body of results on idealmodel analyses of SPNs.) For example, for the censoredAES block cipher, namely AES with most of the mixing layers removed, we show that 192 rounds suffice to attain $2^{128}$closeness to pairwise independence. The prior such result for AES (Liu, Tessaro and Vaikuntanathan, CRYPTO 2021) required more than 9000 rounds.

20240110 Quantum TimeSpace Tradeoffs for Matrix Problems.
Summary: We consider the time and space required for quantum computers to solve a wide variety of problems involving matrices, many of which have only been analyzed classically in prior work. Our main results show that for a range of linear algebra problems  including matrixvector product, matrix inversion, matrix multiplication and powering  existing classical timespace tradeoffs, several of which are tight for every space bound, also apply to quantum algorithms. For example, for almost all matrices $A$, including the discrete Fourier transform (DFT) matrix, we prove that quantum circuits with at most $T$ input queries and $S$ qubits of memory require $T=\Omega(n^2/S)$ to compute matrixvector product $Ax$ for $x \in \{0,1\}^n$. We similarly prove that matrix multiplication for $n\times n$ binary matrices requires $T=\Omega(n^3 / \sqrt{S})$. Because many of our lower bounds match deterministic algorithms with the same time and space complexity, we show that quantum computers cannot provide any asymptotic advantage for these problems with any space bound. We obtain matching lower bounds for the stronger notion of quantum cumulative memory complexity  the sum of the space per layer of a circuit. We also consider Boolean (i.e. ANDOR) matrix multiplication and matrixvector products, improving the previous quantum timespace tradeoff lower bounds for $n\times n$ Boolean matrix multiplication to $T=\Omega(n^{2.5}/S^{1/4})$ from $T=\Omega(n^{2.5}/S^{1/2})$. Our improved lower bound for Boolean matrix multiplication is based on a new coloring argument that extracts more from the strong direct product theorem used in prior work. Our tight lower bounds for linear algebra problems require adding a new bucketing method to the recordingquery technique of Zhandry that lets us apply classical arguments to upper bound the success probability of quantum circuits.

20231228 Variable Neighborhood Searching Rerandomization.
Summary: Rerandomization discards undesired treatment assignments to ensure covariate balance in randomized experiments. However, rerandomization based on acceptancerejection sampling is computationally inefficient, especially when numerous independent assignments are required to perform randomizationbased statistical inference. Existing acceleration methods are suboptimal and are not applicable in structured experiments, including stratified experiments and experiments with clusters. Based on metaheuristics in combinatorial optimization, we propose a novel variable neighborhood searching rerandomization(VNSRR) method to draw balanced assignments in various experiments efficiently. We derive the unbiasedness and a lower bound for the variance reduction of the treatment effect estimator under VNSRR. Simulation studies and a real data example indicate that our method maintains the appealing statistical properties of rerandomization and can sample thousands of treatment assignments within seconds, even in cases where existing methods require an hour to complete the task.

20231221 LERNA: Secure SingleServer Aggregation via KeyHomomorphic Masking.
Summary: This paper introduces LERNA, a new framework for singleserver secure aggregation. Our protocols are tailored to the setting where multiple consecutive aggregation phases are performed with the same set of clients, a fraction of which can drop out in some of the phases. We rely on an initial secret sharing setup among the clients which is generated onceandforall, and reused in all following aggregation phases. Compared to prior works [Bonawitz et al. CCS’17, Bell et al. CCS’20], the reusable setup eliminates one round of communication between the server and clients per aggregation—i.e., we need two rounds for semihonest security (instead of three), and three rounds (instead of four) in the malicious model. Our approach also significantly reduces the server’s computational costs by only requiring the reconstruction of a single secretshared value (per aggregation). Prior work required reconstructing a secretshared value for each client involved in the computation. We provide instantiations of LERNA based on both the Decisional Composite Residuosity (DCR) and (Ring) Learning with Rounding ((R)LWR) assumptions respectively and evaluate a version based on the latter assumption. In addition to savings in roundcomplexity (which result in reduced latency), our experiments show that the server computational costs are reduced by two orders of magnitude in comparison to the stateoftheart. In settings with a large number of clients, we also reduce the computational costs up to twentyfold for most clients, while a small set of “heavy clients” is subject to a workload that is still smaller than that of prior work.

20231219 Initializing Services in Interactive ML Systems for Diverse Users.
Summary: This paper studies ML systems that interactively learn from users across multiple subpopulations with heterogeneous data distributions. The primary objective is to provide specialized services for different user groups while also predicting user preferences. Once the users select a service based on how well the service anticipated their preference, the services subsequently adapt and refine themselves based on the user data they accumulate, resulting in an iterative, alternating minimization process between users and services (learning dynamics). Employing such tailored approaches has two main challenges: (i) Unknown user preferences: Typically, data on user preferences are unavailable without interaction, and uniform data collection across a large and diverse user base can be prohibitively expensive. (ii) Suboptimal Local Solutions: The total loss (sum of loss functions across all users and all services) landscape is not convex even if the individual losses on a single service are convex, making it likely for the learning dynamics to get stuck in local minima. The final outcome of the aforementioned learning dynamics is thus strongly influenced by the initial set of services offered to users, and is not guaranteed to be close to the globally optimal outcome. In this work, we propose a randomized algorithm to adaptively select very few users to collect preference data from, while simultaneously initializing a set of services. We prove that under mild assumptions on the loss functions, the expected total loss achieved by the algorithm right after initialization is within a factor of the globally optimal total loss with complete user preference data, and this factor scales only logarithmically in the number of services. Our theory is complemented by experiments on real as well as semisynthetic datasets.

20231213 Fair Active Learning in LowData Regimes.
Summary: In critical machine learning applications, ensuring fairness is essential to avoid perpetuating social inequities. In this work, we address the challenges of reducing bias and improving accuracy in datascarce environments, where the cost of collecting labeled data prohibits the use of large, labeled datasets. In such settings, active learning promises to maximize marginal accuracy gains of small amounts of labeled data. However, existing applications of active learning for fairness fail to deliver on this, typically requiring large labeled datasets, or failing to ensure the desired fairness tolerance is met on the population distribution. To address such limitations, we introduce an innovative active learning framework that combines an exploration procedure inspired by posterior sampling with a fair classification subroutine. We demonstrate that this framework performs effectively in very datascarce regimes, maximizing accuracy while satisfying fairness constraints with high probability. We evaluate our proposed approach using wellestablished realworld benchmark datasets and compare it against stateoftheart methods, demonstrating its effectiveness in producing fair models, and improvement over existing methods.

20231205 XOR Lemmas for Communication via Marginal Information.
Summary: We define the $\textit{marginal information}$ of a communication protocol, and use it to prove XOR lemmas for communication complexity. We show that if every $C$bit protocol has bounded advantage for computing a Boolean function $f$, then every $\tilde \Omega(C \sqrt{n})$bit protocol has advantage $\exp(\Omega(n))$ for computing the $n$fold xor $f^{\oplus n}$. We prove exponentially small bounds in the average case setting, and near optimal bounds for product distributions and for boundedround protocols.

20231129 Sparsifying generalized linear models.
Summary: We consider the sparsification of sums $F : \mathbb{R}^n \to \mathbb{R}$ where $F(x) = f_1(\langle a_1,x\rangle) + \cdots + f_m(\langle a_m,x\rangle)$ for vectors $a_1,\ldots,a_m \in \mathbb{R}^n$ and functions $f_1,\ldots,f_m : \mathbb{R} \to \mathbb{R}_+$. We show that $(1+\varepsilon)$approximate sparsifiers of $F$ with support size $\frac{n}{\varepsilon^2} (\log \frac{n}{\varepsilon})^{O(1)}$ exist whenever the functions $f_1,\ldots,f_m$ are symmetric, monotone, and satisfy natural growth bounds. Additionally, we give efficient algorithms to compute such a sparsifier assuming each $f_i$ can be evaluated efficiently. Our results generalize the classic case of $\ell_p$ sparsification, where $f_i(z) = z^p$, for $p \in (0, 2]$, and give the first nearlinear size sparsifiers in the wellstudied setting of the Huber loss function and its generalizations, e.g., $f_i(z) = \min\{z^p, z^2\}$ for $0 < p \leq 2$. Our sparsification algorithm can be applied to give nearoptimal reductions for optimizing a variety of generalized linear models including $\ell_p$ regression for $p \in (1, 2]$ to high accuracy, via solving $(\log n)^{O(1)}$ sparse regression instances with $m \le n(\log n)^{O(1)}$, plus runtime proportional to the number of nonzero entries in the vectors $a_1, \dots, a_m$.

20231128 Can Generalist Foundation Models Outcompete SpecialPurpose Tuning? Case Study in Medicine.
Summary: Generalist foundation models such as GPT4 have displayed surprising capabilities in a wide variety of domains and tasks. Yet, there is a prevalent assumption that they cannot match specialist capabilities of finetuned models. For example, most explorations to date on medical competency benchmarks have leveraged domainspecific training, as exemplified by efforts on BioGPT and MedPaLM. We build on a prior study of GPT4's capabilities on medical challenge benchmarks in the absence of special training. Rather than using simple prompting to highlight the model's outofthebox capabilities, we perform a systematic exploration of prompt engineering. We find that prompting innovation can unlock deeper specialist capabilities and show that GPT4 easily tops prior leading results for medical benchmarks. The prompting methods we explore are general purpose, and make no specific use of domain expertise, removing the need for expertcurated content. Our experimental design carefully controls for overfitting during the prompt engineering process. We introduce Medprompt, based on a composition of several prompting strategies. With Medprompt, GPT4 achieves stateoftheart results on all nine of the benchmark datasets in the MultiMedQA suite. The method outperforms leading specialist models such as MedPaLM 2 by a significant margin with an order of magnitude fewer calls to the model. Steering GPT4 with Medprompt achieves a 27% reduction in error rate on the MedQA dataset over the best methods to date achieved with specialist models and surpasses a score of 90% for the first time. Beyond medical problems, we show the power of Medprompt to generalize to other domains and provide evidence for the broad applicability of the approach via studies of the strategy on exams in electrical engineering, machine learning, philosophy, accounting, law, nursing, and clinical psychology.

20231122 Positional Description Matters for Transformers Arithmetic.
Summary: Transformers, central to the successes in modern Natural Language Processing, often falter on arithmetic tasks despite their vast capabilities which paradoxically include remarkable coding abilities. We observe that a crucial challenge is their naive reliance on positional information to solve arithmetic problems with a small number of digits, leading to poor performance on larger numbers. Herein, we delve deeper into the role of positional encoding, and propose several ways to fix the issue, either by modifying the positional encoding directly, or by modifying the representation of the arithmetic task to leverage standard positional encoding differently. We investigate the value of these modifications for three tasks: (i) classical multiplication, (ii) length extrapolation in addition, and (iii) addition in natural language context. For (i) we train a small model on a small dataset (100M parameters and 300k samples) with remarkable aptitude in (direct, no scratchpad) 15 digits multiplication and essentially perfect up to 12 digits, while usual training in this context would give a model failing at 4 digits multiplication. In the experiments on addition, we use a mere 120k samples to demonstrate: for (ii) extrapolation from 10 digits to testing on 12 digits numbers while usual training would have no extrapolation, and for (iii) almost perfect accuracy up to 5 digits while usual training would be correct only up to 3 digits (which is essentially memorization with a training set of 120k samples).

20231117 A Whole New Ball Game: A Primal Accelerated Method for Matrix Games and Minimizing the Maximum of Smooth Functions.
Summary: We design algorithms for minimizing $\max_{i\in[n]} f_i(x)$ over a $d$dimensional Euclidean or simplex domain. When each $f_i$ is $1$Lipschitz and $1$smooth, our method computes an $\epsilon$approximate solution using $\widetilde{O}(n \epsilon^{1/3} + \epsilon^{2})$ gradient and function evaluations, and $\widetilde{O}(n \epsilon^{4/3})$ additional runtime. For large $n$, our evaluation complexity is optimal up to polylogarithmic factors. In the special case where each $f_i$ is linear  which corresponds to finding a nearoptimal primal strategy in a matrix game  our method finds an $\epsilon$approximate solution in runtime $\widetilde{O}(n (d/\epsilon)^{2/3} + nd + d\epsilon^{2})$. For $n>d$ and $\epsilon=1/\sqrt{n}$ this improves over all existing firstorder methods. When additionally $d = \omega(n^{8/11})$ our runtime also improves over all known interior point methods. Our algorithm combines three novel primitives: (1) A dynamic data structure which enables efficient stochastic gradient estimation in small $\ell_2$ or $\ell_1$ balls. (2) A mirror descent algorithm tailored to our data structure implementing an oracle which minimizes the objective over these balls. (3) A simple ball oracle acceleration framework suitable for nonEuclidean geometry.

20231116 Ghost Value Augmentation for $k$ECSS and $k$ECSM.
Summary: We give a polytime algorithm for the $k$edgeconnected spanning subgraph ($k$ECSS) problem that returns a solution of cost no greater than the cheapest $(k+10)$ECSS on the same graph. Our approach enhances the iterative relaxation framework with a new ingredient, which we call ghost values, that allows for high sparsity in intermediate problems. Our guarantees improve upon the bestknown approximation factor of $2$ for $k$ECSS whenever the optimal value of $(k+10)$ECSS is close to that of $k$ECSS. This is a property that holds for the closely related problem $k$edgeconnected spanning multisubgraph ($k$ECSM), which is identical to $k$ECSS except edges can be selected multiple times at the same cost. As a consequence, we obtain a $\left(1+O\left(\frac{1}{k}\right)\right)$approximation algorithm for $k$ECSM, which resolves a conjecture of Pritchard and improves upon a recent $\left(1+O\left(\frac{1}{\sqrt{k}}\right)\right)$approximation algorithm of Karlin, Klein, Oveis Gharan, and Zhang. Moreover, we present a matching lower bound for $k$ECSM, showing that our approximation ratio is tight up to the constant factor in $O\left(\frac{1}{k}\right)$, unless $P=NP$.

20231115 From Trees to Polynomials and Back Again: New Capacity Bounds with Applications to TSP.
Summary: We give simply exponential lower bounds on the probabilities of a given strongly Rayleigh distribution, depending only on its expectation. This resolves a weak version of a problem left open by KarlinKleinOveis Gharan in their recent breakthrough work on metric TSP, and this resolution leads to a minor improvement of their approximation factor for metric TSP. Our results also allow for a more streamlined analysis of the algorithm. To achieve these new bounds, we build upon the work of GurvitsLeake on the use of the productization technique for bounding the capacity of a real stable polynomial. This technique allows one to reduce certain inequalities for real stable polynomials to products of affine linear forms, which have an underlying matrix structure. In this paper, we push this technique further by characterizing the worstcase polynomials via bipartitioned forests. This rigid combinatorial structure yields a clean induction argument, which implies our stronger bounds. In general, we believe the results of this paper will lead to further improvement and simplification of the analysis of various combinatorial and probabilistic bounds and algorithms.

20231113 How to Use Quantum Indistinguishability Obfuscation.
Summary: Quantum copy protection, introduced by Aaronson, enables giving out a quantum programdescription that cannot be meaningfully duplicated. Despite over a decade of study, copy protection is only known to be possible for a very limited class of programs. As our first contribution, we show how to achieve "bestpossible" copy protection for all programs. We do this by introducing quantum state indistinguishability obfuscation (qsiO), a notion of obfuscation for quantum descriptions of classical programs. We show that applying qsiO to a program immediately achieves bestpossible copy protection. Our second contribution is to show that, assuming injective oneway functions exist, qsiO is concrete copy protection for a large family of puncturable programs  significantly expanding the class of copyprotectable programs. A key tool in our proof is a new variant of unclonable encryption (UE) that we call coupled unclonable encryption (cUE). While constructing UE in the standard model remains an important open problem, we are able to build cUE from oneway functions. If we additionally assume the existence of UE, then we can further expand the class of puncturable programs for which qsiO is copy protection. Finally, we construct qsiO relative to an efficient quantum oracle.

20231107 Computing Approximate $\ell_p$ Sensitivities.
Summary: Recent works in dimensionality reduction for regression tasks have introduced the notion of sensitivity, an estimate of the importance of a specific datapoint in a dataset, offering provable guarantees on the quality of the approximation after removing lowsensitivity datapoints via subsampling. However, fast algorithms for approximating $\ell_p$ sensitivities, which we show is equivalent to approximate $\ell_p$ regression, are known for only the $\ell_2$ setting, in which they are termed leverage scores. In this work, we provide efficient algorithms for approximating $\ell_p$ sensitivities and related summary statistics of a given matrix. In particular, for a given $n \times d$ matrix, we compute $\alpha$approximation to its $\ell_1$ sensitivities at the cost of $O(n/\alpha)$ sensitivity computations. For estimating the total $\ell_p$ sensitivity (i.e. the sum of $\ell_p$ sensitivities), we provide an algorithm based on importance sampling of $\ell_p$ Lewis weights, which computes a constant factor approximation to the total sensitivity at the cost of roughly $O(\sqrt{d})$ sensitivity computations. Furthermore, we estimate the maximum $\ell_1$ sensitivity, up to a $\sqrt{d}$ factor, using $O(d)$ sensitivity computations. We generalize all these results to $\ell_p$ norms for $p > 1$. Lastly, we experimentally show that for a wide class of matrices in realworld datasets, the total sensitivity can be quickly approximated and is significantly smaller than the theoretical prediction, demonstrating that realworld datasets have low intrinsic effective dimensionality.

20231107 Userlevel Differentially Private Stochastic Convex Optimization: Efficient Algorithms with Optimal Rates.
Summary: We study differentially private stochastic convex optimization (DPSCO) under userlevel privacy, where each user may hold multiple data items. Existing work for userlevel DPSCO either requires superpolynomial runtime [Ghazi et al. (2023)] or requires the number of users to grow polynomially with the dimensionality of the problem with additional strict assumptions [Bassily et al. (2023)]. We develop new algorithms for userlevel DPSCO that obtain optimal rates for both convex and strongly convex functions in polynomial time and require the number of users to grow only logarithmically in the dimension. Moreover, our algorithms are the first to obtain optimal rates for nonsmooth functions in polynomial time. These algorithms are based on multiplepass DPSGD, combined with a novel private mean estimation procedure for concentrated data, which applies an outlier removal step before estimating the mean of the gradients.

20231103 A Lower Bound for the Max Entropy Algorithm for TSP.
Summary: One of the most famous conjectures in combinatorial optimization is the fourthirds conjecture, which states that the integrality gap of the subtour LP relaxation of the TSP is equal to $\frac43$. For 40 years, the best known upper bound was 1.5, due to Wolsey (1980). Recently, Karlin, Klein, and Oveis Gharan (2022) showed that the max entropy algorithm for the TSP gives an improved bound of $1.5  10^{36}$. In this paper, we show that the approximation ratio of the max entropy algorithm is at least 1.375, even for graphic TSP. Thus the max entropy algorithm does not appear to be the algorithm that will ultimately resolve the fourthirds conjecture in the affirmative, should that be possible.

20231027 Minimax Optimal Submodular Optimization with Bandit Feedback.
Summary: We consider maximizing a monotonic, submodular set function $f: 2^{[n]} \rightarrow [0,1]$ under stochastic bandit feedback. Specifically, $f$ is unknown to the learner but at each time $t=1,\dots,T$ the learner chooses a set $S_t \subset [n]$ with $S_t \leq k$ and receives reward $f(S_t) + \eta_t$ where $\eta_t$ is meanzero subGaussian noise. The objective is to minimize the learner's regret over $T$ times with respect to ($1e^{1}$)approximation of maximum $f(S_*)$ with $S_* = k$, obtained through greedy maximization of $f$. To date, the best regret bound in the literature scales as $k n^{1/3} T^{2/3}$. And by trivially treating every set as a unique arm one deduces that $\sqrt{ {n \choose k} T }$ is also achievable. In this work, we establish the first minimax lower bound for this setting that scales like $\mathcal{O}(\min_{i \le k}(in^{1/3}T^{2/3} + \sqrt{n^{ki}T}))$. Moreover, we propose an algorithm that is capable of matching the lower bound regret.

20231027 Structured Semidefinite Programming for Recovering Structured Preconditioners.
Summary: We develop a general framework for finding approximatelyoptimal preconditioners for solving linear systems. Leveraging this framework we obtain improved runtimes for fundamental preconditioning and linear system solving problems including the following. We give an algorithm which, given positive definite $\mathbf{K} \in \mathbb{R}^{d \times d}$ with $\mathrm{nnz}(\mathbf{K})$ nonzero entries, computes an $\epsilon$optimal diagonal preconditioner in time $\widetilde{O}(\mathrm{nnz}(\mathbf{K}) \cdot \mathrm{poly}(\kappa^\star,\epsilon^{1}))$, where $\kappa^\star$ is the optimal condition number of the rescaled matrix. We give an algorithm which, given $\mathbf{M} \in \mathbb{R}^{d \times d}$ that is either the pseudoinverse of a graph Laplacian matrix or a constant spectral approximation of one, solves linear systems in $\mathbf{M}$ in $\widetilde{O}(d^2)$ time. Our diagonal preconditioning results improve stateoftheart runtimes of $\Omega(d^{3.5})$ attained by generalpurpose semidefinite programming, and our solvers improve stateoftheart runtimes of $\Omega(d^{\omega})$ where $\omega > 2.3$ is the current matrix multiplication constant. We attain our results via new algorithms for a class of semidefinite programs (SDPs) we call matrixdictionary approximation SDPs, which we leverage to solve an associated problem we call matrixdictionary recovery.

20231025 Detecting Pretraining Data from Large Language Models.
Summary: Although large language models (LLMs) are widely deployed, the data used to train them is rarely disclosed. Given the incredible scale of this data, up to trillions of tokens, it is all but certain that it includes potentially problematic text such as copyrighted materials, personally identifiable information, and test data for widely reported reference benchmarks. However, we currently have no way to know which data of these types is included or in what proportions. In this paper, we study the pretraining data detection problem: given a piece of text and blackbox access to an LLM without knowing the pretraining data, can we determine if the model was trained on the provided text? To facilitate this study, we introduce a dynamic benchmark WIKIMIA that uses data created before and after model training to support gold truth detection. We also introduce a new detection method MinK% Prob based on a simple hypothesis: an unseen example is likely to contain a few outlier words with low probabilities under the LLM, while a seen example is less likely to have words with such low probabilities. MinK% Prob can be applied without any knowledge about the pretraining corpus or any additional training, departing from previous detection methods that require training a reference model on data that is similar to the pretraining data. Moreover, our experiments demonstrate that MinK% Prob achieves a 7.4% improvement on WIKIMIA over these previous methods. We apply MinK% Prob to three realworld scenarios, copyrighted book detection, contaminated downstream example detection and privacy auditing of machine unlearning, and find it a consistently effective solution.

20231025 Fast Algorithms for Separable Linear Programs.
Summary: In numerical linear algebra, considerable effort has been devoted to obtaining faster algorithms for linear systems whose underlying matrices exhibit structural properties. A prominent success story is the method of generalized nested dissection~[LiptonRoseTarjan'79] for separable matrices. On the other hand, the majority of recent developments in the design of efficient linear program (LP) solves do not leverage the ideas underlying these faster linear system solvers nor consider the separable structure of the constraint matrix. We give a faster algorithm for separable linear programs. Specifically, we consider LPs of the form $\min_{\mathbf{A}\mathbf{x}=\mathbf{b}, \mathbf{l}\leq\mathbf{x}\leq\mathbf{u}} \mathbf{c}^\top\mathbf{x}$, where the graphical support of the constraint matrix $\mathbf{A} \in \mathbb{R}^{n\times m}$ is $O(n^\alpha)$separable. These include flow problems on planar graphs and low treewidth matrices among others. We present an $\tilde{O}((m+m^{1/2 + 2\alpha}) \log(1/\epsilon))$ time algorithm for these LPs, where $\epsilon$ is the relative accuracy of the solution. Our new solver has two important implications: for the $k$multicommodity flow problem on planar graphs, we obtain an algorithm running in $\tilde{O}(k^{5/2} m^{3/2})$ time in the high accuracy regime; and when the support of $\mathbf{A}$ is $O(n^\alpha)$separable with $\alpha \leq 1/4$, our algorithm runs in $\tilde{O}(m)$ time, which is nearly optimal. The latter significantly improves upon the natural approach of combining interior point methods and nested dissection, whose time complexity is lower bounded by $\Omega(\sqrt{m}(m+m^{\alpha\omega}))=\Omega(m^{3/2})$, where $\omega$ is the matrix multiplication constant. Lastly, in the setting of lowtreewidth LPs, we recover the results of [DLY,STOC21] and [GS,22] with significantly simpler data structure machinery.

20231019 Unmasking Transformers: A Theoretical Approach to Data Recovery via Attention Weights.
Summary: In the realm of deep learning, transformers have emerged as a dominant architecture, particularly in natural language processing tasks. However, with their widespread adoption, concerns regarding the security and privacy of the data processed by these models have arisen. In this paper, we address a pivotal question: Can the data fed into transformers be recovered using their attention weights and outputs? We introduce a theoretical framework to tackle this problem. Specifically, we present an algorithm that aims to recover the input data $X \in \mathbb{R}^{d \times n}$ from given attention weights $W = QK^\top \in \mathbb{R}^{d \times d}$ and output $B \in \mathbb{R}^{n \times n}$ by minimizing the loss function $L(X)$. This loss function captures the discrepancy between the expected output and the actual output of the transformer. Our findings have significant implications for the Localized Layerwise Mechanism (LLM), suggesting potential vulnerabilities in the model's design from a security and privacy perspective. This work underscores the importance of understanding and safeguarding the internal workings of transformers to ensure the confidentiality of processed data.

20231018 Superiority of Softmax: Unveiling the Performance Edge Over Linear Attention.
Summary: Large transformer models have achieved stateoftheart results in numerous natural language processing tasks. Among the pivotal components of the transformer architecture, the attention mechanism plays a crucial role in capturing token interactions within sequences through the utilization of softmax function. Conversely, linear attention presents a more computationally efficient alternative by approximating the softmax operation with linear complexity. However, it exhibits substantial performance degradation when compared to the traditional softmax attention mechanism. In this paper, we bridge the gap in our theoretical understanding of the reasons behind the practical performance gap between softmax and linear attention. By conducting a comprehensive comparative analysis of these two attention mechanisms, we shed light on the underlying reasons for why softmax attention outperforms linear attention in most scenarios.

20231015 FiLM: Fillin Language Models for AnyOrder Generation.
Summary: Language models have become the backbone of today's AI systems. However, their predominant lefttoright generation limits the use of bidirectional context, which is essential for tasks that involve filling text in the middle. We propose the Fillin Language Model (FiLM), a new language modeling approach that allows for flexible generation at any position without adhering to a specific generation order. Its training extends the masked language modeling objective by adopting varying mask probabilities sampled from the Beta distribution to enhance the generative capabilities of FiLM. During inference, FiLM can seamlessly insert missing phrases, sentences, or paragraphs, ensuring that the outputs are fluent and are coherent with the surrounding context. In both automatic and human evaluations, FiLM outperforms existing infilling methods that rely on lefttoright language models trained on rearranged text segments. FiLM is easy to implement and can be either trained from scratch or finetuned from a lefttoright language model. Notably, as the model size grows, FiLM's perplexity approaches that of strong lefttoright language models of similar sizes, indicating FiLM's scalability and potential as a large language model.

20231012 Stronger Coreset Bounds for Kernel Density Estimators via Chaining.
Summary: We apply the discrepancy method and a chaining approach to give improved bounds on the coreset complexity of a wide class of kernel functions. Our results give randomized polynomial time algorithms to produce coresets of size $O\big(\frac{\sqrt{d}}{\varepsilon}\sqrt{\log\log \frac{1}{\varepsilon}}\big)$ for the Gaussian and Laplacian kernels in the case that the data set is uniformly bounded, an improvement that was not possible with previous techniques. We also obtain coresets of size $O\big(\frac{1}{\varepsilon}\sqrt{\log\log \frac{1}{\varepsilon}}\big)$ for the Laplacian kernel for $d$ constant. Finally, we give the best known bounds of $O\big(\frac{\sqrt{d}}{\varepsilon}\sqrt{\log(2\max\{1,\alpha\})}\big)$ on the coreset complexity of the exponential, Hellinger, and JS Kernels, where $1/\alpha$ is the bandwidth parameter of the kernel.

20231012 On the Rational Degree of Boolean Functions and Applications.
Summary: We study a natural complexity measure of Boolean functions known as the (exact) rational degree. For total functions $f$, it is conjectured that $\mathrm{rdeg}(f)$ is polynomially related to $\mathrm{deg}(f)$, where $\mathrm{deg}(f)$ is the Fourier degree. Towards this conjecture, we show that symmetric functions have rational degree at least $\mathrm{deg}(f)/2$ and monotone functions have rational degree at least $\sqrt{\mathrm{deg}(f)}$. We observe that both of these lower bounds are tight. In addition, we show that all readonce depth$d$ Boolean formulae have rational degree at least $\Omega(\mathrm{deg}(f)^{1/d})$. Furthermore, we show that almost every Boolean function on $n$ variables has rational degree at least $n/2  O(\sqrt{n})$. In contrast to total functions, we exhibit partial functions that witness unbounded separations between rational and approximate degree, in both directions. As a consequence, we show that for quantum computers, postselection and boundederror are incomparable resources in the blackbox model.

20231003 Learning quantum Hamiltonians at any temperature in polynomial time.
Summary: We study the problem of learning a local quantum Hamiltonian $H$ given copies of its Gibbs state $\rho = e^{\beta H}/\textrm{tr}(e^{\beta H})$ at a known inverse temperature $\beta>0$. Anshu, Arunachalam, Kuwahara, and Soleimanifar (arXiv:2004.07266) gave an algorithm to learn a Hamiltonian on $n$ qubits to precision $\epsilon$ with only polynomially many copies of the Gibbs state, but which takes exponential time. Obtaining a computationally efficient algorithm has been a major open problem [Alhambra'22 (arXiv:2204.08349)], [Anshu, Arunachalam'22 (arXiv:2204.08349)], with prior work only resolving this in the limited cases of high temperature [Haah, Kothari, Tang'21 (arXiv:2108.04842)] or commuting terms [Anshu, Arunachalam, Kuwahara, Soleimanifar'21]. We fully resolve this problem, giving a polynomial time algorithm for learning $H$ to precision $\epsilon$ from polynomially many copies of the Gibbs state at any constant $\beta > 0$. Our main technical contribution is a new flat polynomial approximation to the exponential function, and a translation between multivariate scalar polynomials and nested commutators. This enables us to formulate Hamiltonian learning as a polynomial system. We then show that solving a lowdegree sumofsquares relaxation of this polynomial system suffices to accurately learn the Hamiltonian.

20230911 Textbooks Are All You Need II: phi1.5 technical report.
Summary: We continue the investigation into the power of smaller Transformerbased language models as initiated by \textbf{TinyStories}  a 10 million parameter model that can produce coherent English  and the followup work on \textbf{phi1}, a 1.3 billion parameter model with Python coding performance close to the stateoftheart. The latter work proposed to use existing Large Language Models (LLMs) to generate ``textbook quality" data as a way to enhance the learning process compared to traditional web data. We follow the ``Textbooks Are All You Need" approach, focusing this time on common sense reasoning in natural language, and create a new 1.3 billion parameter model named \textbf{phi1.5}, with performance on natural language tasks comparable to models 5x larger, and surpassing most nonfrontier LLMs on more complex reasoning tasks such as gradeschool mathematics and basic coding. More generally, \textbf{phi1.5} exhibits many of the traits of much larger LLMs, both good  such as the ability to ``think step by step" or perform some rudimentary incontext learning  and bad, including hallucinations and the potential for toxic and biased generations  encouragingly though, we are seeing improvement on that front thanks to the absence of web data. We opensource \textbf{phi1.5} to promote further research on these urgent topics.

20230828 Faster MinCost Flow on Bounded Treewidth Graphs.
Summary: We present a $\widetilde{O}(m\sqrt{\tau}+n\tau)$ time algorithm for finding a minimumcost flow in graphs with $n$ vertices and $m$ edges, given a tree decomposition of width $\tau$ and polynomially bounded integer costs and capacities. This improves upon the current best algorithms for general linear programs bounded by treewidth which run in $\widetilde{O}(m \tau^{(\omega+1)/2})$ time by [DongLeeYe,21] and [GuSong,22], where $\omega \approx 2.37$ is the matrix multiplication exponent. Our approach leverages recent advances in structured linear program solvers and robust interior point methods. As a corollary, for any graph $G$ with $n$ vertices, $m$ edges, and treewidth $\tau$, we obtain a $\widetilde{O}(\tau^3 \cdot m)$ time algorithm to compute a tree decomposition of $G$ with width $O(\tau \cdot \log n)$.

20230821 Clustered Linear Contextual Bandits with Knapsacks.
Summary: In this work, we study clustered contextual bandits where rewards and resource consumption are the outcomes of clusterspecific linear models. The arms are divided in clusters, with the cluster memberships being unknown to an algorithm. Pulling an arm in a time period results in a reward and in consumption for each one of multiple resources, and with the total consumption of any resource exceeding a constraint implying the termination of the algorithm. Thus, maximizing the total reward requires learning not only models about the reward and the resource consumption, but also cluster memberships. We provide an algorithm that achieves regret sublinear in the number of time periods, without requiring access to all of the arms. In particular, we show that it suffices to perform clustering only once to a randomly selected subset of the arms. To achieve this result, we provide a sophisticated combination of techniques from the literature of econometrics and of bandits with constraints.

20230818 Do you know what qmeans?.
Summary: Clustering is one of the most important tools for analysis of large datasets, and perhaps the most popular clustering algorithm is Lloyd's iteration for $k$means. This iteration takes $N$ vectors $v_1,\dots,v_N\in\mathbb{R}^d$ and outputs $k$ centroids $c_1,\dots,c_k\in\mathbb{R}^d$; these partition the vectors into clusters based on which centroid is closest to a particular vector. We present an overall improved version of the "$q$means" algorithm, the quantum algorithm originally proposed by Kerenidis, Landman, Luongo, and Prakash (2019) which performs $\varepsilon$$k$means, an approximate version of $k$means clustering. This algorithm does not rely on the quantum linear algebra primitives of prior work, instead only using its QRAM to prepare and measure simple states based on the current iteration's clusters. The time complexity is $O\big(\frac{k^{2}}{\varepsilon^2}(\sqrt{k}d + \log(Nd))\big)$ and maintains the polylogarithmic dependence on $N$ while improving the dependence on most of the other parameters. We also present a "dequantized" algorithm for $\varepsilon$$k$means which runs in $O\big(\frac{k^{2}}{\varepsilon^2}(kd + \log(Nd))\big)$ time. Notably, this classical algorithm matches the polylogarithmic dependence on $N$ attained by the quantum algorithms.

20230816 Convergence of TwoLayer Regression with Nonlinear Units.
Summary: Large language models (LLMs), such as ChatGPT and GPT4, have shown outstanding performance in many human life task. Attention computation plays an important role in training LLMs. Softmax unit and ReLU unit are the key structure in attention computation. Inspired by them, we put forward a softmax ReLU regression problem. Generally speaking, our goal is to find an optimal solution to the regression problem involving the ReLU unit. In this work, we calculate a close form representation for the Hessian of the loss function. Under certain assumptions, we prove the Lipschitz continuous and the PSDness of the Hessian. Then, we introduce an greedy algorithm based on approximate Newton method, which converges in the sense of the distance to optimal solution. Last, We relax the Lipschitz condition and prove the convergence in the sense of loss value.

20230811 A BetterThan1.6Approximation for PrizeCollecting TSP.
Summary: PrizeCollecting TSP is a variant of the traveling salesperson problem where one may drop vertices from the tour at the cost of vertexdependent penalties. The quality of a solution is then measured by adding the length of the tour and the sum of all penalties of vertices that are not visited. We present a polynomialtime approximation algorithm with an approximation guarantee slightly below $1.6$, where the guarantee is with respect to the natural linear programming relaxation of the problem. This improves upon the previous bestknown approximation ratio of $1.774$. Our approach is based on a known decomposition for solutions of this linear relaxation into rooted trees. Our algorithm takes a tree from this decomposition and then performs a pruning step before doing parity correction on the remainder. Using a simple analysis, we bound the approximation guarantee of the proposed algorithm by $(1+\sqrt{5})/2 \approx 1.618$, the golden ratio. With some additional technical care we further improve it to $1.599$.

20230802 Optimal Online Discrepancy Minimization.
Summary: We prove that there exists an online algorithm that for any sequence of vectors $v_1,\ldots,v_T \in \mathbb{R}^n$ with $\v_i\_2 \leq 1$, arriving one at a time, decides random signs $x_1,\ldots,x_T \in \{ 1,1\}$ so that for every $t \le T$, the prefix sum $\sum_{i=1}^t x_iv_i$ is $10$subgaussian. This improves over the work of Alweiss, Liu and Sawhney who kept prefix sums $O(\sqrt{\log (nT)})$subgaussian, and gives a $O(\sqrt{\log T})$ bound on the discrepancy $\max_{t \in T} \\sum_{i=1}^t x_i v_i\_\infty$. Our proof combines a generalization of Banaszczyk's prefix balancing result to trees with a cloning argument to find distributions rather than single colorings. We also show a matching $\Omega(\sqrt{\log T})$ strategy for an oblivious adversary.

20230730 Polytopes with Bounded Integral Slack Matrices Have SubExponential Extension Complexity.
Summary: We show that any bounded integral function $f : A \times B \mapsto \{0,1, \dots, \Delta\}$ with rank $r$ has deterministic communication complexity $\Delta^{O(\Delta)} \cdot \sqrt{r} \cdot \log^2 r$, where the rank of $f$ is defined to be the rank of the $A \times B$ matrix whose entries are the function values. As a corollary, we show that any $n$dimensional polytope that admits a slack matrix with entries from $\{0,1,\dots,\Delta\}$ has extension complexity at most $\exp(\Delta^{O(\Delta)} \cdot \sqrt{n} \cdot \log^2 n)$.

20230717 Zeroth Order Algorithm for Softmax Attention Optimization.
Summary: Large language models (LLMs) have brought about significant transformations in human society. Among the crucial computations in LLMs, the softmax unit holds great importance. Its helps the model generating a probability distribution on potential subsequent words or phrases, considering a series of input words. By utilizing this distribution, the model selects the most probable next word or phrase, based on the assigned probabilities. The softmax unit assumes a vital function in LLM training as it facilitates learning from data through the adjustment of neural network weights and biases. With the development of the size of LLMs, computing the gradient becomes expensive. However, Zeroth Order method can approximately compute the gradient with only forward passes. In this paper, we present a Zeroth Order algorithm specifically tailored for Softmax optimization. We demonstrate the convergence of our algorithm, highlighting its effectiveness in efficiently computing gradients for largescale LLMs. By leveraging the ZerothOrder method, our work contributes to the advancement of optimization techniques in the context of complex language models.

20230707 Scalable Membership Inference Attacks via Quantile Regression.
Summary: Membership inference attacks are designed to determine, using black box access to trained models, whether a particular example was used in training or not. Membership inference can be formalized as a hypothesis testing problem. The most effective existing attacks estimate the distribution of some test statistic (usually the model's confidence on the true label) on points that were (and were not) used in training by training many \emph{shadow models}  i.e. models of the same architecture as the model being attacked, trained on a random subsample of data. While effective, these attacks are extremely computationally expensive, especially when the model under attack is large. We introduce a new class of attacks based on performing quantile regression on the distribution of confidence scores induced by the model under attack on points that are not used in training. We show that our method is competitive with stateoftheart shadow model attacks, while requiring substantially less compute because our attack requires training only a single model. Moreover, unlike shadow model attacks, our proposed attack does not require any knowledge of the architecture of the model under attack and is therefore truly ``blackbox". We show the efficacy of this approach in an extensive series of experiments on various datasets and model architectures.

20230620 Textbooks Are All You Need.
Summary: We introduce phi1, a new large language model for code, with significantly smaller size than competing models: phi1 is a Transformerbased model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT3.5 (1B tokens). Despite this small scale, phi1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays surprising emergent properties compared to phi1base, our model before our finetuning stage on a dataset of coding exercises, and phi1small, a smaller model with 350M parameters trained with the same pipeline as phi1 that still achieves 45% on HumanEval.

20230613 Efficient Algorithm for Solving Hyperbolic Programs.
Summary: Hyperbolic polynomials is a class of realroots polynomials that has wide range of applications in theoretical computer science. Each hyperbolic polynomial also induces a hyperbolic cone that is of particular interest in optimization due to its generality, as by choosing the polynomial properly, one can easily recover the classic optimization problems such as linear programming and semidefinite programming. In this work, we develop efficient algorithms for hyperbolic programming, the problem in each one wants to minimize a linear objective, under a system of linear constraints and the solution must be in the hyperbolic cone induced by the hyperbolic polynomial. Our algorithm is an instance of interior point method (IPM) that, instead of following the central path, it follows the central Swath, which is a generalization of central path. To implement the IPM efficiently, we utilize a relaxation of the hyperbolic program to a quadratic program, coupled with the first four moments of the hyperbolic eigenvalues that are crucial to update the optimization direction. We further show that, given an evaluation oracle of the polynomial, our algorithm only requires $O(n^2d^{2.5})$ oracle calls, where $n$ is the number of variables and $d$ is the degree of the polynomial, with extra $O((n+m)^3 d^{0.5})$ arithmetic operations, where $m$ is the number of constraints.

20230602 An Empirical Study on Challenging Math Problem Solving with GPT4.
Summary: Employing Large Language Models (LLMs) to address mathematical problems is an intriguing research endeavor, considering the abundance of math problems expressed in natural language across numerous science and engineering fields. While several prior works have investigated solving elementary mathematics using LLMs, this work explores the frontier of using GPT4 for solving more complex and challenging math problems. We evaluate various ways of using GPT4. Some of them are adapted from existing work, and one is MathChat, a conversational problemsolving framework newly proposed in this work. We perform the evaluation on difficult high school competition problems from the MATH dataset, which shows the advantage of the proposed conversational approach.

20230601 Faster Robust Tensor Power Method for Arbitrary Order.
Summary: Tensor decomposition is a fundamental method used in various areas to deal with highdimensional data. \emph{Tensor power method} (TPM) is one of the widelyused techniques in the decomposition of tensors. This paper presents a novel tensor power method for decomposing arbitrary order tensors, which overcomes limitations of existing approaches that are often restricted to lowerorder (less than $3$) tensors or require strong assumptions about the underlying data structure. We apply sketching method, and we are able to achieve the running time of $\widetilde{O}(n^{p1})$, on the power $p$ and dimension $n$ tensor. We provide a detailed analysis for any $p$th order tensor, which is never given in previous works.

20230531 Doubly Constrained Fair Clustering.
Summary: The remarkable attention which fair clustering has received in the last few years has resulted in a significant number of different notions of fairness. Despite the fact that these notions are welljustified, they are often motivated and studied in a disjoint manner where one fairness desideratum is considered exclusively in isolation from the others. This leaves the understanding of the relations between different fairness notions as an important open problem in fair clustering. In this paper, we take the first step in this direction. Specifically, we consider the two most prominent demographic representation fairness notions in clustering: (1) Group Fairness (GF), where the different demographic groups are supposed to have close to populationlevel representation in each cluster and (2) Diversity in Center Selection (DS), where the selected centers are supposed to have close to populationlevel representation of each group. We show that given a constant approximation algorithm for one constraint (GF or DS only) we can obtain a constant approximation solution that satisfies both constraints simultaneously. Interestingly, we prove that any given solution that satisfies the GF constraint can always be postprocessed at a bounded degradation to the clustering cost to additionally satisfy the DS constraint while the reverse is not true. Furthermore, we show that both GF and DS are incompatible (having an empty feasibility set in the worst case) with a collection of other distancebased fairness notions. Finally, we carry experiments to validate our theoretical findings.

20230525 Learning across Data Owners with Joint Differential Privacy.
Summary: In this paper, we study the setting in which data owners train machine learning models collaboratively under a privacy notion called joint differential privacy [Kearns et al., 2018]. In this setting, the model trained for each data owner $j$ uses $j$'s data without privacy consideration and other owners' data with differential privacy guarantees. This setting was initiated in [Jain et al., 2021] with a focus on linear regressions. In this paper, we study this setting for stochastic convex optimization (SCO). We present an algorithm that is a variant of DPSGD [Song et al., 2013; Abadi et al., 2016] and provides theoretical bounds on its population loss. We compare our algorithm to several baselines and discuss for what parameter setups our algorithm is more preferred. We also empirically study joint differential privacy in the multiclass classification problem over two public datasets. Our empirical findings are wellconnected to the insights from our theoretical results.

20230515 Sparsifying sums of norms.
Summary: For any norms $N_1,\ldots,N_m$ on $\mathbb{R}^n$ and $N(x) := N_1(x)+\cdots+N_m(x)$, we show there is a sparsified norm $\tilde{N}(x) = w_1 N_1(x) + \cdots + w_m N_m(x)$ such that $N(x)  \tilde{N}(x) \leq \epsilon N(x)$ for all $x \in \mathbb{R}^n$, where $w_1,\ldots,w_m$ are nonnegative weights, of which only $O(\epsilon^{2} n \log(n/\epsilon) (\log n)^{2.5} )$ are nonzero. Additionally, if $N$ is $\mathrm{poly}(n)$equivalent to the Euclidean norm on $\mathbb{R}^n$, then such weights can be found with high probability in time $O(m (\log n)^{O(1)} + \mathrm{poly}(n)) T$, where $T$ is the time required to evaluate a norm $N_i$. This immediately yields analogous statements for sparsifying sums of symmetric submodular functions. More generally, we show how to sparsify sums of $p$th powers of norms when the sum is $p$uniformly smooth.

20230515 LinearSized Sparsifiers via NearLinear Time Discrepancy Theory.
Summary: Discrepancy theory provides powerful tools for producing higherquality objects which "beat the union bound" in fundamental settings throughout combinatorics and computer science. However, this quality has often come at the price of more expensive algorithms. We introduce a new framework for bridging this gap, by allowing for the efficient implementation of discrepancytheoretic primitives. Our framework repeatedly solves regularized optimization problems to low accuracy to approximate the partial coloring method of [Rot17], and simplifies and generalizes recent work of [JSS23] on fast algorithms for Spencer's theorem. In particular, our framework only requires that the discrepancy body of interest has exponentially large Gaussian measure and is expressible as a sublevel set of a symmetric, convex function. We combine this framework with new tools for proving Gaussian measure lower bounds to give improved algorithms for a variety of sparsification and coloring problems. As a first application, we use our framework to obtain an $\widetilde{O}(m \cdot \epsilon^{3.5})$ time algorithm for constructing an $\epsilon$approximate spectral sparsifier of an $m$edge graph, matching the sparsity of [BSS14] up to constant factors and improving upon the $\widetilde{O}(m \cdot \epsilon^{6.5})$ runtime of [LeeS17]. We further give a stateoftheart algorithm for constructing graph ultrasparsifiers and an almostlinear time algorithm for constructing linearsized degreepreserving sparsifiers via discrepancy theory; in the latter case, such sparsifiers were not known to exist previously. We generalize these results to their analogs in sparsifying isotropic sums of positive semidefinite matrices. Finally, to demonstrate the versatility of our technique, we obtain a nearlyinputsparsity time constructive algorithm for Spencer's theorem (where we recover a recent result of [JSS23]).

20230505 On Optimization and Counting of NonBroken Bases of Matroids.
Summary: Given a matroid $M=(E,{\cal I})$, and a total ordering over the elements $E$, a broken circuit is a circuit where the smallest element is removed and an NBC independent set is an independent set in ${\cal I}$ with no broken circuit. The set of NBC independent sets of any matroid $M$ define a simplicial complex called the broken circuit complex which has been the subject of intense study in combinatorics. Recently, Adiprasito, Huh and Katz showed that the face of numbers of any broken circuit complex form a logconcave sequence, proving a longstanding conjecture of Rota. We study counting and optimization problems on NBC bases of a generic matroid. We find several fundamental differences with the independent set complex: for example, we show that it is NPhard to find the maxweight NBC base of a matroid or that the convex hull of NBC bases of a matroid has edges of arbitrary large length. We also give evidence that the natural downup walk on the space of NBC bases of a matroid may not mix rapidly by showing that for some family of matroids it is NPhard to count the number of NBC bases after certain conditionings.

20230504 Automatic Prompt Optimization with "Gradient Descent" and Beam Search.
Summary: Large Language Models (LLMs) have shown impressive performance as general purpose agents, but their abilities remain highly dependent on prompts which are hand written with onerous trialanderror effort. We propose a simple and nonparametric solution to this problem, Automatic Prompt Optimization (APO), which is inspired by numerical gradient descent to automatically improve prompts, assuming access to training data and an LLM API. The algorithm uses minibatches of data to form natural language "gradients" that criticize the current prompt. The gradients are then "propagated" into the prompt by editing the prompt in the opposite semantic direction of the gradient. These gradient descent steps are guided by a beam search and bandit selection procedure which significantly improves algorithmic efficiency. Preliminary results across three benchmark NLP tasks and the novel problem of LLM jailbreak detection suggest that Automatic Prompt Optimization can outperform prior prompt editing techniques and improve an initial prompt's performance by up to 31%, by using data to rewrite vague task descriptions into more precise annotation instructions.

20230504 A Hyperbolic Extension of KadisonSinger Type Results.
Summary: In 2013, Marcus, Spielman, and Srivastava resolved the famous KadisonSinger conjecture. It states that for $n$ independent random vectors $v_1,\cdots, v_n$ that have expected squared norm bounded by $\epsilon$ and are in the isotropic position in expectation, there is a positive probability that the determinant polynomial $\det(xI  \sum_{i=1}^n v_iv_i^\top)$ has roots bounded by $(1 + \sqrt{\epsilon})^2$. An interpretation of the KadisonSinger theorem is that we can always find a partition of the vectors $v_1,\cdots,v_n$ into two sets with a low discrepancy in terms of the spectral norm (in other words, rely on the determinant polynomial). In this paper, we provide two results for a broader class of polynomials, the hyperbolic polynomials. Furthermore, our results are in two generalized settings: $\bullet$ The first one shows that the KadisonSinger result requires a weaker assumption that the vectors have a bounded sum of hyperbolic norms. $\bullet$ The second one relaxes the KadisonSinger result's distribution assumption to the Strongly Rayleigh distribution. To the best of our knowledge, the previous results only support determinant polynomials [Anari and Oveis Gharan'14, Kyng, Luh and Song'20]. It is unclear whether they can be generalized to a broader class of polynomials. In addition, we also provide a subexponential time algorithm for constructing our results.

20230420 Attention Scheme Inspired Softmax Regression.
Summary: Large language models (LLMs) have made transformed changes for human society. One of the key computation in LLMs is the softmax unit. This operation is important in LLMs because it allows the model to generate a distribution over possible next words or phrases, given a sequence of input words. This distribution is then used to select the most likely next word or phrase, based on the probabilities assigned by the model. The softmax unit plays a crucial role in training LLMs, as it allows the model to learn from the data by adjusting the weights and biases of the neural network. In the area of convex optimization such as using central path method to solve linear programming. The softmax function has been used a crucial tool for controlling the progress and stability of potential function [Cohen, Lee and Song STOC 2019, Brand SODA 2020]. In this work, inspired the softmax unit, we define a softmax regression problem. Formally speaking, given a matrix $A \in \mathbb{R}^{n \times d}$ and a vector $b \in \mathbb{R}^n$, the goal is to use greedy type algorithm to solve \begin{align*} \min_{x} \ \langle \exp(Ax), {\bf 1}_n \rangle^{1} \exp(Ax)  b \_2^2. \end{align*} In certain sense, our provable convergence result provides theoretical support for why we can use greedy algorithm to train softmax function in practice.

20230416 Thin trees for laminar families.
Summary: In the laminarconstrained spanning tree problem, the goal is to find a minimumcost spanning tree which respects upper bounds on the number of times each cut in a given laminar family is crossed. This generalizes the wellstudied degreebounded spanning tree problem, as well as a previously studied setting where a chain of cuts is given. We give the first constantfactor approximation algorithm; in particular we show how to obtain a multiplicative violation of the crossing bounds of less than 22 while losing less than a factor of 5 in terms of cost. Our result compares to the natural LP relaxation. As a consequence, our results show that given a $k$edgeconnected graph and a laminar family $\mathcal{L} \subseteq 2^V$ of cuts, there exists a spanning tree which contains only an $O(1/k)$ fraction of the edges across every cut in $\mathcal{L}$. This can be viewed as progress towards the Thin Tree Conjecture, which (in a strong form) states that this guarantee can be obtained for all cuts simultaneously.

20230413 Solving Tensor Low Cycle Rank Approximation.
Summary: Large language models have become ubiquitous in modern life, finding applications in various domains such as natural language processing, language translation, and speech recognition. Recently, a breakthrough work [Zhao, Panigrahi, Ge, and Arora Arxiv 2023] explains the attention model from probabilistic contextfree grammar (PCFG). One of the central computation task for computing probability in PCFG is formulating a particular tensor low rank approximation problem, we can call it tensor cycle rank. Given an $n \times n \times n$ third order tensor $A$, we say that $A$ has cycle rank$k$ if there exists three $n \times k^2$ size matrices $U , V$, and $W$ such that for each entry in each \begin{align*} A_{a,b,c} = \sum_{i=1}^k \sum_{j=1}^k \sum_{l=1}^k U_{a,i+k(j1)} \otimes V_{b, j + k(l1)} \otimes W_{c, l + k(i1) } \end{align*} for all $a \in [n], b \in [n], c \in [n]$. For the tensor classical rank, tucker rank and train rank, it has been well studied in [Song, Woodruff, Zhong SODA 2019]. In this paper, we generalize the previous ``rotation and sketch'' technique in page 186 of [Song, Woodruff, Zhong SODA 2019] and show an input sparsity time algorithm for cycle rank.

20230410 Randomized and Deterministic Attention Sparsification Algorithms for Overparameterized Feature Dimension.
Summary: Large language models (LLMs) have shown their power in different areas. Attention computation, as an important subroutine of LLMs, has also attracted interests in theory. Recently the static computation and dynamic maintenance of attention matrix has been studied by [Alman and Song 2023] and [Brand, Song and Zhou 2023] from both algorithmic perspective and hardness perspective. In this work, we consider the sparsification of the attention problem. We make one simplification which is the logit matrix is symmetric. Let $n$ denote the length of sentence, let $d$ denote the embedding dimension. Given a matrix $X \in \mathbb{R}^{n \times d}$, suppose $d \gg n$ and $\ X X^\top \_{\infty} < r$ with $r \in (0,0.1)$, then we aim for finding $Y \in \mathbb{R}^{n \times m}$ (where $m\ll d$) such that \begin{align*} \ D(Y)^{1} \exp( Y Y^\top )  D(X)^{1} \exp( X X^\top) \_{\infty} \leq O(r) \end{align*} We provide two results for this problem. $\bullet$ Our first result is a randomized algorithm. It runs in $\widetilde{O}(\mathrm{nnz}(X) + n^{\omega} ) $ time, has $1\delta$ succeed probability, and chooses $m = O(n \log(n/\delta))$. Here $\mathrm{nnz}(X)$ denotes the number of nonzero entries in $X$. We use $\omega$ to denote the exponent of matrix multiplication. Currently $\omega \approx 2.373$. $\bullet$ Our second result is a deterministic algorithm. It runs in $\widetilde{O}(\min\{\sum_{i\in[d]}\mathrm{nnz}(X_i)^2, dn^{\omega1}\} + n^{\omega+1})$ time and chooses $m = O(n)$. Here $X_i$ denote the $i$th column of matrix $X$. Our main findings have the following implication for applied LLMs task: for any super large feature dimension, we can reduce it down to the size nearly linear in length of sentence.

20230407 Convex Minimization with Integer Minima in $\widetilde O(n^4)$ Time.
Summary: Given a convex function $f$ on $\mathbb{R}^n$ with an integer minimizer, we show how to find an exact minimizer of $f$ using $O(n^2 \log n)$ calls to a separation oracle and $O(n^4 \log n)$ time. The previous best polynomial time algorithm for this problem given in [Jiang, SODA 2021, JACM 2022] achieves $O(n^2\log\log n/\log n)$ oracle complexity. However, the overall runtime of Jiang's algorithm is at least $\widetilde{\Omega}(n^8)$, due to expensive subroutines such as the LenstraLenstraLov\'asz (LLL) algorithm [Lenstra, Lenstra, Lov\'asz, Math. Ann. 1982] and random walk based cutting plane method [Bertsimas, Vempala, JACM 2004]. Our significant speedup is obtained by a nontrivial combination of a faster version of the LLL algorithm due to [Neumaier, Stehl\'e, ISSAC 2016] that gives similar guarantees, the volumetric center cutting plane method (CPM) by [Vaidya, FOCS 1989] and its fast implementation given in [Jiang, Lee, Song, Wong, STOC 2020]. For the special case of submodular function minimization (SFM), our result implies a strongly polynomial time algorithm for this problem using $O(n^3 \log n)$ calls to an evaluation oracle and $O(n^4 \log n)$ additional arithmetic operations. Both the oracle complexity and the number of arithmetic operations of our more general algorithm are better than the previous bestknown runtime algorithms for this specific problem given in [Lee, Sidford, Wong, FOCS 2015] and [Dadush, V\'egh, Zambelli, SODA 2018, MOR 2021].

20230327 Revisiting Area Convexity: Faster BoxSimplex Games and Spectrahedral Generalizations.
Summary: We investigate different aspects of area convexity [Sherman '17], a mysterious tool introduced to tackle optimization problems under the challenging $\ell_\infty$ geometry. We develop a deeper understanding of its relationship with more conventional analyses of extragradient methods [Nemirovski '04, Nesterov '07]. We also give improved solvers for the subproblems required by variants of the [Sherman '17] algorithm, designed through the lens of relative smoothness [BauschkeBolteTeboulle '17, LuFreundNesterov '18]. Leveraging these new tools, we give a stateoftheart firstorder algorithm for solving boxsimplex games (a primaldual formulation of $\ell_\infty$ regression) in a $d \times n$ matrix with bounded rows, using $O(\log d \cdot \epsilon^{1})$ matrixvector queries. As a consequence, we obtain improved complexities for approximate maximum flow, optimal transport, minmeancycle, and other basic combinatorial optimization problems. We also develop a nearlinear time algorithm for a matrix generalization of boxsimplex games, capturing a family of problems closely related to semidefinite programs recently used as subroutines in robust statistics and numerical linear algebra.

20230326 The Subspace Flatness Conjecture and Faster Integer Programming.
Summary: In a seminal paper, Kannan and Lov\'asz (1988) considered a quantity $\mu_{KL}(\Lambda,K)$ which denotes the best volumebased lower bound on the covering radius $\mu(\Lambda,K)$ of a convex body $K$ with respect to a lattice $\Lambda$. Kannan and Lov\'asz proved that $\mu(\Lambda,K) \leq n \cdot \mu_{KL}(\Lambda,K)$ and the Subspace Flatness Conjecture by Dadush (2012) claims a $O(\log(2n))$ factor suffices, which would match the lower bound from the work of Kannan and Lov\'asz. We settle this conjecture up to a constant in the exponent by proving that $\mu(\Lambda,K) \leq O(\log^{3}(2n)) \cdot \mu_{KL} (\Lambda,K)$. Our proof is based on the Reverse Minkowski Theorem due to Regev and StephensDavidowitz (2017). Following the work of Dadush (2012, 2019), we obtain a $(\log(2n))^{O(n)}$time randomized algorithm to solve integer programs in $n$ variables. Another implication of our main result is a nearoptimal flatness constant of $O(n \log^{3}(2n))$.

20230322 Sparks of Artificial General Intelligence: Early experiments with GPT4.
Summary: Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT4, when it was still in active development by OpenAI. We contend that (this early version of) GPT4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT4's performance is strikingly close to humanlevel performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond nextword prediction. We conclude with reflections on societal influences of the recent technological leap and future research directions.

20230314 Testing Causality for High Dimensional Data.
Summary: Determining causal relationship between high dimensional observations are among the most important tasks in scientific discoveries. In this paper, we revisited the \emph{linear trace method}, a technique proposed in~\citep{janzing2009telling,zscheischler2011testing} to infer the causal direction between two random variables of high dimensions. We strengthen the existing results significantly by providing an improved tail analysis in addition to extending the results to nonlinear trace functionals with sharper confidence bounds under certain distributional assumptions. We obtain our results by interpreting the trace estimator in the causal regime as a function over random orthogonal matrices, where the concentration of Lipschitz functions over such space could be applied. We additionally propose a novel ridgeregularized variant of the estimator in \cite{zscheischler2011testing}, and give provable bounds relating the ridgeestimated terms to their groundtruth counterparts. We support our theoretical results with encouraging experiments on synthetic datasets, more prominently, under highdimension low sample size regime.

20230313 An Improved Sample Complexity for Rank1 Matrix Sensing.
Summary: Matrix sensing is a problem in signal processing and machine learning that involves recovering a lowrank matrix from a set of linear measurements. The goal is to reconstruct the original matrix as accurately as possible, given only a set of linear measurements obtained by sensing the matrix [Jain, Netrapalli and Shanghavi, 2013]. In this work, we focus on a particular direction of matrix sensing, which is called rank$1$ matrix sensing [Zhong, Jain and Dhillon, 2015]. We present an improvement over the original algorithm in [Zhong, Jain and Dhillon, 2015]. It is based on a novel analysis and sketching technique that enables faster convergence rates and better accuracy in recovering lowrank matrices. The algorithm focuses on developing a theoretical understanding of the matrix sensing problem and establishing its advantages over previous methods. The proposed sketching technique allows for efficiently extracting relevant information from the linear measurements, making the algorithm computationally efficient and scalable. Our novel matrix sensing algorithm improves former result [Zhong, Jain and Dhillon, 2015] on in two senses: $\bullet$ We improve the sample complexity from $\widetilde{O}(\epsilon^{2} dk^2)$ to $\widetilde{O}(\epsilon^{2} (d+k^2))$. $\bullet$ We improve the running time from $\widetilde{O}(md^2 k^2)$ to $\widetilde{O}(m d^2 k)$. The proposed algorithm has theoretical guarantees and is analyzed to provide insights into the underlying structure of lowrank matrices and the nature of the linear measurements used in the recovery process. It advances the theoretical understanding of matrix sensing and provides a new approach for solving this important problem.

20230308 Streaming Kernel PCA Algorithm With Small Space.
Summary: Principal Component Analysis (PCA) is a widely used technique in machine learning, data analysis and signal processing. With the increase in the size and complexity of datasets, it has become important to develop lowspace usage algorithms for PCA. Streaming PCA has gained significant attention in recent years, as it can handle large datasets efficiently. The kernel method, which is commonly used in learning algorithms such as Support Vector Machines (SVMs), has also been applied in PCA algorithms. We propose a streaming algorithm for Kernel PCA problems based on the traditional scheme by Oja. Our algorithm addresses the challenge of reducing the memory usage of PCA while maintaining its accuracy. We analyze the performance of our algorithm by studying the conditions under which it succeeds. Specifically, we show that, when the spectral ratio $R := \lambda_1/\lambda_2$ of the target covariance matrix is lower bounded by $C \cdot \log n\cdot \log d$, the streaming PCA can be solved with $O(d)$ space cost. Our proposed algorithm has several advantages over existing methods. First, it is a streaming algorithm that can handle large datasets efficiently. Second, it employs the kernel method, which allows it to capture complex nonlinear relationships among data points. Third, it has a lowspace usage, making it suitable for applications where memory is limited.

20230307 Complete Log Concavity of CoverageLike Functions.
Summary: We introduce an expressive subclass of nonnegative almost submodular set functions, called strongly 2coverage functions which include coverage and (sums of) matroid rank functions, and prove that the homogenization of the generating polynomial of any such function is completely logconcave, taking a step towards characterizing the coefficients of (homogeneous) completely logconcave polynomials. As a consequence we obtain that the "level sets" of any such function form an ultralog concave sequence.

20230302 An Improved Classical Singular Value Transformation for Quantum Machine Learning.
Summary: We study quantum speedups in quantum machine learning (QML) by analyzing the quantum singular value transformation (QSVT) framework. QSVT, introduced by [GSLW, STOC'19, arXiv:1806.01838], unifies all major types of quantum speedup; in particular, a wide variety of QML proposals are applications of QSVT on lowrank classical data. We challenge these proposals by providing a classical algorithm that matches the performance of QSVT in this regime up to a small polynomial overhead. We show that, given a matrix $A \in \mathbb{C}^{m\times n}$, a vector $b \in \mathbb{C}^{n}$, a bounded degree$d$ polynomial $p$, and lineartime preprocessing, we can output a description of a vector $v$ such that $\v  p(A) b\ \leq \varepsilon\b\$ in $\widetilde{\mathcal{O}}(d^{11} \A\_{\mathrm{F}}^4 / (\varepsilon^2 \A\^4 ))$ time. This improves upon the best known classical algorithm [CGLLTW, STOC'20, arXiv:1910.06151], which requires $\widetilde{\mathcal{O}}(d^{22} \A\_{\mathrm{F}}^6 /(\varepsilon^6 \A\^6 ) )$ time, and narrows the gap with QSVT, which, after lineartime preprocessing to load input into a quantumaccessible memory, can estimate the magnitude of an entry $p(A)b$ to $\varepsilon\b\$ error in $\widetilde{\mathcal{O}}(d\A\_{\mathrm{F}}/(\varepsilon \A\))$ time. Our key insight is to combine the Clenshaw recurrence, an iterative method for computing matrix polynomials, with sketching techniques to simulate QSVT classically. We introduce several new classical techniques in this work, including (a) a nonoblivious matrix sketch for approximately preserving bilinear forms, (b) a new stability analysis for the Clenshaw recurrence, and (c) a new technique to bound arithmetic progressions of the coefficients appearing in the Chebyshev series expansion of bounded functions, each of which may be of independent interest.

20230228 A CS guide to the quantum singular value transformation.
Summary: We present a simplified exposition of some pieces of [Gily\'en, Su, Low, and Wiebe, STOC'19, arXiv:1806.01838], which introduced a quantum singular value transformation (QSVT) framework for applying polynomial functions to blockencoded matrices. The QSVT framework has garnered substantial recent interest from the quantum algorithms community, as it was demonstrated by [GSLW19] to encapsulate many existing algorithms naturally phrased as an application of a matrix function. First, we posit that the lifting of quantum singular processing (QSP) to QSVT is better viewed not through Jordan's lemma (as was suggested by [GSLW19]) but as an application of the cosinesine decomposition, which can be thought of as a more explicit and stronger version of Jordan's lemma. Second, we demonstrate that the constructions of bounded polynomial approximations given in [GSLW19], which use a variety of ad hoc approaches drawing from Fourier analysis, Chebyshev series, and Taylor series, can be unified under the framework of truncation of Chebyshev series, and indeed, can in large part be matched via a bounded variant of a standard metatheorem from [Trefethen, 2013]. We hope this work finds use to the community as a companion guide for understanding and applying the powerful framework of [GSLW19].

20230227 Queryoptimal estimation of unitary channels in diamond distance.
Summary: We consider process tomography for unitary quantum channels. Given access to an unknown unitary channel acting on a $\textsf{d}$dimensional qudit, we aim to output a classical description of a unitary that is $\varepsilon$close to the unknown unitary in diamond norm. We design an algorithm achieving error $\varepsilon$ using $O(\textsf{d}^2/\varepsilon)$ applications of the unknown channel and only one qudit. This improves over prior results, which use $O(\textsf{d}^3/\varepsilon^2)$ [via standard process tomography] or $O(\textsf{d}^{2.5}/\varepsilon)$ [Yang, Renner, and Chiribella, PRL 2020] applications. To show this result, we introduce a simple technique to "bootstrap" an algorithm that can produce constanterror estimates to one that can produce $\varepsilon$error estimates with the Heisenberg scaling. Finally, we prove a complementary lower bound showing that estimation requires $\Omega(\textsf{d}^2/\varepsilon)$ applications, even with access to the inverse or controlled versions of the unknown unitary. This shows that our algorithm has both optimal query complexity and optimal space complexity.

20230224 Quantum trapdoor functions from classical oneway functions.
Summary: We formalize and study the notion of a quantum trapdoor function. This is an efficiently computable unitary that takes as input a "public" quantum state and a classical string $x$, and outputs a quantum state. This map is such that (i) it is hard to invert, in the sense that it is hard to recover $x$ given the output state (and many copies of the public state), and (ii) there is a classical trapdoor that allows efficient inversion. We show that a quantum trapdoor function can be constructed from any quantumsecure oneway function. A direct consequence of this result is that, assuming just the existence of quantumsecure oneway functions, there exists a publickey encryption scheme with a (pure) quantum public key.

20230221 $k$NNAdapter: Efficient Domain Adaptation for BlackBox Language Models.
Summary: Finetuning a language model on a new domain is standard practice for domain adaptation. However, it can be infeasible when it comes to modern largescale language models such as GPT3, which can only be accessed through APIs, making it difficult to access the internal parameters of the model. In this paper, we propose $k$NNAdapter, a method to effectively adapt these blackbox large language models (LLMs) to a new domain. The $k$NNAdapter builds on top of the retrievalaugmented language model, and adaptively learns to interpolate the output of the language model with retrieval results from a datastore consisting of the target domain data. Our experiments on four different domains demonstrate that $k$NNAdapter significantly improves perplexity, and works particularly well in settings with limited access to LLMs. Additionally, we show that $k$NNAdapter is more effective than finetuning when the amount of training data is limited. We also release a dataset to encourage further study.

20230220 Private (Stochastic) NonConvex Optimization Revisited: SecondOrder Stationary Points and Excess Risks.
Summary: We consider the problem of minimizing a nonconvex objective while preserving the privacy of the examples in the training data. Building upon the previous variancereduced algorithm SpiderBoost, we introduce a new framework that utilizes two different kinds of gradient oracles. The first kind of oracles can estimate the gradient of one point, and the second kind of oracles, less precise and more costeffective, can estimate the gradient difference between two points. SpiderBoost uses the first kind periodically, once every few steps, while our framework proposes using the first oracle whenever the total drift has become large and relies on the second oracle otherwise. This new framework ensures the gradient estimations remain accurate all the time, resulting in improved rates for finding secondorder stationary points. Moreover, we address a more challenging task of finding the global minima of a nonconvex objective using the exponential mechanism. Our findings indicate that the regularized exponential mechanism can closely match previous empirical and population risk bounds, without requiring smoothness assumptions for algorithms with polynomial running time. Furthermore, by disregarding running time considerations, we show that the exponential mechanism can achieve a good population risk bound and provide a nearly matching lower bound.

20230213 Algorithmic Aspects of the LogLaplace Transform and a NonEuclidean Proximal Sampler.
Summary: The development of efficient sampling algorithms catering to nonEuclidean geometries has been a challenging endeavor, as discretization techniques which succeed in the Euclidean setting do not readily carry over to more general settings. We develop a nonEuclidean analog of the recent proximal sampler of [LST21], which naturally induces regularization by an object known as the logLaplace transform (LLT) of a density. We prove new mathematical properties (with an algorithmic flavor) of the LLT, such as strong convexitysmoothness duality and an isoperimetric inequality, which are used to prove a mixing time on our proximal sampler matching [LST21] under a warm start. As our main application, we show our warmstarted sampler improves the value oracle complexity of differentially private convex optimization in $\ell_p$ and Schatten$p$ norms for $p \in [1, 2]$ to match the Euclidean setting [GLL22], while retaining stateoftheart excess risk bounds [GLLST23]. We find our investigation of the LLT to be a promising proofofconcept of its utility as a tool for designing samplers, and outline directions for future exploration.

20230121 From Pseudorandomness to MultiGroup Fairness and Back.
Summary: We identify and explore connections between the recent literature on multigroup fairness for prediction algorithms and the pseudorandomness notions of leakageresilience and graph regularity. We frame our investigation using new, statistical distancebased variants of multicalibration that are closely related to the concept of outcome indistinguishability. Adopting this perspective leads us naturally not only to our graph theoretic results, but also to new, more efficient algorithms for multicalibration in certain parameter regimes and a novel proof of a hardcore lemma for realvalued functions.

20230113 Cumulative Memory Lower Bounds for Randomized and Quantum Computation.
Summary: Cumulative memory  the sum of space used per step over the duration of a computation  is a finegrained measure of timespace complexity that was introduced to analyze cryptographic applications like password hashing. It is a more accurate cost measure for algorithms that have infrequent spikes in memory usage and are run in environments such as cloud computing that allow dynamic allocation and deallocation of resources during execution, or when many multiple instances of an algorithm are interleaved in parallel. We prove the first lower bounds on cumulative memory complexity for both sequential classical computation and quantum circuits. Moreover, we develop general paradigms for bounding cumulative memory complexity inspired by the standard paradigms for proving timespace tradeoff lower bounds that can only lower bound the maximum space used during an execution. The resulting lower bounds on cumulative memory that we obtain are just as strong as the best timespace tradeoff lower bounds, which are very often known to be tight. Although previous results for pebbling and random oracle models have yielded timespace tradeoff lower bounds larger than the cumulative memory complexity, our results show that in general computational models such separations cannot follow from known lower bound techniques and are not true for many functions. Among many possible applications of our general methods, we show that any classical sorting algorithm with success probability at least $1/\text{poly}(n)$ requires cumulative memory $\tilde \Omega(n^2)$, any classical matrix multiplication algorithm requires cumulative memory $\Omega(n^6/T)$, any quantum sorting circuit requires cumulative memory $\Omega(n^3/T)$, and any quantum circuit that finds $k$ disjoint collisions in a random function requires cumulative memory $\Omega(k^3n/T^2)$.

20230110 Elastic Cash.
Summary: Elastic Cash is a new decentralized mechanism for regulating the money supply. The mechanism operates by modifying the supply so that an interest rate determined by a public market is kept approximately fixed. It can be incorporated into the conventional monetary system to improve the elasticity of the US Dollar, and it can be used to design new elastic cryptocurrencies that remain decentralized.

20230101 ReSQueing Parallel and Private Stochastic Convex Optimization.
Summary: We introduce a new tool for stochastic convex optimization (SCO): a Reweighted Stochastic Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density. Combining ReSQue with recent advances in ball oracle acceleration [CJJJLST20, ACJJS21], we develop algorithms achieving stateoftheart complexities for SCO in parallel and private settings. For a SCO objective constrained to the unit ball in $\mathbb{R}^d$, we obtain the following results (up to polylogarithmic factors). We give a parallel algorithm obtaining optimization error $\epsilon_{\text{opt}}$ with $d^{1/3}\epsilon_{\text{opt}}^{2/3}$ gradient oracle query depth and $d^{1/3}\epsilon_{\text{opt}}^{2/3} + \epsilon_{\text{opt}}^{2}$ gradient queries in total, assuming access to a boundedvariance stochastic gradient estimator. For $\epsilon_{\text{opt}} \in [d^{1}, d^{1/4}]$, our algorithm matches the stateoftheart oracle depth of [BJLLS19] while maintaining the optimal total work of stochastic gradient descent. Given $n$ samples of Lipschitz loss functions, prior works [BFTT19, BFGT20, AFKT21, KLL21] established that if $n \gtrsim d \epsilon_{\text{dp}}^{2}$, $(\epsilon_{\text{dp}}, \delta)$differential privacy is attained at no asymptotic cost to the SCO utility. However, these prior works all required a superlinear number of gradient queries. We close this gap for sufficiently large $n \gtrsim d^2 \epsilon_{\text{dp}}^{3}$, by using ReSQue to design an algorithm with nearlinear gradient query complexity in this regime.

20221214 Learning threshold neurons via the "edge of stability".
Summary: Existing analyses of neural network training often operate under the unrealistic assumption of an extremely small learning rate. This lies in stark contrast to practical wisdom and empirical studies, such as the work of J. Cohen et al. (ICLR 2021), which exhibit startling new phenomena (the "edge of stability" or "unstable convergence") and potential benefits for generalization in the large learning rate regime. Despite a flurry of recent works on this topic, however, the latter effect is still poorly understood. In this paper, we take a step towards understanding genuinely nonconvex training dynamics with large learning rates by performing a detailed analysis of gradient descent for simplified models of twolayer neural networks. For these models, we provably establish the edge of stability phenomenon and discover a sharp phase transition for the step size below which the neural network fails to learn "thresholdlike" neurons (i.e., neurons with a nonzero firstlayer bias). This elucidates one possible mechanism by which the edge of stability can in fact lead to better generalization, as threshold neurons are basic building blocks with useful inductive bias for many tasks.

20221213 A (Slightly) Improved Deterministic Approximation Algorithm for Metric TSP.
Summary: We show that the max entropy algorithm can be derandomized (with respect to a particular objective function) to give a deterministic $3/2\epsilon$ approximation algorithm for metric TSP for some $\epsilon > 10^{36}$. To obtain our result, we apply the method of conditional expectation to an objective function constructed in prior work which was used to certify that the expected cost of the algorithm is at most $3/2\epsilon$ times the cost of an optimal solution to the subtour elimination LP. The proof in this work involves showing that the expected value of this objective function can be computed in polynomial time (at all stages of the algorithm's execution).

20221203 Exploring the Limits of Differentially Private Deep Learning with Groupwise Clipping.
Summary: Differentially private deep learning has recently witnessed advances in computational efficiency and privacyutility tradeoff. We explore whether further improvements along the two axes are possible and provide affirmative answers leveraging two instantiations of \emph{groupwise clipping}. To reduce the compute time overhead of private learning, we show that \emph{perlayer clipping}, where the gradient of each neural network layer is clipped separately, allows clipping to be performed in conjunction with backpropagation in differentially private optimization. This results in private learning that is as memoryefficient and almost as fast per training update as nonprivate learning for many workflows of interest. While perlayer clipping with constant thresholds tends to underperform standard flat clipping, perlayer clipping with adaptive thresholds matches or outperforms flat clipping under given training epoch constraints, hence attaining similar or better task performance within less wall time. To explore the limits of scaling (pretrained) models in differentially private deep learning, we privately finetune the 175 billionparameter GPT3. We bypass scaling challenges associated with clipping gradients that are distributed across multiple devices with \emph{perdevice clipping} that clips the gradient of each model piece separately on its host device. Privately finetuning GPT3 with perdevice clipping achieves a task performance at $\epsilon=1$ better than what is attainable by nonprivately finetuning the largest GPT2 on a summarization task.

20221130 On Disperser/Lifting Properties of the Index and InnerProduct Functions.
Summary: Querytocommunication lifting theorems, which connect the query complexity of a Boolean function to the communication complexity of an associated `lifted' function obtained by composing the function with many copies of another function known as a gadget, have been instrumental in resolving many open questions in computational complexity. Several important complexity questions could be resolved if we could make substantial improvements in the input size required for lifting with the Index function, from its current nearlinear size down to polylogarithmic in the number of inputs $N$ of the original function or, ideally, constant. The nearlinear size bound was shown by Lovett, Meka, Mertz, Pitassi and Zhang using a recent breakthrough improvement on the Sunflower Lemma to show that a certain graph associated with the Index function of nearlinear size is a disperser. They also stated a conjecture about the Index function that is essential for further improvements in the size required for lifting with Index using current techniques. In this paper we prove the following; 1) The conjecture of Lovett et al. is false when the size of the Index gadget is $\log N\omega(1)$. 2) Also, the InnerProduct function, which satisfies the disperser property at size $O(\log N)$, does not have this property when its size is $\log N\omega(1)$. 3) Nonetheless, using Index gadgets of size at least 4, we prove a lifting theorem for a restricted class of communication protocols in which one of the players is limited to sending parities of its inputs. 4) Using the ideas from this lifting theorem, we derive a strong lifting theorem from decision tree size to parity decision tree size. We use this to derive a general lifting theorem in proof complexity from treeresolution size to treelike $Res(\oplus)$ refutation size, which yields many new exponential lower bounds on such proofs.

20221127 Dynamic Kernel Sparsifiers.
Summary: A geometric graph associated with a set of points $P= \{x_1, x_2, \cdots, x_n \} \subset \mathbb{R}^d$ and a fixed kernel function $\mathsf{K}:\mathbb{R}^d\times \mathbb{R}^d\to\mathbb{R}_{\geq 0}$ is a complete graph on $P$ such that the weight of edge $(x_i, x_j)$ is $\mathsf{K}(x_i, x_j)$. We present a fullydynamic data structure that maintains a spectral sparsifier of a geometric graph under updates that change the locations of points in $P$ one at a time. The update time of our data structure is $n^{o(1)}$ with high probability, and the initialization time is $n^{1+o(1)}$. Under certain assumption, we can provide a fully dynamic spectral sparsifier with the robostness to adaptive adversary. We further show that, for the Laplacian matrices of these geometric graphs, it is possible to maintain random sketches for the results of matrix vector multiplication and inversematrix vector multiplication in $n^{o(1)}$ time, under updates that change the locations of points in $P$ or change the query vector by a sparse difference.

20221121 Upper and Lower Bounds on the Smoothed Complexity of the Simplex Method.
Summary: The simplex method for linear programming is known to be highly efficient in practice, and understanding its performance from a theoretical perspective is an active research topic. The framework of smoothed analysis, first introduced by Spielman and Teng (JACM '04) for this purpose, defines the smoothed complexity of solving a linear program with $d$ variables and $n$ constraints as the expected running time when Gaussian noise of variance $\sigma^2$ is added to the LP data. We prove that the smoothed complexity of the simplex method is $O(\sigma^{3/2} d^{13/4}\log^{7/4} n)$, improving the dependence on $1/\sigma$ compared to the previous bound of $O(\sigma^{2} d^2\sqrt{\log n})$. We accomplish this through a new analysis of the \emph{shadow bound}, key to earlier analyses as well. Illustrating the power of our new method, we use our method to prove a nearly tight upper bound on the smoothed complexity of twodimensional polygons. We also establish the first nontrivial lower bound on the smoothed complexity of the simplex method, proving that the \emph{shadow vertex simplex method} requires at least $\Omega \Big(\min \big(\sigma^{1/2} d^{1/2}\log^{1/4} d,2^d \big) \Big)$ pivot steps with high probability. A key part of our analysis is a new variation on the extended formulation for the regular $2^k$gon. We end with a numerical experiment that suggests this analysis could be further improved.

20221117 How to FineTune Vision Models with SGD.
Summary: SGD and AdamW are the two most used optimizers for finetuning large neural networks in computer vision. When the two methods perform the same, SGD is preferable because it uses less memory (12 bytes/parameter with momentum and 8 bytes/parameter without) than AdamW (16 bytes/parameter). However, on a suite of downstream tasks, especially those with distribution shifts, we find that finetuning with AdamW performs substantially better than SGD on modern Vision Transformer and ConvNeXt models. We find that large gaps in performance between SGD and AdamW occur when the finetuning gradients in the first "embedding" layer are much larger than in the rest of the model. Our analysis suggests an easy fix that works consistently across datasets and models: freezing the embedding layer (less than 1% of the parameters) leads to SGD with or without momentum performing slightly better than AdamW while using less memory (e.g., on ViTL, SGD uses 33% less GPU memory). Our insights result in stateoftheart accuracies on five popular distribution shift benchmarks: WILDSFMoW, WILDSCamelyon, BREEDSLiving17, Waterbirds, and DomainNet.

20221109 A 4/3Approximation Algorithm for HalfIntegral Cycle Cut Instances of the TSP.
Summary: A longstanding conjecture for the traveling salesman problem (TSP) states that the integrality gap of the standard linear programming relaxation of the TSP is at most 4/3. Despite significant efforts, the conjecture remains open. We consider the halfintegral case, in which the LP has solution values in $\{0, 1/2, 1\}$. Such instances have been conjectured to be the most difficult instances for the overall fourthirds conjecture. Karlin, Klein, and Oveis Gharan, in a breakthrough result, were able to show that in the halfintegral case, the integrality gap is at most 1.49993. This result led to the first significant progress on the overall conjecture in decades; the same authors showed the integrality gap is at most $1.5 10^{36}$ in the nonhalfintegral case. For the halfintegral case, the current bestknown ratio is 1.4983, a result by Gupta et al. With the improvements on the 3/2 bound remaining very incremental even in the halfintegral case, we turn the question around and look for a large class of halfintegral instances for which we can prove that the 4/3 conjecture is correct. The previous works on the halfintegral case perform induction on a hierarchy of critical tight sets in the support graph of the LP solution, in which some of the sets correspond to "cycle cuts" and the others to "degree cuts". We show that if all the sets in the hierarchy correspond to cycle cuts, then we can find a distribution of tours whose expected cost is at most 4/3 times the value of the halfintegral LP solution; sampling from the distribution gives us a randomized 4/3approximation algorithm. We note that the known bad cases for the integrality gap have a gap of 4/3 and have a halfintegral LP solution in which all the critical tight sets in the hierarchy are cycle cuts; thus our result is tight.

20221107 From approximate to exact integer programming.
Summary: Approximate integer programming is the following: For a convex body $K \subseteq \mathbb{R}^n$, either determine whether $K \cap \mathbb{Z}^n$ is empty, or find an integer point in the convex body scaled by $2$ from its center of gravity $c$. Approximate integer programming can be solved in time $2^{O(n)}$ while the fastest known methods for exact integer programming run in time $2^{O(n)} \cdot n^n$. So far, there are no efficient methods for integer programming known that are based on approximate integer programming. Our main contribution are two such methods, each yielding novel complexity results. First, we show that an integer point $x^* \in (K \cap \mathbb{Z}^n)$ can be found in time $2^{O(n)}$, provided that the remainders of each component $x_i^* \mod{\ell}$ for some arbitrarily fixed $\ell \geq 5(n+1)$ of $x^*$ are given. The algorithm is based on a cuttingplane technique, iteratively halving the volume of the feasible set. The cutting planes are determined via approximate integer programming. Enumeration of the possible remainders gives a $2^{O(n)}n^n$ algorithm for general integer programming. This matches the current best bound of an algorithm by Dadush (2012) that is considerably more involved. Our algorithm also relies on a new asymmetric approximate Carath\'eodory theorem that might be of interest on its own. Our second method concerns integer programming problems in equationstandard form $Ax = b, 0 \leq x \leq u, \, x \in \mathbb{Z}^n$ . Such a problem can be reduced to the solution of $\prod_i O(\log u_i +1)$ approximate integer programming problems. This implies, for example that knapsack or subsetsum problems with polynomial variable range $0 \leq x_i \leq p(n)$ can be solved in time $(\log n)^{O(n)}$. For these problems, the best running time so far was $n^n \cdot 2^{O(n)}$.

20221103 BestofBothWorlds Multiparty Quantum Computation with Publicly Verifiable Identifiable Abort.
Summary: Alon et al. (CRYPTO 2021) introduced a multiparty quantum computation protocol that is secure with identifiable abort (MPQCSWIA). However, their protocol allows only inside MPQC parties to know the identity of malicious players. This becomes problematic when two groups of people disagree and need a third party, like a jury, to verify who the malicious party is. This issue takes on heightened significance in the quantum setting, given that quantum states may exist in only a single copy. Thus, we emphasize the necessity of a protocol with publicly verifiable identifiable abort (PVIA), enabling outside observers with only classical computational power to agree on the identity of the malicious party in case of an abort. However, achieving MPQC with PVIA poses significant challenges due to the nocloning theorem, and previous works proposed by Mahadev (STOC 2018) and Chung et al. (Eurocrypt 2022) for classical verification of quantum computation fall short. In this paper, we obtain the first MPQCPVIA protocol assuming postquantum oblivious transfer and a classical broadcast channel. The core component of our construction is a new authentication primitive called auditable quantum authentication (AQA) that identifies the malicious sender with overwhelming probability. Additionally, we provide the first MPQC protocol with bestofbothworlds (BoBW) security, which guarantees output delivery with an honest majority and remains secure with abort even if the majority is dishonest. Our bestofbothworlds MPQC protocol also satisfies PVIA upon abort.

20221029 The Vector Balancing Constant for Zonotopes.
Summary: The vector balancing constant $\mathrm{vb}(K,Q)$ of two symmetric convex bodies $K,Q$ is the minimum $r \geq 0$ so that any number of vectors from $K$ can be balanced into an $r$scaling of $Q$. A question raised by Schechtman is whether for any zonotope $K \subseteq \mathbb{R}^d$ one has $\mathrm{vb}(K,K) \lesssim \sqrt{d}$. Intuitively, this asks whether a natural geometric generalization of Spencer's Theorem (for which $K = B^d_\infty$) holds. We prove that for any zonotope $K \subseteq \mathbb{R}^d$ one has $\mathrm{vb}(K,K) \lesssim \sqrt{d} \log \log \log d$. Our main technical contribution is a tight lower bound on the Gaussian measure of any section of a normalized zonotope, generalizing Vaaler's Theorem for cubes. We also prove that for two different normalized zonotopes $K$ and $Q$ one has $\mathrm{vb}(K,Q) \lesssim \sqrt{d \log d}$. All the bounds are constructive and the corresponding colorings can be computed in polynomial time.

20221027 Learning SingleIndex Models with Shallow Neural Networks.
Summary: Singleindex models are a class of functions given by an unknown univariate ``link'' function applied to an unknown onedimensional projection of the input. These models are particularly relevant in high dimension, when the data might present lowdimensional structure that learning algorithms should adapt to. While several statistical aspects of this model, such as the sample complexity of recovering the relevant (onedimensional) subspace, are wellunderstood, they rely on tailored algorithms that exploit the specific structure of the target function. In this work, we introduce a natural class of shallow neural networks and study its ability to learn singleindex models via gradient flow. More precisely, we consider shallow networks in which biases of the neurons are frozen at random initialization. We show that the corresponding optimization landscape is benign, which in turn leads to generalization guarantees that match the nearoptimal sample complexity of dedicated semiparametric methods.

20221027 A distribution testing oracle separation between QMA and QCMA.
Summary: It is a longstanding open question in quantum complexity theory whether the definition of $\textit{nondeterministic}$ quantum computation requires quantum witnesses $(\textsf{QMA})$ or if classical witnesses suffice $(\textsf{QCMA})$. We make progress on this question by constructing a randomized classical oracle separating the respective computational complexity classes. Previous separations [AaronsonKuperberg (CCC'07), FeffermanKimmel (MFCS'18)] required a quantum unitary oracle. The separating problem is deciding whether a distribution supported on regular undirected graphs either consists of multiple connected components (yes instances) or consists of one expanding connected component (no instances) where the graph is given in an adjacencylist format by the oracle. Therefore, the oracle is a distribution over $n$bit boolean functions.

20221022 Discrepancy Minimization in InputSparsity Time.
Summary: A recent work of Larsen [Lar23] gave a faster combinatorial alternative to Bansal's SDP algorithm for finding a coloring $x\in\{1,1\}^n$ that approximately minimizes the discrepancy $\mathrm{disc}(A,x) : = \ A x \_{\infty}$ of a general realvalued $m\times n$ matrix $A$. Larsen's algorithm runs in $\widetilde{O}(mn^2)$ time compared to Bansal's $\widetilde{O}(mn^{4.5})$time algorithm, at the price of a slightly weaker logarithmic approximation ratio in terms of the hereditary discrepancy of $A$ [Ban10]. In this work we present a combinatorial $\widetilde{O}(\mathrm{nnz}(A) + n^3)$ time algorithm with the same approximation guarantee as Larsen, which is optimal for tall matrices $m=\mathrm{poly}(n)$. Using a more intricate analysis and fast matrixmultiplication, we achieve $\widetilde{O}(\mathrm{nnz}(A) + n^{2.53})$ time, which breaks cubic runtime for square matrices, and bypasses the barrier of linearprogramming approaches [ES14] for which inputsparsity time is currently out of reach. Our algorithm relies on two main ideas: (i) A new sketching technique for finding a projection matrix with short $\ell_2$basis using implicit leveragescore sampling; (ii) A data structure for faster implementation of the iterative EdgeWalk partialcoloring algorithm of LovettMeka, using an alternative analysis that enables ``lazy" batchupdates with lowrank corrections. Our result nearly closes the computational gap between realvalued and binary matrices (setsystems), for which inputsparsity time coloring was very recently obtained [JSS23].

20221021 Augmentation with Projection: Towards an Effective and Efficient Data Augmentation Paradigm for Distillation.
Summary: Knowledge distillation is one of the primary methods of transferring knowledge from large to small models. However, it requires massive taskspecific data, which may not be plausible in many realworld applications. Data augmentation methods such as representation interpolation, token replacement, or augmentation with models are applied to tackle this problem. However, these data augmentation methods either potentially cause shifts in decision boundaries (representation interpolation), are not expressive enough (token replacement), or introduce too much computational overhead (augmentation with models). To this end, we propose AugPro (Augmentation with Projection), an effective and efficient data augmentation method for distillation. Our method builds on top of representation interpolation augmentation methods to maintain the diversity of expressions and converts the augmented data to tokens to avoid shifting decision boundaries. It uses simple operations that come with little computational overhead. The results on multiple GLUE tasks show that our methods can improve distillation performance by a large margin at a low time cost. Codes are available at https://github.com/googleresearch/googleresearch/tree/master/augpro.

20221015 A Nearly Optimal Size Coreset Algorithm with Nearly Linear Time.
Summary: A coreset is a point set containing information about geometric properties of a larger point set. A series of previous works show that in many machine learning problems, especially in clustering problems, coreset could be very useful to build efficient algorithms. Two main measures of an coreset construction algorithm's performance are the running time of the algorithm and the size of the coreset output by the algorithm. In this paper we study the construction of coresets for the $(k,z)$clustering problem, which is a generalization of $k$means and $k$median problem. By properly designing a sketchingbased distance estimation data structure, we propose faster algorithms that construct coresets with matching size of the stateoftheart results.

20221013 Conditionnumberindependent convergence rate of Riemannian Hamiltonian Monte Carlo with numerical integrators.
Summary: We study the convergence rate of discretized Riemannian Hamiltonian Monte Carlo on sampling from distributions in the form of $e^{f(x)}$ on a convex body $\mathcal{M}\subset\mathbb{R}^{n}$. We show that for distributions in the form of $e^{\alpha^{\top}x}$ on a polytope with $m$ constraints, the convergence rate of a family of commonlyused integrators is independent of $\left\Vert \alpha\right\Vert _{2}$ and the geometry of the polytope. In particular, the implicit midpoint method (IMM) and the generalized Leapfrog method (LM) have a mixing time of $\widetilde{O}\left(mn^{3}\right)$ to achieve $\epsilon$ total variation distance to the target distribution. These guarantees are based on a general bound on the convergence rate for densities of the form $e^{f(x)}$ in terms of parameters of the manifold and the integrator. Our theoretical guarantee complements the empirical results of [KLSV22], which shows that RHMC with IMM can sample illconditioned, nonsmooth and constrained distributions in very high dimension efficiently in practice.

20221012 Quantum Depth in the Random Oracle Model.
Summary: We give a comprehensive characterization of the computational power of shallow quantum circuits combined with classical computation. Specifically, for classes of search problems, we show that the following statements hold, relative to a random oracle: (a) $\mathsf{BPP}^{\mathsf{QNC}^{\mathsf{BPP}}} \neq \mathsf{BQP}$. This refutes Jozsa's conjecture [QIP 05] in the random oracle model. As a result, this gives the first instantiatable separation between the classes by replacing the oracle with a cryptographic hash function, yielding a resolution to one of Aaronson's ten semigrand challenges in quantum computing. (b) $\mathsf{BPP}^{\mathsf{QNC}} \nsubseteq \mathsf{QNC}^{\mathsf{BPP}}$ and $\mathsf{QNC}^{\mathsf{BPP}} \nsubseteq \mathsf{BPP}^{\mathsf{QNC}}$. This shows that there is a subtle interplay between classical computation and shallow quantum computation. In fact, for the second separation, we establish that, for some problems, the ability to perform adaptive measurements in a single shallow quantum circuit, is more useful than the ability to perform polynomially many shallow quantum circuits without adaptive measurements. (c) There exists a 2message proof of quantum depth protocol. Such a protocol allows a classical verifier to efficiently certify that a prover must be performing a computation of some minimum quantum depth. Our proof of quantum depth can be instantiated using the recent proof of quantumness construction by Yamakawa and Zhandry [STOC 22].

20221006 NLTS Hamiltonians from classical LTCs.
Summary: We provide a completely selfcontained construction of a family of NLTS Hamiltonians [Freedman and Hastings, 2014] based on ideas from [Anshu, Breuckmann, and Nirkhe, 2022], [Cross, He, Natarajan, Szegedy, and Zhu, 2022] and [Eldar and Harrow, 2017]. Crucially, it does not require optimalparameter quantum LDPC codes and can be built from simple classical LTCs such as the repetition code on an expander graph. Furthermore, it removes the constantrate requirement from the construction of Anshu, Breuckmann, and Nirkhe.

20220921 Chaining, Group Leverage Score Overestimates, and Fast Spectral Hypergraph Sparsification.
Summary: We present an algorithm that given any $n$vertex, $m$edge, rank $r$ hypergraph constructs a spectral sparsifier with $O(n \varepsilon^{2} \log n \log r)$ hyperedges in nearlylinear $\widetilde{O}(mr)$ time. This improves in both size and efficiency over a line of work (BansalSvenssonTrevisan 2019, KapralovKrauthgamerTardosYoshida 2021) for which the previous best size was $O(\min\{n \varepsilon^{4} \log^3 n,nr^3 \varepsilon^{2} \log n\})$ and runtime was $\widetilde{O}(mr + n^{O(1)})$. Independent Result: In an independent work, Lee (Lee 2022) also shows how to compute a spectral hypergraph sparsifier with $O(n \varepsilon^{2} \log n \log r)$ hyperedges.

20220915 Multicalibrated Regression for Downstream Fairness.
Summary: We show how to take a regression function $\hat{f}$ that is appropriately ``multicalibrated'' and efficiently postprocess it into an approximately error minimizing classifier satisfying a large variety of fairness constraints. The postprocessing requires no labeled data, and only a modest amount of unlabeled data and computation. The computational and sample complexity requirements of computing $\hat f$ are comparable to the requirements for solving a single fair learning task optimally, but it can in fact be used to solve many different downstream fairnessconstrained learning problems efficiently. Our postprocessing method easily handles intersecting groups, generalizing prior work on postprocessing regression functions to satisfy fairness constraints that only applied to disjoint groups. Our work extends recent work showing that multicalibrated regression functions are ``omnipredictors'' (i.e. can be postprocessed to optimally solve unconstrained ERM problems) to constrained optimization.

20220909 Spectral hypergraph sparsification via chaining.
Summary: In a hypergraph on $n$ vertices where $D$ is the maximum size of a hyperedge, there is a weighted hypergraph spectral $\varepsilon$sparsifier with at most $O(\varepsilon^{2} \log(D) \cdot n \log n)$ hyperedges. This improves over the bound of Kapralov, Krauthgamer, Tardos and Yoshida (2021) who achieve $O(\varepsilon^{4} n (\log n)^3)$, as well as the bound $O(\varepsilon^{2} D^3 n \log n)$ obtained by Bansal, Svensson, and Trevisan (2019). The same sparsification result was obtained independently by Jambulapati, Liu, and Sidford (2022).

20220829 Online Bidding Algorithms for ReturnonSpend Constrained Advertisers.
Summary: Online advertising has recently grown into a highly competitive and complex multibilliondollar industry, with advertisers bidding for ad slots at large scales and high frequencies. This has resulted in a growing need for efficient "autobidding" algorithms that determine the bids for incoming queries to maximize advertisers' targets subject to their specified constraints. This work explores efficient online algorithms for a single valuemaximizing advertiser under an increasingly popular constraint: ReturnonSpend (RoS). We quantify efficiency in terms of regret relative to the optimal algorithm, which knows all queries a priori. We contribute a simple online algorithm that achieves nearoptimal regret in expectation while always respecting the specified RoS constraint when the input sequence of queries are i.i.d. samples from some distribution. We also integrate our results with the previous work of Balseiro, Lu, and Mirrokni [BLM20] to achieve nearoptimal regret while respecting both RoS and fixed budget constraints. Our algorithm follows the primaldual framework and uses online mirror descent (OMD) for the dual updates. However, we need to use a noncanonical setup of OMD, and therefore the classic lowregret guarantee of OMD, which is for the adversarial setting in online learning, no longer holds. Nonetheless, in our case and more generally where lowregret dynamics are applied in algorithm design, the gradients encountered by OMD can be far from adversarial but influenced by our algorithmic choices. We exploit this key insight to show our OMD setup achieves low regret in the realm of our algorithm.

20220824 A Slightly Improved Bound for the KLS Constant.
Summary: We refine the recent breakthrough technique of Klartag and Lehec to obtain an improved polylogarithmic bound for the KLS constant.

20220809 Training Overparametrized Neural Networks in Sublinear Time.
Summary: The success of deep learning comes at a tremendous computational and energy cost, and the scalability of training massively overparametrized neural networks is becoming a real barrier to the progress of artificial intelligence (AI). Despite the popularity and low costperiteration of traditional backpropagation via gradient decent, stochastic gradient descent (SGD) has prohibitive convergence rate in nonconvex settings, both in theory and practice. To mitigate this cost, recent works have proposed to employ alternative (Newtontype) training methods with much faster convergence rate, albeit with higher costperiteration. For a typical neural network with $m=\mathrm{poly}(n)$ parameters and input batch of $n$ datapoints in $\mathbb{R}^d$, the previous work of [Brand, Peng, Song, and Weinstein, ITCS'2021] requires $\sim mnd + n^3$ time per iteration. In this paper, we present a novel training method that requires only $m^{1\alpha} n d + n^3$ amortized time in the same overparametrized regime, where $\alpha \in (0.01,1)$ is some fixed constant. This method relies on a new and alternative view of neural networks, as a set of binary search trees, where each iteration corresponds to modifying a small subset of the nodes in the tree. We believe this view would have further applications in the design and analysis of deep neural networks (DNNs).

20220809 An Improved TrickleDown Theorem for Partite Complexes.
Summary: We prove a strengthening of the trickle down theorem for partite complexes. Given a $(d+1)$partite $d$dimensional simplicial complex, we show that if "on average" the links of faces of codimension 2 are $\frac{1\delta}{d}$(onesided) spectral expanders, then the link of any face of codimension $k$ is an $O(\frac{1\delta}{k\delta})$(onesided) spectral expander, for all $3\leq k\leq d+1$. For an application, using our theorem as a blackbox, we show that links of faces of codimension $k$ in recent constructions of bounded degree high dimensional expanders have spectral expansion at most $O(1/k)$ fraction of the spectral expansion of the links of the worst faces of codimension $2$.

20220807 Decomposable NonSmooth Convex Optimization with NearlyLinear Gradient Oracle Complexity.
Summary: Many fundamental problems in machine learning can be formulated by the convex program \[ \min_{\theta\in R^d}\ \sum_{i=1}^{n}f_{i}(\theta), \] where each $f_i$ is a convex, Lipschitz function supported on a subset of $d_i$ coordinates of $\theta$. One common approach to this problem, exemplified by stochastic gradient descent, involves sampling one $f_i$ term at every iteration to make progress. This approach crucially relies on a notion of uniformity across the $f_i$'s, formally captured by their condition number. In this work, we give an algorithm that minimizes the above convex formulation to $\epsilon$accuracy in $\widetilde{O}(\sum_{i=1}^n d_i \log (1 /\epsilon))$ gradient computations, with no assumptions on the condition number. The previous best algorithm independent of the condition number is the standard cutting plane method, which requires $O(nd \log (1/\epsilon))$ gradient computations. As a corollary, we improve upon the evaluation oracle complexity for decomposable submodular minimization by Axiotis et al. (ICML 2021). Our main technical contribution is an adaptive procedure to select an $f_i$ term at every iteration via a novel combination of cuttingplane and interiorpoint methods.

20220727 Searching for Regularity in Bounded Functions.
Summary: Given a function $f$ on $\mathbb{F}_2^n$, we study the following problem. What is the largest affine subspace $\mathcal{U}$ such that when restricted to $\mathcal{U}$, all the nontrivial Fourier coefficients of $f$ are very small? For the natural class of bounded Fourier degree $d$ functions $f:\mathbb{F}_2^n \to [1,1]$, we show that there exists an affine subspace of dimension at least $ \tilde\Omega(n^{1/d!}k^{2})$, wherein all of $f$'s nontrivial Fourier coefficients become smaller than $ 2^{k}$. To complement this result, we show the existence of degree $d$ functions with coefficients larger than $2^{d\log n}$ when restricted to any affine subspace of dimension larger than $\Omega(dn^{1/(d1)})$. In addition, we give explicit examples of functions with analogous but weaker properties. Along the way, we provide multiple characterizations of the Fourier coefficients of functions restricted to subspaces of $\mathbb{F}_2^n$ that may be useful in other contexts. Finally, we highlight applications and connections of our results to parity kill number and affine dispersers.

20220719 Pandora Box Problem with Nonobligatory Inspection: Hardness and Approximation Scheme.
Summary: Weitzman (1979) introduced the Pandora Box problem as a model for sequential search with inspection costs, and gave an elegant indexbased policy that attains provably optimal expected payoff. In various scenarios, the searching agent may select an option without making a costly inspection. The variant of the Pandora box problem with nonobligatory inspection has attracted interest from both economics and algorithms researchers. Various simple algorithms have proved suboptimal, with the best known 0.8approximation algorithm due to Guha et al. (2008). No hardness result for the problem was known. In this work, we show that it is NPhard to compute an optimal policy for Pandora's problem with nonobligatory inspection. We also give a polynomialtime approximation scheme (PTAS) that computes policies with an expected payoff at least $(1  \epsilon)$fraction of the optimal, for arbitrarily small $\epsilon > 0$. On the side, we show the decision version of the problem to be in NP.

20220718 Private Convex Optimization in General Norms.
Summary: We propose a new framework for differentially private optimization of convex functions which are Lipschitz in an arbitrary norm $\\cdot\$. Our algorithms are based on a regularized exponential mechanism which samples from the density $\propto \exp(k(F+\mu r))$ where $F$ is the empirical loss and $r$ is a regularizer which is strongly convex with respect to $\\cdot\$, generalizing a recent work of [Gopi, Lee, Liu '22] to nonEuclidean settings. We show that this mechanism satisfies Gaussian differential privacy and solves both DPERM (empirical risk minimization) and DPSCO (stochastic convex optimization) by using localization tools from convex geometry. Our framework is the first to apply to private convex optimization in general normed spaces and directly recovers nonprivate SCO rates achieved by mirror descent as the privacy parameter $\epsilon \to \infty$. As applications, for Lipschitz optimization in $\ell_p$ norms for all $p \in (1, 2)$, we obtain the first optimal privacyutility tradeoffs; for $p = 1$, we improve tradeoffs obtained by the recent works [Asi, Feldman, Koren, Talwar '21, Bassily, Guzman, Nandi '21] by at least a logarithmic factor. Our $\ell_p$ norm and Schatten$p$ norm optimization frameworks are complemented with polynomialtime samplers whose query complexity we explicitly bound.

20220707 Approximate Carathéodory bounds via Discrepancy Theory.
Summary: The approximate Carath\'eodory problem in general form is as follows: Given two symmetric convex bodies $P,Q \subseteq \mathbb{R}^m$, a parameter $k \in \mathbb{N}$ and $\mathbf{z} \in \textrm{conv}(X)$ with $X \subseteq P$, find $\mathbf{v}_1,\ldots,\mathbf{v}_k \in X$ so that $\\mathbf{z}  \frac{1}{k}\sum_{i=1}^k \mathbf{v}_i\_Q$ is minimized. Maurey showed that if both $P$ and $Q$ coincide with the $\ \cdot \_p$ball, then an error of $O(\sqrt{p/k})$ is possible. We prove a reduction to the vector balancing constant from discrepancy theory which for most cases can provide tight bounds for general $P$ and $Q$. For the case where $P$ and $Q$ are both $\ \cdot \_p$balls we prove an upper bound of $\sqrt{ \frac{\min\{ p, \log (\frac{2m}{k}) \}}{k}}$. Interestingly, this bound cannot be obtained taking independent random samples; instead we use the LovettMeka random walk. We also prove an extension to the more general case where $P$ and $Q$ are $\\cdot \_p$ and $\ \cdot \_q$balls with $2 \leq p \leq q \leq \infty$.

20220707 Individual Preference Stability for Clustering.
Summary: In this paper, we propose a natural notion of individual preference (IP) stability for clustering, which asks that every data point, on average, is closer to the points in its own cluster than to the points in any other cluster. Our notion can be motivated from several perspectives, including game theory and algorithmic fairness. We study several questions related to our proposed notion. We first show that deciding whether a given data set allows for an IPstable clustering in general is NPhard. As a result, we explore the design of efficient algorithms for finding IPstable clusterings in some restricted metric spaces. We present a polytime algorithm to find a clustering satisfying exact IPstability on the real line, and an efficient algorithm to find an IPstable 2clustering for a tree metric. We also consider relaxing the stability constraint, i.e., every data point should not be too far from its own cluster compared to any other cluster. For this case, we provide polytime algorithms with different guarantees. We evaluate some of our algorithms and several standard clustering approaches on real data sets.

20220701 When Does Differentially Private Learning Not Suffer in High Dimensions?.
Summary: Large pretrained models can be privately finetuned to achieve performance approaching that of nonprivate models. A common theme in these results is the surprising observation that highdimensional models can achieve favorable privacyutility tradeoffs. This seemingly contradicts known results on the modelsize dependence of differentially private convex learning and raises the following research question: When does the performance of differentially private learning not degrade with increasing model size? We identify that the magnitudes of gradients projected onto subspaces is a key factor that determines performance. To precisely characterize this for private convex learning, we introduce a condition on the objective that we term \emph{restricted Lipschitz continuity} and derive improved bounds for the excess empirical and population risks that are dimensionindependent under additional conditions. We empirically show that in private finetuning of large language models, gradients obtained during finetuning are mostly controlled by a few principal components. This behavior is similar to conditions under which we obtain dimensionindependent bounds in convex settings. Our theoretical and empirical results together provide a possible explanation for recent successes in largescale private finetuning. Code to reproduce our results can be found at \url{https://github.com/lxuechen/privatetransformers/tree/main/examples/classification/spectral_analysis}.