
20240710 Is MLBased Cryptanalysis Inherently Limited? Simulating Cryptographic Adversaries via GradientBased Methods.
Summary: Given the recent progress in machine learning (ML), the cryptography community has started exploring the applicability of ML methods to the design of new cryptanalytic approaches. While current empirical results show promise, the extent to which such methods may outperform classical cryptanalytic approaches is still somewhat unclear. In this work, we initiate exploration of the theory of MLbased cryptanalytic techniques, in particular providing new results towards understanding whether they are fundamentally limited compared to traditional approaches. Whereas most classic cryptanalysis crucially relies on directly processing individual samples (e.g., plaintextciphertext pairs), modern ML methods thus far only interact with samples via gradientbased computations that average a loss function over all samples. It is, therefore, conceivable that such gradientbased methods are inherently weaker than classical approaches. We introduce a unifying framework for capturing both ``samplebased'' adversaries that are provided with direct access to individual samples and ``gradientbased'' ones that are restricted to issuing gradientbased queries that are averaged over all given samples via a loss function. Within our framework, we establish a general feasibility result showing that any samplebased adversary can be simulated by a seeminglyweaker gradientbased one. Moreover, the simulation exhibits a nearly optimal overhead in terms of the gradientbased simulator's running time. Finally, we extend and refine our simulation technique to construct a gradientbased simulator that is fully parallelizable (crucial for avoiding an undesirable overhead for parallelizable cryptanalytic tasks), which is then used to construct a gradientbased simulator that executes the particular and highly useful gradientdescent method. Taken together, although the extent to which ML methods may outperform classical cryptanalytic approaches is still somewhat unclear, our results indicate that such gradientbased methods are not inherently limited by their seemingly restricted access to the provided samples.

20240708 MUSE: Machine Unlearning SixWay Evaluation for Language Models.
Summary: Language models (LMs) are trained on vast amounts of text data, which may include private and copyrighted content. Data owners may request the removal of their data from a trained model due to privacy or copyright concerns. However, exactly unlearning only these datapoints (i.e., retraining with the data removed) is intractable in modernday models. This has led to the development of many approximate unlearning algorithms. The evaluation of the efficacy of these algorithms has traditionally been narrow in scope, failing to precisely quantify the success and practicality of the algorithm from the perspectives of both the model deployers and the data owners. We address this issue by proposing MUSE, a comprehensive machine unlearning evaluation benchmark that enumerates six diverse desirable properties for unlearned models: (1) no verbatim memorization, (2) no knowledge memorization, (3) no privacy leakage, (4) utility preservation on data not intended for removal, (5) scalability with respect to the size of removal requests, and (6) sustainability over sequential unlearning requests. Using these criteria, we benchmark how effectively eight popular unlearning algorithms on 7Bparameter LMs can unlearn Harry Potter books and news articles. Our results demonstrate that most algorithms can prevent verbatim memorization and knowledge memorization to varying degrees, but only one algorithm does not lead to severe privacy leakage. Furthermore, existing algorithms fail to meet deployer's expectations because they often degrade general model utility and also cannot sustainably accommodate successive unlearning requests or largescale content removal. Our findings identify key issues with the practicality of existing unlearning algorithms on language models, and we release our benchmark to facilitate further evaluations: musebench.github.io

20240701 An XOR Lemma for Deterministic Communication Complexity.
Summary: We prove a lower bound on the communication complexity of computing the $n$fold xor of an arbitrary function $f$, in terms of the communication complexity and rank of $f$. We prove that $D(f^{\oplus n}) \geq n \cdot \Big(\frac{\Omega(D(f))}{\log \mathsf{rk}(f)} \log \mathsf{rk}(f)\Big )$, where here $D(f), D(f^{\oplus n})$ represent the deterministic communication complexity, and $\mathsf{rk}(f)$ is the rank of $f$. Our methods involve a new way to use information theory to reason about deterministic communication complexity.

20240622 Fair Clustering: Critique, Caveats, and Future Directions.
Summary: Clustering is a fundamental problem in machine learning and operations research. Therefore, given the fact that fairness considerations have become of paramount importance in algorithm design, fairness in clustering has received significant attention from the research community. The literature on fair clustering has resulted in a collection of interesting fairness notions and elaborate algorithms. In this paper, we take a critical view of fair clustering, identifying a collection of ignored issues such as the lack of a clear utility characterization and the difficulty in accounting for the downstream effects of a fair clustering algorithm in machine learning settings. In some cases, we demonstrate examples where the application of a fair clustering algorithm can have significant negative impacts on social welfare. We end by identifying a collection of steps that would lead towards more impactful research in fair clustering.

20240620 Mind the Privacy Unit! UserLevel Differential Privacy for Language Model FineTuning.
Summary: Large language models (LLMs) have emerged as powerful tools for tackling complex tasks across diverse domains, but they also raise privacy concerns when finetuned on sensitive data due to potential memorization. While differential privacy (DP) offers a promising solution by ensuring models are 'almost indistinguishable' with or without any particular privacy unit, current evaluations on LLMs mostly treat each example (text record) as the privacy unit. This leads to uneven user privacy guarantees when contributions per user vary. We therefore study userlevel DP motivated by applications where it necessary to ensure uniform privacy protection across users. We present a systematic evaluation of userlevel DP for LLM finetuning on natural language generation tasks. Focusing on two mechanisms for achieving userlevel DP guarantees, Group Privacy and Userwise DPSGD, we investigate design choices like data selection strategies and parameter tuning for the best privacyutility tradeoff.

20240618 FirstOrder Methods for Linearly Constrained Bilevel Optimization.
Summary: Algorithms for bilevel optimization often encounter Hessian computations, which are prohibitive in high dimensions. While recent works offer firstorder methods for unconstrained bilevel problems, the constrained setting remains relatively underexplored. We present firstorder linearly constrained optimization methods with finitetime hypergradient stationarity guarantees. For linear equality constraints, we attain $\epsilon$stationarity in $\widetilde{O}(\epsilon^{2})$ gradient oracle calls, which is nearlyoptimal. For linear inequality constraints, we attain $(\delta,\epsilon)$Goldstein stationarity in $\widetilde{O}(d{\delta^{1} \epsilon^{3}})$ gradient oracle calls, where $d$ is the upperlevel dimension. Finally, we obtain for the linear inequality setting dimensionfree rates of $\widetilde{O}({\delta^{1} \epsilon^{4}})$ oracle complexity under the additional assumption of oracle access to the optimal dual variable. Along the way, we develop new nonsmooth nonconvex optimization methods with inexact oracles. We verify these guarantees with preliminary numerical experiments.

20240611 Closing the ComputationalQuery Depth Gap in Parallel Stochastic Convex Optimization.
Summary: We develop a new parallel algorithm for minimizing Lipschitz, convex functions with a stochastic subgradient oracle. The total number of queries made and the query depth, i.e., the number of parallel rounds of queries, match the prior stateoftheart, [CJJLLST23], while improving upon the computational depth by a polynomial factor for sufficiently small accuracy. When combined with previous stateoftheart methods our result closes a gap between the bestknown query depth and the bestknown computational depth of parallel algorithms. Our method starts with a ball acceleration framework of previous parallel methods, i.e., [CJJJLST20, ACJJS21], which reduce the problem to minimizing a regularized Gaussian convolution of the function constrained to Euclidean balls. By developing and leveraging new stability properties of the Hessian of this induced function, we depart from prior parallel algorithms and reduce these ballconstrained optimization problems to stochastic unconstrained quadratic minimization problems. Although we are unable to prove concentration of the asymmetric matrices that we use to approximate this Hessian, we nevertheless develop an efficient parallel method for solving these quadratics. Interestingly, our algorithms can be improved using fast matrix multiplication and use nearlylinear work if the matrix multiplication exponent is 2.

20240605 Private Online Learning via Lazy Algorithms.
Summary: We study the problem of private online learning, specifically, online prediction from experts (OPE) and online convex optimization (OCO). We propose a new transformation that transforms lazy online learning algorithms into private algorithms. We apply our transformation for differentially private OPE and OCO using existing lazy algorithms for these problems. Our final algorithms obtain regret, which significantly improves the regret in the high privacy regime $\varepsilon \ll 1$, obtaining $\sqrt{T \log d} + T^{1/3} \log(d)/\varepsilon^{2/3}$ for DPOPE and $\sqrt{T} + T^{1/3} \sqrt{d}/\varepsilon^{2/3}$ for DPOCO. We also complement our results with a lower bound for DPOPE, showing that these rates are optimal for a natural family of lowswitching private algorithms.

20240604 Private Stochastic Convex Optimization with Heavy Tails: NearOptimality from Simple Reductions.
Summary: We study the problem of differentially private stochastic convex optimization (DPSCO) with heavytailed gradients, where we assume a $k^{\text{th}}$moment bound on the Lipschitz constants of sample functions rather than a uniform bound. We propose a new reductionbased approach that enables us to obtain the first optimal rates (up to logarithmic factors) in the heavytailed setting, achieving error $G_2 \cdot \frac 1 {\sqrt n} + G_k \cdot (\frac{\sqrt d}{n\epsilon})^{1  \frac 1 k}$ under $(\epsilon, \delta)$approximate differential privacy, up to a mild $\textup{polylog}(\frac{1}{\delta})$ factor, where $G_2^2$ and $G_k^k$ are the $2^{\text{nd}}$ and $k^{\text{th}}$ moment bounds on sample Lipschitz constants, nearlymatching a lower bound of [Lowy and Razaviyayn 2023]. We further give a suite of private algorithms in the heavytailed setting which improve upon our basic result under additional assumptions, including an optimal algorithm under a knownLipschitz constant assumption, a nearlinear time algorithm for smooth functions, and an optimal linear time algorithm for smooth generalized linear models.

20240530 Reconstruction Attacks on Machine Unlearning: Simple Models are Vulnerable.
Summary: Machine unlearning is motivated by desire for data autonomy: a person can request to have their data's influence removed from deployed models, and those models should be updated as if they were retrained without the person's data. We show that, counterintuitively, these updates expose individuals to highaccuracy reconstruction attacks which allow the attacker to recover their data in its entirety, even when the original models are so simple that privacy risk might not otherwise have been a concern. We show how to mount a nearperfect attack on the deleted data point from linear regression models. We then generalize our attack to other loss functions and architectures, and empirically demonstrate the effectiveness of our attacks across a wide range of datasets (capturing both tabular and image data). Our work highlights that privacy risk is significant even for extremely simple model classes when individuals can request deletion of their data from the model.

20240526 A General Framework for LatticeBased ABE Using Evasive InnerProduct Functional Encryption.
Summary: We present a general framework for constructing attributebased encryption (ABE) schemes for arbitrary function class based on lattices from two ingredients, i) a noisy linear secret sharing scheme for the class and ii) a new type of innerproduct functional encryption (IPFE) scheme, termed *evasive* IPFE, which we introduce in this work. We propose latticebased evasive IPFE schemes and establish their security under simple conditions based on variants of evasive learning with errors (LWE) assumption recently proposed by Wee [EUROCRYPT ’22] and Tsabary [CRYPTO ’22]. Our general framework is modular and conceptually simple, reducing the task of constructing ABE to that of constructing noisy linear secret sharing schemes, a more lightweight primitive. The versatility of our framework is demonstrated by three new ABE schemes based on variants of the evasive LWE assumption.  We obtain two ciphertextpolicy ABE schemes for all polynomialsize circuits with a predetermined depth bound. One of these schemes has *succinct* ciphertexts and secret keys, of size polynomial in the depth bound, rather than the circuit size. This eliminates the need for tensor LWE, another new assumption, from the previous stateoftheart construction by Wee [EUROCRYPT ’22].  We develop ciphertextpolicy and keypolicy ABE schemes for deterministic finite automata (DFA) and logspace Turing machines ($\mathsf{L}$). They are the first latticebased publickey ABE schemes supporting uniform models of computation. Previous latticebased schemes for uniform computation were limited to the secretkey setting or offered only weaker security against bounded collusion. Lastly, the new primitive of evasive IPFE serves as the latticebased counterpart of pairingbased IPFE, enabling the application of techniques developed in pairingbased ABE constructions to latticebased constructions. We believe it is of independent interest and may find other applications.

20240519 InformationTheoretic MultiServer PIR with Global Preprocessing.
Summary: We propose a new unified framework to construct multiserver, informationtheoretic Private Information Retrieval (PIR) schemes that leverage global preprocesing to achieve sublinear computation per query. Despite a couple earlier attempts, our understanding of PIR schemes in the global preprocessing model remains limited, and so far, we only know a few sparse points in the broad design space. Our framework not only unifies earlier results in this space, but leads to several new results. First, we can improve the server space of the stateoftheart scheme by a polynomial factor. Second, we can broaden the parameter space of known results, allowing a smooth tradeoff between bandwidth and computation. Third, while earlier schemes achieve better perserver bandwidth and computation as we add more servers, the server space actually grows w.r.t. the number of servers. We offer a new scalable family of schemes where the perserver bandwidth, computation, and space all decrease as we add more servers. This scalable family of schemes also implies the socalled ``doubly efficient'' PIR scheme with any superconstant number of servers, achieving $n^{1+o(1)}$ server space and preprocessing cost, and $n^{o(1)}$ bandwidth and computation per query.

20240513 Who's in and who's out? A case study of multimodal CLIPfiltering in DataComp.
Summary: As training datasets become increasingly drawn from unstructured, uncontrolled environments such as the web, researchers and industry practitioners have increasingly relied upon data filtering techniques to "filter out the noise" of webscraped data. While datasets have been widely shown to reflect the biases and values of their creators, in this paper we contribute to an emerging body of research that assesses the filters used to create these datasets. We show that imagetext data filtering also has biases and is valueladen, encoding specific notions of what is counted as "highquality" data. In our work, we audit a standard approach of imagetext CLIPfiltering on the academic benchmark DataComp's CommonPool by analyzing discrepancies of filtering through various annotation techniques across multiple modalities of image, text, and website source. We find that data relating to several imputed demographic groups  such as LGBTQ+ people, older women, and younger men  are associated with higher rates of exclusion. Moreover, we demonstrate cases of exclusion amplification: not only are certain marginalized groups already underrepresented in the unfiltered data, but CLIPfiltering excludes data from these groups at higher rates. The datafiltering step in the machine learning pipeline can therefore exacerbate representation disparities already present in the datagathering step, especially when existing filters are designed to optimize a specificallychosen downstream performance metric like zeroshot image classification accuracy. Finally, we show that the NSFW filter fails to remove sexuallyexplicit content from CommonPool, and that CLIPfiltering includes several categories of copyrighted content at high rates. Our conclusions point to a need for fundamental changes in dataset creation and filtering practices.

20240510 A computational test of quantum contextuality, and even simpler proofs of quantumness.
Summary: Bell nonlocality is a fundamental feature of quantum mechanics whereby measurements performed on "spatially separated" quantum systems can exhibit correlations that cannot be understood as revealing predetermined values. This is a special case of the more general phenomenon of "quantum contextuality", which says that such correlations can occur even when the measurements are not necessarily on separate quantum systems, but are merely "compatible" (i.e. commuting). Crucially, while any nonlocal game yields an experiment that demonstrates quantum advantage by leveraging the "spatial separation" of two or more devices (and in fact several such demonstrations have been conducted successfully in recent years), the same is not true for quantum contextuality: finding the contextuality analogue of such an experiment is arguably one of the central open questions in the foundations of quantum mechanics. In this work, we show that an arbitrary contextuality game can be compiled into an operational "test of contextuality" involving a single quantum device, by only making the assumption that the device is computationally bounded. Our work is inspired by the recent work of Kalai et al. (STOC '23) that converts any nonlocal game into a classical test of quantum advantage with a single device. The central idea in their work is to use cryptography to enforce spatial separation within subsystems of a single quantum device. Our work can be seen as using cryptography to enforce "temporal separation", i.e. to restrict communication between sequential measurements. Beyond contextuality, we employ our ideas to design a "proof of quantumness" that, to the best of our knowledge, is arguably even simpler than the ones proposed in the literature so far.

20240430 Structure learning of Hamiltonians from realtime evolution.
Summary: We initiate the study of Hamiltonian structure learning from realtime evolution: given the ability to apply $e^{\mathrm{i} Ht}$ for an unknown local Hamiltonian $H = \sum_{a = 1}^m \lambda_a E_a$ on $n$ qubits, the goal is to recover $H$. This problem is already wellstudied under the assumption that the interaction terms, $E_a$, are given, and only the interaction strengths, $\lambda_a$, are unknown. But is it possible to learn a local Hamiltonian without prior knowledge of its interaction structure? We present a new, general approach to Hamiltonian learning that not only solves the challenging structure learning variant, but also resolves other open questions in the area, all while achieving the gold standard of Heisenberglimited scaling. In particular, our algorithm recovers the Hamiltonian to $\varepsilon$ error with an evolution time scaling with $1/\varepsilon$, and has the following appealing properties: (1) it does not need to know the Hamiltonian terms; (2) it works beyond the shortrange setting, extending to any Hamiltonian $H$ where the sum of terms interacting with a qubit has bounded norm; (3) it evolves according to $H$ in constant time $t$ increments, thus achieving constant time resolution. To our knowledge, no prior algorithm with Heisenberglimited scaling existed with even one of these properties. As an application, we can also learn Hamiltonians exhibiting powerlaw decay up to accuracy $\varepsilon$ with total evolution time beating the standard limit of $1/\varepsilon^2$.

20240422 Phi3 Technical Report: A Highly Capable Language Model Locally on Your Phone.
Summary: We introduce phi3mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT3.5 (e.g., phi3mini achieves 69% on MMLU and 8.38 on MTbench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaledup version of the one used for phi2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide some initial parameterscaling results with a 7B and 14B models trained for 4.8T tokens, called phi3small and phi3medium, both significantly more capable than phi3mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MTbench). Moreover, we also introduce phi3vision, a 4.2 billion parameter model based on phi3mini with strong reasoning capabilities for image and text prompts.

20240416 On approximability of the Permanent of PSD matrices.
Summary: We study the complexity of approximating the permanent of a positive semidefinite matrix $A\in \mathbb{C}^{n\times n}$. 1. We design a new approximation algorithm for $\mathrm{per}(A)$ with approximation ratio $e^{(0.9999 + \gamma)n}$, exponentially improving upon the current best bound of $e^{(1+\gammao(1))n}$ [AGOS17,YP22]. Here, $\gamma \approx 0.577$ is Euler's constant. 2. We prove that it is NPhard to approximate $\mathrm{per}(A)$ within a factor $e^{(\gamma\epsilon)n}$ for any $\epsilon>0$. This is the first exponential hardness of approximation for this problem. Along the way, we prove optimal hardness of approximation results for the $\\cdot\_{2\to q}$ ``norm'' problem of a matrix for all $1 < q < 2$.

20240404 The power of a single Haar random state: constructing and separating quantum pseudorandomness.
Summary: In this work, we focus on the following question: what are the cryptographic implications of having access to an oracle that provides a single Haar random quantum state? We show, perhaps surprisingly, that such an oracle is sufficient to construct quantum pseudorandomness. Pseudorandom states (PRS) are a family of states for which it is hard to distinguish between polynomially many copies of either a state sampled uniformly from the family or a Haar random state. A weaker notion, called singlecopy pseudorandom states (1PRS), satisfies this property with respect to a single copy. Our main result is that 1PRS (as well as bitcommitments) exist relative to an oracle that provides a single Haar random state. We build on this result to show the existence of an oracle relative to which 1PRS exist, but PRS do not. This provides one of the first blackbox separations between different forms of quantum pseudorandomness.

20240404 Cryptographic Hardness of Score Estimation.
Summary: We show that $L^2$accurate score estimation, in the absence of strong assumptions on the data distribution, is computationally hard even when sample complexity is polynomial in the relevant problem parameters. Our reduction builds on the result of Chen et al. (ICLR 2023), who showed that the problem of generating samples from an unknown data distribution reduces to $L^2$accurate score estimation. Our hardtoestimate distributions are the "Gaussian pancakes" distributions, originally due to Diakonikolas et al. (FOCS 2017), which have been shown to be computationally indistinguishable from the standard Gaussian under widely believed hardness assumptions from latticebased cryptography (Bruna et al., STOC 2021; Gupte et al., FOCS 2022).

20240328 Improving the Bit Complexity of Communication for Distributed Convex Optimization.
Summary: We consider the communication complexity of some fundamental convex optimization problems in the pointtopoint (coordinator) and blackboard communication models. We strengthen known bounds for approximately solving linear regression, $p$norm regression (for $1\leq p\leq 2$), linear programming, minimizing the sum of finitely many convex nonsmooth functions with varying supports, and low rank approximation; for a number of these fundamental problems our bounds are nearly optimal, as proven by our lower bounds. Among our techniques, we use the notion of block leverage scores, which have been relatively unexplored in this context, as well as dropping all but the ``middle" bits in Richardsonstyle algorithms. We also introduce a new communication problem for accurately approximating inner products and establish a lower bound using the spherical Radon transform. Our lower bound can be used to show the first separation of linear programming and linear systems in the distributed model when the number of constraints is polynomial, addressing an open question in prior work.

20240325 HighTemperature Gibbs States are Unentangled and Efficiently Preparable.
Summary: We show that thermal states of local Hamiltonians are separable above a constant temperature. Specifically, for a local Hamiltonian $H$ on a graph with degree $\mathfrak{d}$, its Gibbs state at inverse temperature $\beta$, denoted by $\rho =e^{\beta H}/ \textrm{tr}(e^{\beta H})$, is a classical distribution over product states for all $\beta < 1/(c\mathfrak{d})$, where $c$ is a constant. This sudden death of thermal entanglement upends conventional wisdom about the presence of shortrange quantum correlations in Gibbs states. Moreover, we show that we can efficiently sample from the distribution over product states. In particular, for any $\beta < 1/( c \mathfrak{d}^3)$, we can prepare a state $\epsilon$close to $\rho$ in trace distance with a depthone quantum circuit and $\textrm{poly}(n) \log(1/\epsilon)$ classical overhead. A priori the task of preparing a Gibbs state is a natural candidate for achieving superpolynomial quantum speedups, but our results rule out this possibility above a fixed constant temperature.

20240319 The status of the quantum PCP conjecture (games version).
Summary: In classical complexity theory, the two definitions of probabilistically checkable proofs  the constraint satisfaction and the nonlocal games version  are computationally equal in power. In the quantum setting, the situation is far less clear. The result MIP* = RE of Ji et. al. (arXiv:2001.04383) and refinements by Natarajan and Zhang (arXiv:2302.04322) show that multiprover interactive proof systems with polylogarithmically long messages can solve any decision problem in RE, including undecidable problems like the halting problem. These results show that any connection between the "constraint satisfaction" or "Hamiltonian" quantum PCP conjecture and nonlocal games must involve restricting the players in the game to be computationally efficient. This note contains two main results: (1) we give a "quantum games PCP for AM" in the form of a new construction of a succinct MIP* protocol with efficient provers for the canonical AMcomplete problem, and (2) we explain an error in the energy amplification procedure of Natarajan and Vidick (arXiv:1710.03062) which invalidates their claim to have constructed a quantum games PCP for a QMAcomplete problem. In surveying the obstacles remaining towards a quantum games PCP for QMA, we highlight the importance and challenge of understanding gap amplification for Hamiltonians even when locality is replaced by much weaker constraints, such as bounds on the "Pauli spectrum" of the Hamiltonian. We hope these questions will motivate progress towards new "baby versions" of Hamiltonian quantum PCP conjecture.

20240306 BlackBox $k$to$1$PCA Reductions: Theory and Applications.
Summary: The $k$principal component analysis ($k$PCA) problem is a fundamental algorithmic primitive that is widelyused in data analysis and dimensionality reduction applications. In statistical settings, the goal of $k$PCA is to identify a top eigenspace of the covariance matrix of a distribution, which we only have blackbox access to via samples. Motivated by these settings, we analyze blackbox deflation methods as a framework for designing $k$PCA algorithms, where we model access to the unknown target matrix via a blackbox $1$PCA oracle which returns an approximate top eigenvector, under two popular notions of approximation. Despite being arguably the most natural reductionbased approach to $k$PCA algorithm design, such blackbox methods, which recursively call a $1$PCA oracle $k$ times, were previously poorlyunderstood. Our main contribution is significantly sharper bounds on the approximation parameter degradation of deflation methods for $k$PCA. For a quadratic form notion of approximation we term ePCA (energy PCA), we show deflation methods suffer no parameter loss. For an alternative wellstudied approximation notion we term cPCA (correlation PCA), we tightly characterize the parameter regimes where deflation methods are feasible. Moreover, we show that in all feasible regimes, $k$cPCA deflation algorithms suffer no asymptotic parameter loss for any constant $k$. We apply our framework to obtain stateoftheart $k$PCA algorithms robust to dataset contamination, improving prior work in sample complexity by a $\mathsf{poly}(k)$ factor.

20240304 Differentially Private Synthetic Data via Foundation Model APIs 2: Text.
Summary: Text data has become extremely valuable due to the emergence of machine learning algorithms that learn from it. A lot of highquality text data generated in the real world is private and therefore cannot be shared or used freely due to privacy concerns. Generating synthetic replicas of private text data with a formal privacy guarantee, i.e., differential privacy (DP), offers a promising and scalable solution. However, existing methods necessitate DP finetuning of large language models (LLMs) on private data to generate DP synthetic data. This approach is not viable for proprietary LLMs (e.g., GPT3.5) and also demands considerable computational resources for opensource LLMs. Lin et al. (2024) recently introduced the Private Evolution (PE) algorithm to generate DP synthetic images with only API access to diffusion models. In this work, we propose an augmented PE algorithm, named AugPE, that applies to the complex setting of text. We use API access to an LLM and generate DP synthetic text without any model training. We conduct comprehensive experiments on three benchmark datasets. Our results demonstrate that AugPE produces DP synthetic text that yields competitive utility with the SOTA DP finetuning baselines. This underscores the feasibility of relying solely on API access of LLMs to produce highquality DP synthetic texts, thereby facilitating more accessible routes to privacypreserving LLM applications. Our code and data are available at https://github.com/AIsecure/augpe.

20240221 Private Gradient Descent for Linear Regression: Tighter Error Bounds and InstanceSpecific Uncertainty Estimation.
Summary: We provide an improved analysis of standard differentially private gradient descent for linear regression under the squared error loss. Under modest assumptions on the input, we characterize the distribution of the iterate at each time step. Our analysis leads to new results on the algorithm's accuracy: for a proper fixed choice of hyperparameters, the sample complexity depends only linearly on the dimension of the data. This matches the dimensiondependence of the (nonprivate) ordinary least squares estimator as well as that of recent private algorithms that rely on sophisticated adaptive gradientclipping schemes (Varshney et al., 2022; Liu et al., 2023). Our analysis of the iterates' distribution also allows us to construct confidence intervals for the empirical optimizer which adapt automatically to the variance of the algorithm on a particular data set. We validate our theorems through experiments on synthetic data.

20240220 Testing Calibration in NearlyLinear Time.
Summary: In the recent literature on machine learning and decision making, calibration has emerged as a desirable and widelystudied statistical property of the outputs of binary prediction models. However, the algorithmic aspects of measuring model calibration have remained relatively less wellexplored. Motivated by [BGHN23], which proposed a rigorous framework for measuring distances to calibration, we initiate the algorithmic study of calibration through the lens of property testing. We define the problem of calibration testing from samples where given $n$ draws from a distribution $\mathcal{D}$ on $(predictions, binary outcomes)$, our goal is to distinguish between the case where $\mathcal{D}$ is perfectly calibrated, and the case where $\mathcal{D}$ is $\varepsilon$far from calibration. We make the simple observation that the empirical smooth calibration linear program can be reformulated as an instance of minimumcost flow on a highlystructured graph, and design an exact dynamic programmingbased solver for it which runs in time $O(n\log^2(n))$, and solves the calibration testing problem informationtheoretically optimally in the same time. This improves upon stateoftheart blackbox linear program solvers requiring $\Omega(n^\omega)$ time, where $\omega > 2$ is the exponent of matrix multiplication. We also develop algorithms for tolerant variants of our testing problem improving upon blackbox linear program solvers, and give sample complexity lower bounds for alternative calibration measures to the one considered in this work. Finally, we present experiments showing the testing problem we define faithfully captures standard notions of calibration, and that our algorithms scale efficiently to accommodate large sample sizes.

20240213 On blackbox separations of quantum digital signatures from pseudorandom states.
Summary: It is wellknown that digital signatures can be constructed from oneway functions in a blackbox way. While oneway functions are essentially the minimal assumption in classical cryptography, this is not the case in the quantum setting. A variety of qualitatively weaker and inherently quantum assumptions (e.g. EFI pairs, oneway state generators, and pseudorandom states) are known to be sufficient for nontrivial quantum cryptography. While it is known that commitments, zeroknowledge proofs, and even multiparty computation can be constructed from these assumptions, it has remained an open question whether the same is true for quantum digital signatures schemes (QDS). In this work, we show that there $\textit{does not}$ exist a blackbox construction of a QDS scheme with classical signatures from pseudorandom states with linear, or greater, output length. Our result complements that of Morimae and Yamakawa (2022), who described a $\textit{onetime}$ secure QDS scheme with classical signatures, but left open the question of constructing a standard $\textit{multitime}$ secure one.

20240202 Large language models cannot replace human participants because they cannot portray identity groups.
Summary: Large language models (LLMs) are increasing in capability and popularity, propelling their application in new domains  including as replacements for human participants in computational social science, user testing, annotation tasks, and more. Traditionally, in all of these settings survey distributors are careful to find representative samples of the human population to ensure the validity of their results and understand potential demographic differences. This means in order to be a suitable replacement, LLMs will need to be able to capture the influence of positionality (i.e., relevance of social identities like gender and race). However, we show that there are two inherent limitations in the way current LLMs are trained that prevent this. We argue analytically for why LLMs are doomed to both misportray and flatten the representations of demographic groups, then empirically show this to be true on 4 LLMs through a series of human studies with 3200 participants across 16 demographic identities. We also discuss a third consideration about how identity prompts can essentialize identities. Throughout, we connect each of these limitations to a pernicious history that shows why each is harmful for marginalized demographic groups. Overall, we urge caution in use cases where LLMs are intended to replace human participants whose identities are relevant to the task at hand. At the same time, in cases where the goal is to supplement rather than replace (e.g., pilot studies), we provide empiricallybetter inferencetime techniques to reduce, but not remove, these harms.

20240202 Enhancing Stochastic Gradient Descent: A Unified Framework and Novel Acceleration Methods for Faster Convergence.
Summary: Based on SGD, previous works have proposed many algorithms that have improved convergence speed and generalization in stochastic optimization, such as SGDm, AdaGrad, Adam, etc. However, their convergence analysis under nonconvex conditions is challenging. In this work, we propose a unified framework to address this issue. For any firstorder methods, we interpret the updated direction $g_t$ as the sum of the stochastic subgradient $\nabla f_t(x_t)$ and an additional acceleration term $\frac{2\langle v_t, \nabla f_t(x_t) \rangle}{\v_t\_2^2} v_t$, thus we can discuss the convergence by analyzing $\langle v_t, \nabla f_t(x_t) \rangle$. Through our framework, we have discovered two plugandplay acceleration methods: \textbf{Reject Accelerating} and \textbf{Random Vector Accelerating}, we theoretically demonstrate that these two methods can directly lead to an improvement in convergence rate.

20240110 Quantum TimeSpace Tradeoffs for Matrix Problems.
Summary: We consider the time and space required for quantum computers to solve a wide variety of problems involving matrices, many of which have only been analyzed classically in prior work. Our main results show that for a range of linear algebra problems  including matrixvector product, matrix inversion, matrix multiplication and powering  existing classical timespace tradeoffs, several of which are tight for every space bound, also apply to quantum algorithms. For example, for almost all matrices $A$, including the discrete Fourier transform (DFT) matrix, we prove that quantum circuits with at most $T$ input queries and $S$ qubits of memory require $T=\Omega(n^2/S)$ to compute matrixvector product $Ax$ for $x \in \{0,1\}^n$. We similarly prove that matrix multiplication for $n\times n$ binary matrices requires $T=\Omega(n^3 / \sqrt{S})$. Because many of our lower bounds match deterministic algorithms with the same time and space complexity, we show that quantum computers cannot provide any asymptotic advantage for these problems with any space bound. We obtain matching lower bounds for the stronger notion of quantum cumulative memory complexity  the sum of the space per layer of a circuit. We also consider Boolean (i.e. ANDOR) matrix multiplication and matrixvector products, improving the previous quantum timespace tradeoff lower bounds for $n\times n$ Boolean matrix multiplication to $T=\Omega(n^{2.5}/S^{1/4})$ from $T=\Omega(n^{2.5}/S^{1/2})$. Our improved lower bound for Boolean matrix multiplication is based on a new coloring argument that extracts more from the strong direct product theorem used in prior work. Our tight lower bounds for linear algebra problems require adding a new bucketing method to the recordingquery technique of Zhandry that lets us apply classical arguments to upper bound the success probability of quantum circuits.

20231228 Variable Neighborhood Searching Rerandomization.
Summary: Rerandomization discards undesired treatment assignments to ensure covariate balance in randomized experiments. However, rerandomization based on acceptancerejection sampling is computationally inefficient, especially when numerous independent assignments are required to perform randomizationbased statistical inference. Existing acceleration methods are suboptimal and are not applicable in structured experiments, including stratified experiments and experiments with clusters. Based on metaheuristics in combinatorial optimization, we propose a novel variable neighborhood searching rerandomization(VNSRR) method to draw balanced assignments in various experiments efficiently. We derive the unbiasedness and a lower bound for the variance reduction of the treatment effect estimator under VNSRR. Simulation studies and a real data example indicate that our method maintains the appealing statistical properties of rerandomization and can sample thousands of treatment assignments within seconds, even in cases where existing methods require an hour to complete the task.

20231219 Initializing Services in Interactive ML Systems for Diverse Users.
Summary: This paper studies ML systems that interactively learn from users across multiple subpopulations with heterogeneous data distributions. The primary objective is to provide specialized services for different user groups while also predicting user preferences. Once the users select a service based on how well the service anticipated their preference, the services subsequently adapt and refine themselves based on the user data they accumulate, resulting in an iterative, alternating minimization process between users and services (learning dynamics). Employing such tailored approaches has two main challenges: (i) Unknown user preferences: Typically, data on user preferences are unavailable without interaction, and uniform data collection across a large and diverse user base can be prohibitively expensive. (ii) Suboptimal Local Solutions: The total loss (sum of loss functions across all users and all services) landscape is not convex even if the individual losses on a single service are convex, making it likely for the learning dynamics to get stuck in local minima. The final outcome of the aforementioned learning dynamics is thus strongly influenced by the initial set of services offered to users, and is not guaranteed to be close to the globally optimal outcome. In this work, we propose a randomized algorithm to adaptively select very few users to collect preference data from, while simultaneously initializing a set of services. We prove that under mild assumptions on the loss functions, the expected total loss achieved by the algorithm right after initialization is within a factor of the globally optimal total loss with complete user preference data, and this factor scales only logarithmically in the number of services. Our theory is complemented by experiments on real as well as semisynthetic datasets.

20231213 Fair Active Learning in LowData Regimes.
Summary: In critical machine learning applications, ensuring fairness is essential to avoid perpetuating social inequities. In this work, we address the challenges of reducing bias and improving accuracy in datascarce environments, where the cost of collecting labeled data prohibits the use of large, labeled datasets. In such settings, active learning promises to maximize marginal accuracy gains of small amounts of labeled data. However, existing applications of active learning for fairness fail to deliver on this, typically requiring large labeled datasets, or failing to ensure the desired fairness tolerance is met on the population distribution. To address such limitations, we introduce an innovative active learning framework that combines an exploration procedure inspired by posterior sampling with a fair classification subroutine. We demonstrate that this framework performs effectively in very datascarce regimes, maximizing accuracy while satisfying fairness constraints with high probability. We evaluate our proposed approach using wellestablished realworld benchmark datasets and compare it against stateoftheart methods, demonstrating its effectiveness in producing fair models, and improvement over existing methods.

20231205 XOR Lemmas for Communication via Marginal Information.
Summary: We define the $\textit{marginal information}$ of a communication protocol, and use it to prove XOR lemmas for communication complexity. We show that if every $C$bit protocol has bounded advantage for computing a Boolean function $f$, then every $\tilde \Omega(C \sqrt{n})$bit protocol has advantage $\exp(\Omega(n))$ for computing the $n$fold xor $f^{\oplus n}$. We prove exponentially small bounds in the average case setting, and near optimal bounds for product distributions and for boundedround protocols.

20231129 Sparsifying generalized linear models.
Summary: We consider the sparsification of sums $F : \mathbb{R}^n \to \mathbb{R}$ where $F(x) = f_1(\langle a_1,x\rangle) + \cdots + f_m(\langle a_m,x\rangle)$ for vectors $a_1,\ldots,a_m \in \mathbb{R}^n$ and functions $f_1,\ldots,f_m : \mathbb{R} \to \mathbb{R}_+$. We show that $(1+\varepsilon)$approximate sparsifiers of $F$ with support size $\frac{n}{\varepsilon^2} (\log \frac{n}{\varepsilon})^{O(1)}$ exist whenever the functions $f_1,\ldots,f_m$ are symmetric, monotone, and satisfy natural growth bounds. Additionally, we give efficient algorithms to compute such a sparsifier assuming each $f_i$ can be evaluated efficiently. Our results generalize the classic case of $\ell_p$ sparsification, where $f_i(z) = z^p$, for $p \in (0, 2]$, and give the first nearlinear size sparsifiers in the wellstudied setting of the Huber loss function and its generalizations, e.g., $f_i(z) = \min\{z^p, z^2\}$ for $0 < p \leq 2$. Our sparsification algorithm can be applied to give nearoptimal reductions for optimizing a variety of generalized linear models including $\ell_p$ regression for $p \in (1, 2]$ to high accuracy, via solving $(\log n)^{O(1)}$ sparse regression instances with $m \le n(\log n)^{O(1)}$, plus runtime proportional to the number of nonzero entries in the vectors $a_1, \dots, a_m$.

20231128 Can Generalist Foundation Models Outcompete SpecialPurpose Tuning? Case Study in Medicine.
Summary: Generalist foundation models such as GPT4 have displayed surprising capabilities in a wide variety of domains and tasks. Yet, there is a prevalent assumption that they cannot match specialist capabilities of finetuned models. For example, most explorations to date on medical competency benchmarks have leveraged domainspecific training, as exemplified by efforts on BioGPT and MedPaLM. We build on a prior study of GPT4's capabilities on medical challenge benchmarks in the absence of special training. Rather than using simple prompting to highlight the model's outofthebox capabilities, we perform a systematic exploration of prompt engineering. We find that prompting innovation can unlock deeper specialist capabilities and show that GPT4 easily tops prior leading results for medical benchmarks. The prompting methods we explore are general purpose, and make no specific use of domain expertise, removing the need for expertcurated content. Our experimental design carefully controls for overfitting during the prompt engineering process. We introduce Medprompt, based on a composition of several prompting strategies. With Medprompt, GPT4 achieves stateoftheart results on all nine of the benchmark datasets in the MultiMedQA suite. The method outperforms leading specialist models such as MedPaLM 2 by a significant margin with an order of magnitude fewer calls to the model. Steering GPT4 with Medprompt achieves a 27% reduction in error rate on the MedQA dataset over the best methods to date achieved with specialist models and surpasses a score of 90% for the first time. Beyond medical problems, we show the power of Medprompt to generalize to other domains and provide evidence for the broad applicability of the approach via studies of the strategy on exams in electrical engineering, machine learning, philosophy, accounting, law, nursing, and clinical psychology.

20231122 Positional Description Matters for Transformers Arithmetic.
Summary: Transformers, central to the successes in modern Natural Language Processing, often falter on arithmetic tasks despite their vast capabilities which paradoxically include remarkable coding abilities. We observe that a crucial challenge is their naive reliance on positional information to solve arithmetic problems with a small number of digits, leading to poor performance on larger numbers. Herein, we delve deeper into the role of positional encoding, and propose several ways to fix the issue, either by modifying the positional encoding directly, or by modifying the representation of the arithmetic task to leverage standard positional encoding differently. We investigate the value of these modifications for three tasks: (i) classical multiplication, (ii) length extrapolation in addition, and (iii) addition in natural language context. For (i) we train a small model on a small dataset (100M parameters and 300k samples) with remarkable aptitude in (direct, no scratchpad) 15 digits multiplication and essentially perfect up to 12 digits, while usual training in this context would give a model failing at 4 digits multiplication. In the experiments on addition, we use a mere 120k samples to demonstrate: for (ii) extrapolation from 10 digits to testing on 12 digits numbers while usual training would have no extrapolation, and for (iii) almost perfect accuracy up to 5 digits while usual training would be correct only up to 3 digits (which is essentially memorization with a training set of 120k samples).

20231117 A Whole New Ball Game: A Primal Accelerated Method for Matrix Games and Minimizing the Maximum of Smooth Functions.
Summary: We design algorithms for minimizing $\max_{i\in[n]} f_i(x)$ over a $d$dimensional Euclidean or simplex domain. When each $f_i$ is $1$Lipschitz and $1$smooth, our method computes an $\epsilon$approximate solution using $\widetilde{O}(n \epsilon^{1/3} + \epsilon^{2})$ gradient and function evaluations, and $\widetilde{O}(n \epsilon^{4/3})$ additional runtime. For large $n$, our evaluation complexity is optimal up to polylogarithmic factors. In the special case where each $f_i$ is linear  which corresponds to finding a nearoptimal primal strategy in a matrix game  our method finds an $\epsilon$approximate solution in runtime $\widetilde{O}(n (d/\epsilon)^{2/3} + nd + d\epsilon^{2})$. For $n>d$ and $\epsilon=1/\sqrt{n}$ this improves over all existing firstorder methods. When additionally $d = \omega(n^{8/11})$ our runtime also improves over all known interior point methods. Our algorithm combines three novel primitives: (1) A dynamic data structure which enables efficient stochastic gradient estimation in small $\ell_2$ or $\ell_1$ balls. (2) A mirror descent algorithm tailored to our data structure implementing an oracle which minimizes the objective over these balls. (3) A simple ball oracle acceleration framework suitable for nonEuclidean geometry.

20231116 Ghost Value Augmentation for $k$EdgeConnectivity.
Summary: We give a polytime algorithm for the $k$edgeconnected spanning subgraph ($k$ECSS) problem that returns a solution of cost no greater than the cheapest $(k+10)$ECSS on the same graph. Our approach enhances the iterative relaxation framework with a new ingredient, which we call ghost values, that allows for high sparsity in intermediate problems. Our guarantees improve upon the bestknown approximation factor of $2$ for $k$ECSS whenever the optimal value of $(k+10)$ECSS is close to that of $k$ECSS. This is a property that holds for the closely related problem $k$edgeconnected spanning multisubgraph ($k$ECSM), which is identical to $k$ECSS except edges can be selected multiple times at the same cost. As a consequence, we obtain a $\left(1+O\left(\frac{1}{k}\right)\right)$approximation algorithm for $k$ECSM, which resolves a conjecture of Pritchard and improves upon a recent $\left(1+O\left(\frac{1}{\sqrt{k}}\right)\right)$approximation algorithm of Karlin, Klein, Oveis Gharan, and Zhang. Moreover, we present a matching lower bound for $k$ECSM, showing that our approximation ratio is tight up to the constant factor in $O\left(\frac{1}{k}\right)$, unless $P=NP$.

20231115 From Trees to Polynomials and Back Again: New Capacity Bounds with Applications to TSP.
Summary: We give simply exponential lower bounds on the probabilities of a given strongly Rayleigh distribution, depending only on its expectation. This resolves a weak version of a problem left open by KarlinKleinOveis Gharan in their recent breakthrough work on metric TSP, and this resolution leads to a minor improvement of their approximation factor for metric TSP. Our results also allow for a more streamlined analysis of the algorithm. To achieve these new bounds, we build upon the work of GurvitsLeake on the use of the productization technique for bounding the capacity of a real stable polynomial. This technique allows one to reduce certain inequalities for real stable polynomials to products of affine linear forms, which have an underlying matrix structure. In this paper, we push this technique further by characterizing the worstcase polynomials via bipartitioned forests. This rigid combinatorial structure yields a clean induction argument, which implies our stronger bounds. In general, we believe the results of this paper will lead to further improvement and simplification of the analysis of various combinatorial and probabilistic bounds and algorithms.

20231113 How to Use Quantum Indistinguishability Obfuscation.
Summary: Quantum copy protection, introduced by Aaronson, enables giving out a quantum programdescription that cannot be meaningfully duplicated. Despite over a decade of study, copy protection is only known to be possible for a very limited class of programs. As our first contribution, we show how to achieve "bestpossible" copy protection for all programs. We do this by introducing quantum state indistinguishability obfuscation (qsiO), a notion of obfuscation for quantum descriptions of classical programs. We show that applying qsiO to a program immediately achieves bestpossible copy protection. Our second contribution is to show that, assuming injective oneway functions exist, qsiO is concrete copy protection for a large family of puncturable programs  significantly expanding the class of copyprotectable programs. A key tool in our proof is a new variant of unclonable encryption (UE) that we call coupled unclonable encryption (cUE). While constructing UE in the standard model remains an important open problem, we are able to build cUE from oneway functions. If we additionally assume the existence of UE, then we can further expand the class of puncturable programs for which qsiO is copy protection. Finally, we construct qsiO relative to an efficient quantum oracle.

20231113 Forallexist statements in pseudopolynomial time.
Summary: Given a convex set $Q \subseteq R^m$ and an integer matrix $W \in Z^{m \times n}$, we consider statements of the form $ \forall b \in Q \cap Z^m$ $\exists x \in Z^n$ s.t. $Wx \leq b$. Such statements can be verified in polynomial time with the algorithm of Kannan and its improvements if $n$ is fixed and $Q$ is a polyhedron. The running time of the bestknown algorithms is doubly exponential in~$n$. In this paper, we provide a pseudopolynomialtime algorithm if $m$ is fixed. Its running time is $(m \Delta)^{O(m^2)}$, where $\Delta = \W\_\infty$. Furthermore it applies to general convex sets $Q$.

20231107 Computing Approximate $\ell_p$ Sensitivities.
Summary: Recent works in dimensionality reduction for regression tasks have introduced the notion of sensitivity, an estimate of the importance of a specific datapoint in a dataset, offering provable guarantees on the quality of the approximation after removing lowsensitivity datapoints via subsampling. However, fast algorithms for approximating $\ell_p$ sensitivities, which we show is equivalent to approximate $\ell_p$ regression, are known for only the $\ell_2$ setting, in which they are termed leverage scores. In this work, we provide efficient algorithms for approximating $\ell_p$ sensitivities and related summary statistics of a given matrix. In particular, for a given $n \times d$ matrix, we compute $\alpha$approximation to its $\ell_1$ sensitivities at the cost of $O(n/\alpha)$ sensitivity computations. For estimating the total $\ell_p$ sensitivity (i.e. the sum of $\ell_p$ sensitivities), we provide an algorithm based on importance sampling of $\ell_p$ Lewis weights, which computes a constant factor approximation to the total sensitivity at the cost of roughly $O(\sqrt{d})$ sensitivity computations. Furthermore, we estimate the maximum $\ell_1$ sensitivity, up to a $\sqrt{d}$ factor, using $O(d)$ sensitivity computations. We generalize all these results to $\ell_p$ norms for $p > 1$. Lastly, we experimentally show that for a wide class of matrices in realworld datasets, the total sensitivity can be quickly approximated and is significantly smaller than the theoretical prediction, demonstrating that realworld datasets have low intrinsic effective dimensionality.

20231107 Userlevel Differentially Private Stochastic Convex Optimization: Efficient Algorithms with Optimal Rates.
Summary: We study differentially private stochastic convex optimization (DPSCO) under userlevel privacy, where each user may hold multiple data items. Existing work for userlevel DPSCO either requires superpolynomial runtime [Ghazi et al. (2023)] or requires the number of users to grow polynomially with the dimensionality of the problem with additional strict assumptions [Bassily et al. (2023)]. We develop new algorithms for userlevel DPSCO that obtain optimal rates for both convex and strongly convex functions in polynomial time and require the number of users to grow only logarithmically in the dimension. Moreover, our algorithms are the first to obtain optimal rates for nonsmooth functions in polynomial time. These algorithms are based on multiplepass DPSGD, combined with a novel private mean estimation procedure for concentrated data, which applies an outlier removal step before estimating the mean of the gradients.

20231103 A Lower Bound for the Max Entropy Algorithm for TSP.
Summary: One of the most famous conjectures in combinatorial optimization is the fourthirds conjecture, which states that the integrality gap of the subtour LP relaxation of the TSP is equal to $\frac43$. For 40 years, the best known upper bound was 1.5, due to Wolsey (1980). Recently, Karlin, Klein, and Oveis Gharan (2022) showed that the max entropy algorithm for the TSP gives an improved bound of $1.5  10^{36}$. In this paper, we show that the approximation ratio of the max entropy algorithm is at least 1.375, even for graphic TSP. Thus the max entropy algorithm does not appear to be the algorithm that will ultimately resolve the fourthirds conjecture in the affirmative, should that be possible.

20231027 Minimax Optimal Submodular Optimization with Bandit Feedback.
Summary: We consider maximizing a monotonic, submodular set function $f: 2^{[n]} \rightarrow [0,1]$ under stochastic bandit feedback. Specifically, $f$ is unknown to the learner but at each time $t=1,\dots,T$ the learner chooses a set $S_t \subset [n]$ with $S_t \leq k$ and receives reward $f(S_t) + \eta_t$ where $\eta_t$ is meanzero subGaussian noise. The objective is to minimize the learner's regret over $T$ times with respect to ($1e^{1}$)approximation of maximum $f(S_*)$ with $S_* = k$, obtained through greedy maximization of $f$. To date, the best regret bound in the literature scales as $k n^{1/3} T^{2/3}$. And by trivially treating every set as a unique arm one deduces that $\sqrt{ {n \choose k} T }$ is also achievable. In this work, we establish the first minimax lower bound for this setting that scales like $\mathcal{O}(\min_{i \le k}(in^{1/3}T^{2/3} + \sqrt{n^{ki}T}))$. Moreover, we propose an algorithm that is capable of matching the lower bound regret.

20231027 Structured Semidefinite Programming for Recovering Structured Preconditioners.
Summary: We develop a general framework for finding approximatelyoptimal preconditioners for solving linear systems. Leveraging this framework we obtain improved runtimes for fundamental preconditioning and linear system solving problems including the following. We give an algorithm which, given positive definite $\mathbf{K} \in \mathbb{R}^{d \times d}$ with $\mathrm{nnz}(\mathbf{K})$ nonzero entries, computes an $\epsilon$optimal diagonal preconditioner in time $\widetilde{O}(\mathrm{nnz}(\mathbf{K}) \cdot \mathrm{poly}(\kappa^\star,\epsilon^{1}))$, where $\kappa^\star$ is the optimal condition number of the rescaled matrix. We give an algorithm which, given $\mathbf{M} \in \mathbb{R}^{d \times d}$ that is either the pseudoinverse of a graph Laplacian matrix or a constant spectral approximation of one, solves linear systems in $\mathbf{M}$ in $\widetilde{O}(d^2)$ time. Our diagonal preconditioning results improve stateoftheart runtimes of $\Omega(d^{3.5})$ attained by generalpurpose semidefinite programming, and our solvers improve stateoftheart runtimes of $\Omega(d^{\omega})$ where $\omega > 2.3$ is the current matrix multiplication constant. We attain our results via new algorithms for a class of semidefinite programs (SDPs) we call matrixdictionary approximation SDPs, which we leverage to solve an associated problem we call matrixdictionary recovery.

20231025 Detecting Pretraining Data from Large Language Models.
Summary: Although large language models (LLMs) are widely deployed, the data used to train them is rarely disclosed. Given the incredible scale of this data, up to trillions of tokens, it is all but certain that it includes potentially problematic text such as copyrighted materials, personally identifiable information, and test data for widely reported reference benchmarks. However, we currently have no way to know which data of these types is included or in what proportions. In this paper, we study the pretraining data detection problem: given a piece of text and blackbox access to an LLM without knowing the pretraining data, can we determine if the model was trained on the provided text? To facilitate this study, we introduce a dynamic benchmark WIKIMIA that uses data created before and after model training to support gold truth detection. We also introduce a new detection method MinK% Prob based on a simple hypothesis: an unseen example is likely to contain a few outlier words with low probabilities under the LLM, while a seen example is less likely to have words with such low probabilities. MinK% Prob can be applied without any knowledge about the pretraining corpus or any additional training, departing from previous detection methods that require training a reference model on data that is similar to the pretraining data. Moreover, our experiments demonstrate that MinK% Prob achieves a 7.4% improvement on WIKIMIA over these previous methods. We apply MinK% Prob to three realworld scenarios, copyrighted book detection, contaminated downstream example detection and privacy auditing of machine unlearning, and find it a consistently effective solution.

20231025 Fast Algorithms for Separable Linear Programs.
Summary: In numerical linear algebra, considerable effort has been devoted to obtaining faster algorithms for linear systems whose underlying matrices exhibit structural properties. A prominent success story is the method of generalized nested dissection~[LiptonRoseTarjan'79] for separable matrices. On the other hand, the majority of recent developments in the design of efficient linear program (LP) solves do not leverage the ideas underlying these faster linear system solvers nor consider the separable structure of the constraint matrix. We give a faster algorithm for separable linear programs. Specifically, we consider LPs of the form $\min_{\mathbf{A}\mathbf{x}=\mathbf{b}, \mathbf{l}\leq\mathbf{x}\leq\mathbf{u}} \mathbf{c}^\top\mathbf{x}$, where the graphical support of the constraint matrix $\mathbf{A} \in \mathbb{R}^{n\times m}$ is $O(n^\alpha)$separable. These include flow problems on planar graphs and low treewidth matrices among others. We present an $\tilde{O}((m+m^{1/2 + 2\alpha}) \log(1/\epsilon))$ time algorithm for these LPs, where $\epsilon$ is the relative accuracy of the solution. Our new solver has two important implications: for the $k$multicommodity flow problem on planar graphs, we obtain an algorithm running in $\tilde{O}(k^{5/2} m^{3/2})$ time in the high accuracy regime; and when the support of $\mathbf{A}$ is $O(n^\alpha)$separable with $\alpha \leq 1/4$, our algorithm runs in $\tilde{O}(m)$ time, which is nearly optimal. The latter significantly improves upon the natural approach of combining interior point methods and nested dissection, whose time complexity is lower bounded by $\Omega(\sqrt{m}(m+m^{\alpha\omega}))=\Omega(m^{3/2})$, where $\omega$ is the matrix multiplication constant. Lastly, in the setting of lowtreewidth LPs, we recover the results of [DLY,STOC21] and [GS,22] with significantly simpler data structure machinery.

20231019 Unmasking Transformers: A Theoretical Approach to Data Recovery via Attention Weights.
Summary: In the realm of deep learning, transformers have emerged as a dominant architecture, particularly in natural language processing tasks. However, with their widespread adoption, concerns regarding the security and privacy of the data processed by these models have arisen. In this paper, we address a pivotal question: Can the data fed into transformers be recovered using their attention weights and outputs? We introduce a theoretical framework to tackle this problem. Specifically, we present an algorithm that aims to recover the input data $X \in \mathbb{R}^{d \times n}$ from given attention weights $W = QK^\top \in \mathbb{R}^{d \times d}$ and output $B \in \mathbb{R}^{n \times n}$ by minimizing the loss function $L(X)$. This loss function captures the discrepancy between the expected output and the actual output of the transformer. Our findings have significant implications for the Localized Layerwise Mechanism (LLM), suggesting potential vulnerabilities in the model's design from a security and privacy perspective. This work underscores the importance of understanding and safeguarding the internal workings of transformers to ensure the confidentiality of processed data.

20231018 Superiority of Softmax: Unveiling the Performance Edge Over Linear Attention.
Summary: Large transformer models have achieved stateoftheart results in numerous natural language processing tasks. Among the pivotal components of the transformer architecture, the attention mechanism plays a crucial role in capturing token interactions within sequences through the utilization of softmax function. Conversely, linear attention presents a more computationally efficient alternative by approximating the softmax operation with linear complexity. However, it exhibits substantial performance degradation when compared to the traditional softmax attention mechanism. In this paper, we bridge the gap in our theoretical understanding of the reasons behind the practical performance gap between softmax and linear attention. By conducting a comprehensive comparative analysis of these two attention mechanisms, we shed light on the underlying reasons for why softmax attention outperforms linear attention in most scenarios.

20231015 FiLM: Fillin Language Models for AnyOrder Generation.
Summary: Language models have become the backbone of today's AI systems. However, their predominant lefttoright generation limits the use of bidirectional context, which is essential for tasks that involve filling text in the middle. We propose the Fillin Language Model (FiLM), a new language modeling approach that allows for flexible generation at any position without adhering to a specific generation order. Its training extends the masked language modeling objective by adopting varying mask probabilities sampled from the Beta distribution to enhance the generative capabilities of FiLM. During inference, FiLM can seamlessly insert missing phrases, sentences, or paragraphs, ensuring that the outputs are fluent and are coherent with the surrounding context. In both automatic and human evaluations, FiLM outperforms existing infilling methods that rely on lefttoright language models trained on rearranged text segments. FiLM is easy to implement and can be either trained from scratch or finetuned from a lefttoright language model. Notably, as the model size grows, FiLM's perplexity approaches that of strong lefttoright language models of similar sizes, indicating FiLM's scalability and potential as a large language model.

20231012 Stronger Coreset Bounds for Kernel Density Estimators via Chaining.
Summary: We apply the discrepancy method and a chaining approach to give improved bounds on the coreset complexity of a wide class of kernel functions. Our results give randomized polynomial time algorithms to produce coresets of size $O\big(\frac{\sqrt{d}}{\varepsilon}\sqrt{\log\log \frac{1}{\varepsilon}}\big)$ for the Gaussian and Laplacian kernels in the case that the data set is uniformly bounded, an improvement that was not possible with previous techniques. We also obtain coresets of size $O\big(\frac{1}{\varepsilon}\sqrt{\log\log \frac{1}{\varepsilon}}\big)$ for the Laplacian kernel for $d$ constant. Finally, we give the best known bounds of $O\big(\frac{\sqrt{d}}{\varepsilon}\sqrt{\log(2\max\{1,\alpha\})}\big)$ on the coreset complexity of the exponential, Hellinger, and JS Kernels, where $1/\alpha$ is the bandwidth parameter of the kernel.

20231012 On the Rational Degree of Boolean Functions and Applications.
Summary: We study a natural complexity measure of Boolean functions known as the (exact) rational degree. For total functions $f$, it is conjectured that $\mathrm{rdeg}(f)$ is polynomially related to $\mathrm{deg}(f)$, where $\mathrm{deg}(f)$ is the Fourier degree. Towards this conjecture, we show that symmetric functions have rational degree at least $\mathrm{deg}(f)/2$ and monotone functions have rational degree at least $\sqrt{\mathrm{deg}(f)}$. We observe that both of these lower bounds are tight. In addition, we show that all readonce depth$d$ Boolean formulae have rational degree at least $\Omega(\mathrm{deg}(f)^{1/d})$. Furthermore, we show that almost every Boolean function on $n$ variables has rational degree at least $n/2  O(\sqrt{n})$. In contrast to total functions, we exhibit partial functions that witness unbounded separations between rational and approximate degree, in both directions. As a consequence, we show that for quantum computers, postselection and boundederror are incomparable resources in the blackbox model.

20231003 Learning quantum Hamiltonians at any temperature in polynomial time.
Summary: We study the problem of learning a local quantum Hamiltonian $H$ given copies of its Gibbs state $\rho = e^{\beta H}/\textrm{tr}(e^{\beta H})$ at a known inverse temperature $\beta>0$. Anshu, Arunachalam, Kuwahara, and Soleimanifar (arXiv:2004.07266) gave an algorithm to learn a Hamiltonian on $n$ qubits to precision $\epsilon$ with only polynomially many copies of the Gibbs state, but which takes exponential time. Obtaining a computationally efficient algorithm has been a major open problem [Alhambra'22 (arXiv:2204.08349)], [Anshu, Arunachalam'22 (arXiv:2204.08349)], with prior work only resolving this in the limited cases of high temperature [Haah, Kothari, Tang'21 (arXiv:2108.04842)] or commuting terms [Anshu, Arunachalam, Kuwahara, Soleimanifar'21]. We fully resolve this problem, giving a polynomial time algorithm for learning $H$ to precision $\epsilon$ from polynomially many copies of the Gibbs state at any constant $\beta > 0$. Our main technical contribution is a new flat polynomial approximation to the exponential function, and a translation between multivariate scalar polynomials and nested commutators. This enables us to formulate Hamiltonian learning as a polynomial system. We then show that solving a lowdegree sumofsquares relaxation of this polynomial system suffices to accurately learn the Hamiltonian.

20230911 Textbooks Are All You Need II: phi1.5 technical report.
Summary: We continue the investigation into the power of smaller Transformerbased language models as initiated by \textbf{TinyStories}  a 10 million parameter model that can produce coherent English  and the followup work on \textbf{phi1}, a 1.3 billion parameter model with Python coding performance close to the stateoftheart. The latter work proposed to use existing Large Language Models (LLMs) to generate ``textbook quality" data as a way to enhance the learning process compared to traditional web data. We follow the ``Textbooks Are All You Need" approach, focusing this time on common sense reasoning in natural language, and create a new 1.3 billion parameter model named \textbf{phi1.5}, with performance on natural language tasks comparable to models 5x larger, and surpassing most nonfrontier LLMs on more complex reasoning tasks such as gradeschool mathematics and basic coding. More generally, \textbf{phi1.5} exhibits many of the traits of much larger LLMs, both good  such as the ability to ``think step by step" or perform some rudimentary incontext learning  and bad, including hallucinations and the potential for toxic and biased generations  encouragingly though, we are seeing improvement on that front thanks to the absence of web data. We opensource \textbf{phi1.5} to promote further research on these urgent topics.

20230828 Faster MinCost Flow and Approximate Tree Decomposition on Bounded Treewidth Graphs.
Summary: We present an algorithm for mincost flow in graphs with $n$ vertices and $m$ edges, given a tree decomposition of width $\tau$ and size $S$, and polynomially bounded, integral edge capacities and costs, running in $\widetilde{O}(m\sqrt{\tau} + S)$ time. This improves upon the previous fastest algorithm in this setting achieved by the boundedtreewidth linear program solver by [DongLeeYe,21] and [GuSong,22], which runs in $\widetilde{O}(m \tau^{(\omega+1)/2})$ time, where $\omega \approx 2.37$ is the matrix multiplication exponent. Our approach leverages recent advances in structured linear program solvers and robust interior point methods (IPM). For general graphs where treewidth is trivially bounded by $n$, the algorithm runs in $\widetilde{O}(m \sqrt n)$ time, which is the bestknown result without using the LeeSidford barrier or $\ell_1$ IPM, demonstrating the surprising power of robust interior point methods. As a corollary, we obtain a $\widetilde{O}(\operatorname{tw}^3 \cdot m)$ time algorithm to compute a tree decomposition of width $O(\operatorname{tw}\cdot \log(n))$, given a graph with $m$ edges.

20230821 Clustered Linear Contextual Bandits with Knapsacks.
Summary: In this work, we study clustered contextual bandits where rewards and resource consumption are the outcomes of clusterspecific linear models. The arms are divided in clusters, with the cluster memberships being unknown to an algorithm. Pulling an arm in a time period results in a reward and in consumption for each one of multiple resources, and with the total consumption of any resource exceeding a constraint implying the termination of the algorithm. Thus, maximizing the total reward requires learning not only models about the reward and the resource consumption, but also cluster memberships. We provide an algorithm that achieves regret sublinear in the number of time periods, without requiring access to all of the arms. In particular, we show that it suffices to perform clustering only once to a randomly selected subset of the arms. To achieve this result, we provide a sophisticated combination of techniques from the literature of econometrics and of bandits with constraints.

20230818 Do you know what qmeans?.
Summary: Clustering is one of the most important tools for analysis of large datasets, and perhaps the most popular clustering algorithm is Lloyd's iteration for $k$means. This iteration takes $N$ vectors $v_1,\dots,v_N\in\mathbb{R}^d$ and outputs $k$ centroids $c_1,\dots,c_k\in\mathbb{R}^d$; these partition the vectors into clusters based on which centroid is closest to a particular vector. We present an overall improved version of the "$q$means" algorithm, the quantum algorithm originally proposed by Kerenidis, Landman, Luongo, and Prakash (2019) which performs $\varepsilon$$k$means, an approximate version of $k$means clustering. This algorithm does not rely on the quantum linear algebra primitives of prior work, instead only using its QRAM to prepare and measure simple states based on the current iteration's clusters. The time complexity is $O\big(\frac{k^{2}}{\varepsilon^2}(\sqrt{k}d + \log(Nd))\big)$ and maintains the polylogarithmic dependence on $N$ while improving the dependence on most of the other parameters. We also present a "dequantized" algorithm for $\varepsilon$$k$means which runs in $O\big(\frac{k^{2}}{\varepsilon^2}(kd + \log(Nd))\big)$ time. Notably, this classical algorithm matches the polylogarithmic dependence on $N$ attained by the quantum algorithms.

20230816 Convergence of TwoLayer Regression with Nonlinear Units.
Summary: Large language models (LLMs), such as ChatGPT and GPT4, have shown outstanding performance in many human life task. Attention computation plays an important role in training LLMs. Softmax unit and ReLU unit are the key structure in attention computation. Inspired by them, we put forward a softmax ReLU regression problem. Generally speaking, our goal is to find an optimal solution to the regression problem involving the ReLU unit. In this work, we calculate a close form representation for the Hessian of the loss function. Under certain assumptions, we prove the Lipschitz continuous and the PSDness of the Hessian. Then, we introduce an greedy algorithm based on approximate Newton method, which converges in the sense of the distance to optimal solution. Last, We relax the Lipschitz condition and prove the convergence in the sense of loss value.

20230811 A BetterThan1.6Approximation for PrizeCollecting TSP.
Summary: PrizeCollecting TSP is a variant of the traveling salesperson problem where one may drop vertices from the tour at the cost of vertexdependent penalties. The quality of a solution is then measured by adding the length of the tour and the sum of all penalties of vertices that are not visited. We present a polynomialtime approximation algorithm with an approximation guarantee slightly below $1.6$, where the guarantee is with respect to the natural linear programming relaxation of the problem. This improves upon the previous bestknown approximation ratio of $1.774$. Our approach is based on a known decomposition for solutions of this linear relaxation into rooted trees. Our algorithm takes a tree from this decomposition and then performs a pruning step before doing parity correction on the remainder. Using a simple analysis, we bound the approximation guarantee of the proposed algorithm by $(1+\sqrt{5})/2 \approx 1.618$, the golden ratio. With some additional technical care we further improve it to $1.599$.

20230802 Optimal Online Discrepancy Minimization.
Summary: We prove that there exists an online algorithm that for any sequence of vectors $v_1,\ldots,v_T \in \mathbb{R}^n$ with $\v_i\_2 \leq 1$, arriving one at a time, decides random signs $x_1,\ldots,x_T \in \{ 1,1\}$ so that for every $t \le T$, the prefix sum $\sum_{i=1}^t x_iv_i$ is $10$subgaussian. This improves over the work of Alweiss, Liu and Sawhney who kept prefix sums $O(\sqrt{\log (nT)})$subgaussian, and gives a $O(\sqrt{\log T})$ bound on the discrepancy $\max_{t \in T} \\sum_{i=1}^t x_i v_i\_\infty$. Our proof combines a generalization of Banaszczyk's prefix balancing result to trees with a cloning argument to find distributions rather than single colorings. We also show a matching $\Omega(\sqrt{\log T})$ strategy for an oblivious adversary.

20230730 Polytopes with Bounded Integral Slack Matrices Have SubExponential Extension Complexity.
Summary: We show that any bounded integral function $f : A \times B \mapsto \{0,1, \dots, \Delta\}$ with rank $r$ has deterministic communication complexity $\Delta^{O(\Delta)} \cdot \sqrt{r} \cdot \log r$, where the rank of $f$ is defined to be the rank of the $A \times B$ matrix whose entries are the function values. As a corollary, we show that any $n$dimensional polytope that admits a slack matrix with entries from $\{0,1,\dots,\Delta\}$ has extension complexity at most $\exp(\Delta^{O(\Delta)} \cdot \sqrt{n} \cdot \log n)$.

20230717 Zeroth Order Algorithm for Softmax Attention Optimization.
Summary: Large language models (LLMs) have brought about significant transformations in human society. Among the crucial computations in LLMs, the softmax unit holds great importance. Its helps the model generating a probability distribution on potential subsequent words or phrases, considering a series of input words. By utilizing this distribution, the model selects the most probable next word or phrase, based on the assigned probabilities. The softmax unit assumes a vital function in LLM training as it facilitates learning from data through the adjustment of neural network weights and biases. With the development of the size of LLMs, computing the gradient becomes expensive. However, Zeroth Order method can approximately compute the gradient with only forward passes. In this paper, we present a Zeroth Order algorithm specifically tailored for Softmax optimization. We demonstrate the convergence of our algorithm, highlighting its effectiveness in efficiently computing gradients for largescale LLMs. By leveraging the ZerothOrder method, our work contributes to the advancement of optimization techniques in the context of complex language models.

20230707 Scalable Membership Inference Attacks via Quantile Regression.
Summary: Membership inference attacks are designed to determine, using black box access to trained models, whether a particular example was used in training or not. Membership inference can be formalized as a hypothesis testing problem. The most effective existing attacks estimate the distribution of some test statistic (usually the model's confidence on the true label) on points that were (and were not) used in training by training many \emph{shadow models}  i.e. models of the same architecture as the model being attacked, trained on a random subsample of data. While effective, these attacks are extremely computationally expensive, especially when the model under attack is large. We introduce a new class of attacks based on performing quantile regression on the distribution of confidence scores induced by the model under attack on points that are not used in training. We show that our method is competitive with stateoftheart shadow model attacks, while requiring substantially less compute because our attack requires training only a single model. Moreover, unlike shadow model attacks, our proposed attack does not require any knowledge of the architecture of the model under attack and is therefore truly ``blackbox". We show the efficacy of this approach in an extensive series of experiments on various datasets and model architectures.

20230620 Textbooks Are All You Need.
Summary: We introduce phi1, a new large language model for code, with significantly smaller size than competing models: phi1 is a Transformerbased model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT3.5 (1B tokens). Despite this small scale, phi1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays surprising emergent properties compared to phi1base, our model before our finetuning stage on a dataset of coding exercises, and phi1small, a smaller model with 350M parameters trained with the same pipeline as phi1 that still achieves 45% on HumanEval.

20230613 Efficient Algorithm for Solving Hyperbolic Programs.
Summary: Hyperbolic polynomials is a class of realroots polynomials that has wide range of applications in theoretical computer science. Each hyperbolic polynomial also induces a hyperbolic cone that is of particular interest in optimization due to its generality, as by choosing the polynomial properly, one can easily recover the classic optimization problems such as linear programming and semidefinite programming. In this work, we develop efficient algorithms for hyperbolic programming, the problem in each one wants to minimize a linear objective, under a system of linear constraints and the solution must be in the hyperbolic cone induced by the hyperbolic polynomial. Our algorithm is an instance of interior point method (IPM) that, instead of following the central path, it follows the central Swath, which is a generalization of central path. To implement the IPM efficiently, we utilize a relaxation of the hyperbolic program to a quadratic program, coupled with the first four moments of the hyperbolic eigenvalues that are crucial to update the optimization direction. We further show that, given an evaluation oracle of the polynomial, our algorithm only requires $O(n^2d^{2.5})$ oracle calls, where $n$ is the number of variables and $d$ is the degree of the polynomial, with extra $O((n+m)^3 d^{0.5})$ arithmetic operations, where $m$ is the number of constraints.

20230602 MathChat: Converse to Tackle Challenging Math Problems with LLM Agents.
Summary: Employing Large Language Models (LLMs) to address mathematical problems is an intriguing research endeavor, considering the abundance of math problems expressed in natural language across numerous science and engineering fields. LLMs, with their generalized ability, are used as a foundation model to build AI agents for different tasks. In this paper, we study the effectiveness of utilizing LLM agents to solve math problems through conversations. We propose MathChat, a conversational problemsolving framework designed for math problems. MathChat consists of an LLM agent and a user proxy agent which is responsible for tool execution and additional guidance. This synergy facilitates a collaborative problemsolving process, where the agents engage in a dialogue to solve the problems. We perform evaluation on difficult high school competition problems from the MATH dataset. Utilizing Python, we show that MathChat can further improve previous toolusing prompting methods by 6%.

20230601 Faster Robust Tensor Power Method for Arbitrary Order.
Summary: Tensor decomposition is a fundamental method used in various areas to deal with highdimensional data. \emph{Tensor power method} (TPM) is one of the widelyused techniques in the decomposition of tensors. This paper presents a novel tensor power method for decomposing arbitrary order tensors, which overcomes limitations of existing approaches that are often restricted to lowerorder (less than $3$) tensors or require strong assumptions about the underlying data structure. We apply sketching method, and we are able to achieve the running time of $\widetilde{O}(n^{p1})$, on the power $p$ and dimension $n$ tensor. We provide a detailed analysis for any $p$th order tensor, which is never given in previous works.

20230531 Doubly Constrained Fair Clustering.
Summary: The remarkable attention which fair clustering has received in the last few years has resulted in a significant number of different notions of fairness. Despite the fact that these notions are welljustified, they are often motivated and studied in a disjoint manner where one fairness desideratum is considered exclusively in isolation from the others. This leaves the understanding of the relations between different fairness notions as an important open problem in fair clustering. In this paper, we take the first step in this direction. Specifically, we consider the two most prominent demographic representation fairness notions in clustering: (1) Group Fairness (GF), where the different demographic groups are supposed to have close to populationlevel representation in each cluster and (2) Diversity in Center Selection (DS), where the selected centers are supposed to have close to populationlevel representation of each group. We show that given a constant approximation algorithm for one constraint (GF or DS only) we can obtain a constant approximation solution that satisfies both constraints simultaneously. Interestingly, we prove that any given solution that satisfies the GF constraint can always be postprocessed at a bounded degradation to the clustering cost to additionally satisfy the DS constraint while the reverse is not true. Furthermore, we show that both GF and DS are incompatible (having an empty feasibility set in the worst case) with a collection of other distancebased fairness notions. Finally, we carry experiments to validate our theoretical findings.

20230525 Learning across Data Owners with Joint Differential Privacy.
Summary: In this paper, we study the setting in which data owners train machine learning models collaboratively under a privacy notion called joint differential privacy [Kearns et al., 2018]. In this setting, the model trained for each data owner $j$ uses $j$'s data without privacy consideration and other owners' data with differential privacy guarantees. This setting was initiated in [Jain et al., 2021] with a focus on linear regressions. In this paper, we study this setting for stochastic convex optimization (SCO). We present an algorithm that is a variant of DPSGD [Song et al., 2013; Abadi et al., 2016] and provides theoretical bounds on its population loss. We compare our algorithm to several baselines and discuss for what parameter setups our algorithm is more preferred. We also empirically study joint differential privacy in the multiclass classification problem over two public datasets. Our empirical findings are wellconnected to the insights from our theoretical results.

20230515 Sparsifying sums of norms.
Summary: For any norms $N_1,\ldots,N_m$ on $\mathbb{R}^n$ and $N(x) := N_1(x)+\cdots+N_m(x)$, we show there is a sparsified norm $\tilde{N}(x) = w_1 N_1(x) + \cdots + w_m N_m(x)$ such that $N(x)  \tilde{N}(x) \leq \epsilon N(x)$ for all $x \in \mathbb{R}^n$, where $w_1,\ldots,w_m$ are nonnegative weights, of which only $O(\epsilon^{2} n \log(n/\epsilon) (\log n)^{2.5} )$ are nonzero. Additionally, if $N$ is $\mathrm{poly}(n)$equivalent to the Euclidean norm on $\mathbb{R}^n$, then such weights can be found with high probability in time $O(m (\log n)^{O(1)} + \mathrm{poly}(n)) T$, where $T$ is the time required to evaluate a norm $N_i$. This immediately yields analogous statements for sparsifying sums of symmetric submodular functions. More generally, we show how to sparsify sums of $p$th powers of norms when the sum is $p$uniformly smooth.

20230515 LinearSized Sparsifiers via NearLinear Time Discrepancy Theory.
Summary: Discrepancy theory provides powerful tools for producing higherquality objects which "beat the union bound" in fundamental settings throughout combinatorics and computer science. However, this quality has often come at the price of more expensive algorithms. We introduce a new framework for bridging this gap, by allowing for the efficient implementation of discrepancytheoretic primitives. Our framework repeatedly solves regularized optimization problems to low accuracy to approximate the partial coloring method of [Rot17], and simplifies and generalizes recent work of [JSS23] on fast algorithms for Spencer's theorem. In particular, our framework only requires that the discrepancy body of interest has exponentially large Gaussian measure and is expressible as a sublevel set of a symmetric, convex function. We combine this framework with new tools for proving Gaussian measure lower bounds to give improved algorithms for a variety of sparsification and coloring problems. As a first application, we use our framework to obtain an $\widetilde{O}(m \cdot \epsilon^{3.5})$ time algorithm for constructing an $\epsilon$approximate spectral sparsifier of an $m$edge graph, matching the sparsity of [BSS14] up to constant factors and improving upon the $\widetilde{O}(m \cdot \epsilon^{6.5})$ runtime of [LeeS17]. We further give a stateoftheart algorithm for constructing graph ultrasparsifiers and an almostlinear time algorithm for constructing linearsized degreepreserving sparsifiers via discrepancy theory; in the latter case, such sparsifiers were not known to exist previously. We generalize these results to their analogs in sparsifying isotropic sums of positive semidefinite matrices. Finally, to demonstrate the versatility of our technique, we obtain a nearlyinputsparsity time constructive algorithm for Spencer's theorem (where we recover a recent result of [JSS23]).

20230505 On Optimization and Counting of NonBroken Bases of Matroids.
Summary: Given a matroid $M=(E,{\cal I})$, and a total ordering over the elements $E$, a broken circuit is a circuit where the smallest element is removed and an NBC independent set is an independent set in ${\cal I}$ with no broken circuit. The set of NBC independent sets of any matroid $M$ define a simplicial complex called the broken circuit complex which has been the subject of intense study in combinatorics. Recently, Adiprasito, Huh and Katz showed that the face of numbers of any broken circuit complex form a logconcave sequence, proving a longstanding conjecture of Rota. We study counting and optimization problems on NBC bases of a generic matroid. We find several fundamental differences with the independent set complex: for example, we show that it is NPhard to find the maxweight NBC base of a matroid or that the convex hull of NBC bases of a matroid has edges of arbitrary large length. We also give evidence that the natural downup walk on the space of NBC bases of a matroid may not mix rapidly by showing that for some family of matroids it is NPhard to count the number of NBC bases after certain conditionings.

20230504 Automatic Prompt Optimization with "Gradient Descent" and Beam Search.
Summary: Large Language Models (LLMs) have shown impressive performance as general purpose agents, but their abilities remain highly dependent on prompts which are hand written with onerous trialanderror effort. We propose a simple and nonparametric solution to this problem, Automatic Prompt Optimization (APO), which is inspired by numerical gradient descent to automatically improve prompts, assuming access to training data and an LLM API. The algorithm uses minibatches of data to form natural language "gradients" that criticize the current prompt. The gradients are then "propagated" into the prompt by editing the prompt in the opposite semantic direction of the gradient. These gradient descent steps are guided by a beam search and bandit selection procedure which significantly improves algorithmic efficiency. Preliminary results across three benchmark NLP tasks and the novel problem of LLM jailbreak detection suggest that Automatic Prompt Optimization can outperform prior prompt editing techniques and improve an initial prompt's performance by up to 31%, by using data to rewrite vague task descriptions into more precise annotation instructions.

20230504 A Hyperbolic Extension of KadisonSinger Type Results.
Summary: In 2013, Marcus, Spielman, and Srivastava resolved the famous KadisonSinger conjecture. It states that for $n$ independent random vectors $v_1,\cdots, v_n$ that have expected squared norm bounded by $\epsilon$ and are in the isotropic position in expectation, there is a positive probability that the determinant polynomial $\det(xI  \sum_{i=1}^n v_iv_i^\top)$ has roots bounded by $(1 + \sqrt{\epsilon})^2$. An interpretation of the KadisonSinger theorem is that we can always find a partition of the vectors $v_1,\cdots,v_n$ into two sets with a low discrepancy in terms of the spectral norm (in other words, rely on the determinant polynomial). In this paper, we provide two results for a broader class of polynomials, the hyperbolic polynomials. Furthermore, our results are in two generalized settings: $\bullet$ The first one shows that the KadisonSinger result requires a weaker assumption that the vectors have a bounded sum of hyperbolic norms. $\bullet$ The second one relaxes the KadisonSinger result's distribution assumption to the Strongly Rayleigh distribution. To the best of our knowledge, the previous results only support determinant polynomials [Anari and Oveis Gharan'14, Kyng, Luh and Song'20]. It is unclear whether they can be generalized to a broader class of polynomials. In addition, we also provide a subexponential time algorithm for constructing our results.

20230420 Attention Scheme Inspired Softmax Regression.
Summary: Large language models (LLMs) have made transformed changes for human society. One of the key computation in LLMs is the softmax unit. This operation is important in LLMs because it allows the model to generate a distribution over possible next words or phrases, given a sequence of input words. This distribution is then used to select the most likely next word or phrase, based on the probabilities assigned by the model. The softmax unit plays a crucial role in training LLMs, as it allows the model to learn from the data by adjusting the weights and biases of the neural network. In the area of convex optimization such as using central path method to solve linear programming. The softmax function has been used a crucial tool for controlling the progress and stability of potential function [Cohen, Lee and Song STOC 2019, Brand SODA 2020]. In this work, inspired the softmax unit, we define a softmax regression problem. Formally speaking, given a matrix $A \in \mathbb{R}^{n \times d}$ and a vector $b \in \mathbb{R}^n$, the goal is to use greedy type algorithm to solve \begin{align*} \min_{x} \ \langle \exp(Ax), {\bf 1}_n \rangle^{1} \exp(Ax)  b \_2^2. \end{align*} In certain sense, our provable convergence result provides theoretical support for why we can use greedy algorithm to train softmax function in practice.

20230416 Thin trees for laminar families.
Summary: In the laminarconstrained spanning tree problem, the goal is to find a minimumcost spanning tree which respects upper bounds on the number of times each cut in a given laminar family is crossed. This generalizes the wellstudied degreebounded spanning tree problem, as well as a previously studied setting where a chain of cuts is given. We give the first constantfactor approximation algorithm; in particular we show how to obtain a multiplicative violation of the crossing bounds of less than 22 while losing less than a factor of 5 in terms of cost. Our result compares to the natural LP relaxation. As a consequence, our results show that given a $k$edgeconnected graph and a laminar family $\mathcal{L} \subseteq 2^V$ of cuts, there exists a spanning tree which contains only an $O(1/k)$ fraction of the edges across every cut in $\mathcal{L}$. This can be viewed as progress towards the Thin Tree Conjecture, which (in a strong form) states that this guarantee can be obtained for all cuts simultaneously.

20230413 Solving Tensor Low Cycle Rank Approximation.
Summary: Large language models have become ubiquitous in modern life, finding applications in various domains such as natural language processing, language translation, and speech recognition. Recently, a breakthrough work [Zhao, Panigrahi, Ge, and Arora Arxiv 2023] explains the attention model from probabilistic contextfree grammar (PCFG). One of the central computation task for computing probability in PCFG is formulating a particular tensor low rank approximation problem, we can call it tensor cycle rank. Given an $n \times n \times n$ third order tensor $A$, we say that $A$ has cycle rank$k$ if there exists three $n \times k^2$ size matrices $U , V$, and $W$ such that for each entry in each \begin{align*} A_{a,b,c} = \sum_{i=1}^k \sum_{j=1}^k \sum_{l=1}^k U_{a,i+k(j1)} \otimes V_{b, j + k(l1)} \otimes W_{c, l + k(i1) } \end{align*} for all $a \in [n], b \in [n], c \in [n]$. For the tensor classical rank, tucker rank and train rank, it has been well studied in [Song, Woodruff, Zhong SODA 2019]. In this paper, we generalize the previous ``rotation and sketch'' technique in page 186 of [Song, Woodruff, Zhong SODA 2019] and show an input sparsity time algorithm for cycle rank.

20230410 Randomized and Deterministic Attention Sparsification Algorithms for Overparameterized Feature Dimension.
Summary: Large language models (LLMs) have shown their power in different areas. Attention computation, as an important subroutine of LLMs, has also attracted interests in theory. Recently the static computation and dynamic maintenance of attention matrix has been studied by [Alman and Song 2023] and [Brand, Song and Zhou 2023] from both algorithmic perspective and hardness perspective. In this work, we consider the sparsification of the attention problem. We make one simplification which is the logit matrix is symmetric. Let $n$ denote the length of sentence, let $d$ denote the embedding dimension. Given a matrix $X \in \mathbb{R}^{n \times d}$, suppose $d \gg n$ and $\ X X^\top \_{\infty} < r$ with $r \in (0,0.1)$, then we aim for finding $Y \in \mathbb{R}^{n \times m}$ (where $m\ll d$) such that \begin{align*} \ D(Y)^{1} \exp( Y Y^\top )  D(X)^{1} \exp( X X^\top) \_{\infty} \leq O(r) \end{align*} We provide two results for this problem. $\bullet$ Our first result is a randomized algorithm. It runs in $\widetilde{O}(\mathrm{nnz}(X) + n^{\omega} ) $ time, has $1\delta$ succeed probability, and chooses $m = O(n \log(n/\delta))$. Here $\mathrm{nnz}(X)$ denotes the number of nonzero entries in $X$. We use $\omega$ to denote the exponent of matrix multiplication. Currently $\omega \approx 2.373$. $\bullet$ Our second result is a deterministic algorithm. It runs in $\widetilde{O}(\min\{\sum_{i\in[d]}\mathrm{nnz}(X_i)^2, dn^{\omega1}\} + n^{\omega+1})$ time and chooses $m = O(n)$. Here $X_i$ denote the $i$th column of matrix $X$. Our main findings have the following implication for applied LLMs task: for any super large feature dimension, we can reduce it down to the size nearly linear in length of sentence.

20230407 Convex Minimization with Integer Minima in $\widetilde O(n^4)$ Time.
Summary: Given a convex function $f$ on $\mathbb{R}^n$ with an integer minimizer, we show how to find an exact minimizer of $f$ using $O(n^2 \log n)$ calls to a separation oracle and $O(n^4 \log n)$ time. The previous best polynomial time algorithm for this problem given in [Jiang, SODA 2021, JACM 2022] achieves $O(n^2\log\log n/\log n)$ oracle complexity. However, the overall runtime of Jiang's algorithm is at least $\widetilde{\Omega}(n^8)$, due to expensive subroutines such as the LenstraLenstraLov\'asz (LLL) algorithm [Lenstra, Lenstra, Lov\'asz, Math. Ann. 1982] and random walk based cutting plane method [Bertsimas, Vempala, JACM 2004]. Our significant speedup is obtained by a nontrivial combination of a faster version of the LLL algorithm due to [Neumaier, Stehl\'e, ISSAC 2016] that gives similar guarantees, the volumetric center cutting plane method (CPM) by [Vaidya, FOCS 1989] and its fast implementation given in [Jiang, Lee, Song, Wong, STOC 2020]. For the special case of submodular function minimization (SFM), our result implies a strongly polynomial time algorithm for this problem using $O(n^3 \log n)$ calls to an evaluation oracle and $O(n^4 \log n)$ additional arithmetic operations. Both the oracle complexity and the number of arithmetic operations of our more general algorithm are better than the previous bestknown runtime algorithms for this specific problem given in [Lee, Sidford, Wong, FOCS 2015] and [Dadush, V\'egh, Zambelli, SODA 2018, MOR 2021].

20230327 Revisiting Area Convexity: Faster BoxSimplex Games and Spectrahedral Generalizations.
Summary: We investigate different aspects of area convexity [Sherman '17], a mysterious tool introduced to tackle optimization problems under the challenging $\ell_\infty$ geometry. We develop a deeper understanding of its relationship with more conventional analyses of extragradient methods [Nemirovski '04, Nesterov '07]. We also give improved solvers for the subproblems required by variants of the [Sherman '17] algorithm, designed through the lens of relative smoothness [BauschkeBolteTeboulle '17, LuFreundNesterov '18]. Leveraging these new tools, we give a stateoftheart firstorder algorithm for solving boxsimplex games (a primaldual formulation of $\ell_\infty$ regression) in a $d \times n$ matrix with bounded rows, using $O(\log d \cdot \epsilon^{1})$ matrixvector queries. As a consequence, we obtain improved complexities for approximate maximum flow, optimal transport, minmeancycle, and other basic combinatorial optimization problems. We also develop a nearlinear time algorithm for a matrix generalization of boxsimplex games, capturing a family of problems closely related to semidefinite programs recently used as subroutines in robust statistics and numerical linear algebra.

20230326 The Subspace Flatness Conjecture and Faster Integer Programming.
Summary: In a seminal paper, Kannan and Lov\'asz (1988) considered a quantity $\mu_{KL}(\Lambda,K)$ which denotes the best volumebased lower bound on the covering radius $\mu(\Lambda,K)$ of a convex body $K$ with respect to a lattice $\Lambda$. Kannan and Lov\'asz proved that $\mu(\Lambda,K) \leq n \cdot \mu_{KL}(\Lambda,K)$ and the Subspace Flatness Conjecture by Dadush (2012) claims a $O(\log(2n))$ factor suffices, which would match the lower bound from the work of Kannan and Lov\'asz. We settle this conjecture up to a constant in the exponent by proving that $\mu(\Lambda,K) \leq O(\log^{3}(2n)) \cdot \mu_{KL} (\Lambda,K)$. Our proof is based on the Reverse Minkowski Theorem due to Regev and StephensDavidowitz (2017). Following the work of Dadush (2012, 2019), we obtain a $(\log(2n))^{O(n)}$time randomized algorithm to solve integer programs in $n$ variables. Another implication of our main result is a nearoptimal flatness constant of $O(n \log^{3}(2n))$.

20230322 Sparks of Artificial General Intelligence: Early experiments with GPT4.
Summary: Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT4, when it was still in active development by OpenAI. We contend that (this early version of) GPT4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT4's performance is strikingly close to humanlevel performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond nextword prediction. We conclude with reflections on societal influences of the recent technological leap and future research directions.

20230314 Testing Causality for High Dimensional Data.
Summary: Determining causal relationship between high dimensional observations are among the most important tasks in scientific discoveries. In this paper, we revisited the \emph{linear trace method}, a technique proposed in~\citep{janzing2009telling,zscheischler2011testing} to infer the causal direction between two random variables of high dimensions. We strengthen the existing results significantly by providing an improved tail analysis in addition to extending the results to nonlinear trace functionals with sharper confidence bounds under certain distributional assumptions. We obtain our results by interpreting the trace estimator in the causal regime as a function over random orthogonal matrices, where the concentration of Lipschitz functions over such space could be applied. We additionally propose a novel ridgeregularized variant of the estimator in \cite{zscheischler2011testing}, and give provable bounds relating the ridgeestimated terms to their groundtruth counterparts. We support our theoretical results with encouraging experiments on synthetic datasets, more prominently, under highdimension low sample size regime.

20230313 An Improved Sample Complexity for Rank1 Matrix Sensing.
Summary: Matrix sensing is a problem in signal processing and machine learning that involves recovering a lowrank matrix from a set of linear measurements. The goal is to reconstruct the original matrix as accurately as possible, given only a set of linear measurements obtained by sensing the matrix [Jain, Netrapalli and Shanghavi, 2013]. In this work, we focus on a particular direction of matrix sensing, which is called rank$1$ matrix sensing [Zhong, Jain and Dhillon, 2015]. We present an improvement over the original algorithm in [Zhong, Jain and Dhillon, 2015]. It is based on a novel analysis and sketching technique that enables faster convergence rates and better accuracy in recovering lowrank matrices. The algorithm focuses on developing a theoretical understanding of the matrix sensing problem and establishing its advantages over previous methods. The proposed sketching technique allows for efficiently extracting relevant information from the linear measurements, making the algorithm computationally efficient and scalable. Our novel matrix sensing algorithm improves former result [Zhong, Jain and Dhillon, 2015] on in two senses: $\bullet$ We improve the sample complexity from $\widetilde{O}(\epsilon^{2} dk^2)$ to $\widetilde{O}(\epsilon^{2} (d+k^2))$. $\bullet$ We improve the running time from $\widetilde{O}(md^2 k^2)$ to $\widetilde{O}(m d^2 k)$. The proposed algorithm has theoretical guarantees and is analyzed to provide insights into the underlying structure of lowrank matrices and the nature of the linear measurements used in the recovery process. It advances the theoretical understanding of matrix sensing and provides a new approach for solving this important problem.

20230308 Streaming Kernel PCA Algorithm With Small Space.
Summary: Principal Component Analysis (PCA) is a widely used technique in machine learning, data analysis and signal processing. With the increase in the size and complexity of datasets, it has become important to develop lowspace usage algorithms for PCA. Streaming PCA has gained significant attention in recent years, as it can handle large datasets efficiently. The kernel method, which is commonly used in learning algorithms such as Support Vector Machines (SVMs), has also been applied in PCA algorithms. We propose a streaming algorithm for Kernel PCA problems based on the traditional scheme by Oja. Our algorithm addresses the challenge of reducing the memory usage of PCA while maintaining its accuracy. We analyze the performance of our algorithm by studying the conditions under which it succeeds. Specifically, we show that, when the spectral ratio $R := \lambda_1/\lambda_2$ of the target covariance matrix is lower bounded by $C \cdot \log n\cdot \log d$, the streaming PCA can be solved with $O(d)$ space cost. Our proposed algorithm has several advantages over existing methods. First, it is a streaming algorithm that can handle large datasets efficiently. Second, it employs the kernel method, which allows it to capture complex nonlinear relationships among data points. Third, it has a lowspace usage, making it suitable for applications where memory is limited.

20230307 Complete Log Concavity of CoverageLike Functions.
Summary: We introduce an expressive subclass of nonnegative almost submodular set functions, called strongly 2coverage functions which include coverage and (sums of) matroid rank functions, and prove that the homogenization of the generating polynomial of any such function is completely logconcave, taking a step towards characterizing the coefficients of (homogeneous) completely logconcave polynomials. As a consequence we obtain that the "level sets" of any such function form an ultralog concave sequence.

20230302 An Improved Classical Singular Value Transformation for Quantum Machine Learning.
Summary: We study quantum speedups in quantum machine learning (QML) by analyzing the quantum singular value transformation (QSVT) framework. QSVT, introduced by [GSLW, STOC'19, arXiv:1806.01838], unifies all major types of quantum speedup; in particular, a wide variety of QML proposals are applications of QSVT on lowrank classical data. We challenge these proposals by providing a classical algorithm that matches the performance of QSVT in this regime up to a small polynomial overhead. We show that, given a matrix $A \in \mathbb{C}^{m\times n}$, a vector $b \in \mathbb{C}^{n}$, a bounded degree$d$ polynomial $p$, and lineartime preprocessing, we can output a description of a vector $v$ such that $\v  p(A) b\ \leq \varepsilon\b\$ in $\widetilde{\mathcal{O}}(d^{11} \A\_{\mathrm{F}}^4 / (\varepsilon^2 \A\^4 ))$ time. This improves upon the best known classical algorithm [CGLLTW, STOC'20, arXiv:1910.06151], which requires $\widetilde{\mathcal{O}}(d^{22} \A\_{\mathrm{F}}^6 /(\varepsilon^6 \A\^6 ) )$ time, and narrows the gap with QSVT, which, after lineartime preprocessing to load input into a quantumaccessible memory, can estimate the magnitude of an entry $p(A)b$ to $\varepsilon\b\$ error in $\widetilde{\mathcal{O}}(d\A\_{\mathrm{F}}/(\varepsilon \A\))$ time. Our key insight is to combine the Clenshaw recurrence, an iterative method for computing matrix polynomials, with sketching techniques to simulate QSVT classically. We introduce several new classical techniques in this work, including (a) a nonoblivious matrix sketch for approximately preserving bilinear forms, (b) a new stability analysis for the Clenshaw recurrence, and (c) a new technique to bound arithmetic progressions of the coefficients appearing in the Chebyshev series expansion of bounded functions, each of which may be of independent interest.

20230228 A CS guide to the quantum singular value transformation.
Summary: We present a simplified exposition of some pieces of [Gily\'en, Su, Low, and Wiebe, STOC'19, arXiv:1806.01838], which introduced a quantum singular value transformation (QSVT) framework for applying polynomial functions to blockencoded matrices. The QSVT framework has garnered substantial recent interest from the quantum algorithms community, as it was demonstrated by [GSLW19] to encapsulate many existing algorithms naturally phrased as an application of a matrix function. First, we posit that the lifting of quantum singular processing (QSP) to QSVT is better viewed not through Jordan's lemma (as was suggested by [GSLW19]) but as an application of the cosinesine decomposition, which can be thought of as a more explicit and stronger version of Jordan's lemma. Second, we demonstrate that the constructions of bounded polynomial approximations given in [GSLW19], which use a variety of ad hoc approaches drawing from Fourier analysis, Chebyshev series, and Taylor series, can be unified under the framework of truncation of Chebyshev series, and indeed, can in large part be matched via a bounded variant of a standard metatheorem from [Trefethen, 2013]. We hope this work finds use to the community as a companion guide for understanding and applying the powerful framework of [GSLW19].

20230227 Queryoptimal estimation of unitary channels in diamond distance.
Summary: We consider process tomography for unitary quantum channels. Given access to an unknown unitary channel acting on a $\textsf{d}$dimensional qudit, we aim to output a classical description of a unitary that is $\varepsilon$close to the unknown unitary in diamond norm. We design an algorithm achieving error $\varepsilon$ using $O(\textsf{d}^2/\varepsilon)$ applications of the unknown channel and only one qudit. This improves over prior results, which use $O(\textsf{d}^3/\varepsilon^2)$ [via standard process tomography] or $O(\textsf{d}^{2.5}/\varepsilon)$ [Yang, Renner, and Chiribella, PRL 2020] applications. To show this result, we introduce a simple technique to "bootstrap" an algorithm that can produce constanterror estimates to one that can produce $\varepsilon$error estimates with the Heisenberg scaling. Finally, we prove a complementary lower bound showing that estimation requires $\Omega(\textsf{d}^2/\varepsilon)$ applications, even with access to the inverse or controlled versions of the unknown unitary. This shows that our algorithm has both optimal query complexity and optimal space complexity.

20230224 Quantum trapdoor functions from classical oneway functions.
Summary: We formalize and study the notion of a quantum trapdoor function. This is an efficiently computable unitary that takes as input a "public" quantum state and a classical string $x$, and outputs a quantum state. This map is such that (i) it is hard to invert, in the sense that it is hard to recover $x$ given the output state (and many copies of the public state), and (ii) there is a classical trapdoor that allows efficient inversion. We show that a quantum trapdoor function can be constructed from any quantumsecure oneway function. A direct consequence of this result is that, assuming just the existence of quantumsecure oneway functions, there exists a publickey encryption scheme with a (pure) quantum public key.

20230221 $k$NNAdapter: Efficient Domain Adaptation for BlackBox Language Models.
Summary: Finetuning a language model on a new domain is standard practice for domain adaptation. However, it can be infeasible when it comes to modern largescale language models such as GPT3, which can only be accessed through APIs, making it difficult to access the internal parameters of the model. In this paper, we propose $k$NNAdapter, a method to effectively adapt these blackbox large language models (LLMs) to a new domain. The $k$NNAdapter builds on top of the retrievalaugmented language model, and adaptively learns to interpolate the output of the language model with retrieval results from a datastore consisting of the target domain data. Our experiments on four different domains demonstrate that $k$NNAdapter significantly improves perplexity, and works particularly well in settings with limited access to LLMs. Additionally, we show that $k$NNAdapter is more effective than finetuning when the amount of training data is limited. We also release a dataset to encourage further study.

20230220 Private (Stochastic) NonConvex Optimization Revisited: SecondOrder Stationary Points and Excess Risks.
Summary: We consider the problem of minimizing a nonconvex objective while preserving the privacy of the examples in the training data. Building upon the previous variancereduced algorithm SpiderBoost, we introduce a new framework that utilizes two different kinds of gradient oracles. The first kind of oracles can estimate the gradient of one point, and the second kind of oracles, less precise and more costeffective, can estimate the gradient difference between two points. SpiderBoost uses the first kind periodically, once every few steps, while our framework proposes using the first oracle whenever the total drift has become large and relies on the second oracle otherwise. This new framework ensures the gradient estimations remain accurate all the time, resulting in improved rates for finding secondorder stationary points. Moreover, we address a more challenging task of finding the global minima of a nonconvex objective using the exponential mechanism. Our findings indicate that the regularized exponential mechanism can closely match previous empirical and population risk bounds, without requiring smoothness assumptions for algorithms with polynomial running time. Furthermore, by disregarding running time considerations, we show that the exponential mechanism can achieve a good population risk bound and provide a nearly matching lower bound.

20230213 Algorithmic Aspects of the LogLaplace Transform and a NonEuclidean Proximal Sampler.
Summary: The development of efficient sampling algorithms catering to nonEuclidean geometries has been a challenging endeavor, as discretization techniques which succeed in the Euclidean setting do not readily carry over to more general settings. We develop a nonEuclidean analog of the recent proximal sampler of [LST21], which naturally induces regularization by an object known as the logLaplace transform (LLT) of a density. We prove new mathematical properties (with an algorithmic flavor) of the LLT, such as strong convexitysmoothness duality and an isoperimetric inequality, which are used to prove a mixing time on our proximal sampler matching [LST21] under a warm start. As our main application, we show our warmstarted sampler improves the value oracle complexity of differentially private convex optimization in $\ell_p$ and Schatten$p$ norms for $p \in [1, 2]$ to match the Euclidean setting [GLL22], while retaining stateoftheart excess risk bounds [GLLST23]. We find our investigation of the LLT to be a promising proofofconcept of its utility as a tool for designing samplers, and outline directions for future exploration.

20230121 From Pseudorandomness to MultiGroup Fairness and Back.
Summary: We identify and explore connections between the recent literature on multigroup fairness for prediction algorithms and the pseudorandomness notions of leakageresilience and graph regularity. We frame our investigation using new, statistical distancebased variants of multicalibration that are closely related to the concept of outcome indistinguishability. Adopting this perspective leads us naturally not only to our graph theoretic results, but also to new, more efficient algorithms for multicalibration in certain parameter regimes and a novel proof of a hardcore lemma for realvalued functions.

20230113 Cumulative Memory Lower Bounds for Randomized and Quantum Computation.
Summary: Cumulative memory  the sum of space used per step over the duration of a computation  is a finegrained measure of timespace complexity that was introduced to analyze cryptographic applications like password hashing. It is a more accurate cost measure for algorithms that have infrequent spikes in memory usage and are run in environments such as cloud computing that allow dynamic allocation and deallocation of resources during execution, or when many multiple instances of an algorithm are interleaved in parallel. We prove the first lower bounds on cumulative memory complexity for both sequential classical computation and quantum circuits. Moreover, we develop general paradigms for bounding cumulative memory complexity inspired by the standard paradigms for proving timespace tradeoff lower bounds that can only lower bound the maximum space used during an execution. The resulting lower bounds on cumulative memory that we obtain are just as strong as the best timespace tradeoff lower bounds, which are very often known to be tight. Although previous results for pebbling and random oracle models have yielded timespace tradeoff lower bounds larger than the cumulative memory complexity, our results show that in general computational models such separations cannot follow from known lower bound techniques and are not true for many functions. Among many possible applications of our general methods, we show that any classical sorting algorithm with success probability at least $1/\text{poly}(n)$ requires cumulative memory $\tilde \Omega(n^2)$, any classical matrix multiplication algorithm requires cumulative memory $\Omega(n^6/T)$, any quantum sorting circuit requires cumulative memory $\Omega(n^3/T)$, and any quantum circuit that finds $k$ disjoint collisions in a random function requires cumulative memory $\Omega(k^3n/T^2)$.

20230110 Elastic Cash.
Summary: Elastic Cash is a new decentralized mechanism for regulating the money supply. The mechanism operates by modifying the supply so that an interest rate determined by a public market is kept approximately fixed. It can be incorporated into the conventional monetary system to improve the elasticity of the US Dollar, and it can be used to design new elastic cryptocurrencies that remain decentralized.

20230101 ReSQueing Parallel and Private Stochastic Convex Optimization.
Summary: We introduce a new tool for stochastic convex optimization (SCO): a Reweighted Stochastic Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density. Combining ReSQue with recent advances in ball oracle acceleration [CJJJLST20, ACJJS21], we develop algorithms achieving stateoftheart complexities for SCO in parallel and private settings. For a SCO objective constrained to the unit ball in $\mathbb{R}^d$, we obtain the following results (up to polylogarithmic factors). We give a parallel algorithm obtaining optimization error $\epsilon_{\text{opt}}$ with $d^{1/3}\epsilon_{\text{opt}}^{2/3}$ gradient oracle query depth and $d^{1/3}\epsilon_{\text{opt}}^{2/3} + \epsilon_{\text{opt}}^{2}$ gradient queries in total, assuming access to a boundedvariance stochastic gradient estimator. For $\epsilon_{\text{opt}} \in [d^{1}, d^{1/4}]$, our algorithm matches the stateoftheart oracle depth of [BJLLS19] while maintaining the optimal total work of stochastic gradient descent. Given $n$ samples of Lipschitz loss functions, prior works [BFTT19, BFGT20, AFKT21, KLL21] established that if $n \gtrsim d \epsilon_{\text{dp}}^{2}$, $(\epsilon_{\text{dp}}, \delta)$differential privacy is attained at no asymptotic cost to the SCO utility. However, these prior works all required a superlinear number of gradient queries. We close this gap for sufficiently large $n \gtrsim d^2 \epsilon_{\text{dp}}^{3}$, by using ReSQue to design an algorithm with nearlinear gradient query complexity in this regime.

20221214 Learning threshold neurons via the "edge of stability".
Summary: Existing analyses of neural network training often operate under the unrealistic assumption of an extremely small learning rate. This lies in stark contrast to practical wisdom and empirical studies, such as the work of J. Cohen et al. (ICLR 2021), which exhibit startling new phenomena (the "edge of stability" or "unstable convergence") and potential benefits for generalization in the large learning rate regime. Despite a flurry of recent works on this topic, however, the latter effect is still poorly understood. In this paper, we take a step towards understanding genuinely nonconvex training dynamics with large learning rates by performing a detailed analysis of gradient descent for simplified models of twolayer neural networks. For these models, we provably establish the edge of stability phenomenon and discover a sharp phase transition for the step size below which the neural network fails to learn "thresholdlike" neurons (i.e., neurons with a nonzero firstlayer bias). This elucidates one possible mechanism by which the edge of stability can in fact lead to better generalization, as threshold neurons are basic building blocks with useful inductive bias for many tasks.

20221213 A (Slightly) Improved Deterministic Approximation Algorithm for Metric TSP.
Summary: We show that the max entropy algorithm can be derandomized (with respect to a particular objective function) to give a deterministic $3/2\epsilon$ approximation algorithm for metric TSP for some $\epsilon > 10^{36}$. To obtain our result, we apply the method of conditional expectation to an objective function constructed in prior work which was used to certify that the expected cost of the algorithm is at most $3/2\epsilon$ times the cost of an optimal solution to the subtour elimination LP. The proof in this work involves showing that the expected value of this objective function can be computed in polynomial time (at all stages of the algorithm's execution).

20221203 Exploring the Limits of Differentially Private Deep Learning with Groupwise Clipping.
Summary: Differentially private deep learning has recently witnessed advances in computational efficiency and privacyutility tradeoff. We explore whether further improvements along the two axes are possible and provide affirmative answers leveraging two instantiations of \emph{groupwise clipping}. To reduce the compute time overhead of private learning, we show that \emph{perlayer clipping}, where the gradient of each neural network layer is clipped separately, allows clipping to be performed in conjunction with backpropagation in differentially private optimization. This results in private learning that is as memoryefficient and almost as fast per training update as nonprivate learning for many workflows of interest. While perlayer clipping with constant thresholds tends to underperform standard flat clipping, perlayer clipping with adaptive thresholds matches or outperforms flat clipping under given training epoch constraints, hence attaining similar or better task performance within less wall time. To explore the limits of scaling (pretrained) models in differentially private deep learning, we privately finetune the 175 billionparameter GPT3. We bypass scaling challenges associated with clipping gradients that are distributed across multiple devices with \emph{perdevice clipping} that clips the gradient of each model piece separately on its host device. Privately finetuning GPT3 with perdevice clipping achieves a task performance at $\epsilon=1$ better than what is attainable by nonprivately finetuning the largest GPT2 on a summarization task.

20221130 On Disperser/Lifting Properties of the Index and InnerProduct Functions.
Summary: Querytocommunication lifting theorems, which connect the query complexity of a Boolean function to the communication complexity of an associated `lifted' function obtained by composing the function with many copies of another function known as a gadget, have been instrumental in resolving many open questions in computational complexity. Several important complexity questions could be resolved if we could make substantial improvements in the input size required for lifting with the Index function, from its current nearlinear size down to polylogarithmic in the number of inputs $N$ of the original function or, ideally, constant. The nearlinear size bound was shown by Lovett, Meka, Mertz, Pitassi and Zhang using a recent breakthrough improvement on the Sunflower Lemma to show that a certain graph associated with the Index function of nearlinear size is a disperser. They also stated a conjecture about the Index function that is essential for further improvements in the size required for lifting with Index using current techniques. In this paper we prove the following; 1) The conjecture of Lovett et al. is false when the size of the Index gadget is $\log N\omega(1)$. 2) Also, the InnerProduct function, which satisfies the disperser property at size $O(\log N)$, does not have this property when its size is $\log N\omega(1)$. 3) Nonetheless, using Index gadgets of size at least 4, we prove a lifting theorem for a restricted class of communication protocols in which one of the players is limited to sending parities of its inputs. 4) Using the ideas from this lifting theorem, we derive a strong lifting theorem from decision tree size to parity decision tree size. We use this to derive a general lifting theorem in proof complexity from treeresolution size to treelike $Res(\oplus)$ refutation size, which yields many new exponential lower bounds on such proofs.

20221127 Dynamic Kernel Sparsifiers.
Summary: A geometric graph associated with a set of points $P= \{x_1, x_2, \cdots, x_n \} \subset \mathbb{R}^d$ and a fixed kernel function $\mathsf{K}:\mathbb{R}^d\times \mathbb{R}^d\to\mathbb{R}_{\geq 0}$ is a complete graph on $P$ such that the weight of edge $(x_i, x_j)$ is $\mathsf{K}(x_i, x_j)$. We present a fullydynamic data structure that maintains a spectral sparsifier of a geometric graph under updates that change the locations of points in $P$ one at a time. The update time of our data structure is $n^{o(1)}$ with high probability, and the initialization time is $n^{1+o(1)}$. Under certain assumption, we can provide a fully dynamic spectral sparsifier with the robostness to adaptive adversary. We further show that, for the Laplacian matrices of these geometric graphs, it is possible to maintain random sketches for the results of matrix vector multiplication and inversematrix vector multiplication in $n^{o(1)}$ time, under updates that change the locations of points in $P$ or change the query vector by a sparse difference.

20221121 Upper and Lower Bounds on the Smoothed Complexity of the Simplex Method.
Summary: The simplex method for linear programming is known to be highly efficient in practice, and understanding its performance from a theoretical perspective is an active research topic. The framework of smoothed analysis, first introduced by Spielman and Teng (JACM '04) for this purpose, defines the smoothed complexity of solving a linear program with $d$ variables and $n$ constraints as the expected running time when Gaussian noise of variance $\sigma^2$ is added to the LP data. We prove that the smoothed complexity of the simplex method is $O(\sigma^{3/2} d^{13/4}\log^{7/4} n)$, improving the dependence on $1/\sigma$ compared to the previous bound of $O(\sigma^{2} d^2\sqrt{\log n})$. We accomplish this through a new analysis of the \emph{shadow bound}, key to earlier analyses as well. Illustrating the power of our new method, we use our method to prove a nearly tight upper bound on the smoothed complexity of twodimensional polygons. We also establish the first nontrivial lower bound on the smoothed complexity of the simplex method, proving that the \emph{shadow vertex simplex method} requires at least $\Omega \Big(\min \big(\sigma^{1/2} d^{1/2}\log^{1/4} d,2^d \big) \Big)$ pivot steps with high probability. A key part of our analysis is a new variation on the extended formulation for the regular $2^k$gon. We end with a numerical experiment that suggests this analysis could be further improved.

20221117 How to FineTune Vision Models with SGD.
Summary: SGD and AdamW are the two most used optimizers for finetuning large neural networks in computer vision. When the two methods perform the same, SGD is preferable because it uses less memory (12 bytes/parameter with momentum and 8 bytes/parameter without) than AdamW (16 bytes/parameter). However, on a suite of downstream tasks, especially those with distribution shifts, we find that finetuning with AdamW performs substantially better than SGD on modern Vision Transformer and ConvNeXt models. We find that large gaps in performance between SGD and AdamW occur when the finetuning gradients in the first "embedding" layer are much larger than in the rest of the model. Our analysis suggests an easy fix that works consistently across datasets and models: freezing the embedding layer (less than 1% of the parameters) leads to SGD with or without momentum performing slightly better than AdamW while using less memory (e.g., on ViTL, SGD uses 33% less GPU memory). Our insights result in stateoftheart accuracies on five popular distribution shift benchmarks: WILDSFMoW, WILDSCamelyon, BREEDSLiving17, Waterbirds, and DomainNet.

20221109 A 4/3Approximation Algorithm for HalfIntegral Cycle Cut Instances of the TSP.
Summary: A longstanding conjecture for the traveling salesman problem (TSP) states that the integrality gap of the standard linear programming relaxation of the TSP is at most 4/3. Despite significant efforts, the conjecture remains open. We consider the halfintegral case, in which the LP has solution values in $\{0, 1/2, 1\}$. Such instances have been conjectured to be the most difficult instances for the overall fourthirds conjecture. Karlin, Klein, and Oveis Gharan, in a breakthrough result, were able to show that in the halfintegral case, the integrality gap is at most 1.49993. This result led to the first significant progress on the overall conjecture in decades; the same authors showed the integrality gap is at most $1.5 10^{36}$ in the nonhalfintegral case. For the halfintegral case, the current bestknown ratio is 1.4983, a result by Gupta et al. With the improvements on the 3/2 bound remaining very incremental even in the halfintegral case, we turn the question around and look for a large class of halfintegral instances for which we can prove that the 4/3 conjecture is correct. The previous works on the halfintegral case perform induction on a hierarchy of critical tight sets in the support graph of the LP solution, in which some of the sets correspond to "cycle cuts" and the others to "degree cuts". We show that if all the sets in the hierarchy correspond to cycle cuts, then we can find a distribution of tours whose expected cost is at most 4/3 times the value of the halfintegral LP solution; sampling from the distribution gives us a randomized 4/3approximation algorithm. We note that the known bad cases for the integrality gap have a gap of 4/3 and have a halfintegral LP solution in which all the critical tight sets in the hierarchy are cycle cuts; thus our result is tight.

20221107 From approximate to exact integer programming.
Summary: Approximate integer programming is the following: For a convex body $K \subseteq \mathbb{R}^n$, either determine whether $K \cap \mathbb{Z}^n$ is empty, or find an integer point in the convex body scaled by $2$ from its center of gravity $c$. Approximate integer programming can be solved in time $2^{O(n)}$ while the fastest known methods for exact integer programming run in time $2^{O(n)} \cdot n^n$. So far, there are no efficient methods for integer programming known that are based on approximate integer programming. Our main contribution are two such methods, each yielding novel complexity results. First, we show that an integer point $x^* \in (K \cap \mathbb{Z}^n)$ can be found in time $2^{O(n)}$, provided that the remainders of each component $x_i^* \mod{\ell}$ for some arbitrarily fixed $\ell \geq 5(n+1)$ of $x^*$ are given. The algorithm is based on a cuttingplane technique, iteratively halving the volume of the feasible set. The cutting planes are determined via approximate integer programming. Enumeration of the possible remainders gives a $2^{O(n)}n^n$ algorithm for general integer programming. This matches the current best bound of an algorithm by Dadush (2012) that is considerably more involved. Our algorithm also relies on a new asymmetric approximate Carath\'eodory theorem that might be of interest on its own. Our second method concerns integer programming problems in equationstandard form $Ax = b, 0 \leq x \leq u, \, x \in \mathbb{Z}^n$ . Such a problem can be reduced to the solution of $\prod_i O(\log u_i +1)$ approximate integer programming problems. This implies, for example that knapsack or subsetsum problems with polynomial variable range $0 \leq x_i \leq p(n)$ can be solved in time $(\log n)^{O(n)}$. For these problems, the best running time so far was $n^n \cdot 2^{O(n)}$.

20221103 BestofBothWorlds Multiparty Quantum Computation with Publicly Verifiable Identifiable Abort.
Summary: Alon et al. (CRYPTO 2021) introduced a multiparty quantum computation protocol that is secure with identifiable abort (MPQCSWIA). However, their protocol allows only inside MPQC parties to know the identity of malicious players. This becomes problematic when two groups of people disagree and need a third party, like a jury, to verify who the malicious party is. This issue takes on heightened significance in the quantum setting, given that quantum states may exist in only a single copy. Thus, we emphasize the necessity of a protocol with publicly verifiable identifiable abort (PVIA), enabling outside observers with only classical computational power to agree on the identity of the malicious party in case of an abort. However, achieving MPQC with PVIA poses significant challenges due to the nocloning theorem, and previous works proposed by Mahadev (STOC 2018) and Chung et al. (Eurocrypt 2022) for classical verification of quantum computation fall short. In this paper, we obtain the first MPQCPVIA protocol assuming postquantum oblivious transfer and a classical broadcast channel. The core component of our construction is a new authentication primitive called auditable quantum authentication (AQA) that identifies the malicious sender with overwhelming probability. Additionally, we provide the first MPQC protocol with bestofbothworlds (BoBW) security, which guarantees output delivery with an honest majority and remains secure with abort even if the majority is dishonest. Our bestofbothworlds MPQC protocol also satisfies PVIA upon abort.

20221029 The Vector Balancing Constant for Zonotopes.
Summary: The vector balancing constant $\mathrm{vb}(K,Q)$ of two symmetric convex bodies $K,Q$ is the minimum $r \geq 0$ so that any number of vectors from $K$ can be balanced into an $r$scaling of $Q$. A question raised by Schechtman is whether for any zonotope $K \subseteq \mathbb{R}^d$ one has $\mathrm{vb}(K,K) \lesssim \sqrt{d}$. Intuitively, this asks whether a natural geometric generalization of Spencer's Theorem (for which $K = B^d_\infty$) holds. We prove that for any zonotope $K \subseteq \mathbb{R}^d$ one has $\mathrm{vb}(K,K) \lesssim \sqrt{d} \log \log \log d$. Our main technical contribution is a tight lower bound on the Gaussian measure of any section of a normalized zonotope, generalizing Vaaler's Theorem for cubes. We also prove that for two different normalized zonotopes $K$ and $Q$ one has $\mathrm{vb}(K,Q) \lesssim \sqrt{d \log d}$. All the bounds are constructive and the corresponding colorings can be computed in polynomial time.

20221027 Learning SingleIndex Models with Shallow Neural Networks.
Summary: Singleindex models are a class of functions given by an unknown univariate ``link'' function applied to an unknown onedimensional projection of the input. These models are particularly relevant in high dimension, when the data might present lowdimensional structure that learning algorithms should adapt to. While several statistical aspects of this model, such as the sample complexity of recovering the relevant (onedimensional) subspace, are wellunderstood, they rely on tailored algorithms that exploit the specific structure of the target function. In this work, we introduce a natural class of shallow neural networks and study its ability to learn singleindex models via gradient flow. More precisely, we consider shallow networks in which biases of the neurons are frozen at random initialization. We show that the corresponding optimization landscape is benign, which in turn leads to generalization guarantees that match the nearoptimal sample complexity of dedicated semiparametric methods.

20221027 A distribution testing oracle separation between QMA and QCMA.
Summary: It is a longstanding open question in quantum complexity theory whether the definition of $\textit{nondeterministic}$ quantum computation requires quantum witnesses $(\textsf{QMA})$ or if classical witnesses suffice $(\textsf{QCMA})$. We make progress on this question by constructing a randomized classical oracle separating the respective computational complexity classes. Previous separations [AaronsonKuperberg (CCC'07), FeffermanKimmel (MFCS'18)] required a quantum unitary oracle. The separating problem is deciding whether a distribution supported on regular undirected graphs either consists of multiple connected components (yes instances) or consists of one expanding connected component (no instances) where the graph is given in an adjacencylist format by the oracle. Therefore, the oracle is a distribution over $n$bit boolean functions.

20221022 Discrepancy Minimization in InputSparsity Time.
Summary: A recent work of Larsen [Lar23] gave a faster combinatorial alternative to Bansal's SDP algorithm for finding a coloring $x\in\{1,1\}^n$ that approximately minimizes the discrepancy $\mathrm{disc}(A,x) : = \ A x \_{\infty}$ of a general realvalued $m\times n$ matrix $A$. Larsen's algorithm runs in $\widetilde{O}(mn^2)$ time compared to Bansal's $\widetilde{O}(mn^{4.5})$time algorithm, at the price of a slightly weaker logarithmic approximation ratio in terms of the hereditary discrepancy of $A$ [Ban10]. In this work we present a combinatorial $\widetilde{O}(\mathrm{nnz}(A) + n^3)$ time algorithm with the same approximation guarantee as Larsen, which is optimal for tall matrices $m=\mathrm{poly}(n)$. Using a more intricate analysis and fast matrixmultiplication, we achieve $\widetilde{O}(\mathrm{nnz}(A) + n^{2.53})$ time, which breaks cubic runtime for square matrices, and bypasses the barrier of linearprogramming approaches [ES14] for which inputsparsity time is currently out of reach. Our algorithm relies on two main ideas: (i) A new sketching technique for finding a projection matrix with short $\ell_2$basis using implicit leveragescore sampling; (ii) A data structure for faster implementation of the iterative EdgeWalk partialcoloring algorithm of LovettMeka, using an alternative analysis that enables ``lazy" batchupdates with lowrank corrections. Our result nearly closes the computational gap between realvalued and binary matrices (setsystems), for which inputsparsity time coloring was very recently obtained [JSS23].

20221021 Augmentation with Projection: Towards an Effective and Efficient Data Augmentation Paradigm for Distillation.
Summary: Knowledge distillation is one of the primary methods of transferring knowledge from large to small models. However, it requires massive taskspecific data, which may not be plausible in many realworld applications. Data augmentation methods such as representation interpolation, token replacement, or augmentation with models are applied to tackle this problem. However, these data augmentation methods either potentially cause shifts in decision boundaries (representation interpolation), are not expressive enough (token replacement), or introduce too much computational overhead (augmentation with models). To this end, we propose AugPro (Augmentation with Projection), an effective and efficient data augmentation method for distillation. Our method builds on top of representation interpolation augmentation methods to maintain the diversity of expressions and converts the augmented data to tokens to avoid shifting decision boundaries. It uses simple operations that come with little computational overhead. The results on multiple GLUE tasks show that our methods can improve distillation performance by a large margin at a low time cost. Codes are available at https://github.com/googleresearch/googleresearch/tree/master/augpro.

20221015 A Nearly Optimal Size Coreset Algorithm with Nearly Linear Time.
Summary: A coreset is a point set containing information about geometric properties of a larger point set. A series of previous works show that in many machine learning problems, especially in clustering problems, coreset could be very useful to build efficient algorithms. Two main measures of an coreset construction algorithm's performance are the running time of the algorithm and the size of the coreset output by the algorithm. In this paper we study the construction of coresets for the $(k,z)$clustering problem, which is a generalization of $k$means and $k$median problem. By properly designing a sketchingbased distance estimation data structure, we propose faster algorithms that construct coresets with matching size of the stateoftheart results.

20221013 Conditionnumberindependent convergence rate of Riemannian Hamiltonian Monte Carlo with numerical integrators.
Summary: We study the convergence rate of discretized Riemannian Hamiltonian Monte Carlo on sampling from distributions in the form of $e^{f(x)}$ on a convex body $\mathcal{M}\subset\mathbb{R}^{n}$. We show that for distributions in the form of $e^{\alpha^{\top}x}$ on a polytope with $m$ constraints, the convergence rate of a family of commonlyused integrators is independent of $\left\Vert \alpha\right\Vert _{2}$ and the geometry of the polytope. In particular, the implicit midpoint method (IMM) and the generalized Leapfrog method (LM) have a mixing time of $\widetilde{O}\left(mn^{3}\right)$ to achieve $\epsilon$ total variation distance to the target distribution. These guarantees are based on a general bound on the convergence rate for densities of the form $e^{f(x)}$ in terms of parameters of the manifold and the integrator. Our theoretical guarantee complements the empirical results of [KLSV22], which shows that RHMC with IMM can sample illconditioned, nonsmooth and constrained distributions in very high dimension efficiently in practice.

20221012 Quantum Depth in the Random Oracle Model.
Summary: We give a comprehensive characterization of the computational power of shallow quantum circuits combined with classical computation. Specifically, for classes of search problems, we show that the following statements hold, relative to a random oracle: (a) $\mathsf{BPP}^{\mathsf{QNC}^{\mathsf{BPP}}} \neq \mathsf{BQP}$. This refutes Jozsa's conjecture [QIP 05] in the random oracle model. As a result, this gives the first instantiatable separation between the classes by replacing the oracle with a cryptographic hash function, yielding a resolution to one of Aaronson's ten semigrand challenges in quantum computing. (b) $\mathsf{BPP}^{\mathsf{QNC}} \nsubseteq \mathsf{QNC}^{\mathsf{BPP}}$ and $\mathsf{QNC}^{\mathsf{BPP}} \nsubseteq \mathsf{BPP}^{\mathsf{QNC}}$. This shows that there is a subtle interplay between classical computation and shallow quantum computation. In fact, for the second separation, we establish that, for some problems, the ability to perform adaptive measurements in a single shallow quantum circuit, is more useful than the ability to perform polynomially many shallow quantum circuits without adaptive measurements. (c) There exists a 2message proof of quantum depth protocol. Such a protocol allows a classical verifier to efficiently certify that a prover must be performing a computation of some minimum quantum depth. Our proof of quantum depth can be instantiated using the recent proof of quantumness construction by Yamakawa and Zhandry [STOC 22].

20221006 NLTS Hamiltonians from classical LTCs.
Summary: We provide a completely selfcontained construction of a family of NLTS Hamiltonians [Freedman and Hastings, 2014] based on ideas from [Anshu, Breuckmann, and Nirkhe, 2022], [Cross, He, Natarajan, Szegedy, and Zhu, 2022] and [Eldar and Harrow, 2017]. Crucially, it does not require optimalparameter quantum LDPC codes and can be built from simple classical LTCs such as the repetition code on an expander graph. Furthermore, it removes the constantrate requirement from the construction of Anshu, Breuckmann, and Nirkhe.

20220921 Chaining, Group Leverage Score Overestimates, and Fast Spectral Hypergraph Sparsification.
Summary: We present an algorithm that given any $n$vertex, $m$edge, rank $r$ hypergraph constructs a spectral sparsifier with $O(n \varepsilon^{2} \log n \log r)$ hyperedges in nearlylinear $\widetilde{O}(mr)$ time. This improves in both size and efficiency over a line of work (BansalSvenssonTrevisan 2019, KapralovKrauthgamerTardosYoshida 2021) for which the previous best size was $O(\min\{n \varepsilon^{4} \log^3 n,nr^3 \varepsilon^{2} \log n\})$ and runtime was $\widetilde{O}(mr + n^{O(1)})$. Independent Result: In an independent work, Lee (Lee 2022) also shows how to compute a spectral hypergraph sparsifier with $O(n \varepsilon^{2} \log n \log r)$ hyperedges.

20220915 Multicalibrated Regression for Downstream Fairness.
Summary: We show how to take a regression function $\hat{f}$ that is appropriately ``multicalibrated'' and efficiently postprocess it into an approximately error minimizing classifier satisfying a large variety of fairness constraints. The postprocessing requires no labeled data, and only a modest amount of unlabeled data and computation. The computational and sample complexity requirements of computing $\hat f$ are comparable to the requirements for solving a single fair learning task optimally, but it can in fact be used to solve many different downstream fairnessconstrained learning problems efficiently. Our postprocessing method easily handles intersecting groups, generalizing prior work on postprocessing regression functions to satisfy fairness constraints that only applied to disjoint groups. Our work extends recent work showing that multicalibrated regression functions are ``omnipredictors'' (i.e. can be postprocessed to optimally solve unconstrained ERM problems) to constrained optimization.

20220909 Spectral hypergraph sparsification via chaining.
Summary: In a hypergraph on $n$ vertices where $D$ is the maximum size of a hyperedge, there is a weighted hypergraph spectral $\varepsilon$sparsifier with at most $O(\varepsilon^{2} \log(D) \cdot n \log n)$ hyperedges. This improves over the bound of Kapralov, Krauthgamer, Tardos and Yoshida (2021) who achieve $O(\varepsilon^{4} n (\log n)^3)$, as well as the bound $O(\varepsilon^{2} D^3 n \log n)$ obtained by Bansal, Svensson, and Trevisan (2019). The same sparsification result was obtained independently by Jambulapati, Liu, and Sidford (2022).

20220829 Online Bidding Algorithms for ReturnonSpend Constrained Advertisers.
Summary: Online advertising has recently grown into a highly competitive and complex multibilliondollar industry, with advertisers bidding for ad slots at large scales and high frequencies. This has resulted in a growing need for efficient "autobidding" algorithms that determine the bids for incoming queries to maximize advertisers' targets subject to their specified constraints. This work explores efficient online algorithms for a single valuemaximizing advertiser under an increasingly popular constraint: ReturnonSpend (RoS). We quantify efficiency in terms of regret relative to the optimal algorithm, which knows all queries a priori. We contribute a simple online algorithm that achieves nearoptimal regret in expectation while always respecting the specified RoS constraint when the input sequence of queries are i.i.d. samples from some distribution. We also integrate our results with the previous work of Balseiro, Lu, and Mirrokni [BLM20] to achieve nearoptimal regret while respecting both RoS and fixed budget constraints. Our algorithm follows the primaldual framework and uses online mirror descent (OMD) for the dual updates. However, we need to use a noncanonical setup of OMD, and therefore the classic lowregret guarantee of OMD, which is for the adversarial setting in online learning, no longer holds. Nonetheless, in our case and more generally where lowregret dynamics are applied in algorithm design, the gradients encountered by OMD can be far from adversarial but influenced by our algorithmic choices. We exploit this key insight to show our OMD setup achieves low regret in the realm of our algorithm.

20220824 A Slightly Improved Bound for the KLS Constant.
Summary: We refine the recent breakthrough technique of Klartag and Lehec to obtain an improved polylogarithmic bound for the KLS constant.

20220809 Training Overparametrized Neural Networks in Sublinear Time.
Summary: The success of deep learning comes at a tremendous computational and energy cost, and the scalability of training massively overparametrized neural networks is becoming a real barrier to the progress of artificial intelligence (AI). Despite the popularity and low costperiteration of traditional backpropagation via gradient decent, stochastic gradient descent (SGD) has prohibitive convergence rate in nonconvex settings, both in theory and practice. To mitigate this cost, recent works have proposed to employ alternative (Newtontype) training methods with much faster convergence rate, albeit with higher costperiteration. For a typical neural network with $m=\mathrm{poly}(n)$ parameters and input batch of $n$ datapoints in $\mathbb{R}^d$, the previous work of [Brand, Peng, Song, and Weinstein, ITCS'2021] requires $\sim mnd + n^3$ time per iteration. In this paper, we present a novel training method that requires only $m^{1\alpha} n d + n^3$ amortized time in the same overparametrized regime, where $\alpha \in (0.01,1)$ is some fixed constant. This method relies on a new and alternative view of neural networks, as a set of binary search trees, where each iteration corresponds to modifying a small subset of the nodes in the tree. We believe this view would have further applications in the design and analysis of deep neural networks (DNNs).

20220809 An Improved TrickleDown Theorem for Partite Complexes.
Summary: We prove a strengthening of the trickle down theorem for partite complexes. Given a $(d+1)$partite $d$dimensional simplicial complex, we show that if "on average" the links of faces of codimension 2 are $\frac{1\delta}{d}$(onesided) spectral expanders, then the link of any face of codimension $k$ is an $O(\frac{1\delta}{k\delta})$(onesided) spectral expander, for all $3\leq k\leq d+1$. For an application, using our theorem as a blackbox, we show that links of faces of codimension $k$ in recent constructions of bounded degree high dimensional expanders have spectral expansion at most $O(1/k)$ fraction of the spectral expansion of the links of the worst faces of codimension $2$.

20220807 Decomposable NonSmooth Convex Optimization with NearlyLinear Gradient Oracle Complexity.
Summary: Many fundamental problems in machine learning can be formulated by the convex program \[ \min_{\theta\in R^d}\ \sum_{i=1}^{n}f_{i}(\theta), \] where each $f_i$ is a convex, Lipschitz function supported on a subset of $d_i$ coordinates of $\theta$. One common approach to this problem, exemplified by stochastic gradient descent, involves sampling one $f_i$ term at every iteration to make progress. This approach crucially relies on a notion of uniformity across the $f_i$'s, formally captured by their condition number. In this work, we give an algorithm that minimizes the above convex formulation to $\epsilon$accuracy in $\widetilde{O}(\sum_{i=1}^n d_i \log (1 /\epsilon))$ gradient computations, with no assumptions on the condition number. The previous best algorithm independent of the condition number is the standard cutting plane method, which requires $O(nd \log (1/\epsilon))$ gradient computations. As a corollary, we improve upon the evaluation oracle complexity for decomposable submodular minimization by Axiotis et al. (ICML 2021). Our main technical contribution is an adaptive procedure to select an $f_i$ term at every iteration via a novel combination of cuttingplane and interiorpoint methods.

20220727 Searching for Regularity in Bounded Functions.
Summary: Given a function $f$ on $\mathbb{F}_2^n$, we study the following problem. What is the largest affine subspace $\mathcal{U}$ such that when restricted to $\mathcal{U}$, all the nontrivial Fourier coefficients of $f$ are very small? For the natural class of bounded Fourier degree $d$ functions $f:\mathbb{F}_2^n \to [1,1]$, we show that there exists an affine subspace of dimension at least $ \tilde\Omega(n^{1/d!}k^{2})$, wherein all of $f$'s nontrivial Fourier coefficients become smaller than $ 2^{k}$. To complement this result, we show the existence of degree $d$ functions with coefficients larger than $2^{d\log n}$ when restricted to any affine subspace of dimension larger than $\Omega(dn^{1/(d1)})$. In addition, we give explicit examples of functions with analogous but weaker properties. Along the way, we provide multiple characterizations of the Fourier coefficients of functions restricted to subspaces of $\mathbb{F}_2^n$ that may be useful in other contexts. Finally, we highlight applications and connections of our results to parity kill number and affine dispersers.

20220719 Pandora Box Problem with Nonobligatory Inspection: Hardness and Approximation Scheme.
Summary: Weitzman (1979) introduced the Pandora Box problem as a model for sequential search with inspection costs, and gave an elegant indexbased policy that attains provably optimal expected payoff. In various scenarios, the searching agent may select an option without making a costly inspection. The variant of the Pandora box problem with nonobligatory inspection has attracted interest from both economics and algorithms researchers. Various simple algorithms have proved suboptimal, with the best known 0.8approximation algorithm due to Guha et al. (2008). No hardness result for the problem was known. In this work, we show that it is NPhard to compute an optimal policy for Pandora's problem with nonobligatory inspection. We also give a polynomialtime approximation scheme (PTAS) that computes policies with an expected payoff at least $(1  \epsilon)$fraction of the optimal, for arbitrarily small $\epsilon > 0$. On the side, we show the decision version of the problem to be in NP.

20220718 Private Convex Optimization in General Norms.
Summary: We propose a new framework for differentially private optimization of convex functions which are Lipschitz in an arbitrary norm $\\cdot\$. Our algorithms are based on a regularized exponential mechanism which samples from the density $\propto \exp(k(F+\mu r))$ where $F$ is the empirical loss and $r$ is a regularizer which is strongly convex with respect to $\\cdot\$, generalizing a recent work of [Gopi, Lee, Liu '22] to nonEuclidean settings. We show that this mechanism satisfies Gaussian differential privacy and solves both DPERM (empirical risk minimization) and DPSCO (stochastic convex optimization) by using localization tools from convex geometry. Our framework is the first to apply to private convex optimization in general normed spaces and directly recovers nonprivate SCO rates achieved by mirror descent as the privacy parameter $\epsilon \to \infty$. As applications, for Lipschitz optimization in $\ell_p$ norms for all $p \in (1, 2)$, we obtain the first optimal privacyutility tradeoffs; for $p = 1$, we improve tradeoffs obtained by the recent works [Asi, Feldman, Koren, Talwar '21, Bassily, Guzman, Nandi '21] by at least a logarithmic factor. Our $\ell_p$ norm and Schatten$p$ norm optimization frameworks are complemented with polynomialtime samplers whose query complexity we explicitly bound.

20220707 Approximate Carathéodory bounds via Discrepancy Theory.
Summary: The approximate Carath\'eodory problem in general form is as follows: Given two symmetric convex bodies $P,Q \subseteq \mathbb{R}^m$, a parameter $k \in \mathbb{N}$ and $\mathbf{z} \in \textrm{conv}(X)$ with $X \subseteq P$, find $\mathbf{v}_1,\ldots,\mathbf{v}_k \in X$ so that $\\mathbf{z}  \frac{1}{k}\sum_{i=1}^k \mathbf{v}_i\_Q$ is minimized. Maurey showed that if both $P$ and $Q$ coincide with the $\ \cdot \_p$ball, then an error of $O(\sqrt{p/k})$ is possible. We prove a reduction to the vector balancing constant from discrepancy theory which for most cases can provide tight bounds for general $P$ and $Q$. For the case where $P$ and $Q$ are both $\ \cdot \_p$balls we prove an upper bound of $\sqrt{ \frac{\min\{ p, \log (\frac{2m}{k}) \}}{k}}$. Interestingly, this bound cannot be obtained taking independent random samples; instead we use the LovettMeka random walk. We also prove an extension to the more general case where $P$ and $Q$ are $\\cdot \_p$ and $\ \cdot \_q$balls with $2 \leq p \leq q \leq \infty$.

20220707 Individual Preference Stability for Clustering.
Summary: In this paper, we propose a natural notion of individual preference (IP) stability for clustering, which asks that every data point, on average, is closer to the points in its own cluster than to the points in any other cluster. Our notion can be motivated from several perspectives, including game theory and algorithmic fairness. We study several questions related to our proposed notion. We first show that deciding whether a given data set allows for an IPstable clustering in general is NPhard. As a result, we explore the design of efficient algorithms for finding IPstable clusterings in some restricted metric spaces. We present a polytime algorithm to find a clustering satisfying exact IPstability on the real line, and an efficient algorithm to find an IPstable 2clustering for a tree metric. We also consider relaxing the stability constraint, i.e., every data point should not be too far from its own cluster compared to any other cluster. For this case, we provide polytime algorithms with different guarantees. We evaluate some of our algorithms and several standard clustering approaches on real data sets.

20220701 When Does Differentially Private Learning Not Suffer in High Dimensions?.
Summary: Large pretrained models can be privately finetuned to achieve performance approaching that of nonprivate models. A common theme in these results is the surprising observation that highdimensional models can achieve favorable privacyutility tradeoffs. This seemingly contradicts known results on the modelsize dependence of differentially private convex learning and raises the following research question: When does the performance of differentially private learning not degrade with increasing model size? We identify that the magnitudes of gradients projected onto subspaces is a key factor that determines performance. To precisely characterize this for private convex learning, we introduce a condition on the objective that we term \emph{restricted Lipschitz continuity} and derive improved bounds for the excess empirical and population risks that are dimensionindependent under additional conditions. We empirically show that in private finetuning of large language models, gradients obtained during finetuning are mostly controlled by a few principal components. This behavior is similar to conditions under which we obtain dimensionindependent bounds in convex settings. Our theoretical and empirical results together provide a possible explanation for recent successes in largescale private finetuning. Code to reproduce our results can be found at \url{https://github.com/lxuechen/privatetransformers/tree/main/examples/classification/spectral_analysis}.