My research investigates the theoretical foundations of optimization and decision-making in uncertain environments. My work aims to overcome intractability barriers prevalent throughout modern problems via two core directions:
Optimization under Uncertainty
Constrained RL: Designed the first polynomial-time approximation algorithms for general Constrained MDPs.
Stochastic Optimization: Developed the first adaptive approximation algorithms for correlated Pandora’s Box problems.
Online Scheduling: Established the state-of-the-art competitive ratio for the multilevel aggregation with deadlines.
Game-Theoretic MARL
Safe & Robust Equilibria: Designed the first polynomial-time algorithms for computing anytime-constrained, adversarial-defense, and uncertainty-robust equilibria in Markov Games.
Adversarial Attacks: Characterized optimal poisoning and misinformation attacks on MARL agents.
Incentivized Exploration: Designed the first constant-regret mechanisms to align myopic agents with social welfare goals.
Papers
Below, you can find each of my papers. Unless otherwise noted, author names are ordered by contribution.
We study the computational complexity of approximating general constrained Markov decision processes. Our primary contribution is the design of a polynomial time (0,ε)-additive bicriteria approximation algorithm for finding optimal constrained policies across a broad class of recursively computable constraints, including almost-sure, chance, expectation, and their anytime variants. Matching lower bounds imply our approximation guarantees are optimal so long as P ≠NP. The generality of our approach results in answers to several long-standing open complexity questions in the constrained reinforcement learning literature. Specifically, we are the first to prove polynomial-time approximability for the following settings: policies under chance constraints, deterministic policies under multiple expectation constraints, policies under non-homogeneous constraints (i.e., constraints of different types), and policies under constraints for continuous-state processes.
We extend anytime constraints to the Markov game setting and the corresponding solution concept of anytime-constrained equilibrium (ACE). Then, we present a comprehensive theory of anytime-constrained equilibria that includes (1) a computational characterization of feasible policies, (2) a fixed-parameter tractable algorithm for computing ACE, and (3) a polynomial-time algorithm for approximately computing ACE. Since computing a feasible policy is NP-hard even for two-player zero-sum games, our approximation guarantees are the best possible so long as P ≠NP. We also develop the first theory of efficient computation for action-constrained Markov games, which may be of independent interest.
We present a novel algorithm that efficiently computes near-optimal deterministic policies for constrained reinforcement learning (CRL) problems. Our approach combines three key ideas: (1) value-demand augmentation, (2) action-space approximate dynamic programming, and (3) time-space rounding. Our algorithm constitutes a fully polynomial-time approximation scheme (FPTAS) for any time-space recursive (TSR) cost criteria. A TSR criteria requires the cost of a policy to be computable recursively over both time and (state) space, which includes classical expectation, almost sure, and anytime constraints. Our work answers three open questions spanning two long-standing lines of research: polynomial-time approximability is possible for 1) anytime-constrained policies, 2) almost-sure-constrained policies, and 3) deterministic expectation-constrained policies.
We study security threats to Markov games due to information asymmetry and misinformation. We consider an attacker player who can spread misinformation about its reward function to influence the robust victim player’s behavior. Given a fixed fake reward function, we derive the victim’s policy under worst-case rationality and present polynomial-time algorithms to compute the attacker’s optimal worst-case policy based on linear programming and backward induction. Then, we provide an efficient inception (""planting an idea in someone’s mind"") attack algorithm to find the optimal fake reward function within a restricted set of reward functions with dominant strategies. Importantly, our methods exploit the universal assumption of rationality to compute attacks efficiently. Thus, our work exposes a security vulnerability arising from standard game assumptions under misinformation.
We study robust Markov games (RMG) with s-rectangular uncertainty. We show a general equivalence between computing a robust Nash equilibrium (RNE) of a s-rectangular RMG and computing a Nash equilibrium (NE) of an appropriately constructed regularized MG. The equivalence result yields a planning algorithm for solving s-rectangular RMGs, as well as provable robustness guarantees for policies computed using regularized methods. However, we show that even for just reward-uncertain two-player zero-sum matrix games, computing an RNE is PPAD-hard. Consequently, we derive a special uncertainty structure called efficient player-decomposability and show that RNE for two-player zero-sum RMG in this class can be provably solved in polynomial time. This class includes commonly used uncertainty sets such as L_1 and L_∞ball uncertainty sets.
We study the game modification problem, where a benevolent game designer or a malevolent adversary modifies the reward function of a zero-sum Markov game so that a target deterministic or stochastic policy profile becomes the unique Markov perfect Nash equilibrium and has a value within a target range, in a way that minimizes the modification cost. We characterize the set of policy profiles that can be installed as the unique equilibrium of a game and establish sufficient and necessary conditions for successful installation. We propose an efficient algorithm that solves a convex optimization problem with linear constraints and then performs random perturbation to obtain a modification plan with a near-optimal cost.
Following the increasing use of graphs to communicate statistical information in social media and news platforms, the occurrence of poorly designed misleading graphs has also risen. Thus, previous research has identified common misleading visual features of such graphs. Our study extends this research by empirically comparing the deceptive impact of 14 distinct misleading graph types on viewers’ understanding of the depicted data. We investigated the deceptive nature of these misleading graph types to identify those with the biggest potential to mislead viewers. Our findings indicate that misleading graphs significantly decreased viewers’ accuracy in interpreting data. While certain misleading graphs (e.g., graphs with inverted y-axis or manipulated time intervals) significantly impeded viewers’ accurate graph comprehension, other graphs (e.g., graphs using pictorial bars or graphs with compressed y-axis) had little misleading impact. By identifying misleading graphs that strongly affect viewers’ understanding about depicted data, we suggest that these misleading graphs should be the focus of educational interventions.
We introduce and study constrained Markov Decision Processes (cMDPs) with anytime constraints. An anytime constraint requires the agent to never violate its budget at any point in time, almost surely. Although Markovian policies are no longer sufficient, we show that there exist optimal deterministic policies augmented with cumulative costs. In fact, we present a fixed-parameter tractable reduction from anytime-constrained cMDPs to unconstrained MDPs. Our reduction yields planning and learning algorithms that are time and sample-efficient for tabular cMDPs so long as the precision of the costs is logarithmic in the size of the cMDP. However, we also show that computing non-trivial approximately optimal policies is NP-hard in general. To circumvent this bottleneck, we design provable approximation algorithms that efficiently compute or learn an arbitrarily accurate approximately feasible policy with optimal value so long as the maximum supported cost is bounded by a polynomial in the cMDP or the absolute budget. Given our hardness results, our approximation guarantees are the best possible under worst-case analysis.
To ensure the usefulness of Reinforcement Learning (RL) in real systems, it is crucial to ensure they are robust to noise and adversarial attacks. In adversarial RL, an external attacker has the power to manipulate the victim agent’s interaction with the environment. We study the full class of online manipulation attacks, which include (i) state attacks, (ii) observation attacks (which are a generalization of perceived-state attacks), (iii) action attacks, and (iv) reward attacks. We show the attacker’s problem of designing a stealthy attack that maximizes its own expected reward, which often corresponds to minimizing the victim’s value, is captured by a Markov Decision Process (MDP) that we call a meta-MDP since it is not the true environment but a higher level environment induced by the attacked interaction. We show that the attacker can derive optimal attacks by planning in polynomial time or learning with polynomial sample complexity using standard RL techniques. We argue that the optimal defense policy for the victim can be computed as the solution to a stochastic Stackelberg game, which can be further simplified into a partially-observable turn-based stochastic game (POTBSG). Neither the attacker nor the victim would benefit from deviating from their respective optimal policies, thus such solutions are truly robust. Although the defense problem is NP-hard, we show that optimal Markovian defenses can be computed (learned) in polynomial time (sample complexity) in many scenarios.
We characterize offline data poisoning attacks on Multi-Agent Reinforcement Learning (MARL), where an attacker may change a data set in an attempt to install a (potentially fictitious) unique Markov-perfect Nash equilibrium for a two-player zero-sum Markov game. We propose the unique Nash set, namely the set of games, specified by their Q functions, with a specific joint policy being the unique Nash equilibrium. The unique Nash set is central to poisoning attacks because the attack is successful if and only if data poisoning pushes all plausible games inside it. The unique Nash set generalizes the reward polytope commonly used in inverse reinforcement learning to MARL. For zero-sum Markov games, both the inverse Nash set and the set of plausible games induced by data are polytopes in the Q function space. We exhibit a linear program to efficiently compute the optimal poisoning attack. Our work sheds light on the structure of data poisoning attacks on offline MARL, a necessary step before one can design more robust MARL algorithms.
We revisit the classic Pandora’s Box (PB) problem under correlated distributions on the box values. Recent work of [Shuchi Chawla et al., 2020] obtained constant approximate algorithms for a restricted class of policies for the problem that visit boxes in a fixed order. In this work, we study the complexity of approximating the optimal policy which may adaptively choose which box to visit next based on the values seen so far. Our main result establishes an approximation-preserving equivalence of PB to the well studied Uniform Decision Tree (UDT) problem from stochastic optimization and a variant of the Min-Sum Set Cover (MSSC_f) problem. For distributions of support m, UDT admits a log m approximation, and while a constant factor approximation in polynomial time is a long-standing open problem, constant factor approximations are achievable in subexponential time [Ray Li et al., 2020]. Our main result implies that the same properties hold for PB and MSSC_f. We also study the case where the distribution over values is given more succinctly as a mixture of m product distributions. This problem is again related to a noisy variant of the Optimal Decision Tree which is significantly more challenging. We give a constant-factor approximation that runs in time n^Õ(m²/ε²) when the mixture components on every box are either identical or separated in TV distance by ε.
In offline multi-agent reinforcement learning (MARL), agents estimate policies from a given dataset. We study reward-poisoning attacks in this setting where an exogenous attacker modifies the rewards in the dataset before the agents see the dataset. The attacker wants to guide each agent into a nefarious target policy while minimizing the Lp norm of the reward modification. Unlike attacks on single-agent RL, we show that the attacker can install the target policy as a Markov Perfect Dominant Strategy Equilibrium (MPDSE), which rational agents are guaranteed to follow. This attack can be significantly cheaper than separate single-agent attacks. We show that the attack works on various MARL agents including uncertainty-aware learners, and we exhibit linear programs to efficiently solve the attack problem. We also study the relationship between the structure of the datasets and the minimal attack cost. Our work paves the way for studying defense in offline MARL.
In the information economy, consumer-generated information greatly informs the decisions of future consumers. However, myopic consumers seek to maximize their own reward with no regard for the information they generate. By controlling the consumers’ access to information, a central planner can incentivize them to produce more valuable information for future consumers. The myopic multi-armed bandit problem is a simple model encapsulating these issues. In this paper, we construct a simple incentive-compatible mechanism achieving constant regret for the problem. We use the novel idea of completing phases to selectively reveal information towards the goal of maximizing social welfare. Moreover, we characterize the distributions for which an incentive-compatible mechanism can achieve the first-best outcome and show that our general mechanism achieves first-best in such settings.
In this work, we develop a reward design framework for installing a desired behavior as a strict equilibrium across standard solution concepts: dominant strategy equilibrium, Nash equilibrium, correlated equilibrium, and coarse correlated equilibrium. We also extend our framework to capture the Markov-perfect equivalents of each solution concept. Central to our framework is a comprehensive mathematical characterization of strictly installable, based on the desired solution concept and the behavior’s structure. These characterizations lead to efficient iterative algorithms, which we generalize to handle optimization objectives through linear programming. Finally, we explore how our results generalize to bounded rational agents.
In this paper, we consider the multi-level aggregation problem with deadlines (MLAPD) previously studied by Bienkowski et al. (2015), Buchbinder et al. (2017), and Azar and Touitou (2019). This is an online problem where the algorithm services requests arriving over time and can save costs by aggregating similar requests. Costs are structured in the form of a rooted tree. This problem has applications to many important areas such as multicasting, sensor networks, and supply-chain management. In particular, the TCP-acknowledgment problem, joint-replenishment problem, and assembly problem are all special cases of the delay version of the problem. We present a D-competitive algorithm for MLAPD. This beats the 6(D+1)-competitive algorithm given in Buchbinder et al. (2017). Our approach illuminates key structural aspects of the problem and provides an algorithm that is simpler to implement than previous approaches. We also give improved competitive ratios for special cases of the problem.
Theses and Technical Reports
2025
PhD Thesis
Safe Multi-Agent Reinforcement Learning in Polynomial Time
This talk closely follows the narrative of my NeurIPS 2024 paper, "Deterministic Policies for Constrained Reinforcement Learning in Polynomial Time," but with additional intuition for the problem formulation and its relationship to classic combinatorial, stochastic, and online optmization problems.
Optimal Attack and Defense for Reinforcement Learning
Jeremy McMahan
UW-Madison’s Theory Seminar – February 2025
UW-Madison’s Computer Vision Roundtable – January 2024
This talk closely follows the narrative of my AAAI 2024 paper of same title, but with several historical attack examples to build intuition for the problem formulation.