Mitigating Negative Side Effects via Environment Shaping
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Mitigating Negative Side Effects via Environment Shaping Sandhya Saisubramanian and Shlomo Zilberstein College of Information and Computer Sciences, University of Massachusetts Amherst arXiv:2102.07017v1 [cs.AI] 13 Feb 2021 Abstract Agents operating in unstructured environments often produce negative side effects (NSE), which are difficult to identify at design time. While the agent can learn to mitigate the side effects from human feedback, such feedback is often expen- sive and the rate of learning is sensitive to the agent’s state representation. We examine how humans can assist an agent, beyond providing feedback, and exploit their broader scope of knowledge to mitigate the impacts of NSE. We formulate this problem as a human-agent team with decoupled objec- tives. The agent optimizes its assigned task, during which its (a) Initial Environment (b) Protective sheet over rug actions may produce NSE. The human shapes the environ- ment through minor reconfiguration actions so as to mitigate the impacts of the agent’s side effects, without affecting the Figure 1: Example configurations for boxpushing domain: agent’s ability to complete its assigned task. We present an al- (a) denotes the initial setting in which the actor dirties the gorithm to solve this problem and analyze its theoretical prop- rug when pushing the box over it; (b) denotes a modification erties. Through experiments with human subjects, we assess that avoids the NSE, without affecting the actor’s policy. the willingness of users to perform minor environment modi- fications to mitigate the impacts of NSE. Empirical evaluation of our approach shows that the proposed framework can suc- cessfully mitigate NSE, without affecting the agent’s ability edge about NSE, and therefore they do not have the ability to complete its assigned task. to minimize NSE. How can we leverage human assistance and the broader scope of human knowledge to mitigate neg- ative side effects, when agents are unaware of the side effects Introduction and the associated penalties? Deployed AI agents often require complex design choices A common solution approach in the existing literature to support safe operation in the open world. During design is to update the agent’s model and policy by learning and initial testing, the system designer typically ensures that about NSE through feedbacks (Hadfield-Menell et al. 2017; the agent’s model includes all the necessary details relevant Zhang, Durfee, and Singh 2018; Saisubramanian, Kamar, to its assigned task. Inherently, many other details of the en- and Zilberstein 2020). The knowledge about the NSE can vironment that are unrelated to this task may be ignored. be encoded by constraints (Zhang, Durfee, and Singh 2018) Due to this model incompleteness, the deployed agent’s ac- or by updating the reward function (Hadfield-Menell et al. tions may create negative side effects (NSE) (Amodei et al. 2017; Saisubramanian, Kamar, and Zilberstein 2020). These 2016; Saisubramanian, Zilberstein, and Kamar 2020). For approaches have three main drawbacks. First, the agent’s example, consider an agent that needs to push a box to a state representation is assumed to have all the necessary fea- goal location as quickly as possible (Figure 1(a)). Its model tures to learn to avoid NSE (Saisubramanian, Kamar, and includes accurate information essential to optimizing its as- Zilberstein 2020; Zhang, Durfee, and Singh 2018). In prac- signed task, such as reward for pushing the box. But details tice, the model may include only the features related to the such as the impact of pushing the box over a rug may not be agent’s task. This may affect the agent’s learning and adapta- included in the model if the issue is overlooked at system de- tion since the NSE could potentially be non-Markovian with sign. Consequently, the agent may push a box over the rug, respect to this limited state representation. Second, exten- dirtying it as a side effect. Mitigating such NSE is critical to sive model revisions will likely require suspension of op- improve trust in deployed AI systems. eration and exhaustive evaluation before the system can be It is practically impossible to identify all such NSE dur- redeployed. This is inefficient when the NSE is not safety- ing system design since agents are deployed in varied set- critical, especially for complex systems that have been care- tings. Deployed agents often do not have any prior knowl- fully designed and tested for critical safety aspects. Third,
updating the agent’s policy may not mitigate NSE that are Actor-Designer Framework unavoidable. For example, in the boxpushing domain, NSE The problem of mitigating NSE is formulated as a collabo- are unavoidable if the entire floor is covered with a rug. rative actor-designer framework with decoupled objectives. The key insight of this paper is that agents often operate in The problem setting is as follows. The actor agent operates environments that are configurable, which can be leveraged based on a Markov decision process (MDP) Ma in an envi- to mitigate NSE. We propose environment shaping for de- ronment that is configurable and described by a configura- ployed agents: a process of applying modest modifications tion file E, such as a map. The agent’s model Ma includes to the current environment to make it more agent-friendly the necessary details relevant to its assigned task, which is and minimize the occurrence of NSE. Real world examples its primary objective oP . A factored state representation is show that modest modifications can substantially improve assumed. The actor computes a policy π that optimizes oP . agents’ performance. For example, Amazon improved the Executing π may lead to NSE, unknown to the actor. The performance of its warehouse robots by optimizing the ware- environment designer, typically the user, measures the im- house layouts, covering skylights to reduce the glare, and pact of NSE associated with the actor’s π and shapes the en- repositioning the air conditioning units to blow out to the vironment, if necessary. The actor and the environment de- side (Simon 2019). Recent studies show the need for infras- signer share the configuration file of the environment, which tructure modifications such as traffic signs with barcodes and is updated by the environment designer to reflect the modi- redesigned roads to improve the reliability of autonomous fications. Optimizing oP is prioritized over minimizing the vehicles (Agashe and Chapman 2019; Hendrickson, Biehler, NSE. Hence, shaping is performed in response to the actor’s and Mashayekh 2014; McFarland 2020). Simple modifica- policy. In the rest of the paper, we refer to the environment tions to the environment have also accelerated agent learn- designer simply as designer—not to be confused with the ing (Randløv 2000) and goal recognition (Keren, Gal, and designer of the agent itself. Karpas 2014). These examples show that environment shap- Each modification is a sequence of design actions. An ex- ing can improve an agent’s performance with respect to its ample is {move(table, l1 , l2 ), remove(rug)}, which moves assigned task. We target settings in which environment shap- the table from location l1 to l2 and removes the rug in the ing can reduce NSE, without significantly degrading the per- current setting. We consider the set of modifications to be formance of the agent in completing its assigned task. finite since the environment is generally optimized for the Our formulation consists of an actor and a designer. The user and the agent’s primary task (Shah et al. 2019) and the actor agent computes a policy that optimizes the completion user may not be willing to drastically modify it. Addition- of its assigned task and has no knowledge about NSE of its ally, the set of modifications for an environment is included actions. The designer agent, typically a human, shapes the in the problem specification since it is typically controlled environment through minor modifications so as to mitigate by the user and rooted in the NSE they want to mitigate. the NSE of the actor’s actions, without affecting the actor’s We make the following assumptions about the nature of ability to complete its task. The environment is described us- NSE and the modifications: (1) NSE are undesirable but not ing a configuration file, such as a map, which is shared be- safety-critical, and its occurrence does not affect the actor’s tween the actor and the designer. Given an environment con- ability to complete its task; (2) the start and goal conditions figuration, the actor computes and shares sample trajectories of the actor are fixed and cannot be altered, so that the mod- of its optimal policy with the designer. The designer mea- ifications do not alter the agent’s task; and (3) modifications sures the NSE associated with the actor’s policy and mod- are applied tentatively for evaluation purposes and the en- ifies the environment by updating the configuration file, if vironment is reset if the reconfiguration affects the actor’s necessary. We target settings in which the completion of the ability to complete its task or the actor’s policy in the modi- actor’s task is prioritized over minimizing NSE. The sample fied setting does not minimize the NSE. trajectories and the environment configuration are the only Definition 1. An actor-designer framework to miti- knowledge shared between the actor and the designer. gate negative side effects (AD-NSE) is defined by The advantages of this decoupled approach to minimize hE0 , E, Ma , Md , δA , δD i with: NSE are: (1) robustness—it can handle settings in which • E0 denoting the initial environment configuration; the actor is unaware of the NSE and does not have the nec- • E denoting a finite set of possible reconfigurations of E0 ; essary state representation to effectively learn about NSE; and (2) bounded-performance—by controlling the space of • Ma = hS, A, T, R, s0 , sG i is the actor’s MDP with a dis- valid modifications, bounded-performance of the actor with crete and finite state space S, discrete actions A, transi- respect to its assigned task can be guaranteed. tion function T , reward function R, start state s0 ∈ S, and a goal state sG ∈ S; Our primary contributions are: (1) introducing a novel framework for environment shaping to mitigate the unde- • Md = hΩ, Ψ, C, N i is the model of the designer with sirable side effects of agent actions; (2) presenting an algo- – Ω denoting a finite set of valid modifications that rithm for shaping and analyzing its properties; (3) providing are available for E0 , including ∅ to indicate that no the results of a human subjects study to assess the attitudes changes are made; of users towards negative side effects and their willingness – Ψ : E × Ω → E determines the resulting environment to shape the environment; and (4) empirical evaluation of the configuration after applying a modification ω ∈ Ω to approach on two domains. the current configuration, and is denoted by Ψ(E, ω);
– C : E × Ω → R is a cost function that specifies the cost of applying a modification to an environment, denoted by C(E, ω), with C(E, ∅) = 0, ∀E ∈ E; – N = hπ, E, ζi is a model specifying the penalty for negative side effects in environment E ∈ E for the actor’s policy π, with ζ mapping states in π to E for Figure 2: Actor-designer approach for mitigating NSE. severity estimation. • δA ≥ 0 is the actor’s slack, denoting the maximum al- Definition 4. A robust shaping results in an environment lowed deviation from the optimal value of oP in E0 , when configuration E where all valid policies of the actor, given recomputing its policy in a modified environment; and δA , produce NSE within δD . That is, NπE ≤ δD , ∀π : • δD ≥ 0 indicates the designer’s NSE tolerance threshold. VP∗ (s0 |E0 ) − VPπ (s0 |E) ≤ δA . Decoupled objectives The actor’s objective is to compute Shared knowledge The actor and the designer do not have a policy π that maximizes its expected reward for oP , from details about each other’s model. Since the objectives are de- the start state s0 in the current environment configuration E: coupled, knowledge about the exact model parameters of the other agent is not required. Instead, the two key details that max VPπ (s0 |E), π∈Π are necessary to achieve collaboration are shared: configu- X ration file describing the environment and the actor’s policy. π VP (s) = R(s, π(s)) + T (s, π(s), s0 )VPπ (s0 ), ∀s ∈ S. The shared configuration file, such as the map of the envi- s0 ∈S ronment, allows the designer to effectively communicate the When the environment is modified, the actor recomputes its modifications to the actor. The actor’s policy is required for policy and may end up with a longer path to its goal. The the designer to shape the environment. Compact represen- slack δA denotes the maximum allowed deviation from the tation of the actor’s policy is a practical challenge in large optimal expected reward in E0 , VP∗ (s0 |E0 ), to facilitate min- problems (Guestrin, Koller, and Parr 2001). Therefore, in- imizing the NSE via shaping. A policy π 0 in a modified en- stead of sharing the complete policy, the actor provides a vironment E 0 satisfies the slack δA if finite number of demonstrations D = {τ1 , τ2 , . . . , τn } of 0 its optimal policy for oP in the current environment config- VP∗ (s0 |E0 ) − VPπ (s0 |E 0 ) ≤ δA . uration. Each demonstration τ is a trajectory from start to Given the actor’s π and environment configuration E, the goal, following π. Using D, the designer can extract the ac- designer first estimates its corresponding NSE and the asso- tor’s policy by associating states with actions observed in D, ciated penalty, denoted by NπE . The environment is modified and measure its NSE. The designer does not have knowledge if NπE > δD , where δD is the designer’s tolerance threshold about the details of the agent’s model but is assumed to be for NSE, assuming π is fixed. Given π and a set of valid mod- aware of the agent’s objectives and its general capabilities ifications Ω, the designer selects a modification that maxi- of the agent. This makes it straightforward to construe the mizes its utility: agent’s trajectories. Naturally, increasing the number and di- versity of sample trajectories that cover the actor’s reachable max Uπ (ω) states helps the designer improve the accuracy of estimating ω∈Ω the actor’s NSE and select an appropriate modification. If Uπ (ω) = NπE − NπΨ(E,ω) −C(E, ω). (1) D does not starve any reachable state, following π, then the | {z } reduction in NSE designer can extract the actor’s complete policy. Typically the designer is the user of the system and their Solution Approach preferences and tolerance for NSE is captured by the util- ity of a modification. The designer’s objective is to balance Our solution approach for solving AD-NSE, described in Al- the tradeoff between minimizing the NSE and the cost of gorithm 1, proceeds in two phases: a planning phase and a applying the modification. It is assumed that the cost of shaping phase. In the planning phase (Line 4), the actor com- the modification is measured in the same units as the NSE putes its policy π for oP in the current environment config- penalty. The cost of a modification may be amortized over uration and generates a finite number of sample trajectories episodes of the actor performing the task in the environment. D, following π. The planning phase ends with disclosing D The following properties are used to guarantee bounded- to the designer. The shaping phase (Lines 7-15) begins with performance of the actor when shaping the environment to the designer associating states with actions observed in D minimize the impacts of NSE. to extract a policy π̂, and estimating the corresponding NSE penalty, denoted by Nπ̂E . If Nπ̂E > δD , the designer applies Definition 2. Shaping is admissible if it results in an envi- a utility-maximizing modification and updates the configu- ronment configuration in which (1) the actor can complete ration file. The planning and shaping phases alternate until its assigned task, given δA and (2) the NSE does not in- the NSE is within δD or until all possible modifications have crease, relative to E0 . been tested. Figure 2 illustrates this approach. Definition 3. Shaping is proper if (1) it is admissible and The actor returns D = {} when the modification affects (2) reduces the actor’s NSE to be within δD . its ability to reach the goal, given δA . Modifications are ap-
Algorithm 1 Environment shaping to mitigate NSE Algorithm 2 Diverse modifications(b, Ω, Md , E0 ) Require: hE0 , E, Ma , Md , δA , δD i: AD-NSE 1: Ω̄ ← Ω Require: d: Number of sample trajectories 2: if b ≥ |Ω| then return Ω Require: b: Budget for evaluating modifications 3: for each ω1 , ω2 ∈ Ω̄ do 1: E ∗ ← E0 . Initialize best configuration 4: if similarity(ω1 , ω2 ) ≤ then 2: E ← E0 5: Ω̄ = Ω̄ \ arg maxω (C(E0 , ω1 ), C(E0 , ω2 )) 3: n ← ∞ 6: if |Ω̄| = b then return Ω̄ 4: D ← Solve Ma for E0 and sample d trajectories 7: end if 5: Ω̄ ← Diverse modifications(b, Ω, Md , E0 ) 8: end for 6: while |Ω̄| > 0 do 7: π̂ ← Extract policy from D 8: if Nπ̂E < n then ing to evaluate. When b < |Ω|, it is beneficial to evaluate b di- 9: n ← Nπ̂E verse modifications. Algorithm 2 presents a greedy approach 10: E∗ ← E to select b diverse modifications. If two modifications result 11: end if Shaping phase in similar environment configurations, the algorithm prunes 12: if Nπ̂E ≤ δD then break the modification with a higher cost. The similarity threshold 13: ω ∗ ← arg maxω∈Ω̄ Uπ̂ (ω) is controlled by . This process is repeated until b modifica- 14: if ω ∗ = ∅ then break tions are identified. Measures such as the Jaccard distance or 15: Ω̄ ← Ω̄ \ ω ∗ embeddings may be used to estimate the similarity between 16: D0 ← Solve Ma for Ψ(E0 , ω ∗ ) and sample d trajec- two environment configurations. tories 17: if D0 6= {} then Theoretical Properties 18: D ← D0 19: E ← Ψ(E0 , ω ∗ ) For the sake of clarity of the theoretical analysis, we assume 20: end if that the designer shapes the environment based on the actor’s 21: end while exact π, without having to extract it from D, and with b = 22: return E ∗ |Ω|. These assumptions allow us to examine the properties without approximation errors and sampling biases. Proposition 1. Algorithm 1 produces admissible shaping. plied tentatively for evaluation and the environment is reset Proof. Algorithm 1 ensures that a modification does not if the actor returns D = {} or if the reconfiguration does not negatively impact the actor’s ability to complete its task, minimize the NSE. Therefore, all modifications are applied given δA (Lines 16-19), and stores the configuration with to E0 and it suffices to test each ω without replacement as least NSE (Line 10). Therefore, Algorithm 1 is guaranteed the actor always calculates the corresponding optimal pol- to return an E ∗ with NSE equal to or lower than that of the icy. When Ma is solved approximately, without bounded- initial configuration E0 . Hence, shaping using Algorithm 1 guarantees, it is non-trivial to verify if δA is violated but the is admissible. designer selects a utility-maximizing ω corresponding to this policy. Algorithm 1 terminates when at least one of the fol- Shaping using Algorithm 1 is not guaranteed to be proper lowing conditions is satisfied: (1) NSE impact is within δD ; because (1) no ω ∈ Ω may be able to reduce the NSE below (2) every ω has been tested; or (3) the utility-maximizing op- δD ; or (2) the cost of such a modification may exceed the tion is no modification, ω ∗ = ∅, indicating that cost exceeds corresponding reduction in NSE. With each reconfiguration, the reduction in NSE or no modification can reduce the NSE the actor is required to recompute its policy. We now show further. When no modification reduces the NSE to be within that there exists a class of problems for which the actor’s δD , configuration with the least NSE is returned. policy is unaffected by shaping. Though the designer calculates the utility for all modifi- Definition 5. A modification is policy-preserving if the ac- cations, using N , there is an implicit pruning in the shap- tor’s policy is unaltered by environment shaping. ing phase since only utility-maximizing modifications are This property induces policy invariance before and af- evaluated (Line 16). However, when multiple modifications ter shaping and is therefore considered to be backward- have the same cost and produce similar environment con- compatible. Additionally, the actor is guaranteed to com- figurations, the algorithm will alternate multiple times be- plete its task with δA = 0 when a modification is policy- tween planning and shaping to evaluate all these modifica- preserving with respect to the actor’s optimal policy in E0 . tions, which is |Ω| in the worst case. To minimize the num- Figure 3 illustrates the dynamic Bayesian network for this ber of evaluations in settings with large Ω, consisting of mul- class of problems, where rP denotes the reward associated tiple similar modifications, we present a greedy approach to with oP and rN denotes the penalty for NSE. Let F denote identify and evaluate diverse modifications. the set of features in the environment, with the actor’s policy Selecting diverse modifications Let 0 < b ≤ |Ω| denote depending on FP ⊂ F and FD ⊂ F altered by the designer’s the maximum number of modifications the designer is will- modifications. Let f~ and f~0 be the feature values before and
The actor’s best response is its optimal policy for oP in the current environment configuration, given δA : 0 bA (E) = {π ∈ Π|∀π 0 ∈ Π, VPπ (s0 ) ≥ VPπ (s0 ) ∧ VPπ0 (s0 ) − VPπ (s0 ) ≤ δA }, (3) where π0 denotes the optimal policy in E0 before initiating environment shaping. These best responses are pure strate- gies since there exists a deterministic optimal policy for a discrete MDP and a modification is selected deterministi- cally. Proposition 3 shows that AD-NSE is an extensive- form game, which is useful to understand the link between Figure 3: A dynamic Bayesian network description of a two often distinct fields of decision theory and game the- policy-preserving modification. ory. This also opens the possibility for future work on game- theoretic approaches for mitigating NSE. Proposition 3. An AD-NSE with policy constraints induced after environment shaping. Given the actor’s π, a policy- by Equations 2 and 3 induces an equivalent extensive form preserving ω follows FD = F \ FP , ensuring f~P = f~P0 . game with incomplete information, denoted by N . Proposition 2. Given an environment configuration E and actor’s policy π, a modification ω ∈ Ω is guaranteed to be Proof Sketch. Our solution approach alternates between the policy-preserving if FD = F \ FP . planning phase and the shaping phase. This induces an ex- tensive form game N with strategy profiles Ω for the de- Proof. We prove by contradiction. Let ω ∈ Ω be a modifi- signer and Π for the actor, given start state s0 . However, cation consistent with FD ∩ FP = ∅ and F = FP ∪ FD . Let each agent selects a best response based on the information π and π 0 denote the actor’s policy before and after shaping available to it and its payoff matrix, and is unaware of the with ω such that π 6= π 0 . Since the actor always computes strategies and payoffs of the other agent. The designer is un- an optimal policy for oP (assuming fixed strategy for tie- aware of how its modifications may impact the actor’s policy breaking), a difference in the policy indicates that at least until the actor recomputes its policy and the actor is unaware one feature in FP is affected by ω, f~P 6= f~P0 . This is a con- of the NSE of its actions. Hence this is an extensive form tradiction since FD ∩ FP = ∅. Therefore f~P = f~P0 and π = π 0 . game with incomplete information. Thus, ω ∈ Ω is policy-preserving when FD = F \ FP . Using Harsanyi transformation, N can be converted to Corollary 1. A policy-preserving modification identified by a Bayesian extensive-form game, assuming the players are Algorithm 1 guarantees admissible shaping. rational (Harsanyi 1967). This converts N to a game with Furthermore, a policy-preserving modification results in complete but imperfect information as follows. Nature se- lects a type τi , actor or designer, for each player, which de- robust shaping when P r(FD = f~D ) = 0, ∀f~ = f~P ∪ f~D . The termines its available actions and its payoffs parameterized policy-preserving property is sensitive to the actor’s state by θA and θD for the actor and designer, respectively. Each representation. However, many real-world problems exhibit player knows its type but does not know τ and θ of the other this feature independence. In the boxpushing example, the player. If the players maintain a probability distribution over actor’s state representation may not include details about the the possible types of the other player and update it as the rug since it is not relevant to the task. Moving the rug to game evolves, then the players select a best response, given a different part of the room that is not on the actor’s path to their strategies and beliefs. This satisfies consistency and se- the goal is an example of a policy-preserving and admissible quential rationality property, and therefore there exists a per- shaping. Modifications such as removing the rug or covering fect Bayesian equilibrium for N . it with a protective sheet are policy-preserving and robust shaping since there is no direct contact between the box and the rug, for all policies of the actor (Figure 1(b)). Extension to Multiple Actors Our approach can be extended to settings with multiple ac- Relation to Game Theory Our solution approach with de- tors and one designer, with slight modifications to Equa- coupled objectives can be viewed as a game between the ac- tion 1 and Algorithm 1. This models large environments tor and the designer. The action profile or the set of strategies such as infrastructure modifications to cities. The actors are for the actor is the policy space Π, with respect to oP and E, assumed to be homogeneous in their capabilities, tasks, and with payoffs defined by its reward function R. The action models, but may have different start and goal states. For ex- profile for the designer is the set of all modifications Ω with ample, consider multiple self-driving cars that have the same payoffs defined by its utility function U . In each round of capabilities, task, and model but are navigating between dif- the game, the designer selects a modification that is a best ferent locations in the environment. When optimizing travel response to the actor’s policy π: time, the vehicles may travel at a high velocity through pot- holes, resulting in a bumpy ride for the passenger as a side bD (π) = {ω ∈ Ω|∀ω 0 ∈ Ω, Uπ (ω) ≥ Uπ (ω 0 )}. (2) effect. Driving fast through deep potholes may also damage
the car. If the road is rarely used, the utility-maximizing de- sign may be to reduce the speed limit for that segment. If the road segment with potholes is frequently used by multiple vehicles, filling the potholes reduces the NSE for all actors. The designer’s utility for an ω, given K actors, is: XK U~π (ω) = NπEi − NπΨ(E,ω) i − C(E, ω), i=1 with ~π denoting policies of the K actors. More complex ways of estimating the utility may be considered. The pol- Figure 4: User tolerance score. icy is recomputed for all the actors when the environment is modified. It is assumed that shaping does not introduce a multi-agent coordination problem that did not pre-exist. If a such bumpiness when the AV drives fast through potholes modification impacts the agent coordination for oP , it will and 57.22% were willing to tolerate relatively severe NSE be reflected in the slack violation. Shaping for multiple ac- such as hard braking at a stop sign. Participants were also tors requires preserving the slack guarantees of all actors. A required to enter a tolerance score on a scale of 1 to 5, with 5 policy-preserving modification induces policy invariance for being the highest. Figure 4 shows the distribution of the user all the actors in this setting. tolerance score for the two domains in our human subjects study. The mean, along with the 95% confidence interval for Experiments the tolerance scores are 3.06±0.197 for the household robot We present two sets of experimental results. First, we report domain and 3.17 ± 0.166 for the AV domain. For household the results of our user study conducted specifically to vali- robot domain, 66.11% voted a score of ≥ 3 and 75% voted date two key assumptions of our work: (1) users are willing a score of ≥ 3 for the AV domain. Furthermore, 52.22% to engage in shaping and (2) users want to mitigate NSE respondents of the household robot survey voted that their even when they are not safety-critical. Second, we evalu- trust in the system’s abilities is unaffected by the NSE and ate the effectiveness of shaping in mitigating NSE on two 34.44% indicated that their trust may be affected if the NSE domains in simulation. The user study complements the ex- are not mitigated over time. Similarly for the AV driving do- periments that evaluate our algorithmic contribution to solve main, 50.55% voted that their trust is unaffected by the NSE this problem. and 42.78% voted that NSE may affect their trust if the NSE are not mitigated over time. These results suggest that (1) User Study individual preferences and tolerance of NSE varies and de- Setup We conducted two IRB-approved surveys focus- pends on the severity of NSE; (2) users are generally willing ing on two domains similar to the settings in our simula- to tolerate NSE that are not severe or safety-critical, but pre- tion experiments. The first is a household robot with capa- fer to reduce them as much as possible; and (3) mitigating bilities such as cleaning the floor and pushing boxes. The NSE is important to improve trust in AI systems. second is an autonomous driving domain. We surveyed the Willingness to perform shaping We surveyed participants participants to understand their attitudes to NSE such as the to determine their willingness to perform environment shap- household robot spraying water on the wall when cleaning ing to mitigate the impacts of NSE. In the household robot the floor, the autonomous vehicle (AV) driving fast through domain, shaping involved adding a protective sheet on the potholes which results in a bumpy ride for the users, and the surface. In the driving domain, shaping involved installing a AV slamming the brakes to halt at stop signs which results pothole-detection sensor that detects potholes and limits the in sudden jerks for the passengers. We recruited 500 partici- vehicle’s speed. Our results show that 74.45% participants pants on Amazon Mturk to complete a pre-survey question- are willing to engage in shaping for the household robot do- naire to assess their familiarity with AI systems and fluency main and 92.5% for the AV domain. Among the respondents in English. Based on the pre-survey responses, we invited who were willing to install the sheet in the household robot 300 participants aged above 30 to complete the main survey, domain, 64.39% are willing to purchase the sheet ($10) if it since they are less likely to game the system (Downs et al. not provided by the manufacturer. Similarly, 62.04% were 2010). Participants were required to select the option that willing to purchase the sensor for the AV, which costs $50. best describes their attitude towards NSE occurrence. Re- These results indicate that (1) users are generally willing to sponses that were incomplete or with a survey completion engage in environment shaping to mitigate the impacts of time of less than one minute were discarded and 180 valid NSE; and (2) many users are willing to pay for environment responses for each domain are analyzed. shaping as long as it is generally affordable, relative to the Tolerance to NSE We surveyed participants to understand cost of the AI system. their attitude towards NSE that are not safety-critical and whether such NSE affects the user’s trust in the system. Evaluation in Simulation For the household robot setting, 76.66% of the partici- We evaluate the effectiveness of shaping in two domains: pants indicated willingness to tolerate the NSE. For the driv- boxpushing (Figure 1) and AV driving (Figure 5). The ef- ing domain, 87.77% were willing to tolerate milder NSE fectiveness of shaping is studied in settings with avoidable
agent’s location represented by hx, yi. We consider a grid of size 25 × 15. The actor can move in all four directions at low and high speeds, with costs 2 and 1 respectively. Driv- ing fast through shallow potholes results in a bumpy ride for the user, which is a mild NSE with a penalty of 2. Driving fast through a deep pothole may damage the car in addi- tion to the unpleasant experience for the rider and therefore it is a severe NSE with a penalty of 5. Infrastructure man- agement authorities are the designers for this setting and the Figure 5: AV environment shaping with multiple actors. state space is divided into four zones, similar to geographical divisions of cities for urban planning. NSE, unavoidable NSE, and with multiple actors. Available Modifications We consider 24 modifications Baselines The performance of shaping with budget is com- for the boxpushing domain, such as adding a protective sheet pared with the following baselines. First is the Initial ap- over the rug, moving the vase to corners of the room, remov- proach in which the actor’s policy is naively executed and ing the rug, block access to the rug and vase area, among does not involve shaping or any form of learning to mitigate others. Removing the rug costs 0.4/unit area covered by the NSE. Second is the shaping with exhaustive search to select rug, moving the vase costs 1, and all other modifications cost a modification, b = |Ω|. Third is the Feedback approach in 0.2/unit. Except for blocking access to the rug and vase area, which the gent performs a trajectory of its optimal policy all the other modifications are policy-preserving for the ac- for oP and the human approves or disapproves the observed tor state representation described earlier. The modifications actions based on the NSE occurrence (Saisubramanian, Ka- considered for the driving domain are: reduce the speed limit mar, and Zilberstein 2020). The agent then disables all the in zone i, 1 ≤ i ≤ 4; reduce speed limits in all four zones; disapproved actions and recomputes a policy for execution. fill all potholes; fill deep potholes in all zones; reduce the If the updated policy violates the slack, the actor ignores the speed limit in zones with shallow potholes; and fill deep feedbacks and executes its initial policy since oP is priori- potholes. Reducing the speed costs 0.5 units per pothole in tized. We consider two variants of the feedback approach: that zone and filling each pothole costs 2 units. Reducing with and without generalizing the gathered information to the speed limit disables the ‘move fast’ action for the actor. unseen situations. In Feedback w/ generalization, the actor Filling the potholes is a policy-preserving modification. uses the human feedback as training data to learn a predic- Effectiveness of shaping The effectiveness of shaping tive model of NSE occurrence. The actor disables the actions is evaluated in terms of the average NSE penalty incurred disapproved by the human and those predicted by the model, and the expected value of oP after shaping. Figure 6 plots and recomputes a policy. the results with δA = 0, δD = 0, and b = 3, as the num- We use Jaccard distance to measure the similarity between ber of observed actor trajectories is increased. The results environment configurations. A random forest classifier from are plotted for the boxpushing domain with one actor in set- the sklearn Python package is used for learning a predic- tings with avoidable and unavoidable NSE. Feedback budget tive model. The actor’s MDP is solved using value iteration. is 500. Feedback without generalization does not minimize The algorithms were implemented in Python and tested on NSE and performs similar to initial baseline, because the ac- a computer with 16GB of RAM. We optimize action costs, tor has no knowledge about the undesirability of actions not which are negation of rewards. Values averaged over 100 tri- included in the demonstration. As a result, when an action is als of planning and execution, along with the standard errors, disapproved, the actor may execute an alternate action that are reported for the following domains. is equally bad or worse than the initial action in that state, Boxpushing We consider a boxpushing domain (Saisub- in terms of NSE. Generalizing the feedback overcomes this ramanian, Kamar, and Zilberstein 2020) in which the actor drawback and mitigates the NSE relatively. With at most five is required to minimize the expected time taken to push a trajectories, the designer is able to select a policy-preserving box to the goal location (Figure 1). Each action costs one modification that avoids NSE. Shaping w/ budget b = 3 time unit. The actions succeed with probability 0.9 and may performs similar to shaping by evaluating all modifications. slide right with probability 0.1. Each state is represented as The trend in the relative performances of the different tech- hx, y, bi where x, y denote the agent’s location and b is a niques is similar for both avoidable and unavoidable NSE. boolean variable indicating if the agent is pushing the box. Table 6(f) shows that shaping w/ budget reduces the number Pushing the box over the rug or knocking over a vase on its of shaping evaluations by 50%. way results in NSE, incurring a penalty of 5. The designers are the system users. We experiment with grid size 15× 15. Effect of slack on shaping We vary δA and δD , and plot the resulting NSE penalty in Figure 7. We report re- Driving Our second domain is based on simulated au- sults for the driving domain with a single actor and designer, tonomous driving (Saisubramanian, Kamar, and Zilberstein and shaping based on 100 observed trajectories of the actor. 2020) in which the actor’s objective is to minimize the ex- We vary δA between 0-25% of VP∗ (s0 |E0 ) and δD between pected cost of navigation from start to a goal location, dur- 0-25% of the NSE penalty of the actor’s policy in E0 . Fig- ing which it may encounter some potholes. Each state is the ure 7 shows that increasing the slack helps reduce the NSE,
(a) Average NSE penalty. (b) Expected cost of oP . (c) Average NSE penalty. # Trajectories Shaping Shaping w/ budget 2 1 1 5 2 1 10 2 1 20 2 1 50 6 3 100 6 3 (d) Expected cost of oP . (e) Legend for plots (a)-(d) (f) # Modifications evaluated when NSE are avoidable Figure 6: Results on boxpushing domain with avoidable NSE (a-b) and unavoidable NSE (c-d). Figure 7: Effect of δA and δD on the average NSE penalty. as expected. In particular, when δD ≥ 15%, the NSE penalty is considerably reduced with δA = 15%. Overall, increasing Figure 8: Results with multiple actors. δA is most effective in reducing the NSE. We also tested the effect of slack on the cost for oP . The results showed that the cost was predominantly affected by δA . We observed similar practical. Overall, as we increase the number of agents, sub- performance for all values of δD for a fixed δA and therefore stantial reduction in NSE is observed with shaping. The time do not include that plot. This is an expected behavior since taken for shaping and shaping w/ budget are comparable up oP is prioritized and shaping is performed in response to π. to 100 actors, beyond which we see considerable time sav- ings of using shaping w/ budget. For 300 actors, the average Shaping with multiple actors We measure the cumula- time taken for shaping is 752.27 seconds and the time taken tive NSE penalty incurred by varying the number of actors for shaping w/budget is 490.953 seconds. in the environment from 10 to 300. Figure 8 shows the re- sults for the driving domain, with δD = 0 and δA = 25% of Discussion and Future Work the optimal cost for oP , for each actor. The start and goal lo- cations were randomly generated for the actors. Shaping and We present an actor-designer framework to mitigate the im- feedbacks are based on 100 observed trajectories of each ac- pacts of NSE by shaping the environment. The general idea tor. Shaping w/ budget results are plotted with b = 4, which of environment modification to influence the behaviors of is 50%|Ω|. The feedback budget for each actor is 500. Al- the acting agents has been previously explored in other con- though the feedback approach is sometimes comparable to texts (Zhang, Chen, and Parkes 2009), such as to accelerate shaping when there are fewer actors, it requires overseeing agent learning (Randløv 2000), to quickly infer the goals of each actor and providing individual feedback, which is im- the actor (Keren, Gal, and Karpas 2014), and to maximize
the agent’s reward (Keren et al. 2017). We study how en- man, J.; and Mané, D. 2016. Concrete Problems in AI vironment shaping can mitigate NSE. We also identify the Safety. arXiv preprint arXiv:1606.06565 . conditions under which the actor’s policy is unaffected by Downs, J. S.; Holbrook, M. B.; Sheng, S.; and Cranor, shaping and show that our approach guarantees bounded- L. F. 2010. Are Your Participants Gaming the System? performance of the actor. Screening Mechanical Turk Workers. In Proceedings of the Our approach is well-suited for settings where the agent is SIGCHI Conference on Human Factors in Computing Sys- well-trained to perform its assigned task and must perform tems, 2399–2402. it repeatedly, but NSE are discovered after deployment. Al- Guestrin, C.; Koller, D.; and Parr, R. 2001. Max-norm Pro- tering the agent’s model after deployment requires the envi- jections for Factored MDPs. In Proceedings of the 17th In- ronment designer to be (or have the knowledge of) the de- ternational Joint Conference on Artificial Intelligence. signer of the AI system. Otherwise, the system will have to be suspended and returned to the manufacturer. Our frame- Hadfield-Menell, D.; Milli, S.; Abbeel, P.; Russell, S. J.; and work provides a tool for the users to mitigate NSE without Dragan, A. 2017. Inverse Reward Design. In Advances in making any assumptions about the agent’s model, its learn- Neural Information Processing Systems, 6765–6774. ing capabilities, the representation format for the agent to ef- Harsanyi, J. C. 1967. Games with Incomplete Informa- fectively use new information, or the designer’s knowledge tion Played by “Bayesian” Players, Part I. The Basic Model. about the agent’s model. We argue that is many contexts, it is Management science 14(3): 159–182. much easier for users to reconfigure the environment than to Hendrickson, C.; Biehler, A.; and Mashayekh, Y. 2014. Con- accurately update the agent’s model. Our algorithm guides nected and Autonomous Vehicles 2040 Vision. Pennsylva- the users in the shaping process. nia Department of Transportation . In addition to the greedy approach described in Algo- rithm 2, we also tested a clustering-based approach to iden- Keren, S.; Gal, A.; and Karpas, E. 2014. Goal Recognition tify diverse modifications. In our experiments, the clustering Design. In Proceedings of the 24th International Conference approach performed comparably to the greedy approach in on Automated Planning and Scheduling. identifying diverse modifications but had a longer run time. Keren, S.; Pineda, L.; Gal, A.; Karpas, E.; and Zilberstein, S. It is likely that benefits of the clustering approach would be 2017. Equi-Reward Utility Maximizing Design in Stochas- more evident in settings with a much larger Ω, since it can tic Environments. In Proceedings of the 26th International quickly group similar modifications. In the future, we aim to Joint Conference on Artificial Intelligence, 4353–4360. experiment with a large set of modifications, say |Ω| > 100, McFarland, M. 2020. Michigan Plans to Redesign a Stretch and compare the performance of the clustering approach of Road for Self-Driving Cars. URL https://www.cnn.com/ with that of Algorithm 2. 2020/08/13/cars/michigan-self-driving-road/index.html. When the actor’s model Ma has a large state space, ex- Randløv, J. 2000. Shaping in Reinforcement Learning by isting algorithms for solving large MDPs may be leveraged. Changing the Physics of the Problem. In Proceedings of the When Ma is solved approximately, without bounded guar- 17th International Conference on Machine Learning. antees, slack guarantees cannot be established. Extending our approach to multiple actors with different models is an Saisubramanian, S.; Kamar, E.; and Zilberstein, S. 2020. interesting future direction. A Multi-Objective Approach to Mitigate Negative Side Ef- fects. In Proceedings of the 29th International Joint Confer- We conducted experiments with human subjects to val- ence on Artificial Intelligence, 354–361. idate their willingness to perform environment shaping. In our user study, we focused on scenarios that are more acces- Saisubramanian, S.; Zilberstein, S.; and Kamar, E. 2020. sible to participants from diverse backgrounds so that we can Avoiding Negative Side Effects due to Incomplete Knowl- obtain reliable responses regarding their overall attitude to edge of AI Systems. CoRR abs/2008.12146. NSE and shaping. In the future, we aim to evaluate whether Shah, R.; Krasheninnikov, D.; Alexander, J.; Abbeel, P.; and Algorithm 1 aligns with human judgment in selecting the Dragan, A. 2019. Preferences Implicit in the State of the best modification. Additionally, we will examine ways to au- World. In Proceedings of the 7th International Conference tomatically identify useful modifications in a large space of on Learning Representations (ICLR). valid modifications. Simon, M. 2019. Inside the Amazon Warehouse Where Hu- mans and Machines Become One. URL https://www.wired. Acknowledgments Support for this work was provided com/story/amazon-warehouse-robots/. in part by the Semiconductor Research Corporation under Zhang, H.; Chen, Y.; and Parkes, D. C. 2009. A General Ap- grant #2906.001. proach to Environment Design With One Agent. In Proceed- ings of the 21st International Joint Conference on Artificial References Intelligence, 2002–2014. Zhang, S.; Durfee, E. H.; and Singh, S. P. 2018. Minimax- Agashe, N.; and Chapman, S. 2019. Traffic Signs in the Regret Querying on Side Effects for Safe Optimality in Evolving World of Autonomous Vehicles. Technical report, Factored Markov Decision Processes. In Proceedings of Avery Dennison. the 27th International Joint Conference on Artificial Intel- Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P.; Schul- ligence, 4867–4873.
You can also read