Create Presentation
Download Presentation

Download Presentation
## Generalizing Plans to New Environments in Multiagent Relational MDPs

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Generalizing Plans to New Environments in Multiagent**Relational MDPs Carlos Guestrin Daphne Koller Stanford University**Multiagent Coordination Examples**• Search and rescue • Factory management • Supply chain • Firefighting • Network routing • Air traffic control • Multiple, simultaneous decisions • Exponentially-large spaces • Limited observability • Limited communication**Real-time Strategy Game**Peasants collect resources and build Footmen attack enemies Buildings train peasants and footmen peasant footman building**Scaling up by Generalization**• Exploit similarities between world elements • Generalize plans: • From a set of worlds to a new, unseen world • Avoid need to replan • Tackle larger problems Formalize notion of “similar” elements Compute generalizable plans**Relational Models and MDPs**• Classes: • Peasant, Gold, Wood, Barracks, Footman, Enemy… • Relations • Collects, Builds, Trains, Attacks… • Instances • Peasant1, Peasant2, Footman1, Enemy1… • Value functions in class level • Objects of the same class have same contribution to value function • Factored MDP equivalents of PRMs [Koller, Pfeffer ‘98]**Gold**Peasant P’ G’ Collects AP P G Relational MDPs • Class-level transition probabilities depends on: • Attributes; Actions; Attributes of related objects • Class-level reward function • Instantiation (world) • Number objects; Relations • Well-defined MDP**Planning in a World**• Long-term planning by solving MDP • # states exponential in number of objects • # actions exponential • Efficient approximation by exploiting structure! • RMDP world is a factored MDP**Roadmap to Generalization**• Solve 1 world • Compute generalizable value function • Tackle a new world**State**Dynamics Decisions Rewards P’ P F’ G’ G F AF R H AP E E’ World is a Factored MDP P(F’|F,G,H,AF)**Long-term Utility = Value of MDP**[Manne `60] • Value computed by linear programming: • One variable V (x) for each state • One constraint for each state x and action a • Number of states and actions exponential!**Approximate Value Functions**Linear combination of restricted domain functions [Bellman et al. `63] [Tsitsiklis & Van Roy `96] [Koller & Parr `99,`00] [Guestrin et al. `01] • Each Vodepends on state of object and related objects: • State of footman • Status of barracks • Must find Vo giving good approximate value function**Single LP Solution for Factored MDPs**[Schweitzer and Seidmann ‘85] • Variables for each Vo , for each object • Polynomially many LP variables • One constraint for every state and action • Exponentially many LP constraints • Vo , Qo depend on small sets of variables/actions • Exploit structure as in variable elimination [Guestrin, Koller, Parr `01]**Representing Exponentially Many Constraints**Exponentially many linear = one nonlinear constraint**A**D B C + + + max f ( A , B ) f ( A , C ) f ( C , D ) f ( B , D ) 1 2 3 4 , , , A B C D [ ] = + + + max f ( A , B ) f ( A , C ) max f ( C , D ) f ( B , D ) 1 2 3 4 , , A B C D = + + max f ( A , B ) f ( A , C ) g ( B , C ) 1 2 1 , , A B C Variable Elimination • Can use variable elimination to maximize over state space: [Bertele & Brioschi ‘72] Here we need only 23, instead of 63 sum operations • As in Bayes nets, maximization is exponential in tree-width**Representing the Constraints**• Functions are factored, use Variable Elimination to represent constraints: Number of constraints exponentially smaller**Roadmap to Generalization**• Solve 1 world • Compute generalizable value function • Tackle a new world**Generalization**• Sample a set of worlds • Solve a linear program for these worlds: • Obtain class value functions • When faced with new problem: • Use class value function • No re-planning needed**Worlds and RMDPs**• Meta-level MDP: • Meta-level LP:**Class-level Value Functions**• Approximate solution to meta-level MDP • Linear approximation • Value function defined in the class level • All instances use same local value function**Class-level LP**• Constraints for each world represented by factored LP • Number of worlds exponential or infinite • Sample worlds from P()**samples**Theorem Exponentially (infinitely) many worlds ! need exponentially many samples? NO! Value function within , with prob. at least 1-. Rmax is the maximum class reward Proof method related to [de Farias, Van Roy ‘02]**LP with sampled worlds**• Solve LP for sampled worlds • Use Factored LP for each world • Obtain class-level value function • New world: instantiate value function and act**Learning Classes of Objects**• Which classes of objects have same value function? • Plan for sampled worlds individually • Use value function as “training data” • Find objects with similar values • Include features of world • Used decision tree regression in experiments**Summary of Generalization Algorithm**• Model domain as Relational MDPs • Pick local object value functions Vo • Learn classes by solving some instances • Sample set of worlds • Factored LP computes class-level value function**A New World**• When faced with a new world , value function is: • Q function becomes: • At each state, choose action maximizing Q(x,a) • Number of actions is exponential! • Each QC depends only on a few objects!!!**M1**M4 M2 Q3 M3 Observe only X2 and X3 Local Q function Approximation Q(A1,…,A4, X1,…,X4) Q(A1,…,A4, X1,…,X4) ¼ Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2) + Q3(A2, A3, X2,X3) + Q4(A3, A4, X3,X4) Associated with Agent 3 Limited observability: agent i only observes variables in Qi Must choose action to maximize åi Qi**A1**+ + + max Q ( A , A ) Q ( A , A ) Q ( A , A ) Q ( A , A ) 1 1 2 2 1 3 3 3 4 4 2 4 , , , A A A A 1 2 3 4 A2 A3 [ ] = + + + max Q ( A , A ) Q ( A , A ) max Q ( A , A ) Q ( A , A ) 1 1 2 2 1 3 3 3 4 4 2 4 , , A A A A 1 2 3 4 = + + max Q ( A , A ) Q ( A , A ) g ( A , A ) A4 1 1 2 2 1 3 1 2 3 , , A A A 1 2 3 If A2 attacks and A3defends, then A4 gets $10 Maximizing i Qi:Coordination Graph • Use variable elimination for maximization: [Bertele & Brioschi ‘72] • Limited communication for optimal action choice • Comm. bandwidth = induced width of coord. graph**Summary of Algorithm**• Model domain as Relational MDPs • Factored LP computes class-level value function • Reuse class-level value function in new world**Server**Ring of Rings Unidirectional Ring Star Experimental Results • SysAdmin problem**Leaf**Intermediate Server Intermediate Intermediate Leaf Leaf Classes of Objects Discovered • Learned 3 classes**Results**[with Gearhart and Kanodia] • 2 Peasants, Gold, Wood, Barracks, 2 Footman, Enemy • Reward for dead enemy • About 1 million of state/action pairs • Solve with Factored LP • Some factors are exponential • Coordination graph for action selection**Generalization**• 9 Peasants, Gold, Wood, Barracks, 3 Footman, Enemy • Reward for dead enemy • About 3 trillion of state/action pairs • Instantiate generalizable value function • At run-time, factors are polynomial • Coordination graph for action selection**The 3 aspects of this talk**• Scaling up collaborative multiagent planning • Exploiting structure • Generalization • Factored representation and algorithms • Relational MDP, Factored LP, coordination graph • Freecraft as a benchmark domain**Conclusions**• RMDP • Compact representation for set of similar planning problems • Solve single instance with factored MDP algorithms • Tackle sets of problems with class-level value functions • Efficient sampling of worlds • Learn classes of value functions • Generalization to new domains • Avoid replanning • Solve larger, more complex MDPs