Ιδρυματικό Αποθετήριο
Πολυτεχνείο Κρήτης
EN  |  EL

Αναζήτηση

Πλοήγηση

Ο Χώρος μου

Least-squares policy iteration

Lagoudakis Michael, Parr Ronald

Απλή Εγγραφή


URIhttp://purl.tuc.gr/dl/dias/23F40241-5991-47D2-A68C-3B6EB59A4567-
Γλώσσαen-
Μέγεθος42en
ΤίτλοςLeast-squares policy iterationen
ΔημιουργόςLagoudakis Michaelen
ΔημιουργόςΛαγουδακης Μιχαηλel
Δημιουργός Parr Ronalden
ΠεριγραφήΔημοσίευση σε επιστημονικό περιοδικό el
ΠερίληψηWe propose a new approach to reinforcement learning for control problems which combines value-function approximation with linear architectures and approximate policy iteration. This new approach is motivated by the least-squares temporal-difference learning algorithm (LSTD) for prediction problems, which is known for its efficient use of sample experiences compared to pure temporal-difference algorithms. Heretofore, LSTD has not had a straightforward application to control problems mainly because LSTD learns the state value function of a fixed policy which cannot be used for action selection and control without a model of the underlying process. Our new algorithm, least-squares policy iteration (LSPI), learns the state-action value function which allows for action selection without a model and for incremental policy improvement within a policy-iteration framework. LSPI is a model-free, off-policy method which can use efficiently (and reuse in each iteration) sample experiences collected in any manner. By separating the sample collection method, the choice of the linear approximation architecture, and the solution method, LSPI allows for focused attention on the distinct elements that contribute to practical reinforcement learning. LSPI is tested on the simple task of balancing an inverted pendulum and the harder task of balancing and riding a bicycle to a target location. In both cases, LSPI learns to control the pendulum or the bicycle by merely observing a relatively small number of trials where actions are selected randomly. LSPI is also compared against Q-learning (both with and without experience replay) using the same value function architecture. While LSPI achieves good performance fairly consistently on the difficult bicycle task, Q-learning variants were rarely able to balance for more than a small fraction of the time needed to reach the target location.en
ΤύποςPeer-Reviewed Journal Publicationen
ΤύποςΔημοσίευση σε Περιοδικό με Κριτέςel
Άδεια Χρήσηςhttp://creativecommons.org/licenses/by/4.0/en
Ημερομηνία2015-10-28-
Ημερομηνία Δημοσίευσης2003-
Θεματική ΚατηγορίαReinforcement Learningen
Θεματική ΚατηγορίαMarkov Decision Processesen
Θεματική ΚατηγορίαApproximate Policy Iterationen
Θεματική Κατηγορία Value-Function Approximationen
Θεματική ΚατηγορίαLeast-Squares Methodsen
Βιβλιογραφική ΑναφοράM. G. Lagoudakis and R. Parr, "Least-squares policy iteration," Journal of Machine Learning Research, vol. 4, pp. 1107-1149, 2003.en

Υπηρεσίες

Στατιστικά