Institutional Repository
Technical University of Crete
EN  |  EL

Search

Browse

My Space

Deep reinforcement learning in the Flatland multi-agent environment

Ntaountakis Stavros

Full record


URI: http://purl.tuc.gr/dl/dias/CBCE2963-77DB-4CEE-81D4-190709E5A62B
Year 2021
Type of Item Diploma Work
License
Details
Bibliographic Citation Stavros Ntaountakis, "Deep reinforcement learning in the Flatland multi-agent environment", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2021 https://doi.org/10.26233/heallink.tuc.90660
Appears in Collections

Summary

Over the last few years, railway traffic networks have been increasing in size and complexity due to the ever-growing transportation demands. As a result, railway companies, such as the Swiss Federal Railway company, need to constantly adapt to the increasing transportation demands. FlatLand is a simplified 2D grid simulation that mimics the dynamics of a railway network and was developed as an open sandbox to accelerate academic research on the Vehicle Rescheduling Problem (or VRSP) in the fields of Machine Learning and Operations Research.FlatLand is characterized by many of the common problems that need to be tackled in multi-agent systems. The coexistence of multiple autonomous agents results in a non-stationary environment and a partially-observable state space. At the same time the rewards received by the agents are sparse and delayed, since coordinated sequence of actions are usually required for yielding such positive rewards.Under these considerations, in this thesis, we implement and adapt various Deep Reinforcement Learning methods in the environment of FlatLand. We systematically compare and evaluate both value-based and policy-based methods on various metrics of performance and reliability. We ensure consistent and fair training conditions by employing each agent on a strictly defined training and evaluation setup. We implement standard DQN methods, as well the Double and Dueling Double DQN variants, and adapt them to multiple agents. Additionally, we implement a modified PPO agent as well as a superior PPO agent attached to a Replay Buffer. Lastly, we propose SIL, an agent that combines PPO with Self-Imitation and converges to a successfull policy in most environment settings. SIL is shown to excibit superior performance with respect to all other agents we implemented and tested.

Available Files

Services

Statistics