Project 5
Spring 2024
Due: T 4/30 @ 11:59 PM ET
10 points
Last one! \o/ For this project, you will implement policy iteration and Q learning, which will find control policies in grid worlds. I put starter code for this project on cs-class. You can retrieve it by typing
cs-class-1$ cp ~maloofm/cosc3450/p5.zip ./
The file format for grid worlds is similar to the format for p1's mazes. As an example, see Russell and Norvig's grid world depicted in Figure 17.1. The contents of the file rnGrid.lay for Russell and Norvig's grid world are:
%%%%%% % % % % -% % +% %%%%%%Percent signs (%) represent walls. A plus sign ($+$) indicates that the state has a reward of $+1.0$. A negative sign ($-$) indicates that the state has a reward of $-1.0$. The percent sign in the top-left corner is at the coordinate position $(0, 0)$. The percent sign at the bottom-right corner is at the coordinate position $(X-1, Y-1)$, where $X$ and $Y$ are grid world's sizes for the $x$ and $y$ dimensions, respectively.
In this representation of Russell and Norvig's grid world, the state $(4, 3)$ is a terminal state with a positive reward. If the agent is in state $(1, 1)$, then the action to move north probabilistically transitions the agent to state $(1, 2)$. If the agent is in state $(1, 1)$, then the action to move east probabilistically transitions the agent to state $(2, 1)$. There are five grid files in the zip file that I provided.
To implement GridWorld, derive GridAction from Action. Derive GridState from State. Derive GridWorld from World. GridWorld(String filename) should load a grid world from a file with the name stored in filename. Feel free to port any relevant code from p1 to p5.
Implement PolicyIteration using Russell and Norvig's algorithm for modified policy iteration. It will find optimal policies for a grid world with nondeterministic actions and deterministic rewards. The PolicyIteration.main that I provided implements the application logic for the class.
Implement Sutton and Barto's algorithm for Q-learning using an $\epsilon$-greedy strategy. You can find this algorithm in your lecture notes or on page 131 of Sutton and Barto's book Reinforcement Learning: An Introduction. The QLearner.main that I provided implements the application logic for the class.
Note that GridWorld implements methods that only PolicyIteration uses, that only QLearner uses, and that both use. For example, policy iteration uses the transition model, whereas Q learning does not have access to the model. Q learning involves taking random actions in the environment, whereas policy iteration takes no actions in the environment. This may be confusing, but I think this design decision is preferable to implementing two different worlds with a lot of overlap and duplicate code. I recommend implementing GridWorld completely before starting on PolicyIteration or QLearner. Once you have accomplished this, it should be easy to determine which methods of GridWorld are needed to implement PolicyIteration and QLearner.
In accordance with the class policies and Georgetown's Honor System, I certify that, with the exceptions of the course materials and those items noted below, I have neither given nor received any assistance on this project. Name NetID
When you are ready to submit your project for grading, put your source files, Makefile, and honor statement in a zip file named submit.zip. Upload the zip file to Autolab using the assignment p5. Make sure you remove all debugging output before submitting.
Copyright © 2024 Mark Maloof. All Rights Reserved. This material may not be published, broadcast, rewritten, or redistributed.