Tensorflow solution for the MountainCarContinuous-v0 problem.
The Problem
This was the first time I tried solving an OpenAI Gym problem and this problem looked interesting and was easy to understand.
The problem is about a car placed at the bottom of a hill and our goal is to write a program that helps the car climb the hill. But the car’s engine is not powerful enough to do this by itself hence we need to move the car back and forth until it gains sufficient momentum to climb the hill. Lesser the energy consumed, greater will be the score.
The source code for the problem can be seen here..
The reward is -1 for each step until the goal position (0.5)is reached. The code should stop after reaching the goal or after 200 iterations.
The Solution
The first thing I did is explore the problem and see how the car is moving. For that all I had to do was to load the mountain car enviornment and call the “env.action_space.sample()” for getting the actions. This moves the car back and forth with the values specified in its source code.
After 200 iterations code stopped and the car obviously was unable to climb the mountain. More importantly this lets us see the data generated by the car - position, velocity and its relation with the reward.
Before implementing a neural network we need sufficient data from which the car can learn to move correctly. To do this I have implemented the function “model_data_preperation” that randomly moves the car using the 3 actions- 0(move left), 1(rest) and 2(move right). Instead of setting the reward -1 for all actions, it is set as 1 if the car’s position is getting closer to the top of the mountain. Once the score is greater than -198 the data is added to the list of accepted score. Now we have generated the data that tells us what movements are beneficial for us and what all are not. The game is similarly played for 10000 times.
Nextly I have made a sequencial model that learns from the generated data and it is trained.
Again the car is made to move, only this time the actions are set by the trained model and the reward is set like it was at first, ie. always = -1 unless the goal position is reached.