The most recent project is self-driving vehicles in Grand Theft Auto V. The environment of Grand Theft Auto 5 was chosen specifically for it's challenging atmosphere. Road conditions change, other drivers behave uniquely, and even ambient time and weather changes much like the real world. Much of the GTA V integration work is open-sourced and I have even done quite a few self driving vehicles in Grand Theft Auto 5 tutorials.
If you would like to see the latest version of the AI, check out the Twitch self-driving car AI Stream.
I should stress: The goal of this project is *not* currently to actually make a law-abiding self-driving vehicle that we might put on real streets, but rather to use AI to solve a very challenging problem, such as avoiding police at high rates of speed through a crowded city.
I am coupling together 0.05 and 0.06 since there was only one model change in between them.
v0.05 added granular control via xbox 360 controller emulation, and v0.06 added a visual speedometer for the AI to learn from (which is what required a new model)
Allowing the AI, Charles, to more smoothly control throttle and steering helps with many situations like sweeping turns, or even just turning in intersections. Rather than neural network output being reserved for single actions, and those actions being full or nothing, Charles now tweaks inputs according to the network's output for each action.
Adding a visual speedometer to the game gives the AI the opportunity to learn how fast it is going. Judging from performance, it does appear that the AI has learned a few things from the speedometer.
This is the same model back from 0.03 (you can see it in the report from 0.03)
Training Data: ~1 million labeled 480x270 color game frames
Duration: Training took about 1 week.
Fitment: Training was able to get to about 85% before out of sample compared to in-sample began to decline.
The video above comes from 0.05 (xbox360 emulation). This is a good example of the AI being able to, in a much more controlled manner, weave in and out of traffic. Prior to this model, the AI would have taken max left and max right turns, and crashing would be much more likely.
Above is another example of quite a few instances where fine-steering got us through a cluster of scenarios where full turning would have had us crashing.
Along with more finely-tuned turning, Charles can now control his throttle granularly rather than having no throttle or pedal to the metal.
An example of both throttle and steering control saving us:
Now bringing in the speedometer with v0.06:
Charles seems to handle turns even better with the additional information.
Issues:
Charles is far too dependent on the waypoints to drive. Most of the time, when entering a tunnel and losing the waypoints on the map (GTA V simulates "losing signal" on the map), Charles tends to get stuck in the tunnel. That said, he can sometimes be successful, for example:
Along with the above point, Charles often still runs directly into vehicles in front of him, giving me the impression he's focused too much on the map. That said, he does avoid cars, along with doing things like modulating speed in most vehicles, keeping the speedometer dial to about 90 degrees in many cases.
Here's an example of Charles in a Ramp Buggy (a car that is just basically a ramp, so hitting other cars just sends them flying)
The main goal for v0.03 is to begin bring the AI's model to be able to see a much higher resolution, and to give the AI an actual objective. Up until this point, the AI was trained to, in general, stay on the road and avoid obstacles, but never actually had a set objective. If the AI got into a collision, or bumped something, and the course was changed, then the AI would just simply look for a road again and continue a random journey. Now the AI's task is to follow the map routes to various waypoints, like a GPS guidance system.
The main objective here...is to give the AI an objective.
Hopefully continue to see improvements in the driving of the agent, test the inclusion of following waypoints in the training data, and increase input resolution to the network to 480x270.
At a resolution of 160x90, it's pretty challenging to read the game's map, and the intention is that that AI will still only use visual elements to make decisions. Here's an image comparing the game resolutions:
At first, I attempted to continue to just use the base version of Google's Inception V3 model, but this model was extremely slow to train once in the 20% area. The output layer has 9 options, so "random" is ~11.11%, since this data is balanced. At the beginning things looked alright:
Various training steps along the way, colors are just due to different training times:
Unfortunately it continued to taper:
In terms of time:
Going 9 hours and not making any real progress on a Titan X Pascal GPU suggests to me something is wrong. Of course, I knew the Inception model is totally capable of working with this data, it's been successful in the past, the only change here is the resolution. I started by lowering the learning rate, which actually allowed me to get up to the 45% area in accuracy, but again, plateau:
At this point, I decide to begin manipulating Inception V3, here's my modified version:
def otherception3(width, height, frame_count, lr, output=9, model_name = 'otherception.model', device = 'gpu', num = '0'): with tf.device('/{}:{}'.format(device,num)): network = input_data(shape=[None, width, height,3], name='input') conv1_7_7 = conv_2d(network, 64, 28, strides=4, activation='relu', name = 'conv1_7_7_s2') pool1_3_3 = max_pool_2d(conv1_7_7, 9,strides=4) pool1_3_3 = local_response_normalization(pool1_3_3) conv2_3_3_reduce = conv_2d(pool1_3_3, 64,1, activation='relu',name = 'conv2_3_3_reduce') conv2_3_3 = conv_2d(conv2_3_3_reduce, 192,12, activation='relu', name='conv2_3_3') conv2_3_3 = local_response_normalization(conv2_3_3) pool2_3_3 = max_pool_2d(conv2_3_3, kernel_size=12, strides=2, name='pool2_3_3_s2') inception_3a_1_1 = conv_2d(pool2_3_3, 64, 1, activation='relu', name='inception_3a_1_1') inception_3a_3_3_reduce = conv_2d(pool2_3_3, 96,1, activation='relu', name='inception_3a_3_3_reduce') inception_3a_3_3 = conv_2d(inception_3a_3_3_reduce, 128,filter_size=12, activation='relu', name = 'inception_3a_3_3') inception_3a_5_5_reduce = conv_2d(pool2_3_3,16, filter_size=1,activation='relu', name ='inception_3a_5_5_reduce' ) inception_3a_5_5 = conv_2d(inception_3a_5_5_reduce, 32, filter_size=15, activation='relu', name= 'inception_3a_5_5') inception_3a_pool = max_pool_2d(pool2_3_3, kernel_size=12, strides=1, ) inception_3a_pool_1_1 = conv_2d(inception_3a_pool, 32, filter_size=1, activation='relu', name='inception_3a_pool_1_1') # merge the inception_3a__ inception_3a_output = merge([inception_3a_1_1, inception_3a_3_3, inception_3a_5_5, inception_3a_pool_1_1], mode='concat', axis=3) inception_3b_1_1 = conv_2d(inception_3a_output, 128,filter_size=1,activation='relu', name= 'inception_3b_1_1' ) inception_3b_3_3_reduce = conv_2d(inception_3a_output, 128, filter_size=1, activation='relu', name='inception_3b_3_3_reduce') inception_3b_3_3 = conv_2d(inception_3b_3_3_reduce, 192, filter_size=9, activation='relu',name='inception_3b_3_3') inception_3b_5_5_reduce = conv_2d(inception_3a_output, 32, filter_size=1, activation='relu', name = 'inception_3b_5_5_reduce') inception_3b_5_5 = conv_2d(inception_3b_5_5_reduce, 96, filter_size=15, name = 'inception_3b_5_5') inception_3b_pool = max_pool_2d(inception_3a_output, kernel_size=12, strides=1, name='inception_3b_pool') inception_3b_pool_1_1 = conv_2d(inception_3b_pool, 64, filter_size=1,activation='relu', name='inception_3b_pool_1_1') #merge the inception_3b_* inception_3b_output = merge([inception_3b_1_1, inception_3b_3_3, inception_3b_5_5, inception_3b_pool_1_1], mode='concat',axis=3,name='inception_3b_output') pool3_3_3 = max_pool_2d(inception_3b_output, kernel_size=3, strides=2, name='pool3_3_3') inception_4a_1_1 = conv_2d(pool3_3_3, 192, filter_size=1, activation='relu', name='inception_4a_1_1') inception_4a_3_3_reduce = conv_2d(pool3_3_3, 96, filter_size=1, activation='relu', name='inception_4a_3_3_reduce') inception_4a_3_3 = conv_2d(inception_4a_3_3_reduce, 208, filter_size=3, activation='relu', name='inception_4a_3_3') inception_4a_5_5_reduce = conv_2d(pool3_3_3, 16, filter_size=1, activation='relu', name='inception_4a_5_5_reduce') inception_4a_5_5 = conv_2d(inception_4a_5_5_reduce, 48, filter_size=5, activation='relu', name='inception_4a_5_5') inception_4a_pool = max_pool_2d(pool3_3_3, kernel_size=3, strides=1, name='inception_4a_pool') inception_4a_pool_1_1 = conv_2d(inception_4a_pool, 64, filter_size=1, activation='relu', name='inception_4a_pool_1_1') inception_4a_output = merge([inception_4a_1_1, inception_4a_3_3, inception_4a_5_5, inception_4a_pool_1_1], mode='concat', axis=3, name='inception_4a_output') inception_4b_1_1 = conv_2d(inception_4a_output, 160, filter_size=1, activation='relu', name='inception_4a_1_1') inception_4b_3_3_reduce = conv_2d(inception_4a_output, 112, filter_size=1, activation='relu', name='inception_4b_3_3_reduce') inception_4b_3_3 = conv_2d(inception_4b_3_3_reduce, 224, filter_size=3, activation='relu', name='inception_4b_3_3') inception_4b_5_5_reduce = conv_2d(inception_4a_output, 24, filter_size=1, activation='relu', name='inception_4b_5_5_reduce') inception_4b_5_5 = conv_2d(inception_4b_5_5_reduce, 64, filter_size=5, activation='relu', name='inception_4b_5_5') inception_4b_pool = max_pool_2d(inception_4a_output, kernel_size=3, strides=1, name='inception_4b_pool') inception_4b_pool_1_1 = conv_2d(inception_4b_pool, 64, filter_size=1, activation='relu', name='inception_4b_pool_1_1') inception_4b_output = merge([inception_4b_1_1, inception_4b_3_3, inception_4b_5_5, inception_4b_pool_1_1], mode='concat', axis=3, name='inception_4b_output') inception_4c_1_1 = conv_2d(inception_4b_output, 128, filter_size=1, activation='relu',name='inception_4c_1_1') inception_4c_3_3_reduce = conv_2d(inception_4b_output, 128, filter_size=1, activation='relu', name='inception_4c_3_3_reduce') inception_4c_3_3 = conv_2d(inception_4c_3_3_reduce, 256, filter_size=3, activation='relu', name='inception_4c_3_3') inception_4c_5_5_reduce = conv_2d(inception_4b_output, 24, filter_size=1, activation='relu', name='inception_4c_5_5_reduce') inception_4c_5_5 = conv_2d(inception_4c_5_5_reduce, 64, filter_size=5, activation='relu', name='inception_4c_5_5') inception_4c_pool = max_pool_2d(inception_4b_output, kernel_size=3, strides=1) inception_4c_pool_1_1 = conv_2d(inception_4c_pool, 64, filter_size=1, activation='relu', name='inception_4c_pool_1_1') inception_4c_output = merge([inception_4c_1_1, inception_4c_3_3, inception_4c_5_5, inception_4c_pool_1_1], mode='concat', axis=3,name='inception_4c_output') inception_4d_1_1 = conv_2d(inception_4c_output, 112, filter_size=1, activation='relu', name='inception_4d_1_1') inception_4d_3_3_reduce = conv_2d(inception_4c_output, 144, filter_size=1, activation='relu', name='inception_4d_3_3_reduce') inception_4d_3_3 = conv_2d(inception_4d_3_3_reduce, 288, filter_size=3, activation='relu', name='inception_4d_3_3') inception_4d_5_5_reduce = conv_2d(inception_4c_output, 32, filter_size=1, activation='relu', name='inception_4d_5_5_reduce') inception_4d_5_5 = conv_2d(inception_4d_5_5_reduce, 64, filter_size=5, activation='relu', name='inception_4d_5_5') inception_4d_pool = max_pool_2d(inception_4c_output, kernel_size=3, strides=1, name='inception_4d_pool') inception_4d_pool_1_1 = conv_2d(inception_4d_pool, 64, filter_size=1, activation='relu', name='inception_4d_pool_1_1') inception_4d_output = merge([inception_4d_1_1, inception_4d_3_3, inception_4d_5_5, inception_4d_pool_1_1], mode='concat', axis=3, name='inception_4d_output') inception_4e_1_1 = conv_2d(inception_4d_output, 256, filter_size=1, activation='relu', name='inception_4e_1_1') inception_4e_3_3_reduce = conv_2d(inception_4d_output, 160, filter_size=1, activation='relu', name='inception_4e_3_3_reduce') inception_4e_3_3 = conv_2d(inception_4e_3_3_reduce, 320, filter_size=3, activation='relu', name='inception_4e_3_3') inception_4e_5_5_reduce = conv_2d(inception_4d_output, 32, filter_size=1, activation='relu', name='inception_4e_5_5_reduce') inception_4e_5_5 = conv_2d(inception_4e_5_5_reduce, 128, filter_size=5, activation='relu', name='inception_4e_5_5') inception_4e_pool = max_pool_2d(inception_4d_output, kernel_size=3, strides=1, name='inception_4e_pool') inception_4e_pool_1_1 = conv_2d(inception_4e_pool, 128, filter_size=1, activation='relu', name='inception_4e_pool_1_1') inception_4e_output = merge([inception_4e_1_1, inception_4e_3_3, inception_4e_5_5,inception_4e_pool_1_1],axis=3, mode='concat') pool4_3_3 = max_pool_2d(inception_4e_output, kernel_size=3, strides=2, name='pool_3_3') inception_5a_1_1 = conv_2d(pool4_3_3, 256, filter_size=1, activation='relu', name='inception_5a_1_1') inception_5a_3_3_reduce = conv_2d(pool4_3_3, 160, filter_size=1, activation='relu', name='inception_5a_3_3_reduce') inception_5a_3_3 = conv_2d(inception_5a_3_3_reduce, 320, filter_size=3, activation='relu', name='inception_5a_3_3') inception_5a_5_5_reduce = conv_2d(pool4_3_3, 32, filter_size=1, activation='relu', name='inception_5a_5_5_reduce') inception_5a_5_5 = conv_2d(inception_5a_5_5_reduce, 128, filter_size=5, activation='relu', name='inception_5a_5_5') inception_5a_pool = max_pool_2d(pool4_3_3, kernel_size=3, strides=1, name='inception_5a_pool') inception_5a_pool_1_1 = conv_2d(inception_5a_pool, 128, filter_size=1,activation='relu', name='inception_5a_pool_1_1') inception_5a_output = merge([inception_5a_1_1, inception_5a_3_3, inception_5a_5_5, inception_5a_pool_1_1], axis=3,mode='concat') inception_5b_1_1 = conv_2d(inception_5a_output, 384, filter_size=1,activation='relu', name='inception_5b_1_1') inception_5b_3_3_reduce = conv_2d(inception_5a_output, 192, filter_size=1, activation='relu', name='inception_5b_3_3_reduce') inception_5b_3_3 = conv_2d(inception_5b_3_3_reduce, 384, filter_size=3,activation='relu', name='inception_5b_3_3') inception_5b_5_5_reduce = conv_2d(inception_5a_output, 48, filter_size=1, activation='relu', name='inception_5b_5_5_reduce') inception_5b_5_5 = conv_2d(inception_5b_5_5_reduce,128, filter_size=5, activation='relu', name='inception_5b_5_5' ) inception_5b_pool = max_pool_2d(inception_5a_output, kernel_size=3, strides=1, name='inception_5b_pool') inception_5b_pool_1_1 = conv_2d(inception_5b_pool, 128, filter_size=1, activation='relu', name='inception_5b_pool_1_1') inception_5b_output = merge([inception_5b_1_1, inception_5b_3_3, inception_5b_5_5, inception_5b_pool_1_1], axis=3, mode='concat') pool5_7_7 = avg_pool_2d(inception_5b_output, kernel_size=7, strides=1) pool5_7_7 = dropout(pool5_7_7, 0.4) loss = fully_connected(pool5_7_7, output,activation='softmax') network = regression(loss, optimizer='momentum', loss='categorical_crossentropy', learning_rate=lr, name='targets') model = tflearn.DNN(network, max_checkpoints=0, tensorboard_verbose=0,tensorboard_dir='log') return model
The changes are in the first few layers, since this is what needs to handle the increased resolution.
Training Data: 1 million 480x270 color images mapped to keypresses.
Fitment: Interestingly, I found that out of sample fitment tracked very well to in-sample testing, but *only* with data that was created with the training data. I suspect this is due to the nature of this task where, when shuffling data, the "test" frames, despite being unique, are still very similar to training frames. It might be wise to actually instead separate out testing/training data before shuffling/padding data, so this data includes not only images that have been never seen, but scenarios overall that haven't been seen or at least are abundantly close to the training data.
There we go, much better. Now, I could have continued training this model, but I was pretty darn tired of training models, so I went ahead and pushed v0.03 live, and was very happy with the results. About 10% of the training data contained waypoint-following data, and it was clear to me that the Agent definitely saw the waypoints / markers on the map, but it was struggling. I will also note that the training above was done with a decaying learning rate. I started with 0.001, and ended with 0.00025.
Another major impact to training that I found besides decreasing the learning rate was the batch size. Initially, I was using batches of just 500 samples, since this translated to 250MB of data. I then rose this to batches of 2K files, and noticed some improvement. I upped it to 4K, more improvement, so then finally I increased batch sizes to 10K samples, which was about 3-4GB of data per batch. I could probably continue to increase this, but, for now, this seems to be a good number.
Armed with the confidence that the model was at least capable of seeing the waypoints and paths on the game map, I went back and began making more waypoint-following data, and then trained the model again.
It was here that I discovered "true" (no extremely similar frames in out of sample compared to in-sample) out-of-sample data was more like 50% accurate, so this gives me a new metric by which to test and tweak future models.
After training the neural network on another ~300K samples of waypoint data, the AI is even better at following waypoints.
Somewhat comically, it's clear the AI may have too much focus on the map, rather than the game world, but we can certainly work on this. Even when missing the objective, Charles, our agent, will often try to get back on track, and is often successful:
U-Turns appear to be the most challenging task for Charles, but he is slowly learning them:
Issues:
Missing objectives frequently has Charles trying to drive through walls to get back on track.
A lot of crashing still occurs, but Charles has significantly improved his handling of the vehicle. I'd still like to see this improved.
Charles performs VERY well at night, and actually much worse in the day. This used to be the opposite, so obviously I'd like to bring day-time performance up to where night performance is now.
Charles still continues to exceed my expectations on what's possible with an AI on a frame-by-frame basis with no memory, and now he even has a purpose as he drives. I would have thought even for an AI like this, as bad as Charles is, to still require some form of short term memory.
With waypoints, I have begun tracking Charles' time between objectives, and this data can be used to bring in reinforcement learning. That said, I don't believe that is actually the correct path forward at this stage. Eventually, I would make sense, but I suspect that, while running a single instance, reinforcement learning would still take years upon years to see decent results, at least using methods that I can conjur up. I will continue thinking on this, but I suspect Charles will need much more human assistance still moving forward.
Future Considerations:
With model 0.02, color is introduced. With color, minimal model changes should need to be made, but we will have to see how it plays out. The only change here is, with the input layer, we now have 3 channels for R,G and B color values.
It is expected that color will make feature creation and detection much easier for the network, resulting in more robust features as well as reducing mistakes in practice.
It is also expected that color is going to make learning to drive in other scenarios like night, rain and fog more realistic.
I opted to also explore Google's Inception V3.
Be able to drive with similar or better performance than v0.01, yet also be able to drive in any condition available in the game (rain, smog, fog, storm, night, day...etc)
Learn to hopefully fix more mistakes on own. v0.01 required two major failsafes to fix mistakes: one was a GTA V mod that would simply teleport and reset the player if their coordinates didn't change in a certain period of time. Another failsafe used motion detection to determine if the vehicle was stuck, and it would try to wiggle out of wherever it was, but there are many ways to get permanently stuck in GTA V. Since this Agent is streamed 24/7 live for anyone to see, it needs some ways to get unstuck.
For this next version, I trained two neural network models: The same expanded AlexNet from before as well as Google's Inception V3.
Input: 160 x 90 RGB pixel data. I brought height back to 90 to have the more common 16:9 aspect ratio.
Output: keypresses in the form of W, S, A, or D, along with combinations like W+A, W+D, S+A, and S+D
See the writeup for 0.01 for the expanded AlexNet model. The Inception model used here:
def inception_v3(width, height, frame_count, lr, output=9, model_name = 'inceptionv3.model', device = 'gpu', num = '0'): with tf.device('/{}:{}'.format(device,num)): network = input_data(shape=[None, width, height,3], name='input') conv1_7_7 = conv_2d(network, 64, 7, strides=2, activation='relu', name = 'conv1_7_7_s2') pool1_3_3 = max_pool_2d(conv1_7_7, 3,strides=2) pool1_3_3 = local_response_normalization(pool1_3_3) conv2_3_3_reduce = conv_2d(pool1_3_3, 64,1, activation='relu',name = 'conv2_3_3_reduce') conv2_3_3 = conv_2d(conv2_3_3_reduce, 192,3, activation='relu', name='conv2_3_3') conv2_3_3 = local_response_normalization(conv2_3_3) pool2_3_3 = max_pool_2d(conv2_3_3, kernel_size=3, strides=2, name='pool2_3_3_s2') inception_3a_1_1 = conv_2d(pool2_3_3, 64, 1, activation='relu', name='inception_3a_1_1') inception_3a_3_3_reduce = conv_2d(pool2_3_3, 96,1, activation='relu', name='inception_3a_3_3_reduce') inception_3a_3_3 = conv_2d(inception_3a_3_3_reduce, 128,filter_size=3, activation='relu', name = 'inception_3a_3_3') inception_3a_5_5_reduce = conv_2d(pool2_3_3,16, filter_size=1,activation='relu', name ='inception_3a_5_5_reduce' ) inception_3a_5_5 = conv_2d(inception_3a_5_5_reduce, 32, filter_size=5, activation='relu', name= 'inception_3a_5_5') inception_3a_pool = max_pool_2d(pool2_3_3, kernel_size=3, strides=1, ) inception_3a_pool_1_1 = conv_2d(inception_3a_pool, 32, filter_size=1, activation='relu', name='inception_3a_pool_1_1') # merge the inception_3a__ inception_3a_output = merge([inception_3a_1_1, inception_3a_3_3, inception_3a_5_5, inception_3a_pool_1_1], mode='concat', axis=3) inception_3b_1_1 = conv_2d(inception_3a_output, 128,filter_size=1,activation='relu', name= 'inception_3b_1_1' ) inception_3b_3_3_reduce = conv_2d(inception_3a_output, 128, filter_size=1, activation='relu', name='inception_3b_3_3_reduce') inception_3b_3_3 = conv_2d(inception_3b_3_3_reduce, 192, filter_size=3, activation='relu',name='inception_3b_3_3') inception_3b_5_5_reduce = conv_2d(inception_3a_output, 32, filter_size=1, activation='relu', name = 'inception_3b_5_5_reduce') inception_3b_5_5 = conv_2d(inception_3b_5_5_reduce, 96, filter_size=5, name = 'inception_3b_5_5') inception_3b_pool = max_pool_2d(inception_3a_output, kernel_size=3, strides=1, name='inception_3b_pool') inception_3b_pool_1_1 = conv_2d(inception_3b_pool, 64, filter_size=1,activation='relu', name='inception_3b_pool_1_1') #merge the inception_3b_* inception_3b_output = merge([inception_3b_1_1, inception_3b_3_3, inception_3b_5_5, inception_3b_pool_1_1], mode='concat',axis=3,name='inception_3b_output') pool3_3_3 = max_pool_2d(inception_3b_output, kernel_size=3, strides=2, name='pool3_3_3') inception_4a_1_1 = conv_2d(pool3_3_3, 192, filter_size=1, activation='relu', name='inception_4a_1_1') inception_4a_3_3_reduce = conv_2d(pool3_3_3, 96, filter_size=1, activation='relu', name='inception_4a_3_3_reduce') inception_4a_3_3 = conv_2d(inception_4a_3_3_reduce, 208, filter_size=3, activation='relu', name='inception_4a_3_3') inception_4a_5_5_reduce = conv_2d(pool3_3_3, 16, filter_size=1, activation='relu', name='inception_4a_5_5_reduce') inception_4a_5_5 = conv_2d(inception_4a_5_5_reduce, 48, filter_size=5, activation='relu', name='inception_4a_5_5') inception_4a_pool = max_pool_2d(pool3_3_3, kernel_size=3, strides=1, name='inception_4a_pool') inception_4a_pool_1_1 = conv_2d(inception_4a_pool, 64, filter_size=1, activation='relu', name='inception_4a_pool_1_1') inception_4a_output = merge([inception_4a_1_1, inception_4a_3_3, inception_4a_5_5, inception_4a_pool_1_1], mode='concat', axis=3, name='inception_4a_output') inception_4b_1_1 = conv_2d(inception_4a_output, 160, filter_size=1, activation='relu', name='inception_4a_1_1') inception_4b_3_3_reduce = conv_2d(inception_4a_output, 112, filter_size=1, activation='relu', name='inception_4b_3_3_reduce') inception_4b_3_3 = conv_2d(inception_4b_3_3_reduce, 224, filter_size=3, activation='relu', name='inception_4b_3_3') inception_4b_5_5_reduce = conv_2d(inception_4a_output, 24, filter_size=1, activation='relu', name='inception_4b_5_5_reduce') inception_4b_5_5 = conv_2d(inception_4b_5_5_reduce, 64, filter_size=5, activation='relu', name='inception_4b_5_5') inception_4b_pool = max_pool_2d(inception_4a_output, kernel_size=3, strides=1, name='inception_4b_pool') inception_4b_pool_1_1 = conv_2d(inception_4b_pool, 64, filter_size=1, activation='relu', name='inception_4b_pool_1_1') inception_4b_output = merge([inception_4b_1_1, inception_4b_3_3, inception_4b_5_5, inception_4b_pool_1_1], mode='concat', axis=3, name='inception_4b_output') inception_4c_1_1 = conv_2d(inception_4b_output, 128, filter_size=1, activation='relu',name='inception_4c_1_1') inception_4c_3_3_reduce = conv_2d(inception_4b_output, 128, filter_size=1, activation='relu', name='inception_4c_3_3_reduce') inception_4c_3_3 = conv_2d(inception_4c_3_3_reduce, 256, filter_size=3, activation='relu', name='inception_4c_3_3') inception_4c_5_5_reduce = conv_2d(inception_4b_output, 24, filter_size=1, activation='relu', name='inception_4c_5_5_reduce') inception_4c_5_5 = conv_2d(inception_4c_5_5_reduce, 64, filter_size=5, activation='relu', name='inception_4c_5_5') inception_4c_pool = max_pool_2d(inception_4b_output, kernel_size=3, strides=1) inception_4c_pool_1_1 = conv_2d(inception_4c_pool, 64, filter_size=1, activation='relu', name='inception_4c_pool_1_1') inception_4c_output = merge([inception_4c_1_1, inception_4c_3_3, inception_4c_5_5, inception_4c_pool_1_1], mode='concat', axis=3,name='inception_4c_output') inception_4d_1_1 = conv_2d(inception_4c_output, 112, filter_size=1, activation='relu', name='inception_4d_1_1') inception_4d_3_3_reduce = conv_2d(inception_4c_output, 144, filter_size=1, activation='relu', name='inception_4d_3_3_reduce') inception_4d_3_3 = conv_2d(inception_4d_3_3_reduce, 288, filter_size=3, activation='relu', name='inception_4d_3_3') inception_4d_5_5_reduce = conv_2d(inception_4c_output, 32, filter_size=1, activation='relu', name='inception_4d_5_5_reduce') inception_4d_5_5 = conv_2d(inception_4d_5_5_reduce, 64, filter_size=5, activation='relu', name='inception_4d_5_5') inception_4d_pool = max_pool_2d(inception_4c_output, kernel_size=3, strides=1, name='inception_4d_pool') inception_4d_pool_1_1 = conv_2d(inception_4d_pool, 64, filter_size=1, activation='relu', name='inception_4d_pool_1_1') inception_4d_output = merge([inception_4d_1_1, inception_4d_3_3, inception_4d_5_5, inception_4d_pool_1_1], mode='concat', axis=3, name='inception_4d_output') inception_4e_1_1 = conv_2d(inception_4d_output, 256, filter_size=1, activation='relu', name='inception_4e_1_1') inception_4e_3_3_reduce = conv_2d(inception_4d_output, 160, filter_size=1, activation='relu', name='inception_4e_3_3_reduce') inception_4e_3_3 = conv_2d(inception_4e_3_3_reduce, 320, filter_size=3, activation='relu', name='inception_4e_3_3') inception_4e_5_5_reduce = conv_2d(inception_4d_output, 32, filter_size=1, activation='relu', name='inception_4e_5_5_reduce') inception_4e_5_5 = conv_2d(inception_4e_5_5_reduce, 128, filter_size=5, activation='relu', name='inception_4e_5_5') inception_4e_pool = max_pool_2d(inception_4d_output, kernel_size=3, strides=1, name='inception_4e_pool') inception_4e_pool_1_1 = conv_2d(inception_4e_pool, 128, filter_size=1, activation='relu', name='inception_4e_pool_1_1') inception_4e_output = merge([inception_4e_1_1, inception_4e_3_3, inception_4e_5_5,inception_4e_pool_1_1],axis=3, mode='concat') pool4_3_3 = max_pool_2d(inception_4e_output, kernel_size=3, strides=2, name='pool_3_3') inception_5a_1_1 = conv_2d(pool4_3_3, 256, filter_size=1, activation='relu', name='inception_5a_1_1') inception_5a_3_3_reduce = conv_2d(pool4_3_3, 160, filter_size=1, activation='relu', name='inception_5a_3_3_reduce') inception_5a_3_3 = conv_2d(inception_5a_3_3_reduce, 320, filter_size=3, activation='relu', name='inception_5a_3_3') inception_5a_5_5_reduce = conv_2d(pool4_3_3, 32, filter_size=1, activation='relu', name='inception_5a_5_5_reduce') inception_5a_5_5 = conv_2d(inception_5a_5_5_reduce, 128, filter_size=5, activation='relu', name='inception_5a_5_5') inception_5a_pool = max_pool_2d(pool4_3_3, kernel_size=3, strides=1, name='inception_5a_pool') inception_5a_pool_1_1 = conv_2d(inception_5a_pool, 128, filter_size=1,activation='relu', name='inception_5a_pool_1_1') inception_5a_output = merge([inception_5a_1_1, inception_5a_3_3, inception_5a_5_5, inception_5a_pool_1_1], axis=3,mode='concat') inception_5b_1_1 = conv_2d(inception_5a_output, 384, filter_size=1,activation='relu', name='inception_5b_1_1') inception_5b_3_3_reduce = conv_2d(inception_5a_output, 192, filter_size=1, activation='relu', name='inception_5b_3_3_reduce') inception_5b_3_3 = conv_2d(inception_5b_3_3_reduce, 384, filter_size=3,activation='relu', name='inception_5b_3_3') inception_5b_5_5_reduce = conv_2d(inception_5a_output, 48, filter_size=1, activation='relu', name='inception_5b_5_5_reduce') inception_5b_5_5 = conv_2d(inception_5b_5_5_reduce,128, filter_size=5, activation='relu', name='inception_5b_5_5' ) inception_5b_pool = max_pool_2d(inception_5a_output, kernel_size=3, strides=1, name='inception_5b_pool') inception_5b_pool_1_1 = conv_2d(inception_5b_pool, 128, filter_size=1, activation='relu', name='inception_5b_pool_1_1') inception_5b_output = merge([inception_5b_1_1, inception_5b_3_3, inception_5b_5_5, inception_5b_pool_1_1], axis=3, mode='concat') pool5_7_7 = avg_pool_2d(inception_5b_output, kernel_size=7, strides=1) pool5_7_7 = dropout(pool5_7_7, 0.4) loss = fully_connected(pool5_7_7, output,activation='softmax') network = regression(loss, optimizer='momentum', loss='categorical_crossentropy', learning_rate=lr, name='targets') model = tflearn.DNN(network, max_checkpoints=0, tensorboard_verbose=0,tensorboard_dir='log') return model
Training Data: ~500K samples of 160x90 RGB data from playing GTA V in a car at speed at various times of day and with varying weather.
Duration: Expanded AlexNet trained in under 2 days, Inception took 4 days.
Fitment: Both AlexNet and Inception v3 were able to fit the data with no clear overfitment issues.
This is where things get interesting. Inception certainly took much longer to train, but the actual training statistics of both models made them look very similar.
For this project, I have two machines that I have been using. 1 machine is being used to create data and train the models, and the other machine is used to actually run the models 24/7 in the game. For my purposes, I call the training machine my "main machine" and then I call the testing machine, the one that runs the stream, the "Charles" machine, since I have named my driving Agent "Charles."
So, on the "main" machine, I put both models into action in game. I tested them both in the clear day, in the rain, with police, off road, in fog...etc. Both models appeared to be nearly identical in performance, so I decided I would just run both models on the 24/7 stream for a few hours before going with one.
Upon putting the expanded AlexNet model on the Charles machine, it was clear something was wrong. The Agent...sort of... drove, but not well at all. Certainly worse than v0.01, and definitely way worse than it drove on the main machine.
Curious, I stopped the expanded AlexNet model and loaded in the Inception v3 model. To my surprise, the Inception model performed much better, and actually drove well in all conditions. Some features that were present on my "main" machine, however, were lost, such as backing up from walls/crashes.
I went back and compared all graphics settings between games, and I could not find any differences, besides one: the vehicle headlights on my main machine were "xenon" headlights and the ones on the Charles pc were set to be stock headlights. Changing this appeared to help night time driving a bit, but there were still major differences in quality between machines, despite both machines running almost identical hardware.
While this is just one example, it seems to suggest to me that the inception model is more robust to subtle differences, and that the AlexNet model I was using was far too sensitive to tiny changes.
I plan to continue to look into this interesting finding.
Color has indeed allowed the models to drive in more conditions, but it looks like the expanded alexnet was far too finely-tuned for some sort of difference that is not noticeable to me. I still have no idea why it worked fine on the main machine but not on the Charles machine.
I still see this version as a success, and I am still amazed that it works at all, still only making decisions one frame at a time with no memory of what it's been doing or what previous frames were.
Future Considerations:
Moving forward, my plan is to increase the resolution to 480x270, prepare for reinforcement learning, and continue to work and granular control.
The overall goal of this project is to explore general artificial intelligence. When considering environments to employ AI models into, games are an obvious choice. I have personally chosen Rockstar's Grand Theft Auto V to begin with. My reasoning for this choice is that the GTA V world is an expansive, open, world which has many tasks and avenues that an agent could plausibly explore and learn about. To begin, however, a more simple task with a definitive objective should be chosen. The starting task that I have chosen is to create a self-driving car.
While most actual self-driving cars use all sorts of sensors, such as LIDAR, my plans here are to only feed the neural network visual data. It's understandable that real self-driving cars use everything to their advantage, lives are on the line and they want the best, safest, product.
In this project, and game world, my personal goal is not to solve the challenge of self driving cars with some sort of a la carte solution specific to cars, it is to come up with a model that can be used to drive a car, boat, fly a helicopter or plane, play checkers, or walk around as a pedestrian.
I may add more "senses" besides visual senses in the future, but, for now, I am trying to see just how far we can get with vision alone.
At this stage, my only objective is to see if I can just make a car drive at all with just visual inputs. If successful here, I will begin to fine-tune the objectives.
The goal is not to create a civil self-driving car. This is GTA V after all. To me, the overall objective of the Agent:
To begin, it's fairly obvious that a Convolutional Neural Network should be used. Common sense would also suggest that we would need some sort of memory, whether that's in the form of recurrent layers, or something else entirely. To begin, however, I decided to keep it simple and just do a ConvNet. At first, I tried a simple 2 and 4 layer ConvNet to no real success, as the models just simply wouldn't train. I then grabbed AlexNet and tried that. It seemed to at least train, but plateaued too early. I finally ended on a modified AlexNet, which was many more layers.
Input: 160 x 120 grayscale pixel data from the game
Output: keypresses in the form of W, S, A, or D, along with combinations like W+A, W+D, S+A, and S+D
Python model code (tflearn)
def modified_alexnet(width, height, lr, output=3): network = input_data(shape=[None, width, height, 1], name='input') network = conv_2d(network, 96, 11, strides=4, activation='relu') network = max_pool_2d(network, 3, strides=2) network = local_response_normalization(network) network = conv_2d(network, 256, 5, activation='relu') network = max_pool_2d(network, 3, strides=2) network = local_response_normalization(network) network = conv_2d(network, 384, 3, activation='relu') network = conv_2d(network, 384, 3, activation='relu') network = conv_2d(network, 256, 3, activation='relu') network = max_pool_2d(network, 3, strides=2) network = conv_2d(network, 256, 5, activation='relu') network = max_pool_2d(network, 3, strides=2) network = local_response_normalization(network) network = conv_2d(network, 384, 3, activation='relu') network = conv_2d(network, 384, 3, activation='relu') network = conv_2d(network, 256, 3, activation='relu') network = max_pool_2d(network, 3, strides=2) network = local_response_normalization(network) network = fully_connected(network, 4096, activation='tanh') network = dropout(network, 0.5) network = fully_connected(network, 4096, activation='tanh') network = dropout(network, 0.5) network = fully_connected(network, 4096, activation='tanh') network = dropout(network, 0.5) network = fully_connected(network, 4096, activation='tanh') network = dropout(network, 0.5) network = fully_connected(network, output, activation='softmax') network = regression(network, optimizer='momentum', loss='categorical_crossentropy', learning_rate=lr, name='targets') model = tflearn.DNN(network, checkpoint_path='model_alexnet', max_checkpoints=1, tensorboard_verbose=0, tensorboard_dir='log') return model
Training Data: 521K samples of 160x120 grayscale frames mapped to key presses. This translates to about 2.4 hours of gameplay, and took an actual 20 hours of gameplay to produce, since the data needed to be balanced. About 90% of the data is just to go forward, which will cause problems for the model to learn other keys if left. One could also duplicate data to balance, but I chose to not do that this time.
Data only consisted of mid-day, clear weather, data.
Duration: Training took about 4 days to complete. I used TensorFlow and left TensorBoard reporting fairly high for getting statistical information back from the training. In the future, I would recommend keeping this to a minimum, as the logging takes a considerable amount of time.
Fitment: I was able to train this model to about 90% accuracy, in and out of sample.
Somewhat surprisingly, the model actually works. Let's recall that this is a neural network that merely sees one frame at a time, and just the 160x120 pixels of that frame in grayscale. Despite this, the model actually drives. Clearly, lines of the road were discovered, and the AI could drive in the city, in the country, and even on some dirt roads in the game, which it never even saw in training. Also interestingly, if the Agent found train tracks, it would follow those as well.
Issues: Somewhat comically, the Agent is a massive fan of driving into the ocean. Any time it sees the water, it just goes straight for it. I imagine this is because the water looks like wide-open territory without obstacles. Also, the agent simply cannot see chainlink fences, and often gets stuck at these. In general, the agent still gets into a lot of crashes, and is nowhere near a decent driver, but still surprises me at times.
Given such impressive (at least to me) results, I am going to have to continue this project. I am still pretty darn impressed that the neural network can even learn how to drive, and actually drive, with just one frame at a time, with no memory of what it's been doing or plans any further.
Future Considerations:
A stream of this AI is now available, streaming 24/7 on Twitch.
The objective of Psyber is to tackle dynamic problems through using the latest machine learning techniques.
helloworld@psyber.io