Camera Pose Estimation Using Convolutional Neural Networks

It’s been quite a while since I last posted here. Since my last post I started a PhD in computer science at the University of Colorado, Boulder. I’ve been working on a lot of fun robotics-related stuff lately, but I haven’t had the time to work on Colin. However, I recently finished a really interesting project and I thought I’d take some time to do a quick write-up on it. The project really encapsulates what I’ve loved about my first semester in grad school. When I started I knew basically nothing about neural networks or pose estimation. But in just a few short weeks I had learned enough to build, train, and test a neural network for camera pose estimation. I absolutely love that this is my job now.

My project involved implementing the method for camera pose estimation presented in this paper. The basic idea is this: given two photos of the same scene taken from different positions we need to determine how the camera moved between those two photos. We must calculate both the translation and rotation in three dimensions. A good solution to this problem has many applications, but the one I’m most interested in is visual odometry, the process of determining a robot’s change in position using information from its camera.

For this post I describe how this problem is traditionally addressed. I then present the new method and compare it to the traditional methods. Lastly I present the code for my implementation of the new method. For the implementation section I assume you have some familiarity with Python 3 and Keras. If not there’s plenty of good tutorials on both. I didn’t know either about a month and a half ago. I also didn’t know anything about neural networks when I started this project. If you’re in the same boat I’d suggest you go through these notes and this tutorial. I found them very helpful.

The Traditional Approach
Using CNNs
Comparing Approaches
The Code

Challenge: compute the relative camera pose change between two photos of the same scene




The Traditional Approach


The normal approach to computing the relative pose change between two images involves feature matching. To be brief, a feature detector such as FAST is used to find “keypoints” in both input images. The algorithm assigns unique descriptors to each feature. Examples of possible descriptors are SIFT and BRIEF. The algorithm then compares the set of descriptors from each image to find the set of descriptors common to both images (the intersection of the two sets). Finally, the homography between the two sets of descriptors for the images is computed. This allows us to calculate the change in camera pose between the two images.

This approach is computationally intensive. Each feature descriptor must be compared to every other descriptor to find matches. Additionally, in order to perform well it must be able to find a large set of matching feature points between the two images. There are some cases where this is not possible, however. Scenes with few features will result in an insufficient number of features being detected. On the other hand, objects with very repetitive structure will have many similar features and the matching algorithm will find spurious correspondences between them. Lastly, if the change in camera pose between the images is too great, there may not be sufficient overlap between the sets of features from each image.

Feature Matching

Case in which feature matching is not able to find enough matching features for accurate pose estimation.



Using Convolutional Neural Networks


Because of the limitations of traditional feature-matching for relative camera pose estimation there have been several attempts to employ convolutional neural networks for this purpose.

The method presented in the subject paper is to feed each image into a separate branch of a Siamese neural network. A Siamese neural network contains two or more branches, or subnetworks. The architecture and weights of all the branches are identical. Siamese networks are adept at finding similarities or relationships between images. The structure of the Siamese branches for the network presented in this paper is the same as the convolutional section of AlexNet. AlexNet is a popular neural network for image classification.

The outputs from the Siamese branches are concatenated and then fed into a single fully-connected layer with 7 outputs. These outputs correspond to the 3D translation vector and the orientation quaternion required to move the camera from its initial pose to its final pose.

One additional interesting feature of this approach is that it applies a unique pooling technique between the convolutional and fully-connected sections of the network. Convolutional networks can theoretically accept inputs of arbitrary size. However, the fully connected layers at the end of a convolutional network require fixed-size input. Normally images are cropped or downsampled before being fed into the convolutional section of the network to ensure the input to the fully connected layers is the correct size. However, this can also result in a loss of information in the original image. Spatial pyramid pooling (SPP) generates a fixed-size representation from an arbitrary input size. This means that, by including an SPP layer between the convolutional and fully connected layers, we can give our network the ability to accept arbitrary-sized inputs. It also means that the convolutional section of the network is able to use all of the information present in the input images. The authors of this paper found that it significantly increased the accuracy of their model.


Comparing The Approaches


The performance of the convolutional neural network was compared to the OpenCV implementations of two feature matching methods: SURF and ORB. Images from the DTU robot image dataset were used to test all three methods. The DTU dataset was created by taking photos of several scenes using a camera positioned by a robotic arm. This means that we know the groundtruth for the change in camera pose with high accuracy. So evaluation of the accuracy of each method is quite simple. Additionally, it means the labels used for training the CNN are highly accurate.

Feature matching methods were, in general, more accurate. However, there are certain cases in which the neural network performs better. For higher changes in position and orientation the CNN can slightly outperform feature based methods. Also, the CNN can perform significantly better in cases where it is difficult to find and match features. For example, when the images do not contain a sufficient number of features.

Histogram of errors for orientation (left) and translation (right). The above histograms compare 4 different CNN architectures to SURF and ORB.

 


The Code

I’ve posted the Python 3 code for the neural network below. Supporting code for loading data from the image files and visualizing the results can be found on my github repository. Also, the image files themselves can be downloaded from the DTU website.

# camera-pose.py
# Andrew Kramer

# regresses the relative camera pose between two images using the method 
# presented in https://arxiv.org/pdf/1702.01381.pdf

import numpy as np
import tensorflow
import gzip
import sys
import pickle
import keras
import math
import os

from keras.models import Sequential, Model, load_model
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Conv2D, MaxPooling2D, Input, Lambda
from keras.utils import np_utils
from keras.datasets import mnist
from keras import backend as K
from dataloader import DataLoader
from pooling.spp.SpatialPyramidPooling import SpatialPyramidPooling

beta = 10
epochs = 10

# define loss function for training the network
def custom_objective(y_true, y_pred):
	error = K.square(y_pred - y_true)
	transMag = K.sqrt(error[0] + error[1] + error[2])
	orientMag = K.sqrt(error[3] + error[4] + error[5] + error[6])
	return K.mean(transMag + (beta * orientMag))

def dot_product(v1, v2):
	return sum((a*b) for a,b in zip(v1,v2))

def length(v):
	return math.sqrt(dot_product(v,v))

def compute_mean_error(y_true, y_pred):
	trans_error = 0
	orient_error = 0
	for i in range(0,y_true.shape[0]):
		trans_error += math.acos(dot_product(y_true[i,:3], y_pred[i,:3])/
					(length(y_true[i,:3]) * length(y_pred[i,:3])))
		orient_error += math.acos(dot_product(y_true[i,3:], y_pred[i,3:])/
					 (length(y_true[i,3:]) * length(y_pred[i,3:])))
	mean_trans = trans_error / y_true.shape[0]
	mean_orient = orient_error / y_true.shape[0]
	return mean_trans, mean_orient

# define the architecture for the Siamese branches
def create_conv_branch(input_shape):
	model = Sequential()
	model.add(Conv2D(96, kernel_size=(11,11),
					 strides=4, padding='valid',
					 activation='relu',
					 input_shape=input_shape))
	model.add(MaxPooling2D(pool_size=(2,2), strides=2))
	model.add(Conv2D(256, kernel_size=(5,5),
					 strides=1, padding='same',
					 activation='relu'))
	model.add(MaxPooling2D(pool_size=(3,3), strides=1))
	model.add(Conv2D(384, kernel_size=(3,3),
					 strides=1, padding='same',
					 activation='relu'))
	model.add(Conv2D(384, kernel_size=(3,3),
					 strides=1, padding='same',
					 activation='relu'))
	model.add(Conv2D(256, kernel_size=(3,3),
					 strides=1, padding='same',
					 activation='relu'))
	model.add(SpatialPyramidPooling([1,2,3,6,13]))
	return model

if __name__ == "__main__":

	img_rows, img_cols = 227, 227
	category_IDs = list(range(1,25)) # category IDs from which to pull test and training data
	file_name = 'huge_model_spp_10epoch.h5'
	model = None
	# load training and testing data:
	loader = DataLoader(category_IDs, img_rows, img_cols)
	train_data, test_data = loader.get_data()
	train_labels, test_labels = loader.get_labels()
	input_shape = loader.get_input_shape()

	# define structure of convolutional branches
	conv_branch = create_conv_branch(input_shape)
	branch_a = Input(shape=input_shape)
	branch_b = Input(shape=input_shape)

	processed_a = conv_branch(branch_a)
	processed_b = conv_branch(branch_b)

	# concatenate the outputs of the CNN branches
	regression = keras.layers.concatenate([processed_a, processed_b])

	output = Dense(7, kernel_initializer='normal', name='output')(regression)
	model = Model(inputs=[branch_a, branch_b], outputs=[output])

	model.compile(loss=custom_objective, 
				  optimizer=keras.optimizers.Adam(lr=.0001, decay=.00001),
				  metrics=['accuracy'])

	if os.path.isfile(file_name):
		print("model", file_name, "found")
		model.load_weights(file_name)
		print("model loaded from file")
	else:
		model.fit([train_data[:,0], train_data[:,1]], train_labels,
				  batch_size=128,
				  epochs = epochs,
				  validation_split=0.1,
				  shuffle=True)
		model.save_weights(file_name)
		print("model saved as file", file_name)

	pred = model.predict([test_data[:,0], test_data[:,1]])
	test_trans, test_orient = compute_mean_error(pred, test_labels)
	np.savetxt('pred_spp.txt', pred, delimiter=' ')
	np.savetxt('labels_spp.txt', test_labels, delimiter=' ')

	print('*     Mean translation error on test set: %0.2f' % (test_trans))
	print('*     Mean orientation error on test set: %0.2f' % (test_orient))