LIDAR Odometry with ICP

This post is the second in a series of tutorials on SLAM using scanning 2D LIDAR and wheel odometry. The other posts in the series can be found in the links below. The links will be updated as work on the series progresses.

In the previous post I introduced the Intel Research Center (IRC) Dataset and we used it in some basic visualizations. In particular, I used the robot’s odometry to get a rough estimate of its pose over time by simply concatenating the relative transformations between timesteps, T_i 0 \leq i \leq n:

    \begin{equation*} T_n = T_1 T_2 \dots T_n \end{equation*}

I then used the transform T_n to project the laser scan at each timestep into the global coordinate frame: p_g = T_n p_l where p_g is the set of homogeneous scan points in the global frame, p_l is the set of homogeneous points in the robot’s local frame, and T_n is the estimated transform between the global and local coordinate frames. This process is visualized in VisualizeMeasurements.py in my github repo:

Visualization of the odometry and lidar measurements together.

Watching this visualization even over a short time, it’s obvious that the robot’s odometry is very noisy and collects drift very quickly. With perfect odometry, the objects measured by the LIDAR would stay static as the robot moves past them. This is clearly not the case. Interestingly, the odometry seems to be fairly reliable for translational motion, but it drifts quickly in rotation. This will be significant later.

How Do We Fix The Drift Problem?

It’s clear to us the robot’s wheel odometry isn’t sufficient to estimate its motion. We know this because we can overlay the robot’s LIDAR scans in our minds and get a sense of how the robot’s estimated motion is deviating from its true motion. If we can do this in our minds, could we tell the robot how to do it? Can the robot use its LIDAR scans to estimate its own motion?

Hopefully you’ve guessed the answer is yes, through a process called scan matching. Basically the goal is to take a new scan from the robot’s LIDAR and find the transformation that best aligns the new scan with either previous scans or some sort of abstracted map. There are many ways to implement this idea and for this tutorial I’m going to demonstrate the simplest method: using the Iterative Closest Point (ICP) algorithm to align the newest LIDAR scan with the previous scan.

Iterative Closest Point In Pictures

The ICP algorithm involves 3 steps: association, transformation, and error evaluation. These are repeated until the scans are aligned satisfactorily. I’ll first demonstrate the process pictorially with an example from the IRL dataset and delve into the math below. Let’s first imagine we want to find the transformation that aligns the two scans pictured below. The previous scan, referred to as the target, is in cyan while the new scan, also called the source is in magenta. The goal is to find the rigid transformation (rotation and translation) that best aligns the source to the target.

Two scans we wish to align using ICP

Our first step in estimating this transformation is to decide which points in the source scan correspond to the same physical features as points in the target scan. The simplest way to do this is through a nearest neighbor search: points in the source scan are associated to the nearest point in the target scan. In the image below I’ve found the nearest neighbors of each point in the target scan. Associated points are connected with blue lines:

The association step

We can immediately see some mistakes in the nearest neighbor search, but in general the associations pictured will pull the source points in the right direction. The next step in the process is transformation. We find the transformation that, when applied to the source points, minimizes the mean-squared distance between the associated points: \tilde{T}=\underset{T}{\text{argmin}}\frac{1}{N}\sum_i^N \|t_i - Ts_i\|^2 where \tilde{T} is the final estimated transform and t_i and s_i are target points and source points, respectively. The result of this estimation is pictured below:

Result of the first transformation step

After this we evaluate the error in the alignment as e=\frac{1}{N}\sum_i^N \|t_i - Ts_i\|^2 and decide if we need to repeat the above process. In this case our scans still aren’t aligned very well, so we redo the associations with the transformed source points and repeat the process.

Nearest neighbor associations for the transformed source points

After five iterations our scans the algorithm finds a pretty good alignment:

Scan alignment after the fifth iteration

The Math Details

ICP is actually pretty straightforward, mathematically. The association step is pretty simple. It can just be a brute-force search for the nearest pairs of points between the source and target scans. If you’re working with large scans though, it’s a good idea to use KD Trees for this step.

The real trick to ICP is in the transformation step. Basically, we find the covariance between the two point sets, the matrix M.

    \begin{equation*} M = PQ^T  \end{equation*}

Where P is a matrix whose i^\text{th} column is s_i - \mu_s or the i^\text{th} source point expressed relative to the centroid of the source point set \mu_s. Similarly, Q is a matrix whose i^\text{th} column is t_i - \mu_t. Once we have the covariance matrix M, we find the rotation R between the two point clouds using singular value decomposition:

    \begin{align*} M &= UWV^T \\ R &= VU^T \end{align*}

If you’re wondering how you break the M matrix down into U, W, and V^T, know that most linear algebra packages (including matlab and numpy) have functions for SVD. That’s about as far as you need to get into it. If you’re interested though, the wikipedia page has some good details.

We’ve found the rotation R between the point sets, now we just need the translation t. Luckily it’s pretty simple, just the difference between the centroids of the point clouds t = \mu_t - \mu_s.

Once we have our translation and rotation we evaluate the alignment error as e=\frac{1}{N}\sum_i^N \|t_i - Rs_i + t\|^2. We use this to determine if we should quit or iterate again. Normally we stop the process if the error at the current step is below a threshold, if the difference between the error at the current step and the previous step’s error is below a threshold, or if we’ve reached a maximum number of iterations.

Below is a visualization of a simple ICP motion estimation algorithm. It simply aligns the newest scan to the previous scan to find the motion of the robot between scans:

Demonstration of motion estimation using ICP

Note that this method for motion estimation works pretty well sometimes. If the robot is near large sections of wall at different angles it can estimate its transformation between scans pretty reliably. This is because it has good environmental queues to its motion in all directions. On the other hand, if the robot is in a mostly straight hallway, there’s really nothing in its measurements that will tell it how it’s moving along the hallway. One alignment is as good as any other as long as the walls line up. In these cases the robot’s estimates of its translation are very poor. Similarly, if there just aren’t a lot of unique, persistent features in the scan, which happens sometimes when the robot approaches corners, there aren’t any good cues for the robot to estimate its rotation.

So, matching successive LIDAR scans via the iterative closest point algorithm can give our robot some information about its own movement. However, this on its own is not enough to provide a reliable motion estimate. Luckily, our robot has wheel odometry in addition to LIDAR. Even luckier, in fact, ICP is pretty reliable at estimating rotations but poor with translation in some cases. The wheel odometry, on the other hand, gives us very accurate translation but it is very unreliable with rotation. Next time, we’ll experiment with fusing information from these two sensors to create a more reliable motion estimate.

The Code

Below you can see an implementation of the ICP algorithm in python. You can find the full class, Align2D.py, in my github repo as well as a demonstration of its use in VisualizeICP.py.

# uses the iterative closest point algorithm to find the
# transformation between the source and target point clouds
# that minimizes the sum of squared errors between nearest 
# neighbors in the two point clouds
# params:
#   max_iter: int, max number of iterations
#   min_delta_err: float, minimum change in alignment error
def AlignICP(self, max_iter, min_delta_err):

	mean_sq_error = 1.0e6 # initialize error as large number
	delta_err = 1.0e6    # change in error (used in stopping condition)
	T = self.init_T
	num_iter = 0         # number of iterations
	tf_source = self.source

	while delta_err > min_delta_err and num_iter < max_iter:

		# find correspondences via nearest-neighbor search
		matched_trg_pts,matched_src_pts,indices = self.FindCorrespondences(tf_source)

		# find alingment between source and corresponding target points via SVD
		# note: svd step doesn't use homogeneous points
		new_T = self.AlignSVD(matched_src_pts, matched_trg_pts)

		# update transformation between point sets
		T = np.dot(T,new_T)

		# apply transformation to the source points
		tf_source = np.dot(self.source,T.T)

		# find mean squared error between transformed source points and target points
		new_err = 0
		for i in range(len(indices)):
			if indices[i] != -1:
				diff = tf_source[i,:2] - self.target[indices[i],:2]
				new_err += np.dot(diff,diff.T)

		new_err /= float(len(matched_trg_pts))

		# update error and calculate delta error
		delta_err = abs(mean_sq_error - new_err)
		mean_sq_error = new_err

		num_iter += 1

	return T

# uses singular value decomposition to find the 
# transformation from the target to the source point cloud
# assumes source and target point clounds are ordered such that 
# corresponding points are at the same indices in each array
#
# params:
#   source: numpy array representing source pointcloud
#   target: numpy array representing target pointcloud
# returns:
#   T: transformation between the two point clouds
def AlignSVD(self, source, target):

	# first find the centroids of both point clouds
	src_centroid = self.GetCentroid(source)
	trg_centroid = self.GetCentroid(target)

	# get the point clouds in reference to their centroids
	source_centered = source - src_centroid
	target_centered = target - trg_centroid

	# get cross covariance matrix M
	M = np.dot(source_centered.T,target_centered)

	# get singular value decomposition of the cross covariance matrix
	U,W,V_t = np.linalg.svd(M)

	# get rotation between the two point clouds
	R = np.dot(V_t.T,U.T)

	# get the translation (simply the difference between the point clound centroids)
	t = trg_centroid - src_centroid

	# assemble translation and rotation into a transformation matrix
	T = np.identity(3)
	T[:2,2] = np.squeeze(t)
	T[:2,:2] = R

	return T

def GetCentroid(self, points):
	point_sum = np.sum(points,axis=0)
	return point_sum / float(len(points))

Intro To LIDAR SLAM

I’m two years in to my PhD in robotics and things are going well. I’m working on robotic perception at the NASA Jet Propulsion Laboratory over the summer and I recently had a paper accepted to the conference on Field and Service Robotics. There’s just one problem: I still haven’t won the bet that led me to return to grad school in the first place; I haven’t built a robotic system for autonomous indoor mapping. Today we’ll take a big step toward that goal by starting our development of a 2D LIDAR-based SLAM system.

In this post I’ll propose a strategy for solving the localization and mapping portion of my problem. I’ll also introduce the Intel Research Center (IRC) dataset as a means for evaluating my SLAM efforts. I’ll demonstrate my developments first on the IRC dataset, but the eventual goal will be to use them on a real robot.

The Strategy

The goal of this series is to develop LIDAR-based 2 dimensional SLAM. Of course, numerous open source packages already exist for LIDAR SLAM but, as always, my goal is to understand SLAM on a fundamental level. That’s why I’m building everything from scratch and taking a detailed look at the underlying math.

I start by introducing the dataset below. In subsequent posts I present tutorials on basic methods for LIDAR odometry via matching between LIDAR scans and then matching scans to a persistent map representation. I then develop a method for LIDAR SLAM that fuses information from scan matching and wheel odometry. Lastly I present a full SLAM method which estimates both a globally-consistent pose graph and map. This last step will require all the previously presented methods as well as methods we haven’t discussed such as place recognition and loop closure. Links to all the posts in the series can be found below (these will be populated incrementally as work on this project progresses):

Why LIDAR?

LIDAR is an interesting and versatile sensor. In many ways 2D LIDAR measurements are very similar to the measurements we used in the UTIAS dataset in my EKF SLAM tutorial. As with the UTIAS dataset, the measurement model is simply the range and bearing to the measured landmark or obstacle. No complicated projection or distortion models are required to extract this information, so LIDAR is a pretty gentle introduction to the use of raw sensor data.

Unlike the UTIAS dataset, however, our sensor measurements are not associated with particular landmarks in the environment. Instead a LIDAR measurement is associated with whatever surface happened to get in the way of the laser. This makes data association more difficult. Luckily LIDAR makes up for this ambiguity by giving us much more information per timestep. At a given timestep in the UTIAS dataset we get just a few range and bearing measurements to nearby landmarks. With LIDAR, on the other hand, we get range and bearing measurements to everything in the robot’s vicinity. So instead of just estimating the locations of discrete landmarks we can estimate a continuous map of the whole environment. This is exactly what we need for an indoor mapping robot.

A SICK LMS sensor, similar to the one used to create the IRC Dataset.

Lastly, LIDAR sensors have a (mostly deserved) reputation for being extremely expensive and therefore out of reach for the robotics hobbyist. For instance, the 2D scanning LIDAR used to create the IRC dataset costs thousands of US dollars. How many basement tinkerers have that kind of money to spend on their hobby?

rplidar A3, an inexpensive scanning 2D lidar sensor

Luckily for us, cheaper LIDAR sensors, such as the Rplidar A3, have recently been introduced. These have lower range and lower precision than their more expensive cousins. They are totally sufficient for hobby projects though, and their cost is in the hundreds of dollars rather than the thousands.

Of course, publicly available datasets like the IRC dataset make it possible to do robotics work without spending any money. So let’s start there!

The IRC Dataset

The IRC dataset is an indoor localization and mapping dataset recorded at the Intel Research Center in Seattle in 2003. It is available on the MPRT website but I’d recommend getting it from my github repo instead. The raw data is the same in either case, but my repo has a few helpful scripts for loading, aligning, and visualizing the data. The original dataset has some ambiguities. For instance, it doesn’t specify the angular spacing between the lidar beams or the transformation between the odometry coordinate frame and the lidar coordinate frame. My DataLoader.py script resolves these problems for you!

The DataLoader reads the dataset from the txt files. It converts the odometry measurements into transformation matrices. It also converts the raw LIDAR measurements into cartesian points in the robot’s coordinate frame. Finally, it packages these measurements in Measurement objects. This is a much easier way to interact with the data than reading the txt files line by line.

You can see the LIDAR data in the robot’s frame of reference by running my script, VisializeLaser.py. A short bit of that script’s visualization is shown below:

Raw LIDAR measurements shown in the robot’s frame of reference.

Also, to make sure I got the sensor coordinate frame transformation correct, I made VisualizeMeasurements.py. It plots the pose of the robot in the global coordinate frame calculated by integrating the odometry measurements. We can also see the LIDAR measurements transformed into the global frame. As is shown below, the robot’s moves while the LIDAR measurements stay (more or less) static. This indicates we have the robot-to-LIDAR transformation correct. The visualization also shows the LIDAR measurements don’t stay completely static. This indicates the odometry is pretty noisy and accumulates drift quickly. We also notice the odometry noise is biased pretty heavily in rotation. This will be significant when we try to use it later.

Visualization of the odometry and lidar measurements together.

Intro to the EKF Step 3: SLAM With Known Correspondence

This tutorial is part of a series on Simultaneous Localization and Mapping (SLAM) using the extended Kalman filter. See the following for other tutorials in the series:

This tutorial builds on the previous two tutorials on localization. When doing SLAM with known measurement correspondence the number of landmarks is still fixed. Also, when we get a range and bearing measurement, we know to which landmark that measurement corresponds. This means we don’t need to use the maximum likelihood measurement association method from the previous tutorial. The big difference here is that we don’t have any prior information on where the landmarks are in the world. So instead of just using measurements to refine our pose estimate, we’ll also use them to estimate the landmark locations. Don’t worry though, the process is remarkably similar to simple localization. There’s just a couple of significant changes to highlight.


The State and Covariance Matrices

Since we don’t know where the landmarks are, we need simultaneously estimate their locations as well as our robot’s pose. This means the landmark locations are now part of the state of the system we’re trying to estimate. So the coordinates of the landmarks are now included in the state vector. When doing simple localization the state vector had dimension 3\times1. Now when doing SLAM the state vector has dimension 2n+3\times1 where n is the number of landmarks in our problem. In this formulation the coordinates of landmark j (x_j and y_j) are found at indices 2i+2 and 2i+3 of the state vector.

Remember, our estimate of the system’s state is quantified by the state mean vector \mu and covariance \Sigma. So in order to estimate a larger \mu vector we need to keep track of a larger \Sigma matrix as well. Specifically, we’ll expand the covariance matrix from 3\times3 to 2n+3\times2n+3 where n is again the number of landmarks.


The Update Step

To accommodate the larger size of our state vector we’ll need to make some changes to the EKF algorithm as well. Let’s start with the update step, where we update the pose of the robot based on the control inputs. We still need to update the elements of the state vector and covariance matrix that correspond to the robot’s pose. However, the robot’s control inputs clearly don’t affect the positions of the landmarks. So we need a way to update the robot’s pose without affecting the landmark estimates. The answer here is a special matrix we’ll call F_x. F_x is simply a 3\times2n+3 matrix where the rightmost block is a 3\times3 identity matrix and the other entries are zeros:

    \begin{equation*} F_k=\begin{bmatrix} 1 & 0 & 0 & 0 & \dots & 0 \\ 0 & 1 & 0 & 0 & \dots & 0 \\ 0 & 0 & 1 & 0 & \dots & 0 \end{bmatrix} \end{equation*}

F_x is used to map updates to the robot’s pose mean and covariance to the higher-dimensional space of the full state, which includes the landmark positions. So the update step now looks like this:

    \begin{align*} \bar{\mu}_k &= \mu_{k-1} + F_x^Tu_k' \\ G_k &= I_{2n+3}+F_x^Tg_kF_x \\ \bar{\Sigma}_k &= G_k \Sigma_{k-1} G_k^T + F_x^T R F_x \end{align*}

where u_k' is the motion update calculated from the control input at timestep k (not the raw control input), g_k is the motion Jacobian evaluated at timestep k, I_{2n+3} is a 2n+3\times2n+3 identity matrix, and R is the constant process noise covariance.


The Measurement Step

We need to make a similar change in the measurement step. A given measurement corresponds to only one landmark, so that measurement shouldn’t affect the estimates of the other landmarks. However, each measurement also affects the robot’s pose estimate. So we’ll need a way of mapping our corrections from the 2D landmark and 3D pose spaces to the full state space while ensuring only the relevant quantities in the mean vector, \mu and covariance matrix, \Sigma are affected. For this we need another F matrix. This time it will be F_{xj}, with the subscript xj signifying that it relates the pose x to landmark j. F_{xj} is a 5\times2n+3 matrix. The 3\times3 block in the upper right is an identity matrix, corresponding to the pose terms in the state and covariance matrix. F_{xj} also has a 2\times2 identity block occupying the lower two rows and starting at the index 2j+2 where j is the index of the landmark corresponding to the current measurement. All other entries in F_{xj} are zeros:

    \begin{equation*} F_{xj} = \begin{bmatrix} 1 & 0 & 0 & \dots & 0 & 0 & \dots  \\ 0 & 1 & 0 & \dots & 0 & 0 & \dots  \\ 0 & 0 & 1 & \dots & 0 & 0 & \dots  \\ 0 & 0 & 0 & \dots & 1 & 0 & \dots  \\ 0 & 0 & 0 & \dots & 0 & 1 & \dots  \end{bmatrix} \end{equation*}

This cool matrix allows us to incorporate one measurement at a time, only updating the terms in the state and covariance matrices that correspond to the robot’s pose and the relevant landmark. We do this as follows:

    \begin{align*} H_k &= h_k F_{xj} \\ K &= \bar{\Sigma}H_k^T(H_k\bar{\Sigma}H_k^T+Q) ^{-1}\\ \mu_k &= \bar{\mu}_k + K(z-\hat{z}) \\ \Sigma_k &= (I_{2n+3} - KH_k)*\bar{\Sigma}_k \end{align*}

where h_k is the measurement Jacobian. With only these augmentations we can change the localization algorithm with known correspondence to a SLAM algorithm with known correspondence. Not too complicated, right?


The Complete Algorithm

The code for the complete algorithm is shown below. It is very similar to the code for EKF localization with known correlation with state vector and covariance matrices that now include landmarks and with the update and measurement steps augmented as discussed above. Note also that after the states are estimated, in lines 157 through 167, the system calculates the sum of squared errors for both the robot positions and the landmark positions. This gives us a rough way to judge the quality of our filter tuning. I’ll go into a more principled method for Kalman filter tuning in a later post, but this works well enough for demonstration purposes.

addpath ../common;

deltaT = .02;
alphas = [.1 .01 .18 .08 0 0]; % motion model noise parameters

% measurement model noise parameters
Q_t = [11.8 0;
       0    .18];
   
n_robots = 1;
robot_num = 1;
n_landmarks = 15;
[Barcodes, Landmark_Groundtruth, Robots] = loadMRCLAMdataSet(n_robots);
[Robots, timesteps] = sampleMRCLAMdataSet(Robots, deltaT);

% add pose estimate matrix to Robots
Robots{robot_num}.Est = zeros(size(Robots{robot_num}.G,1), 4);

start = 600;
t = Robots{robot_num}.G(start, 1);

% initialize state mean with the groundtruth
% pose at timestep zero
stateMean = [Robots{robot_num}.G(start,2);
            Robots{robot_num}.G(start,3);
            Robots{robot_num}.G(start,4)];
        
% reshapes stateMean to 1x(2*n_landmarks+3)
% and remaps motion updates to new state vector
F_x = [eye(3) zeros(3, 2 * n_landmarks)];

stateMean = F_x' * stateMean;

% initialize diagonal of pose covariance with small, nonzero values
% since we have a good estimate of the starting pose
stateCov = zeros(2 * n_landmarks + 3, 2 * n_landmarks + 3);
stateCov(1:3,1:3) = 0.001;

% initialize landmark variances to large values since we have
% no prior information on the locations of the landmarks
for i = 4:n_landmarks * 2 + 3
    stateCov(i,i) = 10;
end
   
measurementIndex = 1;

% set up map between barcodes and landmark IDs
% barcode values are NOT the same as landmark IDs
codeDict = containers.Map(Barcodes(:,2),Barcodes(:,1));

% loop through all odometry and measurement samples
% updating the robot's pose estimate with each step
% reference table 10.1 in Probabilistic Robotics
for i = start:size(Robots{robot_num}.G, 1)
    
    % update time
    t = Robots{robot_num}.G(i, 1);
    
    % get control inputs at current timestep
    u_t = Robots{robot_num}.O(i, 2:end);
    % update robot bearing to last timestep's estimate
    theta = stateMean(3, 1);
    % calculate rotation and translation over last timestep
    rot = deltaT * u_t(2);
    halfRot = rot / 2;
    trans = u_t(1) * deltaT;
    
    % calculate pose update in world frame
    poseUpdate = [trans * cos(theta + halfRot);
                  trans * sin(theta + halfRot);
                  rot];
              
    % updated state mean and constrain bearing
    % to +/- pi
    stateMeanBar = stateMean + F_x' * poseUpdate;
    stateMeanBar(3) = conBear(stateMeanBar(3));
    
    % calculate movement jacobian
    g_t = [0 0 trans * -sin(theta + halfRot);
           0 0 trans * cos(theta + halfRot);
           0 0 0];
    G_t = eye(2 * n_landmarks + 3) + F_x' * g_t * F_x;
     
    % calculate motion covariance in control space
    M_t = [(alphas(1) * abs(u_t(1)) + alphas(2) * abs(u_t(2)))^2 0;
           0 (alphas(3) * abs(u_t(1)) + alphas(4) * abs(u_t(2)))^2];
       
    % calculate Jacobian to transform motion covariance to state space
    V_t = [cos(theta + halfRot) -0.5 * sin(theta + halfRot);
           sin(theta + halfRot) 0.5 * cos(theta + halfRot);
           0 1];
    
    % update state covariance
    R_t = V_t * M_t * V_t';
    stateCovBar = (G_t * stateCov * G_t') + (F_x' * R_t * F_x);
    
    % get measurements for current timestep
    [z, measurementIndex] = getObservations(Robots, robot_num, t, measurementIndex, codeDict);

    % if measurements are available
    if z(3,1) > 1
        S = zeros(size(z,2),3,3);
        zHat = zeros(2, size(z,2));
        % loop over every measurement and use to correct state estimate
        for k = 1:size(z, 2) 
            j = z(3,k);
            
            % if the landmark has never been seen before
            % add it to the state vector
            if stateMeanBar(j*2 + 2) == 0
                landmark_pos = [z(1,k) * cos(z(2,k) + stateMeanBar(3));
                                z(1,k) * sin(z(2,k) + stateMeanBar(3))];
                stateMeanBar(2*j+2:2*j+3) = [stateMeanBar(1) + landmark_pos(1);
                                             stateMeanBar(2) + landmark_pos(2)];
                continue;
            end

            % compute predicted range and bearing to the landmark
            delta = [stateMeanBar(2*j+2) - stateMeanBar(1);
                     stateMeanBar(2*j+3) - stateMeanBar(2)];
            q = delta' * delta;
            r = sqrt(q); % predicted range to landmark
            
            % predicted bearing to landmark
            pred_bear = conBear(atan2(delta(2), delta(1)) - stateMeanBar(3));
            % create predicted measurement
            zHat(:,k) = [r;
                         pred_bear];
            % calculate measurement Jacobian    
            h_t = [-delta(1)/r -delta(2)/r  0   delta(1)/r delta(2)/r;
                   delta(2)/q    -delta(1)/q    -1  -delta(2)/q  delta(1)/q];
                     
            F_1 = [eye(3); zeros(2,3)];
            F_2 = [zeros(3,2); eye(2)];
            F_xj = [F_1 zeros(5,2*j-2) F_2 zeros(5,2*n_landmarks - 2*j)];
            
            H_t = h_t * F_xj;
            
            % compute Kalman gain
            K = stateCovBar * H_t' / (H_t * stateCovBar * H_t' + Q_t);
            
            % incorporate new measurement into state mean and covariance
            stateMeanBar = stateMeanBar + K * (z(1:2,k) - zHat(:,k));
            stateCovBar = (eye(2*n_landmarks+3) - (K * H_t)) * stateCovBar;
        end
    end
    
    % update state mean and covariance
    stateMean = stateMeanBar;
    stateMean(3) = conBear(stateMean(3));
    stateCov = stateCovBar;
    
    % add new pose mean to estimated poses
    Robots{robot_num}.Est(i,:) = [t stateMean(1) stateMean(2) stateMean(3)];
end

% calculate land_loss: sum of squares of error in final landmark position
land_loss = 0;
for i = 1:n_landmarks
    x_diff = stateMean(2 * i + 1) - Landmark_Groundtruth(i, 2);
    y_diff = stateMean(2 * i + 2) - Landmark_Groundtruth(i, 3);
    sq_err = x_diff^2 + y_diff^2;
    land_loss = land_loss + sq_err;
end
disp(land_loss);

p_loss = path_loss(Robots, robot_num, start);
disp(p_loss);
animateMRCLAMdataSet(Robots, Landmark_Groundtruth, timesteps, deltaT);

Intro to the EKF Step 1: Localization With Known Correlation

This tutorial is part of a series on Simultaneous Localization and Mapping (SLAM) using the extended Kalman filter. See the following for other tutorials in the series:

In this tutorial I’ll focus on one of the easiest state estimation problems: localization with known measurement correspondence. The “localization” part means that the robot already knows the positions of all of the landmarks in the environment. The “known correspondence” part means that when the robot measures the range and bearing to a landmark it always knows which landmark it’s measuring.

In this tutorial I’ll explain the EKF algorithm and then demonstrate how it can be implemented using the UTIAS dataset. I’ll break it down into the following sections:


Intro To The Algorithm

First off, why do we need this complicated algorithm for localization? Can’t we just track our position with odometry? This isn’t possible because a robot’s pose over time is not deterministic; it can’t be calculated exactly due to errors (also referred to as noise) in the odometry which accumulate over time.

This is why the Kalman filter represents the robot’s pose at time t as a multivariate normal probability distribution with a mean at \mu_t and covariance of \Sigma_t. The robot’s expected (or most likely) pose is at \mu_t and the uncertainty of that pose is quantified by \Sigma_t. The Kalman filter updates the robot’s pose using the robot’s motion from odometry u_t. Then it corrects for noise in the motion update using the robot’s sensor readings z_t.

One important advantage of the Kalman filter (and other similar algorithms) is that it is recursive. It only requires information about the state of the robot at time t-1 in order to estimate the state at time t. In other words, it doesn’t require information about any of the prior states of the robot from time t_0 to time t-2. This means it requires less memory and processor time to compute!

Below I present a mathematical definition for the extended Kalman filter. Most of it is taken from table 7.2 of Probabilistic Robotics by Thrun, Fox, and Burgard. It will be helpful to follow along in Probabilistic Robotics in addition to this tutorial. I’d also recommend reading chapter 3 of that book to gain a better understanding of Gaussian filters and recursive estimation.

Top of page


Inputs

  • the mean of the pose estimate from the previous timestep: \mu_{t-1} = \begin{bmatrix} x_{t-1} &y_{t-1} &\theta_{t-1} \end{bmatrix}^T
    • x_{t-1} is the robot’s previous x position in meters
    • y_{t-1} is the robot’s previous y position in meters
    • \theta_{t-1} is the robot’s previous heading in radians
  • the covariance matrix of the previous timestep, \Sigma_{t-1}
    • this represents the uncertainty of the pose estimate
  • the control inputs (or odometry) at the current timestep, u_t=\begin{bmatrix} v_t &\omega_t \end{bmatrix}^T
    • v_t is the robot’s linear velocity in meters/sec
    • \omega_t is the robot’s angular velocity in radians/sec
  • the measurements from the current timestep, z_t = \begin{bmatrix} z_t^1 &z_t^2 &\dots &z_t^k \end{bmatrix} where k is the number of measurements at time t
    • each z_t^i = \begin{bmatrix} r_t^i &\phi_t^i &s_t^i \end{bmatrix}^T where
      • r_t^i is the range to landmark i in meters
      • \phi_t^i is the bearing to landmark i in radians
      • s_t^i is the ID of landmark i
  • c_t is the correspondence between the measurement and landmarks
    • for our purposes this will be ignored for now
  • the map m = \begin{bmatrix}m_1 &m_2 &\dots &m_n \end{bmatrix} where n is the number of landmarks
    • each m_i = \begin{bmatrix} m_{i,x} &m_{i,y} &m_{i,s} \end{bmatrix}^T where
      • m_{i,x} is the x position of landmark i in meters
      • m_{i,y} is the y position of landmark i in meters
      • m_{i,s} is the ID of landmark i

Top of page


Outputs

  • the updated pose estimate, \mu_t
  • the updated covariance matrix, \Sigma_t
  • the measurement probability, p_{z_t}, which is the likelihood of the measurements z_t given the robot’s estimated pose before measurements were incorporated: \bar{\mu}_t
    • We’ll be ignoring this for now as well

Top of page


Algorithm

    \begin{flalign*} \text{Loc} &\text{alization with known correspondence}(\mu_{t-1}, \Sigma_{t-1},u_t,z_t,c_t,m): \\ &\theta = u_{t-1,\theta} \\ &G_t = \begin{bmatrix} 1 &0 &-v_t \Delta t \sin(\theta + \frac{\omega_t \Delta t}{2})\\ 0 &1 &v_t \Delta t \cos(\theta + \frac{\omega_t \Delta t}{2})\\ 0 &0 &1 \end{bmatrix} \\ &V_t = \begin{bmatrix} \cos(\theta + \frac{\omega_t \Delta t}{2}) &-\frac{1}{2} \sin(\theta + \frac{\omega_t \Delta t}{2}) \\ \sin(\theta + \frac{\omega_t \Delta t}{2}) &\frac{1}{2} \cos(\theta + \frac{\omega_t \Delta t}{2}) \\ 0 &1 \end{bmatrix} \\ &M_t = \begin{bmatrix} \alpha_1 v_{t}^{2} + \alpha_2 \omega_{t}^{2} &0 \\ 0 &\alpha_{3} v_{t}^{2} + \alpha_4 \omega_{t}^{2} \end{bmatrix} \\ &\bar{\mu_t} = \mu_{t-1} + \begin{bmatrix} v_t \Delta t \cos(\theta + \frac{\omega_t \Delta t}{2}) \\ v_t \Delta t \sin(\theta + \frac{\omega_t \Delta t}{2}) \\ \omega_t \Delta t \end{bmatrix} \\ &\bar{\Sigma}_t = G_t \Sigma_{t-1} G_{t}^{T} + V_t M_t V_{t}^{T} \\ &Q_t = \begin{bmatrix} \sigma_{r}^{2} &0 &0 \\ 0 &\sigma_{\phi}^{2} &0 \\ 0 &0 &\sigma_{s}^{2} \end{bmatrix}\\ &\text{for all observed features $z_{t}^{i}=\begin{bmatrix} r_{t}^{i} &\phi_{t}^{i} &s_{t}^{i}\end{bmatrix}^T$ do} \\ \end{flalign*}

    \begin{flalign*} &j=c_{t}^{i}\\ &q = (m_{j,x} - \bar{\mu}_{t,x})^2 + (m_{j,y} - \bar{\mu}_{t,y})^2 \\ &\hat{z}_{t}^{i} = \begin{bmatrix} \sqrt{q} \\ \text{atan2}(m_{j,y} - \bar{\mu}_{t,y} \text{,  }m_{j,x} - \bar{\mu}_{t,x}) - \bar{\mu}_{t, \theta} \\ m_{j,s} \end{bmatrix} \\ &H_{t}^{i} = \begin{bmatrix} -\frac{m_{j,x} - \bar{\mu}_{t,x}}{\sqrt{q}} & -\frac{m_{j,y} - \bar{\mu}_{t,y}}{\sqrt{q}} &0 \\ \frac{m_{j,y} - \bar{\mu}_{t,y}}{q} &-\frac{m_{j,x} - \bar{\mu}_{t,x}}{q} &-1 \\ 0 &0 &0 \end{bmatrix} \\ &S_{t}^{i} = H_{t}^{i} \bar{\Sigma}_t \lbrack H_{t}^{i} \rbrack ^T + Q_t \\ &K_{t}^{i} = \bar{\Sigma}_t \lbrack H_{t}^{i} \rbrack ^T \lbrack S_{t}^{i} \rbrack ^{-1} \\ &\bar{\mu}_t = \bar{\mu}_t + K_{t}^{i}(z_{t}^{i} - \hat{z}_{t}^{i}) \\ &\bar{\Sigma}_t = (I - K_{t}^{i} H_t^i) \bar{\Sigma}_t \\ \end{flalign*}

    \begin{flalign*} &\text{end for} \\ &\mu_t = \bar{\mu}_t \\ &\Sigma_t = \bar{\Sigma}_t \\ &p_{z_t} = \prod_i \det(2 \pi S_t^i)^{-\frac{1}{2}} \exp \{ -\frac{1}{2}(z_t^i - \hat{z}_t^i)^T \lbrack S_t^i \rbrack ^{-1} (z_t^i - \hat{z}_t^i) \} \\ &\text{return } \mu_t, \Sigma_t, p_{z_t} \\ \end{flalign*}

Top of page


Motion Update

The algorithm is divided into two parts: the motion update and the sensor update. First, in the motion update the odometry information is incorporated into the state estimate. The motion update is based around a motion model g(u_t, x_{t-1}), which calculates a new pose based on the previous pose x_{t-1} and the odometry u_t.

(1)   \begin{equation*} \bar{\mu_t} = g(u_t, x_{t-1}) = \mu_{t-1} + \begin{bmatrix} v_t \Delta t \cos(\theta + \frac{\omega_t \Delta t}{2}) \\ v_t \Delta t \sin(\theta + \frac{\omega_t \Delta t}{2}) \\ \omega_t \Delta t \end{bmatrix} \end{equation*}

In this equation \bar{\mu}_t denotes the estimated updated mean of the pose distribution. The motion model I’m using assumes that the robot rotates in place, then drives in a straight line, then rotates in place again. This is a common approximation and it’s accurate enough as long as \Delta t is small.

In addition to calculating a new mean for our robot’s pose distribution \bar{\mu}_t we need calculate an estimated covariance for the new pose, \bar{\Sigma}_t. In order to do this we need to multiply the covariance from the last timestep, \Sigma_{t-1}  by the derivative (also called the Jacobian) of the motion model with respect to the previous state x_{t-1}:

(2)   \begin{equation*} G_t = \frac{\partial g(u_t, \mu_{t-1})}{\partial x_{t-1}} = \begin{bmatrix} 1 &0 &-v_t \Delta t \sin(\theta + \frac{\omega_t \Delta t}{2})\\ 0 &1 &v_t \Delta t \cos(\theta + \frac{\omega_t \Delta t}{2})\\ 0 &0 &1 \end{bmatrix} \end{equation*}

The multiplication looks like the following: G_t \Sigma_{t-1} G_t^T. We also need to calculate the additional noise (or uncertainty) associated with the motion that occurred between t-1 and t.

(3)   \begin{equation*} M_t = \begin{bmatrix} \alpha_1 v_{t}^{2} + \alpha_2 \omega_{t}^{2} &0 \\ 0 &\alpha_{3} v_{t}^{2} + \alpha_4 \omega_{t}^{2} \end{bmatrix} \end{equation*}

The alphas \alpha_1 through \alpha_4 are noise parameters that, unfortunately, need to be hand-tuned for the particular robot being used. Also, M_t is in control space and we need to map it into state space. To do this we need the Jacobian of the motion model with respect to the motion parameters u_t:

(4)   \begin{equation*} V_t = \frac{\partial g(u_t, \mu_{t-1})}{\partial u_t} = \begin{bmatrix} \cos(\theta + \frac{\omega_t \Delta t}{2}) &-\frac{1}{2} \sin(\theta + \frac{\omega_t \Delta t}{2}) \\ \sin(\theta + \frac{\omega_t \Delta t}{2}) &\frac{1}{2} \cos(\theta + \frac{\omega_t \Delta t}{2}) \\ 0 &1 \end{bmatrix} \end{equation*}

So we can map the noise M_t to state space with the following multiplication: V_t M_t V_t^T. To get the complete updated covariance \bar{\Sigma}_t we add the transformed covariance from the previous timestep to the additional noise:

(5)   \begin{equation*} \bar{\Sigma}_t = G_t \Sigma_{t-1} G_{t}^{T} + V_t M_t V_{t}^{T} \end{equation*}

So the motion update step updates the robot’s pose mean and covariance based on its odometry. You may have noticed, however, that the covariance only gets larger in this step. This means our uncertainty about the robot’s pose always grows when we incorporate odometry. We can decrease the uncertainty by incorporating sensor information, which is what we do in the next step.

Note: The motion model and associated Jacobians presented in the preceding section are different from those presented in Probabilistic Robotics. The model presented in Probabilistic Robotics assumes that the robot is always moving in an arc and its angular velocity, \omega_t, is never zero. The Jacobians for this model involve dividing by \omega_t. This means that the model breaks if \omega_t is ever equal to zero. This is a problem for us because \omega_t often is equal to zero in the UTIAS dataset. That is why I subbed in my own motion model.

Top of page


Sensor Update

The sensor update refines the robot’s pose estimate and decreases its uncertainty. This happens in the for loop of the algorithm. It is run for every measurement that occurs at timestep t.

In the first step of the sensor update we calculate what the range and bearing to the landmark would have been if the estimated pose mean \bar{\mu}_t were correct. This is called the expected measurement \hat{z}_t and it is calculated using the measurement model h_t(x_t,m).

(6)   \begin{equation*} q = (m_{j,x} - \bar{\mu}_{t,x})^2 + (m_{j,y} - \bar{\mu}_{t,y})^2 \end{equation*}

(7)   \begin{equation*} \hat{z}_{t}^{i} = h(x_t, m) = \begin{bmatrix} \sqrt{q} \\ \text{atan2}(m_{j,y} - \bar{\mu}_{t,y} \text{,  }m_{j,x} - \bar{\mu}_{t,x}) - \bar{\mu}_{t, \theta} \\ m_{j,s} \end{bmatrix} \\ \end{equation*}

Remember that m_{j,x}, m_{j,y}, and m_{j,s} are the x position, y position, and ID of landmark j, respectively.

We also need to calculate the additional noise associated with the measurement. To do this we need the Jacobian of the measurement model with respect to the robot’s pose:

(8)   \begin{equation*} H_{t}^{i} = \frac{\partial h(\bar{\mu}_t, m)}{\partial x_t} = \begin{bmatrix} -\frac{m_{j,x} - \bar{\mu}_{t,x}}{\sqrt{q}} & -\frac{m_{j,y} - \bar{\mu}_{t,y}}{\sqrt{q}} &0 \\ \frac{m_{j,y} - \bar{\mu}_{t,y}}{q} &-\frac{m_{j,x} - \bar{\mu}_{t,x}}{q} &-1 \\ 0 &0 &0 \end{bmatrix} \end{equation*}

The measurement noise is calculated by multiplying the prior estimated covariance with the measurement model Jacobian and adding additional noise for the new measurement. The additional noise Q_t is a constant for all measurements. It is a 3 \times 3 diagonal matrix with noise parameters \sigma_r, \sigma_\phi, and \sigma_s. The sigmas represent the noise for landmark range, bearing, and noise, respectively. Like the alphas in equation (3), the sigmas must be hand-tuned in order to obtain good performance.

(9)   \begin{equation*} S_t^i = H_t^i \bar{\Sigma}_t \lbrack H_t^i \rbrack ^T + Q_t \end{equation*}

S_t^i is then used to calculate the Kalman gain, K_t^i. The Kalman gain determines how heavily the sensor information should be weighted when correcting the estimated new pose \bar{\mu}_t and \bar{\Sigma}_t.

(10)   \begin{equation*} K_{t}^{i} = \bar{\Sigma}_t \lbrack H_{t}^{i} \rbrack ^T \lbrack S_{t}^{i} \rbrack ^{-1} \end{equation*}

The Kalman gain is used to update the estimated pose mean and covariance \bar{\mu}_t and \bar{\Sigma}_t as follows:

(11)   \begin{equation*} \bar{\mu}_t = \bar{\mu}_t + K_{t}^{i}(z_{t}^{i} - \hat{z}_{t}^{i}) \end{equation*}

(12)   \begin{equation*} \bar{\Sigma}_t = (I - K_{t}^{i} H_t^i) \bar{\Sigma}_t \end{equation*}

The amount of correction to the estimated pose mean depends on K_t^i and the difference between the expected measurement \hat{z}_t^i (calculated in equation (7) and the actual measurement z_t^i. A larger discrepancy between the expected and actual measurement will result in a larger correction.

After the sensor update is complete the pose mean and covariance are updated with the estimated values: \mu_t = \bar{\mu}_t and \Sigma_t = \bar{\Sigma}_t. The measurement probability is also calculated here but we are not making use of this value.

Top of page


Code

When writing robotics software I’m often frustrated by the separation between theory and software engineering. There are many good theoretical explanations of the extended Kalman filter online. There are also several good examples of code that uses an EKF. But I was unable to find any textbook, academic paper, blog post, or stackexchange thread that did a good job of demonstrating how the math translates into code. I’m hoping my code below will change that. I’ve been careful to explain all major steps and refer to the equations above in the comments. I hope you find it helpful!

Lastly, note that comments in MATLAB begin with a % symbol. The code rendering plugin I used doesn’t seem to get this when highlighting the code, however. So be aware that the highlighting below isn’t always accurate, but the actual code is.

addpath ../common; % adds support functions used in all my EKF programs
                   % check my github for details

deltaT = .02; 
alphas = [.2 .03 .09 .08 0 0]; % robot-dependent motion noise parameters
                               % see equation 3

% robot-dependent sensor noise parameters
% see equation 9
sigma_range = 2;
sigma_bearing = 3;
sigma_id = 1;

Q_t = [sigma_range^2 0 0;
       0 sigma_bearing^2 0;
       0 0 sigma_id^2];

measurement_prob = 0;
n_robots = 1;
robot_num = 1;

% load and resample raw data from UTIAS data set
% The robot's position groundtruth, odometry, and measurements 
% are stored in Robots
[Barcodes, Landmark_Groundtruth, Robots] = loadMRCLAMdataSet(n_robots);
[Robots, timesteps] = sampleMRCLAMdataSet(Robots, deltaT);

% add pose estimate matrix to Robots
% data will be added to this as the program runs
Robots{robot_num}.Est = zeros(size(Robots{robot_num}.G,1), 4);

% initialize time, and pose estimate
start = 600; % start index is set to 600 because earlier data was 
             % found to cause problems
t = Robots{robot_num}.G(start, 1); % set start time

% set starting pose mean to pose groundtruth at start time
poseMean = [Robots{robot_num}.G(start,2);
            Robots{robot_num}.G(start,3);
            Robots{robot_num}.G(start,4)];
poseCov = [0.01 0.01 0.01;
           0.01 0.01 0.01;
           0.01 0.01 0.01];
       
% tracks which measurement is next received
measurementIndex = 1;

% set up map between barcodes and landmark IDs
codeDict = containers.Map(Barcodes(:,2),Barcodes(:,1));

% advance measurement index until the next measurement time 
% is greater than the starting time
while (Robots{robot_num}.M(measurementIndex, 1) < t - .05)
        measurementIndex = measurementIndex + 1;
end

% loop through all odometry and measurement samples
% updating the robot's pose estimate with each step
% reference table 7.2 in Probabilistic Robotics
for i = start:size(Robots{robot_num}.G, 1)
    theta = poseMean(3, 1);
    % update time
    t = Robots{robot_num}.G(i, 1);
    % update movement vector per equation 1
    u_t = [Robots{robot_num}.O(i, 2); Robots{robot_num}.O(i, 3)];

    rot = deltaT * u_t(2);
    halfRot = rot / 2;
    trans = u_t(1) * deltaT;
    
    % calculate the movement Jacobian per equation 2
    G_t = [1 0 trans * -sin(theta + halfRot);
           0 1 trans * cos(theta + halfRot);
           0 0 1];
    % calculate motion covariance in control space per equation 3
    M_t = [(alphas(1) * abs(u_t(1)) + alphas(2) * abs(u_t(2)))^2 0;
           0 (alphas(3) * abs(u_t(1)) + alphas(4) * abs(u_t(2)))^2];
       
    % calculate Jacobian to transform motion covariance to state space
    % per equation 4
    V_t = [cos(theta + halfRot) -0.5 * sin(theta + halfRot);
           sin(theta + halfRot) 0.5 * cos(theta + halfRot);
           0 1];
    
    % calculate pose update
    poseUpdate = [trans * cos(theta + halfRot);
                  trans * sin(theta + halfRot);
                  rot];
              
    % calculate estimated pose mean per equation 1
    poseMeanBar = poseMean + poseUpdate;
    % calculate estimated pose covariance per equation 5
    poseCovBar = G_t * poseCov * G_t' + V_t * M_t * V_t';
    
    
    % get measurements for the current timestep, if any exist
    [z, measurementIndex] = getObservations(Robots, robot_num, t, measurementIndex, codeDict);
    
    % create two matrices for expected measurement and measurement
    % covariance
    S = zeros(size(z,2),3,3);
    zHat = zeros(3, size(z,2));
    
    % if any measurements are available
    if z(3,1) > 1
        for k = 1:size(z, 2) % loop over every measurement
            j = z(3,k);

            % get coordinates of the measured landmark
            m = Landmark_Groundtruth(j, 2:3);

            % compute the expected measurement per equations 6 and 7
            xDist = m(1) - poseMeanBar(1);
            yDist = m(2) - poseMeanBar(2);
            q = xDist^2 + yDist^2;
            
            % constrains expected bearing to between 0 and 2*pi
            pred_bear = conBear(atan2(yDist, xDist) - poseMeanBar(3));
            
            zHat(:,k) = [sqrt(q);
                         pred_bear;
                         j];

            % calculate Jacobian of the measurement model per equation 8
            H = [(-1 * (xDist / sqrt(q))) (-1 * (yDist / sqrt(q))) 0;
                 (yDist / q) (-1 * (xDist / q)) -1;
                 0 0 0];
             
            % compute S per equation 9
            S(k,:,:) = H * poseCovBar * H' + Q_t;

            % compute Kalman gain per equation 10
            K = poseCov * H' * inv(squeeze(S(k,:,:)));
                
            % update pose mean and covariance estimates
            % per equations 11 and 12
            poseMeanBar = poseMeanBar + K * (z(:,k) - zHat(:,k));
            poseCovBar = (eye(3) - (K * H)) * poseCovBar;
        end
    end
    
    % update pose mean and covariance
    % constrains heading to between 0 and 2*pi
    poseMean = poseMeanBar;
    poseMean(3) = conBear(poseMean(3));
    poseCov = poseCovBar;
    
    % add pose mean to estimated position vector
    Robots{robot_num}.Est(i,:) = [t poseMean(1) poseMean(2) poseMean(3)];
    
    % calculate error between mean pose and groundtruth
    % for testing only
    %{
    groundtruth = [Robots{robot_num}.G(i, 2); 
                   Robots{robot_num}.G(i, 3); 
                   Robots{robot_num}.G(i, 4)];
    error = groundtruth - poseMean;
    %}
   
end

% display results!
animateMRCLAMdataSet(Robots, Landmark_Groundtruth, timesteps, deltaT);

Note: My tuning of the noise parameters in the code below is workable. But I didn’t spend as much time on them as I did for subsequent iterations of this tutorial. If you’d like to play with them I’m sure you can improve on my tunings!

Top of page

Camera Pose Estimation Using Convolutional Neural Networks

It’s been quite a while since I last posted here. Since my last post I started a PhD in computer science at the University of Colorado, Boulder. I’ve been working on a lot of fun robotics-related stuff lately, but I haven’t had the time to work on Colin. However, I recently finished a really interesting project and I thought I’d take some time to do a quick write-up on it. The project really encapsulates what I’ve loved about my first semester in grad school. When I started I knew basically nothing about neural networks or pose estimation. But in just a few short weeks I had learned enough to build, train, and test a neural network for camera pose estimation. I absolutely love that this is my job now.

My project involved implementing the method for camera pose estimation presented in this paper. The basic idea is this: given two photos of the same scene taken from different positions we need to determine how the camera moved between those two photos. We must calculate both the translation and rotation in three dimensions. A good solution to this problem has many applications, but the one I’m most interested in is visual odometry, the process of determining a robot’s change in position using information from its camera.

For this post I describe how this problem is traditionally addressed. I then present the new method and compare it to the traditional methods. Lastly I present the code for my implementation of the new method. For the implementation section I assume you have some familiarity with Python 3 and Keras. If not there’s plenty of good tutorials on both. I didn’t know either about a month and a half ago. I also didn’t know anything about neural networks when I started this project. If you’re in the same boat I’d suggest you go through these notes and this tutorial. I found them very helpful.

The Traditional Approach
Using CNNs
Comparing Approaches
The Code

Challenge: compute the relative camera pose change between two photos of the same scene




The Traditional Approach


The normal approach to computing the relative pose change between two images involves feature matching. To be brief, a feature detector such as FAST is used to find “keypoints” in both input images. The algorithm assigns unique descriptors to each feature. Examples of possible descriptors are SIFT and BRIEF. The algorithm then compares the set of descriptors from each image to find the set of descriptors common to both images (the intersection of the two sets). Finally, the homography between the two sets of descriptors for the images is computed. This allows us to calculate the change in camera pose between the two images.

This approach is computationally intensive. Each feature descriptor must be compared to every other descriptor to find matches. Additionally, in order to perform well it must be able to find a large set of matching feature points between the two images. There are some cases where this is not possible, however. Scenes with few features will result in an insufficient number of features being detected. On the other hand, objects with very repetitive structure will have many similar features and the matching algorithm will find spurious correspondences between them. Lastly, if the change in camera pose between the images is too great, there may not be sufficient overlap between the sets of features from each image.

Feature Matching

Case in which feature matching is not able to find enough matching features for accurate pose estimation.



Using Convolutional Neural Networks


Because of the limitations of traditional feature-matching for relative camera pose estimation there have been several attempts to employ convolutional neural networks for this purpose.

The method presented in the subject paper is to feed each image into a separate branch of a Siamese neural network. A Siamese neural network contains two or more branches, or subnetworks. The architecture and weights of all the branches are identical. Siamese networks are adept at finding similarities or relationships between images. The structure of the Siamese branches for the network presented in this paper is the same as the convolutional section of AlexNet. AlexNet is a popular neural network for image classification.

The outputs from the Siamese branches are concatenated and then fed into a single fully-connected layer with 7 outputs. These outputs correspond to the 3D translation vector and the orientation quaternion required to move the camera from its initial pose to its final pose.

One additional interesting feature of this approach is that it applies a unique pooling technique between the convolutional and fully-connected sections of the network. Convolutional networks can theoretically accept inputs of arbitrary size. However, the fully connected layers at the end of a convolutional network require fixed-size input. Normally images are cropped or downsampled before being fed into the convolutional section of the network to ensure the input to the fully connected layers is the correct size. However, this can also result in a loss of information in the original image. Spatial pyramid pooling (SPP) generates a fixed-size representation from an arbitrary input size. This means that, by including an SPP layer between the convolutional and fully connected layers, we can give our network the ability to accept arbitrary-sized inputs. It also means that the convolutional section of the network is able to use all of the information present in the input images. The authors of this paper found that it significantly increased the accuracy of their model.


Comparing The Approaches


The performance of the convolutional neural network was compared to the OpenCV implementations of two feature matching methods: SURF and ORB. Images from the DTU robot image dataset were used to test all three methods. The DTU dataset was created by taking photos of several scenes using a camera positioned by a robotic arm. This means that we know the groundtruth for the change in camera pose with high accuracy. So evaluation of the accuracy of each method is quite simple. Additionally, it means the labels used for training the CNN are highly accurate.

Feature matching methods were, in general, more accurate. However, there are certain cases in which the neural network performs better. For higher changes in position and orientation the CNN can slightly outperform feature based methods. Also, the CNN can perform significantly better in cases where it is difficult to find and match features. For example, when the images do not contain a sufficient number of features.

Histogram of errors for orientation (left) and translation (right). The above histograms compare 4 different CNN architectures to SURF and ORB.

 


The Code

I’ve posted the Python 3 code for the neural network below. Supporting code for loading data from the image files and visualizing the results can be found on my github repository. Also, the image files themselves can be downloaded from the DTU website.

# camera-pose.py
# Andrew Kramer

# regresses the relative camera pose between two images using the method 
# presented in https://arxiv.org/pdf/1702.01381.pdf

import numpy as np
import tensorflow
import gzip
import sys
import pickle
import keras
import math
import os

from keras.models import Sequential, Model, load_model
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Conv2D, MaxPooling2D, Input, Lambda
from keras.utils import np_utils
from keras.datasets import mnist
from keras import backend as K
from dataloader import DataLoader
from pooling.spp.SpatialPyramidPooling import SpatialPyramidPooling

beta = 10
epochs = 10

# define loss function for training the network
def custom_objective(y_true, y_pred):
	error = K.square(y_pred - y_true)
	transMag = K.sqrt(error[0] + error[1] + error[2])
	orientMag = K.sqrt(error[3] + error[4] + error[5] + error[6])
	return K.mean(transMag + (beta * orientMag))

def dot_product(v1, v2):
	return sum((a*b) for a,b in zip(v1,v2))

def length(v):
	return math.sqrt(dot_product(v,v))

def compute_mean_error(y_true, y_pred):
	trans_error = 0
	orient_error = 0
	for i in range(0,y_true.shape[0]):
		trans_error += math.acos(dot_product(y_true[i,:3], y_pred[i,:3])/
					(length(y_true[i,:3]) * length(y_pred[i,:3])))
		orient_error += math.acos(dot_product(y_true[i,3:], y_pred[i,3:])/
					 (length(y_true[i,3:]) * length(y_pred[i,3:])))
	mean_trans = trans_error / y_true.shape[0]
	mean_orient = orient_error / y_true.shape[0]
	return mean_trans, mean_orient

# define the architecture for the Siamese branches
def create_conv_branch(input_shape):
	model = Sequential()
	model.add(Conv2D(96, kernel_size=(11,11),
					 strides=4, padding='valid',
					 activation='relu',
					 input_shape=input_shape))
	model.add(MaxPooling2D(pool_size=(2,2), strides=2))
	model.add(Conv2D(256, kernel_size=(5,5),
					 strides=1, padding='same',
					 activation='relu'))
	model.add(MaxPooling2D(pool_size=(3,3), strides=1))
	model.add(Conv2D(384, kernel_size=(3,3),
					 strides=1, padding='same',
					 activation='relu'))
	model.add(Conv2D(384, kernel_size=(3,3),
					 strides=1, padding='same',
					 activation='relu'))
	model.add(Conv2D(256, kernel_size=(3,3),
					 strides=1, padding='same',
					 activation='relu'))
	model.add(SpatialPyramidPooling([1,2,3,6,13]))
	return model

if __name__ == "__main__":

	img_rows, img_cols = 227, 227
	category_IDs = list(range(1,25)) # category IDs from which to pull test and training data
	file_name = 'huge_model_spp_10epoch.h5'
	model = None
	# load training and testing data:
	loader = DataLoader(category_IDs, img_rows, img_cols)
	train_data, test_data = loader.get_data()
	train_labels, test_labels = loader.get_labels()
	input_shape = loader.get_input_shape()

	# define structure of convolutional branches
	conv_branch = create_conv_branch(input_shape)
	branch_a = Input(shape=input_shape)
	branch_b = Input(shape=input_shape)

	processed_a = conv_branch(branch_a)
	processed_b = conv_branch(branch_b)

	# concatenate the outputs of the CNN branches
	regression = keras.layers.concatenate([processed_a, processed_b])

	output = Dense(7, kernel_initializer='normal', name='output')(regression)
	model = Model(inputs=[branch_a, branch_b], outputs=[output])

	model.compile(loss=custom_objective, 
				  optimizer=keras.optimizers.Adam(lr=.0001, decay=.00001),
				  metrics=['accuracy'])

	if os.path.isfile(file_name):
		print("model", file_name, "found")
		model.load_weights(file_name)
		print("model loaded from file")
	else:
		model.fit([train_data[:,0], train_data[:,1]], train_labels,
				  batch_size=128,
				  epochs = epochs,
				  validation_split=0.1,
				  shuffle=True)
		model.save_weights(file_name)
		print("model saved as file", file_name)

	pred = model.predict([test_data[:,0], test_data[:,1]])
	test_trans, test_orient = compute_mean_error(pred, test_labels)
	np.savetxt('pred_spp.txt', pred, delimiter=' ')
	np.savetxt('labels_spp.txt', test_labels, delimiter=' ')

	print('*     Mean translation error on test set: %0.2f' % (test_trans))
	print('*     Mean orientation error on test set: %0.2f' % (test_orient))

 

I’m Learning Multithreading!

Since this quarter started I’ve been doing a weekly interview practice session with friends from school. Last week we spent most of our hour on this:

Rotate Matrix: Given an image represented by an NxN matrix, where each pixel in the image is 4 bytes, write a method to rotate the image by 90 degrees. Can you do this in place?

For those interested, it’s exercise 1.7 in the sixth edition of Cracking the Coding Interview by Gayle Laakmann McDowell. It’s a great interview prep book that’s been really helpful since I started job hunting. I had a lot of fun with this one so I’d like to do a short write up on our thought process.


Row-to-Column Method

The obvious thing to do here is create a new NxN square matrix and transpose the rows of the old matrix into the columns of the new matrix as in the graphic below:

MatrixRotateRtC

The above method takes O(n) time and O(n) space where n is the number of entries in the matrix. Coding this up should be pretty simple so I’m going to skip ahead to the thing I’m actually interested in: the in-place method.


In-Place Method

We can’t get below O(n) time to rotate a matrix since every entry must be moved and there are n entries. However, it is possible to cut the space usage down from O(n) to O(1) by doing the rotation in place. See the graphic below for an illustration of the method:

MatrixRotateIP

How do we implement this in code? It’s a little difficult but the key insight is that it’s possible to find the destination index for any given entry in the matrix from its starting index. In general terms the destination index can be found using this relation

Destination for { i, j } = { j, n – i }

where n is the last index in the matrix. The really cool thing about this is that it works for any starting index. The graphic above only shows a rotation of the outer layer of the matrix but the above method works for any index in any size of square matrix.

The code I post on this blog has been getting longer and longer lately, so I’m only going to post the relevant functions on the site. If you want to see the whole program you can find it on my gitHub repo.

The program is called matrixRotate.cpp. It takes a .txt file containing the matrix to rotate as a command line argument. The text file should have one int  representing the size of the matrix on the first line. Each successive line should have a space delimited string of ints representing each row of the matrix. It prints the rotated matrix to the console.

The function below will take care of a single step in the matrix rotation. After the function completes 4 matrix entries are moved.

// accepts three ints, the starting row and column indices
// and the highest index in the matrix, NOT the size
// rotates 4 matrix entries by 90 degrees counterclockwise
void rotateLeftHelper(int row, int column, int n) 
{
  int temp = matrix[row][column];
  for (int i = 0; i < 3; i++) 
  {
    int nextRow = column;
    int nextColumn = n - row;
    matrix[row][column] = matrix[nextRow][nextColumn];
    row = nextRow;
    column = nextColumn;
  }
  matrix[row][column] = temp;
}

To rotate the whole matrix we need to run the above function on every entry in the first row from { 1, 1 } to { 1, n-1 }. Then every entry in the second row from { 2, 2 } to { 2, n-1 }. This needs to be repeated until we get to the starting index: { n/2, n/2 }. The function below will do this for us:

// accepts an int, the matrix size
// rotates the matrix by 90 degrees counterclockwise
void rotateLeft(int n) 
{
  int level = n - 1; // switch n from matrix size to highest index
  for (int i = 0; i < n/2; i++) 
  {
    for (int j = i; j < level; j++) 
    {
      rotateLeftHelper(i, j, n - 1);
    }
    level--;
  }
}

So when they work in concert, the above functions will rotate a matrix by 90 degrees counterclockwise and it’ll do it in place. That’s pretty cool, right? If you’ll recall though, the problem from Cracking the Coding Interview says the matrix represents an image. If it’s a raw image from a run-of-the-mill camera it will have 20 million entries. Doing each rotation in series will tie up the CPU for a long time. Fortunately, using the above  rotateLeftHelper() function, one instance of the function doesn’t affect the others. They never operate on the same part of the matrix! Do you know what that means? They can run in parallel and asynchronously!


Multi-Threaded Version

Really all we need to do here is create a new thread for every call to the rotateLeftHelper() function instead of running them in series. Cool, right?

The function below implements the in-place matrix rotation method using pthreads. See matrixRotate.cpp on my gitHub repo to see the whole program.

// accepts a pointer to a threadParams struct containing three ints, 
// the starting row and column indices
// and the highest index in the matrix, NOT the size
void* rotateLeftHelper(void* start) 
{
  threadParams* startState = (threadParams*) start;
  int row = startState->row;
  int column = startState->column;
  // cout << "starting thread on {" << row <<  ", " << column << "}" << endl;
  int n = startState->n;
  int temp = matrix[row][column];
  for (int i = 0; i < 3; i++) 
  {
    int nextRow = column;
    int nextColumn = n - row;
    matrix[row][column] = matrix[nextRow][nextColumn];
    row = nextRow;
    column = nextColumn;
  }
  matrix[row][column] = temp;
}

// accepts an int, the matrix size
// rotates a matrix by 90 degrees counterclockwise
void rotateLeft(int n) 
{
  int level = n - 1; // switch n from matrix size to highest index
  // calculate number of threads
  int numThreads = 0;
  for (int i = level; i >= 1; i -= 2) {
    numThreads += i;
  }
  pthread_t *threads;
  threadParams *parameters;
  threads = new pthread_t[numThreads];
  parameters = new threadParams[numThreads];
  int curThread = 0;
  for (int i = 0; i < n/2; i++) 
  {
    for (int j = i; j < level; j++) 
    {
      parameters[curThread].row = i;
      parameters[curThread].column = j;
      parameters[curThread].n = n - 1;
      pthread_create(&threads[curThread], NULL, 
		     &rotateLeftHelper, (void*) &parameters[curThread]);
      curThread++;
    }
    level--;
  }
  // join all threads
  for (int i = 0; i < numThreads; i++) 
  {
    pthread_join(threads[i], NULL);
  }
  delete[] threads;
  delete[] parameters;
}

Further Improvements

So the above code will make use of parallel processing to speed up the matrix rotation. However, creating and destroying a thread for each instance of the rotateLeftHelper() function involves an unnecessary amount of overhead. We could avoid that overhead by using a thread pool.

Also, we’re still running the threads on the CPU. In the best case scenario the CPU has 8 cores, so we’ll be getting a speedup factor of much less than 8. That’s good but not great. Also, we’re tying up the CPU to do a lot of extremely simple tasks. It would be better if we had a much larger number of much worse processors to do this task. Does that sound familiar? It’s called a graphics card and it’s what image processing programs on computers actually use to do tasks like this.

I think the best thing I could do for this program would be to figure out how to create threads that run on my graphics card using openGL.

That’s all for now though! Feel free to email me with questions.

I Finally Learned Git!

If anyone’s tried to look at my GitHub page they may have noticed it didn’t contain any files and wasn’t really set up. I’m proud to announce that changed today! I finally learned to use git and I’ve made a new repo for the examples used on this blog! I’m not going to attempt an in-depth git tutorial, mostly because there’s a already a lot of great resources out there. I mostly used this one. I do have a bit of advice though: use the command line interface!

I tried to learn git using the graphical interface several months ago and I found it was really difficult to understand. In the intervening time I got a lot more comfortable with BASH and I recently tried learning git again using the command line interface. Seriously, use command line if you want to learn git. The command line makes it much easier to understand how git is structured and how it works.

tl;dr The link to my GitHub profile at the bottom of the page now has a purpose!