Coordinate Transforms
One of the fundamental concepts in mobile robots is coordinate transforms. This is also the one which involves a lot of math. However the underlying concept is pretty simple and intuitive. The math part of it is also straight forward. The complexity arises from the fact that there are different conventions and different ways to arrive at the solution. As long as we stick to one representation, the solution should be straight forward.
What and why?
Let us consider a robot which has a LIDAR mounted on it. A LIDAR is a device which scans the environment and produces clouds of points. The data from the LIDAR looks similar to what is shown in the image below.
We can clearly see the road ahead and the obstacles to the side. The three lines you see in the middle (in red, blue and green) is called the coordinate frame. The lines correspond to the three axes X (red), Y (blue) and Z (green). The coordinate frame defines the position of the sensor in 3d space. The points generated by this device is with respect to this coordinate frame. This device is setup such that the points are X forward. What I mean by that is when the vehicle more forward, the points appear along the X direction. We can confirm that by looking at the image. We see that as the vehicle moves, the points appear along the direction of the X axis (red).
Now consider that we added a few more sensors, let us say a GPS, IMU and a camera. Each of these sensors will have an internal coordinate frame and generate data with respect to their coordinate frame. But how do we associate or make sense of all these data? We can do this by expressing the coordinate frames with respect to each other. This expression that relates the coordinate frames is called coordinate transforms. By using the coordinate transforms we can associate the data generated by these sensors and operate on a common frame of reference. The graphic below shows a cartoon of the vehicle with its sensors in the left and coordinate frames on the right.
Representation
The coordinate transforms consists of two parts. The distance (translation) between the frames and the rotation (orientation) between the frames. The distance can be represented using three numbers corresponding to the three axes. The rotation can also be represented using three numbers corresponding to the tilt along each of the axes. These angles are sometimes called roll, pitch and yaw. Totally we need six numbers to represent one coordinate frame with respect to another. Since we need six independent number we say that there are six degrees of freedom in this system.
The rotation can be represented in different ways. Roll, pitch and yaw is a common representation which is intuitive and easy to visualize. However there are issues with this representation (Gimbal lock) which necessitates the use of other methods. Rotation matrices and quaternion are alternate methods to represent rotation. The alternate methods use more than three numbers to represent the three degrees of freedom. The reasoning why this method works is beyond the scope of this article. Here, we will use rotation matrices. These are 3×3 matrices which represent the three rotation angles.
Homogeneous Transforms
Volumes of books are written on the different representation of rotation and methods to manipulate them. Let us just focus on how they are used to represent coordinate frames and transforms. Though the translation and rotation of the coordinate frame can be operated on separately, it is easier to operate when represented together. They are represented as a 4×4 square matrix called homogeneous transform. The square matrix is of the form.
Coordinate Transforms
Let us now look into coordinate transforms and how to use them to move between coordinate frames. We shall use the previous example of the vehicle with different sensors on it. Let L, I, C and G represent the LIDAR, IMU, Camera and GPS frames respectively. To represent the point P in the LIDAR frame we will use the convention
$$P^L$$
Similarly the coordinate transform that represents the point P in the LIDAR frame with respect to the camera frame is given as
$$H^C_L$$
Note the placement of the subscript and superscript carefully. We do it this way to make it easy to chain the transforms together. That is, if you want to represent a point in the LIDAR frame to the IMU frame we can do sometime like
$$P^I=H^I_G.H^G_C.H^C_L.P^L$$
If you observe closely there is elegant chain of the frames. Starting from right, the point is in the LIDAR frame. We premultiply that with a transform that takes it from the LIDAR frame to the Camera frame. This is then premultiplied by a transform that takes it from the Camera frame to the GPS frame and so on. The order of multiplication is important. If you change the order, it will not work. On the outset, it is a bunch of matrix multiplications. However there is a lot of math going on in the background. It is all encapsulated in the underlying representation (homogeneous transforms) and matrix multiplication. Imagine doing this with plane trigonometry. It will be a nightmare.
All these H transforms are called coordinate transforms. The next question is how do we find these transforms. You can use a physical tape-measure to estimate the distances and angles. However that is going to be very inaccurate. If the sensors are mounted based on some CAD drawings (which is generally the case with any serious real world robot), then the measurements can be obtained using the CAD drawings. Generally these measurements are lot better than the ones measured manually. Sometimes these can be obtained using calibration procedures which can be even better than from the CAD drawings. These matrices are sometimes called extrinsics as well.
Pose
The part that I generally get confused when using coordinate transforms is when using Pose. Pose is a 4×4 homogeneous transform that is used to represent the coordinate frame of a sensor or robot. We can get pose from sensors like IMU or GPS which track the position/orientation of the robot. Let us say that our robot moved from a position A to the current position B. Let O be the origin. Let the output of the IMU at position A be Q and at position B be R. Now how do you write Q in the superscript-subscript form using the different frames. This is where I generally get confused. Q and R can be written as
$$Q=H^O_A$$
$$R=H^O_B$$
Observe the subscript and superscript carefully. Q is a transform that takes a point in frame A to frame O. I generally write it the other way and that messes up all the calculations. It is important to get this right. In fact, I wrote this post just to get to this point as I have tripped on this so many times.
This should be sufficient to get you started with sensor data and coordinate frames. The more you get involved with coordinate transforms, the more you will explore and understand. However remember that we just scratched the surface here. There is a lot more to learn. Now that we have a framework, it should be easy to explore and expand our knowledge. If you want me to expand on a particular section please leave a comment and I will do so if there is enough interest.