Rigid Object Pose Estimation from an Image Using Geometry


3D pose estimation is a quite fascinating problem in the computer vision field, which aims to get the 3D orientation of the object based on the 2D image and can be challenging to find the solution. These solutions are mostly used in the autonomous industry to properly detect cars orientation on street scenes.

The car industry is mostly focused on autonomous driving problems, and most of the publicly available datasets contain information only about one axis rotation. Usually, this is around the Y-axis (yaw angle), while the X-axis (roll angle) and Z-axis (pitch angle) are not considered at all or are always equal to zero. This approximation is sufficient to solve autonomous driving problems, but there are other fields of the car industry that require more than one axis orientation, one interesting example being parking occupancy. For parking occupancy, cameras that should detect cars and lines that represent the parking can be placed into different heights, or even from a birds-eye perspective, and all of those require all three dimensions of orientation to be known. That being said, there is a need for an analytical solution that will represent the full 3D orientation of the object, concretely a vehicle.


Our proposal uses Euler angle representation and two coordinate systems (one from the camera and one from the detected object) to define angle orientation for each axis in 3D space.

Our pose estimation neural network understands an object (vehicle) dimension and orientation in the 3-dimensional space (vector space) from an image. As a foundation, the pose estimation neural network uses VGG16, which is known as a shared feature network and its goal is to extract all properties from analyzed images and represent them in the vector space. Branches derived from the VGG16 neural network are composed of fully connected layers to predict the output of the pose estimation neural network. For the detection and classification of vehicles, we use the YOLOv4 neural network, to find all objects of interest (vehicles) in the image and to transfer those to the pose estimation neural network. For semantic segmentation of the image, we use DeepLabV3+ neural network and the goal is to mark all entities of interest in the image (curbside, road signs, etc..)