3D POSE ESTIMATION IN THE CONTEXT OF GRIP POSITION FOR PHRI

Page created by Gene Wang

Hobbies & Interests

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

3D POSE ESTIMATION IN THE CONTEXT OF GRIP POSITION FOR PHRI

Mälardalen University
            School of Innovation Design and Engineering
                          Västerås, Sweden

Thesis for the Degree of Master of Science in Engineering - Robotics
                            30.0 credits

   3D POSE ESTIMATION IN THE
 CONTEXT OF GRIP POSITION FOR
             PHRI

                             Jacob Norman
                          jnn13008@student.mdh.se

Examiner: Martin Ekström
           Mälardalen University, Västerås, Sweden

Supervisor: Fredrik Ekstrand
            Mälardalen University, Västerås, Sweden

Supervisor: Joaquı́n Ballesteros
            University of Málaga, Málaga, Spain

Supervisor: Jesus Manuel Gómez de Gabriel
            University of Málaga, Málaga, Spain

                                June 27, 2021

Jacob Norman Rotation-invariant human pose estimation

Abstract
For human-robot interaction with the intent to grip a human arm, it is necessary that the ideal
gripping location can be identified. In this work, the gripping location is situated on the arm and
thus it can be extracted using the position of the wrist and elbow joints. To achieve this human
pose estimation is proposed as there exist robust methods that work both in and outside of lab
environments. One such example is OpenPose which thanks to the COCO and MPII datasets
has recorded impressive results in a variety of different scenarios in real-time. However, most of
the images in these datasets are taken from a camera mounted at chest height on people that for
the majority of the images are oriented upright. This presents the potential problem that prone
humans which are the primary focus of this project can not be detected. Especially if seen from
an angle that makes the human appear upside down in the camera frame. To remedy this two
different approaches were tested, both aimed at creating a rotation-invariant 2D pose estimation
method. The first method rotates the COCO training data in an attempt to create a model that can
find humans regardless of orientation in the image. The second approach adds a RotationNet as a
preprocessing step to correctly orient the images so that OpenPose can be used to estimate the 2D
pose before rotating back the resulting skeletons.

Jacob Norman                                                                              Rotation-invariant human pose estimation

Table of Contents
1 Introduction                                                                                                                                                                 1

2 Background                                                                                                                                                                   4
  2.1 Computer Vision . . . .            .   .   .   .   .   .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    4
  2.2 Stereo vision . . . . . .          .   .   .   .   .   .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    5
  2.3 CNN . . . . . . . . . . .          .   .   .   .   .   .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    5
  2.4 Human pose estimation              .   .   .   .   .   .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    6

3 Related work                                                                                                                                                                 7
  3.1 Pose estimation . . . . . . . . . .                    .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    7
      3.1.1 Single view RGB image .                          .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    7
      3.1.2 Multi view RGB image .                           .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    8
      3.1.3 Data sets and evaluation .                       .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    9

4 Problem formulation                                                                                                                                                         10
  4.1 Limitations . . . .    .   .   .   .   .   .   .   .   .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   10
  4.2 Constrains . . . . .   .   .   .   .   .   .   .   .   .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   10
  4.3 Hypothesis . . . .     .   .   .   .   .   .   .   .   .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   10
  4.4 Research questions     .   .   .   .   .   .   .   .   .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   10

5 Methodology                                                                                                                                                                 11

6 Method                                                                                                                                                                      12
  6.1 Evaluation of State of the Art . . . . . .                              .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   12
      6.1.1 Modified CMU Panoptic . . . . .                                   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   12
      6.1.2 Choice of state-of-the-art method                                 .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   13
  6.2 Adaption of State-of-the-art . . . . . . .                              .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   14
      6.2.1 Training OpenPose . . . . . . . .                                 .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   14
      6.2.2 RotationNet . . . . . . . . . . . .                               .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   14
      6.2.3 CMU panoptic trainable . . . . .                                  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   14
  6.3 Ethical considerations . . . . . . . . . .                              .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   15

7 Implementation                                                                                                                                                              16
  7.1 Modified CMU panoptic              .   .   .   .   .   .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   16
  7.2 Training openpose . . .            .   .   .   .   .   .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   16
  7.3 Trainable CMU . . . . .            .   .   .   .   .   .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   16
  7.4 RotationNet . . . . . . .          .   .   .   .   .   .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   17
      7.4.1 Architecture . .             .   .   .   .   .   .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   17
      7.4.2 Training . . . . .           .   .   .   .   .   .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   17
      7.4.3 Evaluation . . .             .   .   .   .   .   .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   19

8 Results                                                                                                                                                                     20
  8.1 Evaluation of applicability            of the state of the art                              .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   20
  8.2 State of the art training .            . . . . . . . . . . . . .                            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   21
  8.3 RotationNet training . . .             . . . . . . . . . . . . .                            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   21
  8.4 RotationNet subsystem .                . . . . . . . . . . . . .                            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   22

9 Discussion                                                                                                                                                                  25
  9.1 Evaluation of applicability of the state of the art                                         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   25
  9.2 Training of the state of the art . . . . . . . . . .                                        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   26
  9.3 RotationNet training . . . . . . . . . . . . . . . .                                        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   27
  9.4 RotationNet subsystem . . . . . . . . . . . . . .                                           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   27

10 Goals and research questions                                                                                                                                               29

11 Conclusion and Future Work                                                                                                                                                 31

                                                                         ii

Jacob Norman                                             Rotation-invariant human pose estimation

12 Acknowledgements                                                                                    32

List of Figures
  1    Different perspectives created by approaching a prone human from different angles.               1
  2    Different perspectives created by approaching a prone human from different heights.              2
  3    Different perspectives created by viewing a prone human from three different distances.          2
  4    The search and rescue/assistive robot Valkyrie at UMA consisting of a robot ma-
       nipulator with six degrees of freedom mounted on a mobile platform. A gripper is
       mounted on the end effector with a camera just above. . . . . . . . . . . . . . . . .            3
  5    Gantt chart depicting the initial timeline for the project week by week. . . . . . . .          11
  6    Flowchart of the complete system read left to right where the two-dimensional (2D)
       pose estimation block represents the models presented in section 6.2 . . . . . . . .            12
  7    One of the perspective of the modified CMU panoptic dataset where the first row
       from left to right is the original image and the obscured image. On the second
       row from left to right are the images that are rotated 90, 180 and 270 degrees
       respectively. All images have the same amount of pixels, the 90 and 270 degrees
       have been cropped for this figure to reduce its size. . . . . . . . . . . . . . . . . . .       13
  8    Flowchart of the 2D pose estimation using RotationNet and OpenPose, in the
       flowchart of the whole system in figure 6, . . . . . . . . . . . . . . . . . . . . . . .        14
  9    Four different frames from the CMU trainable dataset with the vectors showing the
       offset rotation plotted to the left and the correctly orientated image(desired before
       openpose) to the right. On the left images the blue vector is the line between the
       pelvis and neck while the orange line is the vertical vector starting at the pelvis. .          15
  10   Architecture of the RotationNet with the MobileNetV2 model summarized into one
       block. The global average pooling layer and Dense(fully connected) layer following
       the MobileNet interprets the features extracted from the MobileNet to classify the
       ImageNet dataset, the remaining layers are implemented to adapt the structure to
       RotationNet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      18
  11   The first six elements in the shuffled and preprocessed dataset used for training, on
       top of each image is the ground truth rotation of each image with positive values
       being counterclockwise oriented and negative values being clockwise oriented. . . .             19
  12   Results from OpenPose on the modified CMU panoptic dataset which has been
       cropped or rotated counterclockwise expressed as MPJPE between all different cam-
       era pairs in cm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    20
  13   Results from OpenPose on the modified CMU panoptic dataset which has been
       cropped or rotated counterclockwise expressed as percentage of times the wrist or
       elbow was not found sorted by camera. . . . . . . . . . . . . . . . . . . . . . . . . .         21
  14   Training graph of the Openpose & MobileNet thin respectively using rotated coco
       images where the x-axis represents every 500th batch size and the y-axis represents
       the loss value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   22
  15   Histogram showing the error distribution of the RotationNet where the Y-axis rep-
       resent the frequency expressed in percent and the X-axis the error. . . . . . . . . .           22
  16   Training data of RotationNet with the mean squared error & mean absolute error
       of the validation data expressed in the y-axis and the epoch expressed in the x-axis.           23
  17   MPJPE of Openpose & RotationNet respectively tested on CMU Panoptic trainable
       testing data where the Y-axis represents MPJPE and the X-axis represents the
       rotation of the image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      23
  18   MPJPE of Openpose and RotationNet respectively tested on CMU Panoptic train-
       able testing data where the Y-axis represents MPJPE and the X-axis represents the
       viewpoint expressed in panel index. . . . . . . . . . . . . . . . . . . . . . . . . . . .       24
  19   Frequency of misses of Openpose & RotationNet respectively tested on CMU Panop-
       tic trainable testing data where the Y-axis represents the percentage of misses and
       the X-axis represents the rotation of the image. . . . . . . . . . . . . . . . . . . . .        24

                                                 iii

Jacob Norman                                              Rotation-invariant human pose estimation

  20   Frequency of misses of Openpose & RotationNet respectivelytested on CMU Panop-
       tic trainable testing data where the Y-axis represents the percentage of misses and
       the X-axis represents the viewpoint expressed in panel index. . . . . . . . . . . . .          25
  21   This image shows the self occlusion present in the CMU Panoptic modified dataset
       from HD cameras 7, 9, 11 and 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . .        26

List of Tables
  1    Definition of done . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   11

                                                  iv

Jacob Norman                                 Rotation-invariant human pose estimation

Acronyms
ROS Robot Operating System
2D    two-dimensional

3D    three-dimensional
UMA University of Malaga
MDH Mälardalen University

CNN Convolutional Neural Network
DCNN Deep Convolutional Neural Network
MPJPE Mean Per Joint Position Error
HRI   Human-Robot Interaction

pHRI Physical Human-Robot Interaction
ReLU Rectified Linear Unit
COCO Common Objects in Context

                                         v

Jacob Norman Rotation-invariant human pose estimation

1 Introduction
Human-Robot Interaction (HRI) is a field of study which focuses on developing robots that are
able to interact with humans in various everyday occurrences. The core of the field is physically
embodied social robots which result in design challenges around the complex social structures that
are inherent to human interaction. For example, robots with a human-inspired design encourage
people to interact similarly to how they would in human-human interaction. This can be used
to plan the interaction, However, if the human’s expectations of the interaction are not achieved
it can result in frustration. For social robots to interact with humans, sensors that interpret the
surroundings are necessary, the most common of which are the primary senses used in human-
human interaction; vision, audition, and touch [1]. Some application areas for HRI are education,
mental and physical health, and, applications in industry, domestic chores, and search and rescue
[2]. In areas such as search and rescue, physical human robot contact is necessary and these
interactions require the utmost care to not harm the patient even more. Therefore safety should
remain the top priority and the system should behave in a predictable manner since there is
generally no way to anticipate how the human would react [3, 4].
At the department of “Ingenierı́a de Sistemas y Automatica” 1 at University of Malaga (UMA),
Spain (where this thesis was partly conducted) a search and rescue/assistive robot named Valkyrie
is being developed. It is a robot manipulator mounted on a mobile platform that allows the robot
to move to the location of a human in need and initiate contact by grasping their wrist. From this
position, it will be possible to monitor the vital signs of the human, as well as, helping the human
up and eventually leading the human to safety in the event of a disaster. This method of HRI
is preferable because conveying information to panic-stricken humans is challenging, this method
also allows Valkyrie to be used in elder care where it can assist people to stand up after falling in
their homes or other environments where there are no other humans around that can help.
The aim of this thesis is to investigate the feasibility to detect and reconstruct three-dimensional
(3D) poses of humans laying on the ground with a monocular camera so that it could be used in
the future to grasp a human arm with a robot manipulator. To achieve this three factors will be
considered, firstly the angle from which Valkyrie is approaching the human. This is necessary to
investigate because depending on the angle of approach it can appear as if the human is rotated
relative to the camera. This is a scenario not found in all applications of 3D pose estimation
and could therefore present an area that has to see improvement to realize Valkyrie. Figure 1
shows the different perspectives that this issue creates. Secondly, the elevation of the approach is

Figure 1: Different perspectives created by approaching a prone human from different angles.

interesting because if a prone human is approached on a flat surface the elevation of the cameras
will be somewhere around chest height. This would result in an angle relative to the prone human
1 http://www.uma.es/isa

Jacob Norman Rotation-invariant human pose estimation

which deviates from the normal circumstance where both camera and subject are at the same
elevation. In the scenario that the elevation is not high enough the soles of the feet could be the
most prominent feature of the image as opposed to a picture taken from straight above which
would never occur when approaching a prone human. Figure 2 shows the different perspectives
that this issue creates. Lastly, the occlusion of the camera frame has to be investigated. When

Figure 2: Different perspectives created by approaching a prone human from different heights.

the mobile manipulator approaches a human arm the other features will be left out of the frame
of the camera. Therefore it is important that the proposed method can still identify the arm when
no other features are visible.

Figure 3: Different perspectives created by viewing a prone human from three different distances.

The motivation behind this project is to create a research platform on which HRI can be
tested, the researchers at Departamento de Ingenieria de Sistemas y Automatica have a particular
interest in HRI where the robot initiates the contact as this is a largely underrepresented field
compared to human initiated contact. Furthermore, there are several projects focused on search
and rescue robots at UMA, however, none of them are small enough that the they can be used
to research human assistive robots. The need for search and rescue robots in general stem from
two main sources, firstly, human-robot collaboration has showed to be effective in assembly lines
when adopting robots[5]. Secondly adopting robots for search and rescue scenarios will reduce the
risk rescue workers put themselves in when they enter hazardous areas to search for survivors.
After finding a human Valkyrie can access the life conditions of the human and take appropriate
response, such as leading them to safety or sending a signal that immediate medical assistance is
necessary while monitoring the vital signs. This way rescue workers would only expose themselves
to danger when necessary as opposed to constantly while searching for survivors. Valkyrie can also
be used as a research platform for assistive robotics, many elderly live alone and it is not feasible
to have help around at all times. With an assistive robot in the home, if one were to fall help
would be able to arrive instantly whether the patient was able to signal for help or not. From that
point the assistive robot would be able to help the human up or signal for help as well as assessing
the condition of the human.
The structure of this document following this introduction is first a presentation of the back-

Jacob Norman                                             Rotation-invariant human pose estimation

ground information that would be useful in order to understand the concepts and methods pre-
sented in this paper. After the background the related work is presented which consists of more
up do date methods on select fields that influenced the decision making in this project. Further
on in the report the problem is broken down and the hypothesis, as well as, research questions are
stated. In the methodology section the way in which the problem will be approached is presented
followed by the methods used to solve the problem in the method section of the report. Details
of the implementation and descriptive information how each step was performed is presented next
which will allow one to replicate this work, should the need arise. Further on in the report the
results will be presented after which they will be discussed in section result and lastly discussion.
The last section will also address the hypothesis, research questions and future work.

Figure 4: The search and rescue/assistive robot Valkyrie at UMA consisting of a robot manipulator
with six degrees of freedom mounted on a mobile platform. A gripper is mounted on the end effector
with a camera just above.

                                                 3

Jacob Norman                                                Rotation-invariant human pose estimation

2     Background
This section introduces some of the methods and concepts that were deemed necessary to un-
derstand in order to fully comprehend this thesis, the first of which is computer vision and how
an image on the computer screen translates to the objects captured by the camera. The second
method of importance is Stereo vision, which can capture the depth otherwise lacking in an im-
age. Thirdly, Convolutional Neural Networks(CNN) are explained which have revolutionized image
processing and are being incorporated with the previously mentioned techniques.

2.1    Computer Vision
In order to find a human and grasp its arm, the robot first needs to detect the human it intends to
rescue; this will be done with a camera. The simplest model of a camera is called a pinhole camera,
it consists of a small hole and a camera house on the back wall of which the image will be projected
upside down [6]. In a modern camera, the image is projected onto sensors that interpret the image
into a digital format. This results in four different coordinate systems, the first two are for the
object relative to the camera and relative to the base of the robot manipulator. The third is a 2D
coordinate system for the image sensors and lastly, there is a coordinate system expressed in pixels
for the digital image [7]. To represent an object in pixel coordinates requires a transformation
from the camera coordinates. This is done with the intrinsic camera matrix, which describes the
internal geometry of the camera. By taking pictures of a checkerboard with a known pattern it is
possible to estimate it as well as rectify the lens distortion, this process is called camera calibration.
The translation and rotational difference between the world and camera coordinates are described
by the extrinsic camera matrix, and the whole transformation from world coordinates to pixel
coordinates is performed with the camera matrix [7], this relation is expressed by equation 1 where
the left most vector represents homogeneous pixel coordinates. The matrices from left to right are
the affine matrix, projection matrix, and finally the extrinsic matrix which includes rotation and
translation. The vector on the right side is the world coordinates in homogeneous coordinates.
                                                                                  
                                                                             U
                       u      a11 a12 a13         f 0 0 0                     
                     v  ∼ a21 a22 a23   0 f 0 0 3x3         R       T3x1 
                                                                                  V  
                                                                                                      (1)
                                                                     0      1 W 
                       1      a31 a32 a33         0 0 1 0
                                                                                    1

   In this transformation the depth data is lost, however, there are several ways to recreate it,
one of which is stereo vision. By using two cameras and placing them next to each other the same
object can be seen in both image planes, if one knows the translation and rotation between the
cameras and the difference in pixel coordinates between the two images it is possible to determine
how far away the object is [6]. Another way of recreating the depth of the object is through
perspective projection, if the distance from the principal point in the camera is known as well as
the focal length together with the size of the object, the distance to the object can be determined.
This relationship is expressed by equation 2 where x represents the distance from the principal
point horizontally in pixels, X represents the horizontal distance from the principal point axis to
the object in the camera coordinates. Finally, F represents the focal length, and Z the distance to
the object[7]. The same is true for the Y coordinates.

                                                        X
                                                Z=f                                                   (2)
                                                        x
Initially it seems like both of these methods are inapplicable since weak perspective projection
requires the size of the object and stereo vision requires two different perspectives.
    However, by mounting the camera on the robot manipulator it is possible to move the ma-
nipulator to capture multiple perspectives with the camera, the robot configuration can provide
the position of the camera for each frame. Additionally, the right hand side of equation 2 can
represents the focal length times the scaling of the object which can be estimated by comparing
the 2D and 3D pose [8].

                                                    4

Jacob Norman Rotation-invariant human pose estimation

2.2 Stereo vision
The most researched area within stereo vision is stereo matching, it is the practice of finding the
same pixel or feature between two or more perspectives. This is necessary to estimate the depth to
that feature and may seem like an easy task since humans do it constantly, however, it is a challenge
within computer vision [9]. In early works finding and matching features was done using Harris
corner detector [6] which is still used and serves as the foundation for several approaches today.
A downside to this algorithm, however, is that it responds poorly to changes in scale. Changes
in scale are common in real-world applications when changing camera focus or zooming [6]. This
problem was remedied by Lowe when SIFT was published [10], which is a feature descriptor that
is invariant to translations, rotations, scaling and robust to moderate perspective transformations
and illumination. This solution has proven useful for a variety of applications such as object recog-
nition and image matching. Another solution to the correspondence problem is non-parametric
local transforms [11], by transforming the images and then computing the correspondence problem
it is possible to get an approach that is robust against factionalism which is when subsections of the
image have their own distinct parameters. This is possible since the correspondence is no longer
calculated on the data values but rather the relative order of the data values. For this approach
to work, there has to be significant variation between the local transforms and the results in the
corresponding area must be similar.

Once the correspondence problem has been solved and the same point has been found in both
images it is possible to calculate the depth by projecting two rays from the focus of the camera
to the object in the image plane. This process is called triangulation and makes it possible to
calculate the coordinates of a point in space if the translation, rotation, and camera calibration are
known [6]. Initially, this problem seems trivial since finding the intersection between two vectors
is straightforward, however, due to noise, the two vectors rarely intersect. Hartley [12] presented
an optimal solution to this problem that is invariant to projective transformations and finds the
best point of intersection non iteratively.
Scharstein and Szeliski [13] present a taxonomy for stereo vision in an attempt to categorize
the different components for dense stereo matching approaches, that is approaches that estimate
the depth for all pixels in the image. Furthermore, Scharstein and Szeliski devised evaluation
metrics for each of the components together with a framework to allow researchers to analyze their
algorithms. This serves as a standardized test in the field of dense stereo matching with researchers
submitting their algorithms to add them to Middlebury’s database of state of the art algorithms.
This is presented on their website [14] and is still maintained. The state of the art in stereo vision
today has progressed to the point where all the top methods on Middlebury utilize a Convolutional
Neural Network (CNN).

2.3 CNN
A CNN is a machine learning algorithm that is most commonly used to extracts information from
images by applying a series of filters, the size of the filters varies and the values are tuned to extract
information that is relevant to achieve the desired result. On the more complex CNN architectures,
there can be several hundred layers of filters, pooling layers, and activation functions that extract
and condense information[15]. These architectures are referred to as Deep Convolutional Neural
Network (DCNN)s because they have many layers which increase the complexity. DCNNs can
have several million variables which makes training time-consuming. A downside to CNNs is that
they are a black box in the sense that you can not take the result from one layer and distinguish if
it is working or not. Within the field of computer vision, deep learning has revolutionized several
topics with DCNNs, in particular, has had an impact on the field [16].
One major downside to CNNs is that they are dependent on the right data for training. In
order to implement a CNN for a specific purpose, a representative dataset is necessary, However,
this also allows them to be versatile, and consequently, CNNs are used for a variety of applications
such as object detection, object classification, segmentation, and pose estimation to name a few.
A CNN is only as good as the dataset used to train it and in 2009 Deng et al. [17] published a
dataset titled ImageNet intended for object recognition, image classification, and, automatic object
clustering. This led to the creation of several DCNNs that because of the diversity of Imagenet are

Jacob Norman Rotation-invariant human pose estimation

versatile and thus have been used for a number of different applications. This is in part because
of a method called transfer learning which circumvents the long training time by using the already
existing weights learned from ImageNet and finetunes them with a different dataset. This takes
advantage of the features the model already has learned and instead of re-learning everything the
already known features can be adapted for the new purpose[18].

2.4 Human pose estimation
Human pose estimation is a field of study which attempts to extract the skeleton from a human
in either 2D or 3D. This is achieved with a CNN that has fifteen to twenty-five outputs that
represent coordinates of different body parts. A wireframe is then constructed between adjacent
parts and a skeletal frame is created. Human pose estimation has also been applied to hands, feet,
and, face to include fingers, toes, and eyes ears, etc. in the wireframe. Mean Per Joint Position
Error (MPJPE) is used to evaluate models and calculates the mean error for every joint. This
requires a dataset that has the ground truth when training a model of which there are several,
however, not all datasets use the same skeleton structure so the MPJPE of different models is not
always directly comparable. Furthermore, there exists a lot of ambiguities in the torso and, as a
result, getting an accurate estimation of the hips is harder than the arms.

Jacob Norman Rotation-invariant human pose estimation

3 Related work
This section addresses the related work in HRI. First, 3D pose estimation will be investigated to
find a human and locate an appropriate position to grip, then ways of controlling the robot manip-
ulator and trajectory generation will be discussed. Lastly, evaluation methods will be investigated
together with metrics to compare the systems to related work.

3.1 Pose estimation
Unlike the majority of the research in this area[19], this project will mount a monocular camera on
a robot manipulator. This allows the possibility of moving the camera to get images from different
perspectives or a sequence of images. With this additional information, methods such as stereo
vision can be considered assuming the object is static. In this section, the methods used in related
work will be sorted by input to the model, after which different data sets and evaluation methods
will be discussed.

3.1.1 Single view RGB image
Human pose estimation from a single RGB image is a topic that has seen great progress thanks
to deep learning [19, 20]. Challenges in this topic include estimating the poses of multiple people
in the same image, inferring the locations of occluded limbs and training a robust model that
works outside of lab environments [20, 21]. Since machine learning is the primary method for this
problem it is important that a good dataset is used for training. Currently, there exists no 3D
pose dataset that has ground truth data outside lab environments [22, 8, 23, 24]. As a result most
research has divided the problem in the two separate tasks, 2D pose estimation and 2D to 3D pose
conversion. This allows the 2D pose estimator to train on a diverse dataset with ground truth
while the 2D to 3D pose converter can train on infer depth on the joints from a motion capture
dataset. While this does not solve the problem, it allows the use of several methods that alleviate
it [22, 8, 23, 24].

2D pose estimation 2D pose estimation is a problem that is largely considered solved, however,
when it is used in the process of 3D pose estimation it has a smaller margin of error. This is the
case because minor errors in the locations of the 2D body joints can have large consequences in the
3D reconstruction [19] since the errors within 2D pose estimation amplifies the error when moving
to 3D. A 2D pose estimation method that have seen widespread use in 3D pose estimation for
single RGB image input is the Stacked Hourglass Networks[22] proposed by Newell et al. [25]. By
pooling and then up-sampling the image from many different resolutions it is possible to capture
features at every scale. This allows the network to understand local features such as faces or hands
while simultaneously being able to interpret them together with the rest of the image and identify
the pose. Another 2D pose estimation algorithm developed by Cao et al. [26] was used before
estimating the 3D pose for images of multiple views [27, 28]. Openpose uses a CNN to predict
confidence maps for body parts and affinity fields that save the association between them, a greedy
parser is then used to get the resulting 2D pose estimation. This method has been proven to work
in real time and can provide the location of fingers as well as facial features.

2D to 3D pose conversion 3D pose estimation from a single image is an ill posed problem
because of the 2D nature of the source data, this imposes challenges where the 3D pose estimator
has to resolve depth ambiguity while also trying to infer the position from occluded limbs [19].
Among related work, a common approach to this problem is regression [8, 24, 22]. By using a
CNN to calculate a heat map of the human, it is possible to create a bounding box around the
human which normalises the subjects size and position. Thus freeing the regressor from having
to localise the person and estimating the scale. The downside to this approach is that global
position information is lost [8]. The evaluation methods for this problem also use a pelvis centred
coordinate system in which there is no transformation between the subject and the camera. Dabral
et al. [21] realised this was a problem for applications such as action recognition and proposed
a weak-perspective projection assumption. This assumes that all points on a 3D object are at

Jacob Norman Rotation-invariant human pose estimation

roughly the same depth(the depth of the root join) and requires a scaling factor for the object
which is estimated by comparing the 2D and 3D poses. A limitation on this method is that it
does not work when the human is aligned with the optical axis. Furthermore, it is not intended
to be highly accurate, but rather a system to make spatial ordering apparent. Similarly to this
approach, Mehta et al. [8] proposed weak perspective projection, however, their approach do not
require iterative optimisation which makes it less time-consuming. By using a generalised form
of Procrustes analysis to align the projection of the 2D and 3D pose, the translation relative the
camera can be described by a linear least squares equation.
To improve the robustness on 3D pose estimation outside of lab environments Yasin et al. [23]
proposed two separate sources of training data for 2D and 3D pose estimation. The 3D data
was gathered from a motion capture database and projected as normalised 2D poses. A 2D pose
estimator would then estimate the pictorial structure model and retrieve the nearest normalised
3D pose which would be an estimate of the final pose. By projecting the 3D pose back into 2D
the final 3D pose is found by minimising the projection error.
Similarly to this Wang et al. [24] also suggested that the final 3D pose should be projected back
into 2D and used to improve the result. The major distinction between the two methods is that
Yasin et al. utilises a K-nearest neighbour to improve the estimate while Wang et al. feeds the
projection error into a CNN.
Another approach to solving the lack of a sufficient dataset is Adversarial learning [22], which
employs the use of two networks: a generator that creates training samples and a discriminator
that tries to distinguish them from real samples. The objective of the generator is to create 3D
poses good enough to fool the discriminator into thinking the samples are real. Its architecture
is based on the popular stacked hourglass with input both from 3D and 2D annotated data.
The 2D to 3D converter in a generative adversarial network can only become as good as the
discriminator, therefore a lot of emphases is placed on the discriminator which is based on a multi-
source architecture that combines CNNs with input from the image used to generate the data,
geometric descriptors and a 2D heat map as well as a depth map.

3.1.2 Multi view RGB image
According to Sarafianos et al. resolving depth ambiguities in 3D pose estimation would be a much
simpler task if depth information could be obtained from a sensor [19]. Additionally, Amin et al.
[29] argues that the search complexity can be reduced significantly by treating this problem as a
joint inference problem between two 2D poses as opposed to a single 3D pose. With two different
viewpoints available, stereo matching can be used to calculate the depth which unlike methods used
for single view depth inference does not rely on estimations. Therefore, the methods presented in
this section will be more robust than the ones presented for single view.

2D to 3D pose conversion Both Garcia et al. [28] and Schwartz et al. [27] utilised OpenPose
explained in section 3.1.1 for 2D pose estimation. Garcia et al. used the joint locations from open
pose to rectify the image and then as features for triangulation. Schwartz et al. also used the
joint locations to remove joints which were only visible from one camera. A heat map generated
from OpenPose was then randomly sampled from and back projected the pixel coordinate as a ray
from which a 3D joint hypothesis was constructed from the point closest to all the rays. The 2D
pose confidence was then calculated by projecting the 3D position to the 2D heat maps. On top
of this Belief propagation was used for posterior estimation and temporal smoothness was used to
reduce the jitter between frames. Hofman et al. [30] suggested a reversed approach in which the
2D pose is used to find a set of similar poses in a 3D pose library, the 3D pose is then evaluated
by projecting it to the other cameras and comparing it with the 2D poses for each camera. If the
error is too large the 2D poses are discarded, otherwise the triangulation and projection error is
minimised by trying with more 3D poses and calculating the error. The best ranked results are
then optimised with gradient descent.

Direct 3D pose estimation One of the problems with 2D to 3D pose reconstruction is that 3D
information has to be inferred before the depth can be calculated [27, 31]. Gai et al. [31] proposes
a solution to this by first finding the relation between the different views and then estimating the

Jacob Norman Rotation-invariant human pose estimation

pose. This is done with a ResNet that inputs each image from different views and then merges
the information in the pooling layer. Regression is then used to estimate the pose and shape
of the human after which an adversarial network was trained to estimate the mesh whose error
is propagated through the entire pipeline. This solution runs in real-time and is comparable to
similar implementations done on single view RGB images with the distinction that it calculates
the global coordinates of the 3D pose. Gai et al. also discovered that the joint error decreased
when the number of viewpoints increased.

3.1.3 Data sets and evaluation
Pose estimation is often implemented with an AI approach that requires large datasets. While
3D pose estimation from a single image suffers heavily from poor data sets which contain either
diverse or ground truth for joint depth. This is detrimental for machine learning approaches, how-
ever, approaches to mitigate this issue exists [22, 23, 24]. 2D pose estimation does not suffer from
this issue since the 2D pose estimators are trained on 2D pose estimation data which is extensive,
diverse, and has ground truth. The depth is then calculated through geometry and therefore does
not suffer from bad training data. Another downside to the lack of a good data set is that every
researcher has to decide which data set to use, this results in several methods that are not directly
comparable to each other. This can be a problem when every paper is claiming to improve on
state-of-the-art by only comparing to the papers that used the same data set.

One data set that is recurrent among articles in 3D pose reconstruction is Human3.6 by Ionescu
et al. [32] which has a standard evaluation protocol. The metric to determine the quality of
the match is MPJPE which represents the error between the estimate and ground truth using
Procrustes alignment. Another dataset of interest is CMU Panoptic which represents humans in
a lab environment using a dome mounted with cameras. The dataset includes a total of 31 HD
cameras, 480 VGA cameras, and 10 RGB-Depth sensors. Full 3D poses are captured of humans
socializing, dancing, playing musical instruments, or showing off a range of motion [33].

Jacob Norman Rotation-invariant human pose estimation

4 Problem formulation
There are several challenges associated with human pose estimation, the first of which is related to
the 3D annotated datasets. Since 3D annotated datasets are both expensive and difficult to create
outside lab environments it can be difficult to adapt the model to the desired scenario. In order to
build a model that can accurately find the wrist and elbow in the perspectives presented in figures
1, 2, and 3, it is necessary to have a dataset that represents these situations.

4.1 Limitations
To make this thesis manageable in the set time frame several limitations and constraints have
been put on the work. Originally, this endeavour started in Spain, where access to the physical
robot was possible, therefore, testing the solution directly on the robot was a big focus. Due to
the impact on Covid-19 the project was moved to Sweden and access to the robot was no longer
possible. Consequently, the project took on a more theoretical approach, in practice, this meant a
lot more focus was put into identifying a good gripping location and the robustness of the solution
as opposed to the interaction between the camera and robot manipulator.

L1 There will be one stationary human lying down in the camera frame.
L2 The exact gripping location will not be identified, instead, the wrist and elbow joints will be
identified as it is assumed the ideal gripping location is on a vector between the joints.
L3 Movement of the mobile manipulator will not be considered.

4.2 Constrains
C1 Images to test the solution will not be taken from a camera mounted on a robot manipulator.
C2 Navigation of the robot manipulator will be considered out of scope for this thesis.
C3 Only solutions which use a monocular camera will be considered.
C4 The images used for this thesis is collected with a monocular camera, therefore, the approach
has to take this into consideration.

4.3 Hypothesis
A monocular camera mounted on a robot manipulator provides
sufficient information to detect a prone human and identify a suitable
gripping location.

4.4 Research questions
RQ1 Can a gripping position be identified regardless from which direction Valkyrie
approaches a prone human?
RQ2 Is a data set designed to represent a prone human from a multitude of angles
necessary to achieve an acceptable estimation of the arm?

Jacob Norman                                            Rotation-invariant human pose estimation

5    Methodology
This project will follow Agile guidelines [34] as they are prevalent within the
industry with employers inquiring if recent graduates are familiar with it. There
are several agile methodologies to choose from, however, since there is only one
participant in this project a modified model of SCRUM and feature-driven
development has been devised and is explained in the section below.
The project starts with a research phase, the goal of which is to develop a better
understanding of the problem and finding state-of-the-art solutions, this phase will
end with a review of the information after which a solution will be decided upon.
The next stage then starts which is implementation-specific planning which
consists of creating a backlog of features where every feature has a development
and design plan as well as a priority list in which order the features should be
completed. The features will also be divided into several different stages that
represent core functionality and then future expansions. The next step an iterative
design process begins, where, similarly to SCRUM a feature(sprint) is selected and
then implemented. The feature is considered completed after each item has been
fulfilled in the definition of done (see table 1). When a feature is complete the next
feature in the list is selected, however, if the feature fails to meet all the criteria in
the definition of done that feature is skipped and instead placed back into step one
or two depending on the issue. Similar to SCRUM, this implementation phase will
consist of the number of days decided upon during the implementation planning.
After the implementation is complete the system will be evaluated as a whole and
once that is done finalization of the report and presentation are the last steps in
this project. The Gantt chart can be seen in figure 5

                                      Definition of done
                      Feature
                      Functional test passed                          O
                      Feature evaluated and results recorded          O
                      Acceptance criteria met                         O
                      Feature documented in project report            O

                 Table 1: All 4 criteria required to fulfil the definition of done

       Figure 5: Gantt chart depicting the initial timeline for the project week by week.

                                                11

Jacob Norman Rotation-invariant human pose estimation

6 Method
The proposed method to find the ideal gripping location on a prone human
consists of first taking a picture, then moving the robot manipulator on which the
camera is mounted, and taking another picture. Each of the pictures is then used
for 2D pose estimation, and the results are later triangulated. The flowchart of
this system can be seen in figure 6.

Figure 6: Flowchart of the complete system read left to right where the 2D pose estimation block
represents the models presented in section 6.2

6.1 Evaluation of State of the Art
To simulate a prone human being approached from a multitude of angles, several
different datasets were considered. A common theme among them was a focus on
upright humans with a camera at chest height, most often in social or
sports-related scenarios. This is most commonly to represent humans for the
purpose of action understanding, surveillance, HRI, motion capture, and CGI [35].
As a result, an already existing dataset could not be used for the purpose of this
report, instead, a dataset had to be modified in an attempt to represent the
scenario. This limits the number of available datasets since not all datasets are
under a license that allows modifications. One dataset that does is CMU Panoptic
[36]. Furthermore, the dataset is constructed using a dome mounted with cameras
to capture different views. This allows the simulation of approaching the human
from different directions.

6.1.1 Modified CMU Panoptic

The CMU Pantoptic dataset has been used extensively in research and consists of
segments of social situations, range of movement, and dancing. CMU panoptic is
the largest 3D annotated dataset seen from the number of camera views [37],
unfortunately there are no segments where the focus is on a human in a prone
position. To remedy this, the dataset will be modified by rotating images taken
from a range of motion segment 90, 180, and 270 degrees to represent the pose a
prone human would have when approached from the head or sides this can be seen
more clearly in Figure 1. Furthermore, a zoomed-in view of the right arm will also
be added to test if the arm is identifiable when the rest of the human is obscured.
Figure 7 shows example images taken from the modified dataset. This dataset will
be refered to in the report as ”Modified CMU Panoptic”

Jacob Norman Rotation-invariant human pose estimation

Figure 7: One of the perspective of the modified CMU panoptic dataset where the first row from
left to right is the original image and the obscured image. On the second row from left to right
are the images that are rotated 90, 180 and 270 degrees respectively. All images have the same
amount of pixels, the 90 and 270 degrees have been cropped for this figure to reduce its size.

The state-of-the-art method was evaluated based on MPJPE of the wrist and
elbow joint. This is the same metric that is used by related work except for the
purpose of this report only the wrist and elbow joint of one of the arms is
considered. Since the cameras cover a 360 degree perspective around the human
only one of the arms is considered since using the closest arm would result in
mirrored results from cameras facing each other. The results are segmented by
rotation, crop, and placed in a grid to display triangulation between individual
cameras. A table graph showing how often all the required joints were not
detected is also presented in section 8.

6.1.2 Choice of state-of-the-art method

Among the state-of-the-art, there are several interesting methods, some of which
are already implemented and are free to use for research purposes and some which
are not implemented. When choosing which method to evaluate several factors
were considered. Firstly, how difficult would the model be to implement on
Valkyrie including training and eventual porting to Robot Operating
System (ROS)? Secondly, is this method proven to work in literature? Is the
method available with the author’s implementation or does it require
implementation and training from scratch?
The chosen state-of-the-art method to evaluate is Openpose because it already has
a ROS implementation that can integrate with Valkyrie, free to use Caffe model,
as well as, several TensorFlow ports which make transfer learning easier.
Furthermore, Openpose is well established in the literature and this choice also
coincides with the wishes of UMA.

Jacob Norman                                         Rotation-invariant human pose estimation

6.2     Adaption of State-of-the-art
In an attempt to create a rotation-invariant model two different approaches were
tested. The first of which attempts to make Openpose rotation invariant by
rotating the input training data that is used to train the model end to end. The
intention behind this is if differently orientated humans are present in the training
data, hopefully, the CNN can adapt to be able to identify human limbs in all
scenarios. The second approach adds a DCNN as a preprocessing step that
extracts the orientation of the human which is used to rotate the image before
rotating the skeleton back after 2d pose estimation. This has been done previously
by Kong et al. [38] to make a hand pose estimation model rotation-invariant.

6.2.1   Training OpenPose

The creators of Openpose has provided the training code to train Openpose end to
end, on the original model this was done using the Common Objects in
Context (COCO) dataset[39]. In an effort to achieve a rotation-invariant 2D pose
estimation model the input COCO dataset was rotated randomly in the
preprocessing step and used to train both openpose and a MobileNet 2D pose
estimation solution.

6.2.2   RotationNet

As an alternative to rotation invariant 2D pose estimation, a preprocessing step
that attempts to extract the angle of the upper body was created. By taking the
MobileNetV2 architecture and adding two fully connected layers separated by an
activation layer with the Rectified Linear Unit (ReLU) function, it is possible to
create a system that takes an input image and treats it as a regression problem.
The desired output of this architecture is the angle at which the image has to be
rotated to get the human aligned with the vertical axis of the image. The 2D
skeleton can then be extracted using 2D pose estimation after which the skeleton
will be rotated back the same amount so that it aligns with the original image.
The flowchart of this system is presented in figure 8.

Figure 8: Flowchart of the 2D pose estimation using RotationNet and OpenPose, in the flowchart
of the whole system in figure 6,

6.2.3   CMU panoptic trainable

The ”CMU panoptic trainable” dataset was created using 120 different views of six
people doing the same movements. The movements were selected from a sequence
where first a series of arm motions are enacted after which the whole upper body

                                             14

Jacob Norman                                           Rotation-invariant human pose estimation

Figure 9: Four different frames from the CMU trainable dataset with the vectors showing the
offset rotation plotted to the left and the correctly orientated image(desired before openpose) to
the right. On the left images the blue vector is the line between the pelvis and neck while the
orange line is the vertical vector starting at the pelvis.

is moved. There is only one human in the frame at a time, and each picture has a
corresponding 3D skeleton and offset rotation. The offset rotation is extracted
with the formula presented in equation 3 where n1 represents a unit vector
originating in the pelvis orientated towards the neck and n2 represents a unit
vector originating in the pelvis orientated vertically with respect to the image.
                                      θ = arccos n1 · n2                                      (3)
These vectors are further demonstrated in figure 9 where the left image shows the
offset and the right image shows the corrected image RotationNet is supposed to
feed to Openpose.

6.3    Ethical considerations
According to the license of CMU Panoptic[40], the modified datasets used in this
thesis are not allowed to be distributed. No other additional sources of data were
collected during this project, therefore, there are no ethical considerations
regarding data management in this thesis.

                                               15

Jacob Norman Rotation-invariant human pose estimation

7 Implementation
This section explains how each of the methods were realized, as well as, more in
depth descriptions.

7.1 Modified CMU panoptic
The modified dataset was created using 25 HD views from the CMU panoptic
pose1 sample [41] which depicts one human moving his arms for 101 frames. These
images were then modified so for each frame there exists one normal image, one
cropped around the right arm, and three rotated 90, 180, and, 270 degrees. The
images that were rotated were not cropped, instead, they have the resolution
1080x1920, as opposed to, 1920x1080.

7.2 Training openpose
In the preprocessing step used for the end-to-end training of openpose, several
image augmentations are made. These include random scaling, rotation, flip, X,
and crop. In addition to creating more robust models that can handle nonperfect
images, this also reduces the risk of overfitting by artificially increasing the
dataset. With random variables in the preprocessing, the dataset can be used for
several epochs without seeing the same image twice. This was taken advantage of
when training openpose to be rotation-invariant because all that was necessary to
feed rotated training images was increase the max and min allowed rotation in the
preprocessing function. The hardware used for this project was Jetson Xavier
AGX, unfortunately, the machine learning framework Caffe which openpose is
built on does not support cuDNN 8.02 [42]. As a substitute, a TensorFlow port
which recreated all the original preprocessing [43] was used which was allowed to
train for 10 days on the Jetson Xavier platform. After the OpenPose training was
interrupted an ImageNet model that was implemented by the same git repository
was trained for 7 days.

7.3 Trainable CMU
The CMU trainable dataset is also created from the CMU-panoptic range of
motion pose 1, however, CMU trainable consists of 1851 frames as opposed to the
101 in the modified dataset. These frames were hand-selected from six different
subjects performing a series of range of motion movements including moving the
arms and upper body. The images are captured at the resolution of 640x480 from
120 different camera views. In total there are around 220000 images split 60-20-20
for training, validation, and testing. Each of the images has a groud truth 3D
skeleton, as well as, an offset rotation which was calculated using the cross product
of the vector between the pelvis and neck and a vertical vector. The testing
dataset was then rotated randomly between -179.99 and 180.00 degrees and this
was added to the offset rotation to create a ground truth rotation. The rotation
cropped the images so the resulting resolution stayed the same after the rotation
and the empty spaces were filled with black pixels. Bilinear interpolation was used
to avoid artifacts created by the rotation. A copy of the testing data was also
2 NVIDIA CUDA Deep Neural Network library

Jacob Norman                                    Rotation-invariant human pose estimation

copied and rescaled to 224x224 before it was rotated, this was done so that the
RotationNet could be tested on data that was preprocessed the same way as the
training and validation data. Both of these models were trained with a batch size
of 16 because it was the highest possible value without running out of memory.

7.4     RotationNet
The RotationNet was implemented in TensorFlow 1.15[44] using Keras [45]
implementation of MobileNetV2 which ends with 1000 outputs that represent
different classes in ImageNet. On top of this was a dense layer that reduced the
number of outputs to 32, followed by an activation layer with the ReLU function,
followed by another dense layer that reduces the total number of outputs to one.
The architecture can be seen in figure 10.

7.4.1   Architecture

RotationNet is based on MobileNetV2 and only adds three layers to its
architecture. A fully connected layer which reduces the number of variables from
the 1000 classes output of ImageNet models down to 32, the second added layer is
an activation layer with a ReLU followed by a second fully connected layer that
brings the total number of variables down to one. MobileNet is 157 layers deep
and is considered a DCNN, when using the MobileNetV2 architecture of 224x224
it has a total of 3.4 million variables which puts it at the lower end of ImageNet
models. The goal of using MobileNetV2 is to use transfer learning to adapt the
features already learned when training on the ImageNet and fine tune them to find
the correct orientation.

7.4.2   Training

The training images were first resized to 224x224, normalized, and then rotated
randomly between -179.99 and 180 degrees which were then added with the offset
rotation native to the image. This was done in an effort to artificially increase the
size of the dataset in an effort the prevent overfitting. The dataset was then
randomly indexed into a buffer the size of the dataset in order to shuffle the
images. Figure 11 shows a grid of six images after the preprocessing step.
The input resolution of 224x224 was used since that is the native resolution on
which ImageNetV2 was trained so to take advantage of the pre-trained weights
this input resolution was necessary. A batch size of 128 was used during the
training, this means inference on 128 images was ran and the weights were
updated to minimize the error on all images. This value was used to add as many
rotations as possible to the batch so as the model would converge. The loss
function used was mean squared error, this is a common loss function for
regression problems and was chosen because it punishes bad estimates with a
higher loss value compared to mean absolute error.
To prevent the model from unlearning all previous knowledge when the new
weights are tuned from random, the model is trained in stages. During the first
pass, only the weights of the three last layers are updated and after that, all
weights will be updated. This was the largest power of two possible with the
hardware of this project. Throughout the training validation loss was monitored

                                         17

Jacob Norman                                           Rotation-invariant human pose estimation

Figure 10: Architecture of the RotationNet with the MobileNetV2 model summarized into one
block. The global average pooling layer and Dense(fully connected) layer following the MobileNet
interprets the features extracted from the MobileNet to classify the ImageNet dataset, the remain-
ing layers are implemented to adapt the structure to RotationNet.

                                               18

Jacob Norman                                            Rotation-invariant human pose estimation

Figure 11: The first six elements in the shuffled and preprocessed dataset used for training, on top
of each image is the ground truth rotation of each image with positive values being counterclockwise
oriented and negative values being clockwise oriented.

and when three epochs without an improvement occurred, the training was
interrupted and the weights from the best performing epoch were saved.

7.4.3   Evaluation

The RotationNet was individually evaluated based on the mean absolute error,
standard deviation, and variance in order to get an understanding of how successful
the training had been and what result one can expect from the RotaionNet. The
entire subsystem shown in image 8 was then compared to OpenPose, both
Openpose and RotationNet were tested on the images of the CMU panoptic
trainable dataset which had been designated for testing and not been previously
seen by any of the algorithms. The results that were compared between the two
models were MPJPE of the image coordinates, resulting in MPJPE expressed in
pixel values and the frequency of misses similarly to the evaluation of the
state-of-the-art using modified CMU Panoptic. These results were then sorted by
panel index where each panel had 5-7 cameras and by rotation of the input image.

                                                19

You can also read