Multi-Sensor Context-Aware Based Chatbot Model: An Application of Humanoid Companion Robot

Page created by Paula Nunez

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Multi-Sensor Context-Aware Based Chatbot Model: An Application of Humanoid Companion Robot

sensors
Article
Multi-Sensor Context-Aware Based Chatbot Model: An
Application of Humanoid Companion Robot
Ping-Huan Kuo 1,2 , Ssu-Ting Lin 3 , Jun Hu 3 and Chiou-Jye Huang 4, *

 1 Department of Mechanical Engineering, National Chung Cheng University, Chiayi 62102, Taiwan;
 phkuo@ccu.edu.tw
 2 Advanced Institute of Manufacturing with High-Tech Innovations (AIM-HI), National Chung Cheng University,
 Chiayi 62102, Taiwan
 3 Department of Intelligent Robotics, National Pingtung University, Pingtung 90004, Taiwan;
 cbc105010@gmail.com (S.-T.L.); wuorsut@gmail.com (J.H.)
 4 Department of Data Science and Big Data Analytics, Providence University, Taichung 43301, Taiwan
 * Correspondence: cjh1007@gm.pu.edu.tw; Tel.: +886-4-2632-8001 (ext. 15124)

 Abstract: In aspect of the natural language processing field, previous studies have generally analyzed
 sound signals and provided related responses. However, in various conversation scenarios, image
 information is still vital. Without the image information, misunderstanding may occur, and lead
 to wrong responses. In order to address this problem, this study proposes a recurrent neural
 network (RNNs) based multi-sensor context-aware chatbot technology. The proposed chatbot model
 incorporates image information with sound signals and gives appropriate responses to the user.
 In order to improve the performance of the proposed model, the long short-term memory (LSTM)
 structure is replaced by gated recurrent unit (GRU). Moreover, a VGG16 model is also chosen for a
 
  feature extractor for the image information. The experimental results demonstrate that the integrative
 technology of sound and image information, which are obtained by the image sensor and sound
Citation: Kuo, P.-H.; Lin, S.-T.; Hu, J.;
 sensor in a companion robot, is helpful for the chatbot model proposed in this study. The feasibility
Huang, C.-J. Multi-Sensor
 of the proposed technology was also confirmed in the experiment.
Context-Aware Based Chatbot Model:
An Application of Humanoid
Companion Robot. Sensors 2021, 21,
 Keywords: multi-sensor fusion; natural language; processing; context-aware computing; chatbot;
5132. https://doi.org/10.3390/ companion robot
s21155132

Academic Editor: Eui Chul Lee
 1. Introduction
Received: 15 June 2021 Many studies have proposed robots that can chat with people. Opportunities to use
Accepted: 27 July 2021
 chatbots will continue to increase. In addition to accompanying older adults living alone,
Published: 29 July 2021
 chatbots are also applied to medical treatment. For example, individuals may be afraid
 of public spaces due to mental illnesses. Under these circumstances, chatbots can be used
Publisher’s Note: MDPI stays neutral
 to provide them with companion and health care. Numerous studies of chatbots have
with regard to jurisdictional claims in
 been reported. Traditional chatbots only use the before and after sentences for training.
published maps and institutional affil-
 However, such training results in low diversity, and accuracy is also low. This traditional
iations.
 system also renders chatbots unable to flexibly answer questions. For example, when being
 asked “What do you see?” traditional chatbots can only answer “I don’t know” due to the
 lack of image information. The aforementioned problem is only one example; numerous
 others are similar.
Copyright: © 2021 by the authors.
 Numerous studies regarding chatbots have been published. Ref. [1] proposed and
Licensee MDPI, Basel, Switzerland.
 applied a hybrid chatbot to communication software. This technology combines a retrieval-
This article is an open access article
 based model and a QANet model and can answer user questions through an E-Learning
distributed under the terms and
 system. In additional to supporting English, this technology can also converse with
conditions of the Creative Commons
 chatbots in Chinese and has performed well in experiments. Ref. [2] also applied chatbots
Attribution (CC BY) license (https://
 in digital learning. In this study, the role of the proposed chatbot is to resolve Visual,
creativecommons.org/licenses/by/
4.0/).
 Auditory, Read, Write, and Kinesthetic (VARK) information throughout the process of

Sensors 2021, 21, 5132. https://doi.org/10.3390/s21155132 https://www.mdpi.com/journal/sensors

Sensors 2021, 21, 5132 2 of 21

responding to the learner. Furthermore, the proposed chatbot also records the learner’s
beta brainwaves and uses machine learning algorithms to classify learning orientations.
Ref. [3] explored many studies on chatbots used in Facebook Messenger, mainly discussing
how teachers use chatbots to supplement their teaching. The article also stated that the
use of chatbots to assist in teaching is not yet mature and, therefore, chatbots as a field of
research offer considerable room for development. Ref. [4] proposed a neural network-
based chatbot that integrates personal data for analysis and estimation and uses the useful
data to return appropriate responses. The main architecture for this technology is LSTM [5],
and experiment results have substantiated the feasibility of this technology. In Ref. [6],
bidirectional recurrent neural network (BiRNN) technology was used as the basis for a
chatbot, and a GRU [7] was used as the core of the technology. The system architecture
was exceedingly large and required substantial computing resources, but was verified
favorably in experiments. Ref. [8] proposed a chatbot called Xatkit and stated that although
it requires improvement, it currently can be applied in some real-world scenarios, and
could be applied in many venues in the future.
Chatbots have diverse applications. To address the massive demand for manpower
in customer service, some companies have proposed automated customer service chat-
bots. However, automated customer service chatbots sometimes misunderstand sentences.
Ref. [9] presented research into automated customer service chatbots and stated that al-
though chatbot technology is not yet fully mature, demand for this technology is notably
high. Moreover, cognitive behavioral therapy (CBT) is a common method of therapy
for patients with panic disorder. Ref. [10] explored the feasibility of using chatbots on
mobile devices for CBT. The study found that using chatbots for CBT was both feasible
and effective, pioneering the application of chatbots in medicine.
Deep learning as a field of research has burgeoned in the early 21st century. Its techno-
logical developments are diverse and have been applied to many research topics [11–15].
The present paper features deep learning techniques to resolve questions that chatbots
cannot answer due to environmental contexts. The first step toward this type of practical
learning is to enable bots to understand the present environment through visual images
and to describe the environment through text. The most direct method is to design bots that
use image caption technologies to understand their environments. The original purpose
of image captions was, and is, to automatically generate annotations for an image. For
example, if an image is entered as input, the image caption model can write a sentence
about the image based on the image status. Therefore, in this article, the described method
enables the bot to view its present environment clearly and to know about any object in
its environment. Conventional chatbot models typically are only able to perform model
training based on context; this type of model training is unable to consider image data.
Even if the model can understand its present environment, it cannot incorporate the un-
derstood image data into dialogue responses. In this study, the proposed method can
deepen the chatbot model’s considerations of the results of its previous conversations, and
together with the current dialogue, can train the chatbot. A statement generated by image
caption technology can also be an input for the chatbot and can be incorporated into the
next dialogue. This method enables the chatbot to answer questions more accurately; the
highly diverse responses vary based on the present environment.
The four major contributions of this work are: (1) integrating image and sound
data to achieve further precision and practicality in chatbot and human communications;
(2) adjusting and optimizing a framework based on a commonly used LSTM model to
train the image caption and chatbot model efficiently, thus improving scenario recognition
abilities; (3) comparing the original LSTM-based model with the adjusted, GRU-based
version and verifying that the training efficiency was improved by the adjustments; and
(4) testing the trained model on a humanoid robot in a series of experiments to verify that
the other model proposed in this paper and the applications of both models are effective
and feasible.

Sensors 2021, 21, x FOR PEER REVIEW 3 of 21
Sensors 2021, 21, 5132 3 of 21

The remaining sections of this paper are as follows. Section 2 details the operational
The remaining
concepts sections
and process of paper.
of this this paper are as3 follows.
Section Section
introduces 2 detailsand
the models the operational
algorithmic
concepts
framework being proposed. Section 4 presents and compares the experimentalgorithmic
and process of this paper. Section 3 introduces the models and results. The
framework
discussionsbeing proposed.inSection
are addressed Section45.presents
Section and compares
6 is the the experiment results. The
conclusion.
discussions are addressed in Section 5. Section 6 is the conclusion.
2. System Architecture
2. System Architecture
In the past, chatting with bots was typically only possible through direct voice or text
In the past, chatting with bots was typically only possible through direct voice or text
conversationsusing
conversations usingconventional
conventionalchatbot chatbotmodels.
models.A Atypical
typicalbot botwaswasunable
unableto toconsider
considerits its
surrounding environment through its vison when responding.
surrounding environment through its vison when responding. The present research relies The present research relies
on chat
on chat training
training with
with images
images that that the
the bot
bot can
can see.
see. ToTo allow
allow thethe chatbot
chatbot models
modelsto to integrate
integrate
image data into input sets, this paper proposes an improved
image data into input sets, this paper proposes an improved model that integrates image model that integrates image
captions, natural language processing, and similar technologies;
captions, natural language processing, and similar technologies; this model enables the bot this model enables the
botchat
to to chat and respond
and respond according
according to theto the environment
environment it sees.itThesees. The feasibility
feasibility and practi-
and practicality of
the model were substantiated by the experiment results. In the future, this method canthis
cality of the model were substantiated by the experiment results. In the future, be
methodto
applied can be applied
chatbots to chatbots
and home care, asand home
well as tocare, as well
increase theas to increase
precision the precision
of typical chatbot’s of
typical chatbot’s verbal expressions and to enable typical
verbal expressions and to enable typical chatbots more aware of reality when conversingchatbots more aware of reality
whenhumans.
with conversing with humans.
Theoverview
The overviewof ofthe
theproposed
proposedframework
framework is is illustrated
illustrated in in Figure
Figure 1. Prior
1. Prior to undergo-
to undergoing
ing any
any training,
training, the the chatbot
chatbot cancan be be
saidsaid
to to
bebe like
like a baby—incapableofofanything.
a baby—incapable anything.But Butafter
after
going through correct learning, the chatbot slowly becomes
going through correct learning, the chatbot slowly becomes more human-like and gains more human-like and gains
the ability
the ability to to communicate
communicate with with people.
people. The training
training process
process resembles
resembles the the process
process by by
whichadults
which adultsteach
teachchildren
childrentoto recognize
recognize objects,
objects, typically
typically holding
holding a picture
a picture andand telling
telling the
the child,
child, “This“This is a dog”.
is a dog”. Although Although the ability
the ability to expressto express
oneself oneself
in words inmaywordsnotmay not be
be entirely
entirelyduring
present presentthe during
initialthe initial training,
training, the rate the rate ofcan
of failure failure can be reduced
be reduced throughthrough
continuous con-
tinuous training.
training. Therefore, Therefore,
we can use wethis
can process
use thisto process
enabletorobots
enable torobots
develop to cognitive
develop cognitive
abilities
abilities regarding
regarding their surroundings.
their surroundings. This process Thisis process
the sameisasthe same as
teaching teaching
children children
to speak, that to
speak,
is, that is,their
developing developing
ability totheir
speak ability
through to speak
continuousthrough continuousand
conversations conversations
correcting their and
mistakes.
correctingTeaching robotsTeaching
their mistakes. is the same—although the bot may answer
robots is the same—although the bot incorrectly
may answer in thein-
beginning,
correctly inthrough continuous
the beginning, learning,
through they can learning,
continuous learn to have theynormal
can learn conversations
to have normal with
people. The concept
conversations map of The
with people. this concept
paper is map presented
of thisinpaper
Figureis2;presented
the chatbot in can
Figureperform
2; the
analyses
chatbot can based on current
perform image
analyses data.on
based When the image
current user asks questions,
data. When the notuser
onlyasks
is the chatbot
questions,
able to make
not only is theconsiderations
chatbot able based to make on the environmental
considerations based dataonitthe
currently detects, the
environmental databotit
is also able to generate a corresponding response based on the
currently detects, the bot is also able to generate a corresponding response based on the conversation context. The
benefit of this context.
conversation method The is inbenefit
the integration of multiple
of this method pieces
is in the of data, allowing
integration of multiple thepieces
chatbot of
to better
data, make responses
allowing the chatbot focusing
to better onmake
relevant questions.
responses In thison
focusing manner,
relevant thequestions.
bot’s responses
In this
become
manner,more precise
the bot’s and more
responses in linemore
become withprecise
reality. and more in line with reality.

Image Sensor

Sound Sensor

Robot Action

Figure 1. An
Figure 1. An overview
overview of
of the
the proposed
proposed framework.
framework.

Sensors 2021,21,
Sensors2021, 21,5132 44 of
of 21
Sensors 2021, 21, xx FOR
FOR PEER
PEER REVIEW
REVIEW 4 of 21
21

Figure 2.
Figure 2. The
The concept
concept of
of the
the proposed
proposed framework.
framework.
framework.

This paper
This
This paper can
paper can be
can be split
be split into
split into two
into two major
two major areas.
major areas. The
areas. The first
The first is
first is picture
is picture scenario
picture scenario analysis,
scenario analysis,and
analysis, and
and
the
the second
the second
second is is natural
is natural language
naturallanguage processing.
languageprocessing.
processing.The The system
Thesystem framework
systemframework is
frameworkisispresentedpresented
presented ininin Figure
Figure
Figure 3.
3. When
3. When
When the
thethe entire
entire system
entire system
system is operating,
operating,
is operating,
is we consider
we consider
we consider two inputs:
two inputs:
two inputs: image
image image
data and data
data voiceandinputs.
and voice
voice
inputs.
For
inputs. For
the For
image theinputs,
the image inputs,
image inputs,
first, we first, we perform
perform
first, we perform feature
featurefeature extraction
extraction through
extraction through
the VGG
through the model.
the VGG model.
VGG model.
This
This
reducesreduces
the the dimensionality
dimensionality of the of the
image image
data, data,
avoidingavoiding
This reduces the dimensionality of the image data, avoiding training problems resultingtraining training
problemsproblems
resulting resulting
from
overly
from large large
from overly
overly inputinput
large dimensions.
input dimensions.
dimensions. Following
Following
Followingthe feature
the feature
the extraction,
feature we then
extraction,
extraction, useuse
we then
we then the the
use image
the im-
im-
caption
age model
caption to
model generate
to generatea paragraph
a paragraphof text
of that
text matches
that
age caption model to generate a paragraph of text that matches the image data. For the the
matches image
the data.
image For
data. the voice
For the
inputs,
voice we
voice inputs, must
inputs, we first
we must convert
must first the
first convert
convert thesound signals
the sound
sound signals into text
signals into messages
into text
text messages through
messages through voice-to-text
through voice-to-
voice-to-
conversion
text conversiontechnologies.
technologies. Then the
Then chatbot
the model
chatbot must
model
text conversion technologies. Then the chatbot model must generate appropriate re- generate
must appropriate
generate responses
appropriate re-
based
sponses onbased
text messages
on text convertedconverted
messages from the user’s
from voice
the and voice
user’s text information
and text about the
information
sponses based on text messages converted from the user’s voice and text information
pictures.
about thetheIn this context,
pictures. thisthe
In this outputthe
context, data typedata
output is text.
typeTherefore,
is text. for the user
text. Therefore,
Therefore, for theto be
the userableto
about pictures. In context, the output data type is for user to
to
be hear
able the
to chatbot’s
hear the response,response,
chatbot’s we use text-to-voice
we use conversion
text-to-voice technologies
conversion to convert
technologies to
be able to hear the chatbot’s response, we use text-to-voice conversion technologies to
the text
convert the message
the text into
text message sounds,
message into either
into sounds, to send
sounds, either to
either to the
to senduser
send to or
to the to enable
the user
user or the
or to bot
to enable to
enable the perform
the bot
bot to a
to
convert
corresponding action.
perform aa corresponding
corresponding action.In this manner,
action. In In thisthe whole
this manner,
manner, the system
the whole framework
whole system not
system framework only
framework not considers
not only
only
perform
signals
considers from sound,
signals from it also incorporates
sound, it also considerations
incorporates of images. of
considerations Therefore,
images. the proposed
Therefore, the
considers signals from sound, it also incorporates considerations of images. Therefore, the
method
proposed improves
method chatbot
improves affinity
chatbot by enabling
affinity chatbots
by enabling to chatbots
generate to responses
generate toresponses
be more in to
proposed method improves chatbot affinity by enabling chatbots to generate responses to
line
be with in
more reality
line and to
with interact
reality and better
to with humans.
interact better with humans.
be more in line with reality and to interact better with humans.

Image Caption
Image Sensor VGG Model Image Caption
Image Sensor VGG Model Model
Model

Chatbot
Sound Sensor Speech to Text Chatbot
Sound Sensor Speech to Text Model
Model

Response Text to Speech Robot Action
Response Text to Speech Robot Action

Figure 3.
Figure
Figure 3. The
3. architecture of
The architecture
architecture of the
of the system.
the system.
system.

To
To validate
To validatethe
validate themethod
the methodproposed
method proposedinin
proposed inthis
thispaper,
this paper,we
paper, weapplied
we appliedthe
applied the
the trained
trained
trained models
models
models with
with
witha
humanoid robot, then verified the efficacy of the method through a series of experiments.
aa humanoid
humanoid robot,robot, then
then verified
verified the
the efficacy
efficacy of of the
the method
method through
through aa series
series ofof experi-
experi-
Humanoid
ments. robots robots
Humanoid are theare closest
the robot robot
closest type to humans
type to and are
humans and therefore
are suitable
therefore for
suitable
ments. Humanoid robots are the closest robot type to humans and are therefore suitable
chatting and
for chatting
chatting and companionship
and companionship purposes;
companionship purposes; therefore,
purposes; therefore, a humanoid
therefore, aa humanoid robot
humanoid robot was
robot waschosen
was chosen for the
chosen for
for
for
experiments
the experiments in this
in paper.
this The The
paper. humanoid
humanoid robotrobot
usedusedin this
in paper
this is presented
paper is in Figure
presented in 4;
Fig-
the experiments in this paper. The humanoid robot used in this paper is presented in Fig-
the
ure humanoid
4; the humanoidrobot robot
is 51 cmis tall
51 cm and
tall weighs
and approximately
weighs approximately3.5 kg.3.5 It
kg.has
It 20 degrees
has 20 degreesof
ure 4; the humanoid robot is 51 cm tall and weighs approximately 3.5 kg. It has 20 degrees
freedom,
of freedom,
freedom, and andhashas
anan
inbuilt
inbuilt Logitech
Logitech C920
C920HD HDProProWebcam
Webcamand andmicrophone.
microphone.The Therobot
robot
of and has an inbuilt Logitech C920 HD Pro Webcam and microphone. The robot
is
is powered
powered by
by a
a XM430
XM430 motor
motor and
and installed
installed with
with an
an Intel
Intel Core
Core i3
i3 processor
processor and
and OpenCR
OpenCR
is powered by a XM430 motor and installed with an Intel Core i3 processor and OpenCR
controller.
controller. ItsIts Inertial
Its InertialMeasurement
MeasurementUnit Unitinvolves
involvesa three-axis
a three-axis gyroscope,
gyroscope, a three-axis
controller. Inertial Measurement Unit involves a three-axis gyroscope, aa three-axis
three-axis ac-
ac-
acceleration
celeration sensor,
sensor, and
and a a three-axis
three-axis magnetometer.
magnetometer. The
The operating
operating system
system isis Ubuntu.
Ubuntu.
celeration sensor, and a three-axis magnetometer. The operating system is Ubuntu.

Sensors
Sensors 21,21,
2021,
2021, 5132 PEER REVIEW
x FOR 5 of
5 of 2121

Webcaam Webcaam

Microphone
Microphone

(a) (b)
Figure 4. 4.
Figure The architecture
The ofof
architecture thethe
humanoid robot:
humanoid (a)(a)
robot: thethe
front view;
front and
view; (b)(b)
and the side
the view.
side view.

3. 3. The
The Proposed
Proposed Model
Model
ThisThis section
section introduces
introduces thethe frameworkofof
framework thethe proposedmodel
proposed modelinindetail.
detail.Because
Because most
most
ofofthethe data
data processed
processed inin this
this work
work waswas time-series
time-series data,
data, RNNs
RNNs constituted
constituted anan appropriate
appropriate
framework.
framework. The The
mostmost well-known
well-known RNN RNN framework
framework is theis commonly
the commonly usedused
LSTM. LSTM. The
The lit-
literature currently features many LSTM variations, such
erature currently features many LSTM variations, such as the GRU framework. However, as the GRU framework. However,
thetheGRUGRU framework
framework is issimpler
simplerthan thanthetheLSTM
LSTMframework,
framework,but butininactual
actualoperations
operationsthe the
GRU can achieve results that are equal to (if not better than)
GRU can achieve results that are equal to (if not better than) the results of LSTM. Also, the results of LSTM. Also,
because
because thethe GRU
GRU framework
framework is is simpler,
simpler, under
under normal
normal conditions
conditions GRU
GRU has
has better
better training
training
efficiency
efficiency than
than LSTM.
LSTM. ToTocompare
comparethe thetwo
twoRNNRNNmodels,
models,we wemust
mustmodify
modifyand andenhance
enhance
the original LSTM RNN in the hopes of achieving better
the original LSTM RNN in the hopes of achieving better convergence efficiency than convergence efficiency thanthe the
original neural
original neural network. network.
Many studies have performed in-depth explorations of chatbots, and this paper takes
Many studies have performed in-depth explorations of chatbots, and this paper takes
the framework from reference [16] as the basis for improvements. In the original version, an
the framework from reference [16] as the basis for improvements. In the original version,
adversarial learning method for generative conversational agents (GCA) is applied for the
an adversarial learning method for generative conversational agents (GCA) is applied for
chatbot model. Such a learning process originally also belonged to one of the sequence-to-
the chatbot model. Such a learning process originally also belonged to one of the sequence-
sequence training regimes. However, such an approach can also transfer to an end-to-end
to-sequence training regimes. However, such an approach can also transfer to an end-to-
training process by applying the specific structure. The chatbot model can generate an
end training process by applying the specific structure. The chatbot model can generate
appropriate sentence by considering the dialogue history. The main core of GCA is LSTM,
an appropriate sentence by considering the dialogue history. The main core of GCA is
and the model also considers the current answer and the previous conversation history.
LSTM, and the model also considers the current answer and the previous conversation
The concept for this paper also was implemented with favorable results in the Keras
history. The concept for this paper also was implemented with favorable results in the
framework [17], as presented in [18]. The Keras framework is a deep learning library which
Keras framework [17], as presented in [18]. The Keras framework is a deep learning library
includes many Application Programming Interfaces (APIs) for building the corresponding
which
models. includes
Kerasmany
is also Application
one of the Programming
most used deep Interfaces
learning(APIs) for building
frameworks. In [18],the Keras
corre-is
sponding
used formodels. Keras is also of
the implementation one theofchatbot
the most used deep
model. Basedlearning frameworks.
on the previous work Inin[18],
[18],
Keras is used for the implementation of the chatbot model. Based
this study improves the model structure and gives better performance than the original on the previous work in
[18], this study improves the model structure and gives better performance
version. The detailed framework of the chatbot is exhibited in Figure 5; Figure 5a presents than the orig-
inal
theversion.
original TheLSTM detailed
versionframework
[18]; Figureof5bthe chatbot
displays theisGRU
exhibited
versionin that
Figure
was5;optimized
Figure 5ain
presents the original LSTM version [18]; Figure 5b displays
this study. In addition to the LSTM-to-GRU transformation, the differences between the GRU version that was op- the
timized in this study.
two versions includeInthe addition
addition to of
thetwo
LSTM-to-GRU transformation,
dense (fully connected) layerstheto differences
the new version. be-
tween the two versions
The parameters are listedinclude the addition
in Table 1. We also ofadded
two dense (fully to
a dropout connected) layers to the
prevent overfitting and
new version. The parameters are listed in Table 1. We
used Adam for the optimizer and categorical cross entropy for the generated lossalso added a dropout to prevent
function.
overfitting and used
A more detailed Adam forof
comparison thetheoptimizer and categorical
two versions is laid out in cross
the entropy for the gener-
next section.
ated loss function. A more detailed comparison of the two versions is laid out in the next
section.

Sensors
Sensors 2021, FOR 21,
 21, x2021, 5132
 PEER REVIEW 6 of 21 6 of 21

 Input Layer Input Layer
 Input Layer Input Layer
 Input: Output: Input: Output:
 Input: Output: Input: Output: (50) (50) (50) (50)
 (50) (50) (50) (50)

 Embedding

 Embedding Input: Output:
 (50) (50, 100)
 Input: Output:
 (50) (50, 100)
 GRU GRU
 Input: Output: Input: Output:
 LSTM LSTM (50, 100) (300) (50, 100) (300)

 Input: Output: Input: Output:
 (50, 100) (300) (50, 100) (300) Concatenate

 Input: Output:
 [(300), (300)] (600)
 Concatenate
 Input: Output: Dense Dense
 [(300), (300)] (600)
 Input: Output: Input: Output:
 (600) (3500) (600) (3500)
 Dense
 Input: Output: Concatenate
 (600) (3500)
 Input: Output:
 [3500), (3500)] (7000)
 Dense
 Dense
 Input: Output:
 (3500) (7000) Input: Output:
 (7000) (7000)

 (a) (b)
 Figure 5. Figure
 Architecture of the chatbot:
 5. Architecture of the (a) LSTM(a)
 chatbot: version;
 LSTMand (b) Enhanced
 version; GRU version.
 and (b) Enhanced GRU version.

 Table 1. Parameters setting ofsetting
 Table 1. Parameters the chatbot.
 of the chatbot.
 Parameter Value
 Parameter Value
 Dropout Rate 0.25
 Dropout Rate 0.25
 GRU CellGRU Cell 300 300
 OptimizerOptimizer Adam Adam
 Loss Function
 Loss Function Categorical Cross Entropy
 Categorical Cross Entropy

 The chatbot
 The model
 chatbotproposed in this study
 model proposed wasstudy
 in this based was
 mainly on the
 based concept
 mainly of a concept
 on the gen- of a
 erative conversational agent (GCA) [16]. Records of a conversation with a chatbot can
 generative conversational agent (GCA) [16]. Records of a conversation with a chatbotbe can
 stored using the x using
 be stored vectorthe
 andx one-hot encoding.
 vector and one-hotAccordingly, the following
 encoding. Accordingly, theequations
 followingare
 equations
 defined:are defined:
 = ̅ , ̅ X, ̅=…[ x ̅1 , x2 , x3 . . . x Ss ] (1) (1)
 h i
 Y = y 1 , y 2 , y 3 . . . y Ss (2)
 = , , … (2)
 Ec = We X (3)
 = (3)
 Ea = We Y (4)
 = layer parameter, Ec represents the corresponding
 where We represents the embedding (4)
 conversation record, and Ea represents the unfinished response. An RNN is then used to
 where We represents the embedding layer parameter, Ec represents the corresponding con-
 extract the embedding vectors of the conversation record and response, as illustrated in the
 versation record, and Ea represents the unfinished response. An RNN is then used to ex-
 following equation:
 tract the embedding vectors of the conversationec =record
 Γc ( Ec and
 ; Wc )response, as illustrated in the (5)
 following equation:
 ea = Γ a ( Ea ; Wa ) (6)
 = ; (5)
 e = [ ec e a ] (7)
 = y ; (6)
 h = σ (W1 e + b1 ) (8)

 = P = ϕ(W2 yh + b2 ) (7) (9)
 where W 1 and W 2 are synaptic weights, b1 and b2 are biases, σ is the Rectified Linear
 Unit (ReLU), and ϕ is the Softmax
 = activation
 + function. Finally, the index with (8)
 the largest

Sensors 2021, 21, x FOR PEER REVIEW 7 of 21

Sensors 2021, 21, 5132
= + (9)
7 of 21
where W1 and W2 are synaptic weights, b1 and b2 are biases, σ is the Rectified Linear Unit
(ReLU), and is the Softmax activation function. Finally, the index with the largest prob-
ability value is identified from P before the next iteration is started. Accordingly, the most
probability value is identified from P before the next iteration is started. Accordingly, the
suitable response can be output.
most suitable response can be output.
The operational method of the chatbot used in this paper is displayed in Figure 6.
The operational method of the chatbot used in this paper is displayed in Figure 6.
The
The figuredisplays
figure displaysthe thetwo
twoinput
inputsources
sourcesof ofthe
thechatbot.
chatbot.The Theleft-hand
left-handinputinputsource
sourceisisthethe
previous
previous sentence spoken by the other party, and the right-hand input source is is
sentence spoken by the other party, and the right-hand input source a collec-
a collection
tion of previous
of previous output
output results.
results. BecauseBecause
of theof the unique
unique advancements
advancements of the proposed
of the proposed training,
training, the chatbot
the chatbot model can model can notably
notably understandunderstand and analyze
and analyze the contextual
the contextual relation-
relationships of
ships of the sentences.
the sentences. Finally, we Finally,
exportwe theexport the generated
generated dialogue word-by-word
dialogue word-by-word in sequence, inthen
se-
quence,
assemble thentheassemble
words into the awords
completeinto sentence.
a completeWhen sentence. When the
the chatbot chatbot the
expresses expresses
output
the output results for the user to hear, the user may respond a
results for the user to hear, the user may respond a second time based on the answersecond time based on the
just
answer justNext,
received. received. Next, in
in addition to addition
the second toresponse
the second fromresponse
the user,from the user, the
the chatbot’s ownchatbot’s
previous
own previous
response response
is added is added
to the chatbot’sto the chatbot’s
left-hand left-hand
input. input. Thisinput
This additional additional input is
is exceedingly
exceedingly
helpful to the chatbot’s ability to understand the contextual meaning. Furthermore,Fur-
helpful to the chatbot’s ability to understand the contextual meaning. this
thermore,
additionalthis additional
input input can be
can be substituted for substituted
the output textfor the outputby
produced text
theproduced
image data. by This
the
image
allowsdata. This allows
the chatbot the chatbot
to become familiarto with
become the familiar with the
present image present
data, enablingimagethedata, ena-to
chatbot
bling
assessthe
thechatbot
image to assessand
content thegenerate
image content
the bestand generate
response. the bestrepeated
Through response. Through
training, the
repeated training,
chatbot model canthe chatbot
fully graspmodel can fully grasp
the relationship of the the relationshipwith
conversation of thetheconversation
user; in this
with the user;
manner, in this can
the chatbot manner,
become themore
chatbot canin
fluent become more fluent
conversations with in conversations
people, narrowing withthe
people, narrowing
gap between the gap
chatbots’ andbetween
humans’chatbots’ and humans’
communication skills.communication skills.

Context Previous Context Previous Context Previous
Ex: How was the Output Ex: How was the Output Ex: How was the Output
weather? Ex: None weather? Ex: It weather? Ex: It is
Step 1

Chatbot Chatbot Chatbot
Model Model Model

Output Output Output
Ex: It Ex: is Ex: hot

Extra Information
(From previous sentence or image caption)

Context Previous Context Previous Context Previous
Ex: It is hot. Do you Output Ex: It is hot. Do you Output Ex: It is hot. Do you Output
need some drink? Ex: None need some drink? Ex: I need some drink? Ex: I like
Step 2

Chatbot Chatbot Chatbot
Model Model Model

Output Output Output
Ex: I Ex: like Ex: coke

Theworkflow
Figure6.6.The
Figure workflowofofthe
thedesigned
designedchatbot
chatbotmodel.
model.

Imagecaption
Image captiontechnology
technologyhas hasbeen
beenaatopic
topicofofin-depth
in-depthexploration
explorationininmany
manypapers,
papers,
and this paper builds on the framework in references [19,20]. The concept
and this paper builds on the framework in references [19,20]. The concept discussed in discussed in
this paper was also implemented with favorable results in the Keras framework
this paper was also implemented with favorable results in the Keras framework [17], as [17], as
demonstratedinin[21].
demonstrated [21].AAdetailed
detailedframework
frameworkofofourourimage
imagecaption
captionmodel
modelisispresented
presentedin in
Figure7,7,with
Figure withFigure
Figure7a7aas
asthe
theoriginal
originalLSTM
LSTMversion
version[21]
[21]and
andFigure
Figure7b7bas
asthe
theoptimized
optimized
GRU version used in this paper. The original LSTM framework was
GRU version used in this paper. The original LSTM framework was exchanged forexchanged foraaGRU
GRU
framework, to reduce the number of training parameters; the concatenation
framework, to reduce the number of training parameters; the concatenation layer that had layer that
had originally been in the middle was implemented through a tensor adding method.
originally been in the middle was implemented through a tensor adding method. The im-
The improved approach, in addition to being able to consider input data from both sides,
proved approach, in addition to being able to consider input data from both sides, can
can also eliminate half of the parameters for this layer and improve training efficiency.
also eliminate half of the parameters for this layer and improve training efficiency. Fur-
Furthermore, batch normalization (BN) techniques were added to the model; BN can
adaptively adjust the value range of the input data, reducing situations in which the
activation function cannot be calculated as a result of the value of the previous layer being
too great or too small. The parameters of the image caption model discussed in this paper
can be found in Table 2. We also added dropout to prevent overfitting; we chose Adam for

Sensors 2021, 21, x FOR PEER REVIEW 8 of 21

 thermore, batch normalization (BN) techniques were added to the model; BN can adap-
 Sensors 2021, 21, 5132 8 of 21
 tively adjust the value range of the input data, reducing situations in which the activation
 function cannot be calculated as a result of the value of the previous layer being too great
 or too small. The parameters of the image caption model discussed in this paper can be
 foundthe
 in Table 2. We
 optimizer also
 and added dropout
 categorical to prevent
 cross entropy overfitting;
 for the we function.
 output loss chose Adam
 The for
 nextthe
 section
 optimizer and categorical
 compares these two cross entropy
 versions for thedetail.
 in greater output loss function. The next section com-
 pares these two versions in greater detail.
 Input Layer
 Input: Output:
 Input Layer (4096) (4096)
 Input: Output:
 (4096) (4096) Dropout
 Input: Output:
 Dropout (4096) (4096)
 Input: Output:
 (4096) (4096) Dense Input Layer
 Input: Output: Input: Output:
 Dense Input Layer (4096) (256) (34) (34)
 Input: Output: Input: Output:
 (4096) (256) (34) (34) Repeat Vector Embedding
 Input: Output: Input: Output:
 Repeat Vector Embedding (256) (34, 256) (34) (34, 256)

 Input: Output: Input: Output:
 (256) (34, 256) (34) (34, 256)
 ADD
 Input: Output:
 Concatenate [(34, 256), (34, 256)] (34, 256)
 Input: Output:
 [(34, 256), (34, 256)] (34, 512) Batch Normalization
 Input: Output:
 LSTM (34, 256) (34, 256)
 Input: Output:
 (34, 512) (500) GRU
 Input: Output:
 Dense (34, 256) (500)

 Input: Output:
 (500) (7549) Dense
 Input: Output:
 (500) (7549)

 (a) (b)
 FigureFigure
 7. Architecture of the image
 7. Architecture caption:
 of the image (a) LSTM
 caption: version;
 (a) LSTM and (b)and
 version; Enhanced GRU version.
 (b) Enhanced GRU version.

 Table Table
 2. Parameter settings
 2. Parameter of Image
 settings caption.
 of Image caption.
 Parameter
 Parameter
 ValueValue
 Dropout Rate 0.5
 Dropout Rate 0.5
 GRU Cell
 GRU Cell 500 500
 Optimizer
 Optimizer AdamAdam
 Loss Function
 Loss Function Categorical
 Categorical CrossCross Entropy
 Entropy

 FigureFigure
 8 presents the operations
 8 presents of theof
 the operations image caption
 the image model;
 caption the figure
 model; demonstrates
 the figure demonstrates
 that the image caption model can also be split into left-hand and
 that the image caption model can also be split into left-hand and right-hand right-hand inputs. The The
 inputs.
 left-hand inputs are image-related; after an image is input into the model,
 left-hand inputs are image-related; after an image is input into the model, we must first we must first
 use VGG16 [22] to[22]
 use VGG16 perform the task
 to perform theof feature
 task extraction.
 of feature The VGG16
 extraction. The VGG16 modelmodel
 used in thisin this
 used
 paperpaper
 is a complete modelmodel
 is a complete that hasthatalready been trained
 has already usingusing
 been trained a large volume
 a large of picture
 volume of picture
 data. The
 data.final
 Theoutput layer of
 final output theof
 layer original was not
 the original wasnecessary for this
 not necessary forwork; but because
 this work; but because
 this model had already
 this model been trained,
 had already the final
 been trained, theoutput layer layer
 final output was eliminated, and the
 was eliminated, andother
 the other
 parts parts
 were treated as excellent
 were treated feature
 as excellent extraction
 feature tools. tools.
 extraction If we Ifinput an entire
 we input picture
 an entire into into
 picture
 the image caption
 the image modelmodel
 caption from thefrombeginning, this may
 the beginning, thisresult in training
 may result issuesissues
 in training from overly
 from overly
 large input data. data.
 large input The useTheofuse
 theoffeature extraction
 the feature method
 extraction reduces
 method the input
 reduces dimensions
 the input dimensions
 and canandpresent the picture’s
 can present data data
 the picture’s in thein most streamlined
 the most streamlined andandadequate
 adequate manner
 manneras as
 vec-
 vectors.
 tors. The
 Theimage
 imagecaption
 captionmodel’s
 model’sright-hand
 right-handinputs
 inputs areare aa series
 series of
 of one-hot
 one-hot vectors
 vectors that
 that rep-
 represent
 the current sequence of the output text. After repeated iterative processes, the image
 caption model describes the image in text by generating the text word-by-word to present
 a complete sentence.

Sensors 2021, 21, x FOR PEER REVIEW 9 of 21

Sensors 2021, 21, 5132 resent the current sequence of the output text. After repeated iterative processes, the9 of
 im-21
 age caption model describes the image in text by generating the text word-by-word to
 present a complete sentence.

 VGG16 Feature Input: VGG16 Feature Input: VGG16 Feature Input:
 Extraction Ex: [1, 0, 0, …0] Extraction Ex: [0, 1, 0, …0] Extraction Ex: [0, 0, 1, …0]

 ...
 Image Caption Image Caption Image Caption
 Model Model Model

 Output Output Output
 Ex: man Ex: is Ex: skiing

 Figure 8. Workflow
 Workflow of the designed Image caption model.

 The model
 The modeltraining
 trainingprocess
 processdiscussed
 discussedininthis
 this paper
 paper is is presented
 presented in Figure
 in Figure 9. The
 9. The de-
 tailed pseudocode is listed in Algorithm 1. The details are addressed in Appendix A. A.
 detailed pseudocode is listed in Algorithm 1. The details are addressed in Appendix

 Algorithm 1: The procedure of the proposed system.
 1. Data preprocessing (image caption)
 2. Image caption model initialization
 3. for all training epochs
 4. for all image caption training samples
 5. Training process
 6. end for
 7. end for
 8. Data preprocessing (chatbot)
 9. Chatbot model initialization
 10. for all training epochs
 11. for all chatbot training samples
 12. Training process
 13. end for
 14. end for

 The process can be split into two major sections: the first section is the image caption
 model, and the second part is the natural language processing part. In the image caption
 model, we first perform preprocessing on the data by consolidating and coding all the
 image data and the labeled text data. Then, after we initialize the model, we conduct model
 training. In the training process, we must observe whether the convergence conditions have
 been satisfied; if not, then training must be continued. If the text caption model training is
 complete, then the process enters its next stage. In the natural language processing part,
 the first and most crucial task is still the data preprocessing, followed by the initialization
 of the chatbot model. Next, training of the chatbot model must be performed using the
 processed data. Similarly, if the chatbot model does not satisfy the convergence conditions,
 the training must continue. Conversely, if the chatbot model has satisfied the convergence
 conditions, the training immediately ends. When the convergence conditions are met, the
 entire training process is complete. However, the image caption model must be trained first
 Figure 9. System flowchart of the training process.
 because this model plays a key role in incorporating image information in the chatbot input.
 If the image caption model is inadequately trained, the image information input into the
 chatbot becomes meaningless and may result in erroneous interpretation. After the training
 of the image caption model is completed, it can be used to convert image information to
 textual information, which is then fed to the input end of the chatbot. This conversion
 approach enables a chatbot model to be constructed without substantial modification. This
 also allows the model to interpret image information, which is a primary focus of this
 study. The problem of overfitting was considered during the training process, which was
 terminated immediately when overfitting occurred.

Output Output Output
Ex: man Ex: is Ex: skiing

Figure 8. Workflow of the designed Image caption model.

Sensors 2021, 21, 5132
The model training process discussed in this paper is presented in Figure 9. The10de-
of 21

tailed pseudocode is listed in Algorithm 1. The details are addressed in Appendix A.

Figure
Figure 9.9. System
System flowchart
flowchart ofof the
the trainingprocess.
training process.

4. The Experimental Results
The first part of this section is a detailed explanation of the training data. Next, we
present training and comparisons based on the model proposed in the previous section.
Finally, we describe how the consolidated functions were integrated with the humanoid
robot to test the effectiveness of the proposed model and analyze the feasibility.

4.1. The Training Data
The databank for sentence training in this paper consisted of data sets of dialogue from
online English classes [18]. Although image caption technology typically uses COCO data
sets [23], which involve substantial amounts of data, as the training data, this experiment
used the Flickr 8k Dataset [24], which has a smaller data volume, so that the model training
results can be used with actual robots that may not have rapid computing abilities. The
Flickr 8k Dataset has 8000 color images in total, and each image has five captions (see
Figure 10 for the training data). From the images, we can see that each image has a chunk
of text that describes the picture; these descriptions are man-made captions. Therefore,
each image has various appropriate descriptions. Therefore, one of the key points of this
paper is that image data must be described in text and incorporated into the chatbot’s
considerations. In this manner, conversations between humans and bots can become
more precise and more in line with actual scenarios. Furthermore, in addition to the data
mentioned, some new images were also added to the experiment in pursuit of greater
realism and more accurate representation of the experimental scenario.

chunk of text that describes the picture; these descriptions are man-made captions. There-
fore, each image has various appropriate descriptions. Therefore, one of the key points of
this paper is that image data must be described in text and incorporated into the chatbot’s
considerations. In this manner, conversations between humans and bots can become more11 of 21
Sensors 2021, 21, x FOR PEER REVIEW
Sensors 2021, 21, 5132 precise and more in line with actual scenarios. Furthermore, in addition to the data men-11 of 21
tioned, some new images were also added to the experiment in pursuit of greater realism
and more accurate representation of the experimental scenario.
chunk of text that describes the picture; these descriptions are man-made captions. There-
fore, each image has various appropriate descriptions. Therefore, one of the key points of
this paper is that image data must be described in text and incorporated into the chatbot’s
considerations. In this manner, conversations between humans and bots can become more
precise and more in line with actual scenarios. Furthermore, in addition to the data men-
tioned, some new images were also added to the experiment in pursuit of greater realism
and more accurate representation of the experimental scenario.

(a) (b)
FigureFigure
10. Training samples
10. Training in Flickr
samples 8 k [24]:
in Flickr 8 k(a) caption:
[24]: “A dog“A
(a) caption: atdog
the beach”; and (b)and
at the beach”; caption: “A
(b) caption: “A
little girl covered in paint sits in front of a painted rainbow with her hands in a bowl.”.
little girl covered in paint sits in front of a painted rainbow with her hands in a bowl”..

4.2.
4.2. The The Experimental
Experimental ResultsResults
To discuss
To discuss the strengths
the strengths and weakness
and weakness of theof the proposed
proposed modelmodeland theand the original
original LSTMLSTM
modelmodel in depth,
in depth, we madewe (a)made the following
the following comparisons.
comparisons. The chatbot
The chatbot training
training
(b) models models
are are
compared in Figure 11, whereas the image caption
compared in Figure 11, whereas the image caption training models are compared in Fig- training models are compared in
Figure
Figure 10.
12.Training
The samples
figures in Flickr 8 k [24]:
demonstrate that (a) caption:
during the “A dog at the
chatbot beach”;
model and (b)
training caption:the
process, “A
ure 12. The figures demonstrate that during the chatbot model training process, the pro-
little girl covered in paint sits in front of a painted rainbow with her hands in a bowl.”.
posedproposed
method ismethodfaster and is faster
stablerand andstabler and has
has a lower finalatraining
lower finalerrortraining
rate. Thiserror
also rate.
sub- This
also
stantiates substantiates
thatExperimental that
the model proposed the model proposed in this paper outperforms the original and
4.2. The Results in this paper outperforms the original and conven-
tionalconventional
frameworks. However,frameworks. duringHowever, during
the training the training
process process
of the image of themodel,
caption imagethe caption
To the
discuss the strengths and weakness of the proposed model and also
the original LSTM
performance of updated model proposed in this study was also slightly better. Althoughbetter.
model, performance of updated model proposed in this study was slightly
model
Although in depth, we
the error value made the
sometimes following
increasedcomparisons. The chatbot training models are
the error value sometimes increased slightly duringslightly during
the training the training
process, overallprocess, overall
it did not it
compared
did not in Figure
affect 11, whereas
convergence. The the convergence
final image caption training
results models
proved that are
the compared
method in Fig-
proposed
affect convergence. The final convergence results proved that the method proposed in this
urethis
12. The figures demonstrate that duringconventional the chatbot model training process, the pro-
paperinsignificantly
paper significantly
outperformedoutperformed
conventional chatbot models. chatbot models.
Conventional Conventional
chatbot
posed method is even
fasterinand stabler and has atraining,
lower final training error rate. This also sub-
models, even in the later stages of training, exhibit trends of increasingly higher error val-higher
chatbot models, the later stages of exhibit trends of increasingly
stantiates
error values. thatThis
the phenomenon
model proposed oncein thissubstantiates
more paper outperforms that thethe original
stability andandthe conven-
validity
ues. This phenomenon once more substantiates that the stability and the validity of the
tional
of the frameworks.
method proposed However,
in thisduring
paperthe aretraining
excellent. process of the image caption model, the
method proposed in this paper are excellent.
performance
To verifyofthe updated modelofproposed
practicality the method in this study was
proposed alsostudy,
in this slightlywebetter. Although
incorporated a
the error value
humanoid robotsometimes
system into increased slightlyexperiment
the following during the training process,the
and integrated overall
trainedit did not
model
into the
affect humanoid The
convergence. robot.
finalThe experiment
convergence was split
results proved into a total
that of fourproposed
the method scenarios;ineachthis
scenario
paper was matched
significantly with a picture
outperformed to represent chatbot
conventional the present situation.
models. The bot, based
Conventional on
chatbot
the situation
models, evenand user
in the dialogue
later stages in of the moment,
training, exhibitmade a corresponding
trends of increasingly response.
higher Therefore,
error val-
the bot’s
ues. This inputs
phenomenon were not oncelimited
moretosubstantiates
the user’s voice that data, but alsoand
the stability included imageofdata.
the validity the
The experimental
method proposed results are depicted
in this paper in Figures 13–16.
are excellent.

Figure 11. The learning curve of the chatbot model.

Figure 11.
Figure The learning
11. The learning curve
curve of
of the
the chatbot
chatbot model.

Sensors 2021, 21,
Sensors 2021, 21, 5132
 x FOR PEER REVIEW 12
 12 of
 of 21
 21

 Figure 12. The learning curve of the image caption model.

 To verify the practicality of the method proposed in this study, we incorporated a
 humanoid robot system into the following experiment and integrated the trained model
 into the humanoid robot. The experiment was split into a total of four scenarios; each sce-
 nario was matched with a picture to represent the present situation. The bot, based on the
 situation and user dialogue in the moment, made a corresponding response. Therefore,
 the bot’s inputs were not limited to the user’s voice data, but also included image data.
 Figure 12. The learning curve of the image caption model.
 The experimental results are depicted in Figures 13–16.
 To verify the practicality of the method proposed in this study, we incorporated a
 humanoid robot system into the following experiment and integrated the trained model
 into the humanoid robot. The experiment was split into a total of four scenarios; each sce-
 nario was matched with a picture to represent the present situation. The bot, based on the
 situation and user dialogue in the moment, made a corresponding response. Therefore,
 the bot’s inputs were not limited to the user’s voice data, but also included image data.
 The experimental results are depicted in Figures 13–16.
 (a) (b)

 (a)
 (c) (b)
 (d)

 (c)
 (e) (d)
 (f)

 (e)
 (g) (f)
 (h)

 Figure
 Figure13.
 13.Experimental
 Experimentalresult
 resultin
 inscenario
 scenarioA.
 A.

 (g) (h)

 Figure 13. Experimental result in scenario A.

Sensors 2021,21,
Sensors 21, 5132 13 of21
 21
Sensors 2021,
 2021, 21, x
 x FOR
 FOR PEER
 PEER REVIEW
 REVIEW 13
 13 of
 of 21

 (a)
 (a) (b)
 (b)

 (c)
 (c) (d)
 (d)

 (e)
 (e) (f)
 (f)
 Figure
 Figure14.
 Figure 14.Experimental
 14. Experimentalresult
 Experimental resultin
 result ininscenario
 scenarioB.
 scenario B.B.

 (a)
 (a) (b)
 (b)

 (c)
 (c) (d)
 (d)

 (e)
 (e) (f)
 (f)
 Figure
 Figure15.
 Figure 15.Experimental
 15. Experimentalresult
 Experimental resultin
 result ininscenario
 scenarioC.
 scenario C.
 C.

Sensors2021,
Sensors 21,x 5132
2021,21, FOR PEER REVIEW 1414
ofof
2121

(a) (b)

(e) (f)

Figure
Figure16.
16.Experimental
Experimentalresult
resultininscenario
scenarioD.D.

TheTheexperiment
experimentcombined
combinedimageimage caption with naturalnaturallanguage
languageprocessing
processingtechnology.
technol-
ogy. The robot was expected to give corresponding responses through related imagesound
The robot was expected to give corresponding responses through related image and and
information.
sound information.In theIn experimental
the experimental procedure,
procedure,the the
robot used
robot useda microphone
a microphone to to
capture
cap-
sound
ture soundandand converted
convertedsound into into
sound text text
through the speech-to-text
through the speech-to-text(STT)(STT)
technique. Then,
technique.
the robot
Then, converted
the robot the obtained
converted the obtainedimage information
image information intointo
texttext
by using
by using a deep
a deepartificial
arti-
neural
ficial network.
neural Sound
network. and image
Sound information
and image was converted
information into textual
was converted into information
textual infor- for
mation for storage then inputted to the natural language processing model. The robot wasto
storage then inputted to the natural language processing model. The robot was able
givetoreasonable
able responses
give reasonable according
responses to the to
according information converted
the information from image
converted and sound.
from image and
sound. After the robot computed suitable sentences, it used the text-to-speech (TTS) tech-to
After the robot computed suitable sentences, it used the text-to-speech (TTS) technique
convert
nique textual information
to convert into sound
textual information intoinformation. Thus, the
sound information. text was
Thus, readwas
the text to allow
read tothe
user the
allow to understand the robot’s
user to understand theresponse.
robot’s response.
InInsome
someinstances,
instances,thetherobot’s
robot’sresponses
responseswere werenotnotperfect.
perfect.These
Theseimperfect
imperfectresponses
responses
may have arisen because the databank of image data and dialogue
may have arisen because the databank of image data and dialogue data was insufficient data was insufficient
and,therefore,
and, therefore,the therobot
robotwaswasunable
unabletotofully
fullyconsider
considerall allcircumstances
circumstancesrelevant
relevanttotothethe
required response. In the future, the model may process dialogue
required response. In the future, the model may process dialogue records that combine records that combine
images;the
images; themodel
modelmay maythen
thenintegrate
integratelarger
largerand
andmore
morecomprehensive
comprehensivechatbot chatbottraining
training
databanks to increase the fluency of communications between
databanks to increase the fluency of communications between humans and robots. humans and robots.

5. Discussion
5. Discussions
In this experiment, the algorithm developed was integrated with a humanoid robot to
In this experiment, the algorithm developed was integrated with a humanoid robot
facilitate conversation with humans. The robot used its built-in microphone to capture users’
tovoice
facilitate conversation
signals in additionwith humans.
to using The robot
the image used
sensor its built-in
installed on itsmicrophone to capture
head to capture image
users’ voice signals in addition to using the image sensor installed
information in the surrounding environment. Because the robot used in the experiment on its head to capture
image information
features in the surrounding
multiple degrees of freedom, environment.
it is capable ofBecause
aligningthe robotwith
its sight usedthat
in the exper-
of the user
iment featuresMoreover,
to converse. multiple degrees
it can turnof freedom,
its head toit observe
is capabletheofsurrounding
aligning its environment
sight with that of
while
the user to converse. Moreover, it can turn its head to observe the
conversing. Accordingly, the robot uses the image–text conversion model to incorporatesurrounding environ-
ment while
relevant conversing.
textual Accordingly,
information the robot
into the chatbot usesThe
input. thecontent
image–text
of theconversion
conversationmodel to
revealed
incorporate relevant textual information into the chatbot input. The
that the response provided by the robot was associated with the image information it content of the conver-
sation revealed
received. that the response
This indicated provided
that the robot does by notthe robotconsider
merely was associated with thebetween
the relationships image

You can also read