Multi-Sensor Context-Aware Based Chatbot Model: An Application of Humanoid Companion Robot

Page created by Paula Nunez
 
CONTINUE READING
Multi-Sensor Context-Aware Based Chatbot Model: An Application of Humanoid Companion Robot
sensors
Article
Multi-Sensor Context-Aware Based Chatbot Model: An
Application of Humanoid Companion Robot
Ping-Huan Kuo 1,2 , Ssu-Ting Lin 3 , Jun Hu 3 and Chiou-Jye Huang 4, *

 1 Department of Mechanical Engineering, National Chung Cheng University, Chiayi 62102, Taiwan;
 phkuo@ccu.edu.tw
 2 Advanced Institute of Manufacturing with High-Tech Innovations (AIM-HI), National Chung Cheng University,
 Chiayi 62102, Taiwan
 3 Department of Intelligent Robotics, National Pingtung University, Pingtung 90004, Taiwan;
 cbc105010@gmail.com (S.-T.L.); wuorsut@gmail.com (J.H.)
 4 Department of Data Science and Big Data Analytics, Providence University, Taichung 43301, Taiwan
 * Correspondence: cjh1007@gm.pu.edu.tw; Tel.: +886-4-2632-8001 (ext. 15124)

 Abstract: In aspect of the natural language processing field, previous studies have generally analyzed
 sound signals and provided related responses. However, in various conversation scenarios, image
 information is still vital. Without the image information, misunderstanding may occur, and lead
 to wrong responses. In order to address this problem, this study proposes a recurrent neural
 network (RNNs) based multi-sensor context-aware chatbot technology. The proposed chatbot model
 incorporates image information with sound signals and gives appropriate responses to the user.
 In order to improve the performance of the proposed model, the long short-term memory (LSTM)
 structure is replaced by gated recurrent unit (GRU). Moreover, a VGG16 model is also chosen for a
 
  feature extractor for the image information. The experimental results demonstrate that the integrative
 technology of sound and image information, which are obtained by the image sensor and sound
Citation: Kuo, P.-H.; Lin, S.-T.; Hu, J.;
 sensor in a companion robot, is helpful for the chatbot model proposed in this study. The feasibility
Huang, C.-J. Multi-Sensor
 of the proposed technology was also confirmed in the experiment.
Context-Aware Based Chatbot Model:
An Application of Humanoid
Companion Robot. Sensors 2021, 21,
 Keywords: multi-sensor fusion; natural language; processing; context-aware computing; chatbot;
5132. https://doi.org/10.3390/ companion robot
s21155132

Academic Editor: Eui Chul Lee
 1. Introduction
Received: 15 June 2021 Many studies have proposed robots that can chat with people. Opportunities to use
Accepted: 27 July 2021
 chatbots will continue to increase. In addition to accompanying older adults living alone,
Published: 29 July 2021
 chatbots are also applied to medical treatment. For example, individuals may be afraid
 of public spaces due to mental illnesses. Under these circumstances, chatbots can be used
Publisher’s Note: MDPI stays neutral
 to provide them with companion and health care. Numerous studies of chatbots have
with regard to jurisdictional claims in
 been reported. Traditional chatbots only use the before and after sentences for training.
published maps and institutional affil-
 However, such training results in low diversity, and accuracy is also low. This traditional
iations.
 system also renders chatbots unable to flexibly answer questions. For example, when being
 asked “What do you see?” traditional chatbots can only answer “I don’t know” due to the
 lack of image information. The aforementioned problem is only one example; numerous
 others are similar.
Copyright: © 2021 by the authors.
 Numerous studies regarding chatbots have been published. Ref. [1] proposed and
Licensee MDPI, Basel, Switzerland.
 applied a hybrid chatbot to communication software. This technology combines a retrieval-
This article is an open access article
 based model and a QANet model and can answer user questions through an E-Learning
distributed under the terms and
 system. In additional to supporting English, this technology can also converse with
conditions of the Creative Commons
 chatbots in Chinese and has performed well in experiments. Ref. [2] also applied chatbots
Attribution (CC BY) license (https://
 in digital learning. In this study, the role of the proposed chatbot is to resolve Visual,
creativecommons.org/licenses/by/
4.0/).
 Auditory, Read, Write, and Kinesthetic (VARK) information throughout the process of

Sensors 2021, 21, 5132. https://doi.org/10.3390/s21155132 https://www.mdpi.com/journal/sensors
Multi-Sensor Context-Aware Based Chatbot Model: An Application of Humanoid Companion Robot
Sensors 2021, 21, 5132 2 of 21

 responding to the learner. Furthermore, the proposed chatbot also records the learner’s
 beta brainwaves and uses machine learning algorithms to classify learning orientations.
 Ref. [3] explored many studies on chatbots used in Facebook Messenger, mainly discussing
 how teachers use chatbots to supplement their teaching. The article also stated that the
 use of chatbots to assist in teaching is not yet mature and, therefore, chatbots as a field of
 research offer considerable room for development. Ref. [4] proposed a neural network-
 based chatbot that integrates personal data for analysis and estimation and uses the useful
 data to return appropriate responses. The main architecture for this technology is LSTM [5],
 and experiment results have substantiated the feasibility of this technology. In Ref. [6],
 bidirectional recurrent neural network (BiRNN) technology was used as the basis for a
 chatbot, and a GRU [7] was used as the core of the technology. The system architecture
 was exceedingly large and required substantial computing resources, but was verified
 favorably in experiments. Ref. [8] proposed a chatbot called Xatkit and stated that although
 it requires improvement, it currently can be applied in some real-world scenarios, and
 could be applied in many venues in the future.
 Chatbots have diverse applications. To address the massive demand for manpower
 in customer service, some companies have proposed automated customer service chat-
 bots. However, automated customer service chatbots sometimes misunderstand sentences.
 Ref. [9] presented research into automated customer service chatbots and stated that al-
 though chatbot technology is not yet fully mature, demand for this technology is notably
 high. Moreover, cognitive behavioral therapy (CBT) is a common method of therapy
 for patients with panic disorder. Ref. [10] explored the feasibility of using chatbots on
 mobile devices for CBT. The study found that using chatbots for CBT was both feasible
 and effective, pioneering the application of chatbots in medicine.
 Deep learning as a field of research has burgeoned in the early 21st century. Its techno-
 logical developments are diverse and have been applied to many research topics [11–15].
 The present paper features deep learning techniques to resolve questions that chatbots
 cannot answer due to environmental contexts. The first step toward this type of practical
 learning is to enable bots to understand the present environment through visual images
 and to describe the environment through text. The most direct method is to design bots that
 use image caption technologies to understand their environments. The original purpose
 of image captions was, and is, to automatically generate annotations for an image. For
 example, if an image is entered as input, the image caption model can write a sentence
 about the image based on the image status. Therefore, in this article, the described method
 enables the bot to view its present environment clearly and to know about any object in
 its environment. Conventional chatbot models typically are only able to perform model
 training based on context; this type of model training is unable to consider image data.
 Even if the model can understand its present environment, it cannot incorporate the un-
 derstood image data into dialogue responses. In this study, the proposed method can
 deepen the chatbot model’s considerations of the results of its previous conversations, and
 together with the current dialogue, can train the chatbot. A statement generated by image
 caption technology can also be an input for the chatbot and can be incorporated into the
 next dialogue. This method enables the chatbot to answer questions more accurately; the
 highly diverse responses vary based on the present environment.
 The four major contributions of this work are: (1) integrating image and sound
 data to achieve further precision and practicality in chatbot and human communications;
 (2) adjusting and optimizing a framework based on a commonly used LSTM model to
 train the image caption and chatbot model efficiently, thus improving scenario recognition
 abilities; (3) comparing the original LSTM-based model with the adjusted, GRU-based
 version and verifying that the training efficiency was improved by the adjustments; and
 (4) testing the trained model on a humanoid robot in a series of experiments to verify that
 the other model proposed in this paper and the applications of both models are effective
 and feasible.
Multi-Sensor Context-Aware Based Chatbot Model: An Application of Humanoid Companion Robot
Sensors 2021, 21, x FOR PEER REVIEW 3 of 21
Sensors 2021, 21, 5132 3 of 21

 The remaining sections of this paper are as follows. Section 2 details the operational
 The remaining
 concepts sections
 and process of paper.
 of this this paper are as3 follows.
 Section Section
 introduces 2 detailsand
 the models the operational
 algorithmic
 concepts
 framework being proposed. Section 4 presents and compares the experimentalgorithmic
 and process of this paper. Section 3 introduces the models and results. The
 framework
 discussionsbeing proposed.inSection
 are addressed Section45.presents
 Section and compares
 6 is the the experiment results. The
 conclusion.
 discussions are addressed in Section 5. Section 6 is the conclusion.
 2. System Architecture
 2. System Architecture
 In the past, chatting with bots was typically only possible through direct voice or text
 In the past, chatting with bots was typically only possible through direct voice or text
 conversationsusing
 conversations usingconventional
 conventionalchatbot chatbotmodels.
 models.A Atypical
 typicalbot botwaswasunable
 unableto toconsider
 considerits its
 surrounding environment through its vison when responding.
 surrounding environment through its vison when responding. The present research relies The present research relies
 on chat
 on chat training
 training with
 with images
 images that that the
 the bot
 bot can
 can see.
 see. ToTo allow
 allow thethe chatbot
 chatbot models
 modelsto to integrate
 integrate
 image data into input sets, this paper proposes an improved
 image data into input sets, this paper proposes an improved model that integrates image model that integrates image
 captions, natural language processing, and similar technologies;
 captions, natural language processing, and similar technologies; this model enables the bot this model enables the
 botchat
 to to chat and respond
 and respond according
 according to theto the environment
 environment it sees.itThesees. The feasibility
 feasibility and practi-
 and practicality of
 the model were substantiated by the experiment results. In the future, this method canthis
 cality of the model were substantiated by the experiment results. In the future, be
 methodto
 applied can be applied
 chatbots to chatbots
 and home care, asand home
 well as tocare, as well
 increase theas to increase
 precision the precision
 of typical chatbot’s of
 typical chatbot’s verbal expressions and to enable typical
 verbal expressions and to enable typical chatbots more aware of reality when conversingchatbots more aware of reality
 whenhumans.
 with conversing with humans.
 Theoverview
 The overviewof ofthe
 theproposed
 proposedframework
 framework is is illustrated
 illustrated in in Figure
 Figure 1. Prior
 1. Prior to undergo-
 to undergoing
 ing any
 any training,
 training, the the chatbot
 chatbot cancan be be
 saidsaid
 to to
 bebe like
 like a baby—incapableofofanything.
 a baby—incapable anything.But Butafter
 after
 going through correct learning, the chatbot slowly becomes
 going through correct learning, the chatbot slowly becomes more human-like and gains more human-like and gains
 the ability
 the ability to to communicate
 communicate with with people.
 people. The training
 training process
 process resembles
 resembles the the process
 process by by
 whichadults
 which adultsteach
 teachchildren
 childrentoto recognize
 recognize objects,
 objects, typically
 typically holding
 holding a picture
 a picture andand telling
 telling the
 the child,
 child, “This“This is a dog”.
 is a dog”. Although Although the ability
 the ability to expressto express
 oneself oneself
 in words inmaywordsnotmay not be
 be entirely
 entirelyduring
 present presentthe during
 initialthe initial training,
 training, the rate the rate ofcan
 of failure failure can be reduced
 be reduced throughthrough
 continuous con-
 tinuous training.
 training. Therefore, Therefore,
 we can use wethis
 can process
 use thisto process
 enabletorobots
 enable torobots
 develop to cognitive
 develop cognitive
 abilities
 abilities regarding
 regarding their surroundings.
 their surroundings. This process Thisis process
 the sameisasthe same as
 teaching teaching
 children children
 to speak, that to
 speak,
 is, that is,their
 developing developing
 ability totheir
 speak ability
 through to speak
 continuousthrough continuousand
 conversations conversations
 correcting their and
 mistakes.
 correctingTeaching robotsTeaching
 their mistakes. is the same—although the bot may answer
 robots is the same—although the bot incorrectly
 may answer in thein-
 beginning,
 correctly inthrough continuous
 the beginning, learning,
 through they can learning,
 continuous learn to have theynormal
 can learn conversations
 to have normal with
 people. The concept
 conversations map of The
 with people. this concept
 paper is map presented
 of thisinpaper
 Figureis2;presented
 the chatbot in can
 Figureperform
 2; the
 analyses
 chatbot can based on current
 perform image
 analyses data.on
 based When the image
 current user asks questions,
 data. When the notuser
 onlyasks
 is the chatbot
 questions,
 able to make
 not only is theconsiderations
 chatbot able based to make on the environmental
 considerations based dataonitthe
 currently detects, the
 environmental databotit
 is also able to generate a corresponding response based on the
 currently detects, the bot is also able to generate a corresponding response based on the conversation context. The
 benefit of this context.
 conversation method The is inbenefit
 the integration of multiple
 of this method pieces
 is in the of data, allowing
 integration of multiple thepieces
 chatbot of
 to better
 data, make responses
 allowing the chatbot focusing
 to better onmake
 relevant questions.
 responses In thison
 focusing manner,
 relevant thequestions.
 bot’s responses
 In this
 become
 manner,more precise
 the bot’s and more
 responses in linemore
 become withprecise
 reality. and more in line with reality.

 Image Sensor

 Sound Sensor

 Robot Action

 Figure 1. An
 Figure 1. An overview
 overview of
 of the
 the proposed
 proposed framework.
 framework.
Multi-Sensor Context-Aware Based Chatbot Model: An Application of Humanoid Companion Robot
Sensors 2021,21,
 Sensors2021, 21,5132 44 of
 of 21
Sensors 2021, 21, xx FOR
 FOR PEER
 PEER REVIEW
 REVIEW 4 of 21
 21

 Figure 2.
 Figure 2. The
 The concept
 concept of
 of the
 the proposed
 proposed framework.
 framework.
 framework.

 This paper
 This
 This paper can
 paper can be
 can be split
 be split into
 split into two
 into two major
 two major areas.
 major areas. The
 areas. The first
 The first is
 first is picture
 is picture scenario
 picture scenario analysis,
 scenario analysis,and
 analysis, and
 and
 the
 the second
 the second
 second is is natural
 is natural language
 naturallanguage processing.
 languageprocessing.
 processing.The The system
 Thesystem framework
 systemframework is
 frameworkisispresentedpresented
 presented ininin Figure
 Figure
 Figure 3.
 3. When
 3. When
 When the
 thethe entire
 entire system
 entire system
 system is operating,
 operating,
 is operating,
 is we consider
 we consider
 we consider two inputs:
 two inputs:
 two inputs: image
 image image
 data and data
 data voiceandinputs.
 and voice
 voice
 inputs.
 For
 inputs. For
 the For
 image theinputs,
 the image inputs,
 image inputs,
 first, we first, we perform
 perform
 first, we perform feature
 featurefeature extraction
 extraction through
 extraction through
 the VGG
 through the model.
 the VGG model.
 VGG model.
 This
 This
 reducesreduces
 the the dimensionality
 dimensionality of the of the
 image image
 data, data,
 avoidingavoiding
 This reduces the dimensionality of the image data, avoiding training problems resultingtraining training
 problemsproblems
 resulting resulting
 from
 overly
 from large large
 from overly
 overly inputinput
 large dimensions.
 input dimensions.
 dimensions. Following
 Following
 Followingthe feature
 the feature
 the extraction,
 feature we then
 extraction,
 extraction, useuse
 we then
 we then the the
 use image
 the im-
 im-
 caption
 age model
 caption to
 model generate
 to generatea paragraph
 a paragraphof text
 of that
 text matches
 that
 age caption model to generate a paragraph of text that matches the image data. For the the
 matches image
 the data.
 image For
 data. the voice
 For the
 inputs,
 voice we
 voice inputs, must
 inputs, we first
 we must convert
 must first the
 first convert
 convert thesound signals
 the sound
 sound signals into text
 signals into messages
 into text
 text messages through
 messages through voice-to-text
 through voice-to-
 voice-to-
 conversion
 text conversiontechnologies.
 technologies. Then the
 Then chatbot
 the model
 chatbot must
 model
 text conversion technologies. Then the chatbot model must generate appropriate re- generate
 must appropriate
 generate responses
 appropriate re-
 based
 sponses onbased
 text messages
 on text convertedconverted
 messages from the user’s
 from voice
 the and voice
 user’s text information
 and text about the
 information
 sponses based on text messages converted from the user’s voice and text information
 pictures.
 about thetheIn this context,
 pictures. thisthe
 In this outputthe
 context, data typedata
 output is text.
 typeTherefore,
 is text. for the user
 text. Therefore,
 Therefore, for theto be
 the userableto
 about pictures. In context, the output data type is for user to
 to
 be hear
 able the
 to chatbot’s
 hear the response,response,
 chatbot’s we use text-to-voice
 we use conversion
 text-to-voice technologies
 conversion to convert
 technologies to
 be able to hear the chatbot’s response, we use text-to-voice conversion technologies to
 the text
 convert the message
 the text into
 text message sounds,
 message into either
 into sounds, to send
 sounds, either to
 either to the
 to senduser
 send to or
 to the to enable
 the user
 user or the
 or to bot
 to enable to
 enable the perform
 the bot
 bot to a
 to
 convert
 corresponding action.
 perform aa corresponding
 corresponding action.In this manner,
 action. In In thisthe whole
 this manner,
 manner, the system
 the whole framework
 whole system not
 system framework only
 framework not considers
 not only
 only
 perform
 signals
 considers from sound,
 signals from it also incorporates
 sound, it also considerations
 incorporates of images. of
 considerations Therefore,
 images. the proposed
 Therefore, the
 considers signals from sound, it also incorporates considerations of images. Therefore, the
 method
 proposed improves
 method chatbot
 improves affinity
 chatbot by enabling
 affinity chatbots
 by enabling to chatbots
 generate to responses
 generate toresponses
 be more in to
 proposed method improves chatbot affinity by enabling chatbots to generate responses to
 line
 be with in
 more reality
 line and to
 with interact
 reality and better
 to with humans.
 interact better with humans.
 be more in line with reality and to interact better with humans.

 Image Caption
 Image Sensor VGG Model Image Caption
 Image Sensor VGG Model Model
 Model

 Chatbot
 Sound Sensor Speech to Text Chatbot
 Sound Sensor Speech to Text Model
 Model

 Response Text to Speech Robot Action
 Response Text to Speech Robot Action

 Figure 3.
 Figure
 Figure 3. The
 3. architecture of
 The architecture
 architecture of the
 of the system.
 the system.
 system.

 To
 To validate
 To validatethe
 validate themethod
 the methodproposed
 method proposedinin
 proposed inthis
 thispaper,
 this paper,we
 paper, weapplied
 we appliedthe
 applied the
 the trained
 trained
 trained models
 models
 models with
 with
 witha
 humanoid robot, then verified the efficacy of the method through a series of experiments.
 aa humanoid
 humanoid robot,robot, then
 then verified
 verified the
 the efficacy
 efficacy of of the
 the method
 method through
 through aa series
 series ofof experi-
 experi-
 Humanoid
 ments. robots robots
 Humanoid are theare closest
 the robot robot
 closest type to humans
 type to and are
 humans and therefore
 are suitable
 therefore for
 suitable
 ments. Humanoid robots are the closest robot type to humans and are therefore suitable
 chatting and
 for chatting
 chatting and companionship
 and companionship purposes;
 companionship purposes; therefore,
 purposes; therefore, a humanoid
 therefore, aa humanoid robot
 humanoid robot was
 robot waschosen
 was chosen for the
 chosen for
 for
 for
 experiments
 the experiments in this
 in paper.
 this The The
 paper. humanoid
 humanoid robotrobot
 usedusedin this
 in paper
 this is presented
 paper is in Figure
 presented in 4;
 Fig-
 the experiments in this paper. The humanoid robot used in this paper is presented in Fig-
 the
 ure humanoid
 4; the humanoidrobot robot
 is 51 cmis tall
 51 cm and
 tall weighs
 and approximately
 weighs approximately3.5 kg.3.5 It
 kg.has
 It 20 degrees
 has 20 degreesof
 ure 4; the humanoid robot is 51 cm tall and weighs approximately 3.5 kg. It has 20 degrees
 freedom,
 of freedom,
 freedom, and andhashas
 anan
 inbuilt
 inbuilt Logitech
 Logitech C920
 C920HD HDProProWebcam
 Webcamand andmicrophone.
 microphone.The Therobot
 robot
 of and has an inbuilt Logitech C920 HD Pro Webcam and microphone. The robot
 is
 is powered
 powered by
 by a
 a XM430
 XM430 motor
 motor and
 and installed
 installed with
 with an
 an Intel
 Intel Core
 Core i3
 i3 processor
 processor and
 and OpenCR
 OpenCR
 is powered by a XM430 motor and installed with an Intel Core i3 processor and OpenCR
 controller.
 controller. ItsIts Inertial
 Its InertialMeasurement
 MeasurementUnit Unitinvolves
 involvesa three-axis
 a three-axis gyroscope,
 gyroscope, a three-axis
 controller. Inertial Measurement Unit involves a three-axis gyroscope, aa three-axis
 three-axis ac-
 ac-
 acceleration
 celeration sensor,
 sensor, and
 and a a three-axis
 three-axis magnetometer.
 magnetometer. The
 The operating
 operating system
 system isis Ubuntu.
 Ubuntu.
 celeration sensor, and a three-axis magnetometer. The operating system is Ubuntu.
Multi-Sensor Context-Aware Based Chatbot Model: An Application of Humanoid Companion Robot
Sensors
Sensors 21,21,
 2021,
 2021, 5132 PEER REVIEW
 x FOR 5 of
 5 of 2121

 Webcaam Webcaam

 Microphone
 Microphone

 (a) (b)
 Figure 4. 4.
 Figure The architecture
 The ofof
 architecture thethe
 humanoid robot:
 humanoid (a)(a)
 robot: thethe
 front view;
 front and
 view; (b)(b)
 and the side
 the view.
 side view.

 3. 3. The
 The Proposed
 Proposed Model
 Model
 ThisThis section
 section introduces
 introduces thethe frameworkofof
 framework thethe proposedmodel
 proposed modelinindetail.
 detail.Because
 Because most
 most
 ofofthethe data
 data processed
 processed inin this
 this work
 work waswas time-series
 time-series data,
 data, RNNs
 RNNs constituted
 constituted anan appropriate
 appropriate
 framework.
 framework. The The
 mostmost well-known
 well-known RNN RNN framework
 framework is theis commonly
 the commonly usedused
 LSTM. LSTM. The
 The lit-
 literature currently features many LSTM variations, such
 erature currently features many LSTM variations, such as the GRU framework. However, as the GRU framework. However,
 thetheGRUGRU framework
 framework is issimpler
 simplerthan thanthetheLSTM
 LSTMframework,
 framework,but butininactual
 actualoperations
 operationsthe the
 GRU can achieve results that are equal to (if not better than)
 GRU can achieve results that are equal to (if not better than) the results of LSTM. Also, the results of LSTM. Also,
 because
 because thethe GRU
 GRU framework
 framework is is simpler,
 simpler, under
 under normal
 normal conditions
 conditions GRU
 GRU has
 has better
 better training
 training
 efficiency
 efficiency than
 than LSTM.
 LSTM. ToTocompare
 comparethe thetwo
 twoRNNRNNmodels,
 models,we wemust
 mustmodify
 modifyand andenhance
 enhance
 the original LSTM RNN in the hopes of achieving better
 the original LSTM RNN in the hopes of achieving better convergence efficiency than convergence efficiency thanthe the
 original neural
 original neural network. network.
 Many studies have performed in-depth explorations of chatbots, and this paper takes
 Many studies have performed in-depth explorations of chatbots, and this paper takes
 the framework from reference [16] as the basis for improvements. In the original version, an
 the framework from reference [16] as the basis for improvements. In the original version,
 adversarial learning method for generative conversational agents (GCA) is applied for the
 an adversarial learning method for generative conversational agents (GCA) is applied for
 chatbot model. Such a learning process originally also belonged to one of the sequence-to-
 the chatbot model. Such a learning process originally also belonged to one of the sequence-
 sequence training regimes. However, such an approach can also transfer to an end-to-end
 to-sequence training regimes. However, such an approach can also transfer to an end-to-
 training process by applying the specific structure. The chatbot model can generate an
 end training process by applying the specific structure. The chatbot model can generate
 appropriate sentence by considering the dialogue history. The main core of GCA is LSTM,
 an appropriate sentence by considering the dialogue history. The main core of GCA is
 and the model also considers the current answer and the previous conversation history.
 LSTM, and the model also considers the current answer and the previous conversation
 The concept for this paper also was implemented with favorable results in the Keras
 history. The concept for this paper also was implemented with favorable results in the
 framework [17], as presented in [18]. The Keras framework is a deep learning library which
 Keras framework [17], as presented in [18]. The Keras framework is a deep learning library
 includes many Application Programming Interfaces (APIs) for building the corresponding
 which
 models. includes
 Kerasmany
 is also Application
 one of the Programming
 most used deep Interfaces
 learning(APIs) for building
 frameworks. In [18],the Keras
 corre-is
 sponding
 used formodels. Keras is also of
 the implementation one theofchatbot
 the most used deep
 model. Basedlearning frameworks.
 on the previous work Inin[18],
 [18],
 Keras is used for the implementation of the chatbot model. Based
 this study improves the model structure and gives better performance than the original on the previous work in
 [18], this study improves the model structure and gives better performance
 version. The detailed framework of the chatbot is exhibited in Figure 5; Figure 5a presents than the orig-
 inal
 theversion.
 original TheLSTM detailed
 versionframework
 [18]; Figureof5bthe chatbot
 displays theisGRU
 exhibited
 versionin that
 Figure
 was5;optimized
 Figure 5ain
 presents the original LSTM version [18]; Figure 5b displays
 this study. In addition to the LSTM-to-GRU transformation, the differences between the GRU version that was op- the
 timized in this study.
 two versions includeInthe addition
 addition to of
 thetwo
 LSTM-to-GRU transformation,
 dense (fully connected) layerstheto differences
 the new version. be-
 tween the two versions
 The parameters are listedinclude the addition
 in Table 1. We also ofadded
 two dense (fully to
 a dropout connected) layers to the
 prevent overfitting and
 new version. The parameters are listed in Table 1. We
 used Adam for the optimizer and categorical cross entropy for the generated lossalso added a dropout to prevent
 function.
 overfitting and used
 A more detailed Adam forof
 comparison thetheoptimizer and categorical
 two versions is laid out in cross
 the entropy for the gener-
 next section.
 ated loss function. A more detailed comparison of the two versions is laid out in the next
 section.
Multi-Sensor Context-Aware Based Chatbot Model: An Application of Humanoid Companion Robot
Sensors
Sensors 2021, FOR 21,
 21, x2021, 5132
 PEER REVIEW 6 of 21 6 of 21

 Input Layer Input Layer
 Input Layer Input Layer
 Input: Output: Input: Output:
 Input: Output: Input: Output: (50) (50) (50) (50)
 (50) (50) (50) (50)

 Embedding

 Embedding Input: Output:
 (50) (50, 100)
 Input: Output:
 (50) (50, 100)
 GRU GRU
 Input: Output: Input: Output:
 LSTM LSTM (50, 100) (300) (50, 100) (300)

 Input: Output: Input: Output:
 (50, 100) (300) (50, 100) (300) Concatenate

 Input: Output:
 [(300), (300)] (600)
 Concatenate
 Input: Output: Dense Dense
 [(300), (300)] (600)
 Input: Output: Input: Output:
 (600) (3500) (600) (3500)
 Dense
 Input: Output: Concatenate
 (600) (3500)
 Input: Output:
 [3500), (3500)] (7000)
 Dense
 Dense
 Input: Output:
 (3500) (7000) Input: Output:
 (7000) (7000)

 (a) (b)
 Figure 5. Figure
 Architecture of the chatbot:
 5. Architecture of the (a) LSTM(a)
 chatbot: version;
 LSTMand (b) Enhanced
 version; GRU version.
 and (b) Enhanced GRU version.

 Table 1. Parameters setting ofsetting
 Table 1. Parameters the chatbot.
 of the chatbot.
 Parameter Value
 Parameter Value
 Dropout Rate 0.25
 Dropout Rate 0.25
 GRU CellGRU Cell 300 300
 OptimizerOptimizer Adam Adam
 Loss Function
 Loss Function Categorical Cross Entropy
 Categorical Cross Entropy

 The chatbot
 The model
 chatbotproposed in this study
 model proposed wasstudy
 in this based was
 mainly on the
 based concept
 mainly of a concept
 on the gen- of a
 erative conversational agent (GCA) [16]. Records of a conversation with a chatbot can
 generative conversational agent (GCA) [16]. Records of a conversation with a chatbotbe can
 stored using the x using
 be stored vectorthe
 andx one-hot encoding.
 vector and one-hotAccordingly, the following
 encoding. Accordingly, theequations
 followingare
 equations
 defined:are defined:
 = ̅ , ̅ X, ̅=…[ x ̅1 , x2 , x3 . . . x Ss ] (1) (1)
 h i
 Y = y 1 , y 2 , y 3 . . . y Ss (2)
 = , , … (2)
 Ec = We X (3)
 = (3)
 Ea = We Y (4)
 = layer parameter, Ec represents the corresponding
 where We represents the embedding (4)
 conversation record, and Ea represents the unfinished response. An RNN is then used to
 where We represents the embedding layer parameter, Ec represents the corresponding con-
 extract the embedding vectors of the conversation record and response, as illustrated in the
 versation record, and Ea represents the unfinished response. An RNN is then used to ex-
 following equation:
 tract the embedding vectors of the conversationec =record
 Γc ( Ec and
 ; Wc )response, as illustrated in the (5)
 following equation:
 ea = Γ a ( Ea ; Wa ) (6)
 = ; (5)
 e = [ ec e a ] (7)
 = y ; (6)
 h = σ (W1 e + b1 ) (8)

 = P = ϕ(W2 yh + b2 ) (7) (9)
 where W 1 and W 2 are synaptic weights, b1 and b2 are biases, σ is the Rectified Linear
 Unit (ReLU), and ϕ is the Softmax
 = activation
 + function. Finally, the index with (8)
 the largest
Multi-Sensor Context-Aware Based Chatbot Model: An Application of Humanoid Companion Robot
Sensors 2021, 21, x FOR PEER REVIEW 7 of 21

Sensors 2021, 21, 5132
 = + (9)
 7 of 21
 where W1 and W2 are synaptic weights, b1 and b2 are biases, σ is the Rectified Linear Unit
 (ReLU), and is the Softmax activation function. Finally, the index with the largest prob-
 ability value is identified from P before the next iteration is started. Accordingly, the most
 probability value is identified from P before the next iteration is started. Accordingly, the
 suitable response can be output.
 most suitable response can be output.
 The operational method of the chatbot used in this paper is displayed in Figure 6.
 The operational method of the chatbot used in this paper is displayed in Figure 6.
 The
 The figuredisplays
 figure displaysthe thetwo
 twoinput
 inputsources
 sourcesof ofthe
 thechatbot.
 chatbot.The Theleft-hand
 left-handinputinputsource
 sourceisisthethe
 previous
 previous sentence spoken by the other party, and the right-hand input source is is
 sentence spoken by the other party, and the right-hand input source a collec-
 a collection
 tion of previous
 of previous output
 output results.
 results. BecauseBecause
 of theof the unique
 unique advancements
 advancements of the proposed
 of the proposed training,
 training, the chatbot
 the chatbot model can model can notably
 notably understandunderstand and analyze
 and analyze the contextual
 the contextual relation-
 relationships of
 ships of the sentences.
 the sentences. Finally, we Finally,
 exportwe theexport the generated
 generated dialogue word-by-word
 dialogue word-by-word in sequence, inthen
 se-
 quence,
 assemble thentheassemble
 words into the awords
 completeinto sentence.
 a completeWhen sentence. When the
 the chatbot chatbot the
 expresses expresses
 output
 the output results for the user to hear, the user may respond a
 results for the user to hear, the user may respond a second time based on the answersecond time based on the
 just
 answer justNext,
 received. received. Next, in
 in addition to addition
 the second toresponse
 the second fromresponse
 the user,from the user, the
 the chatbot’s ownchatbot’s
 previous
 own previous
 response response
 is added is added
 to the chatbot’sto the chatbot’s
 left-hand left-hand
 input. input. Thisinput
 This additional additional input is
 is exceedingly
 exceedingly
 helpful to the chatbot’s ability to understand the contextual meaning. Furthermore,Fur-
 helpful to the chatbot’s ability to understand the contextual meaning. this
 thermore,
 additionalthis additional
 input input can be
 can be substituted for substituted
 the output textfor the outputby
 produced text
 theproduced
 image data. by This
 the
 image
 allowsdata. This allows
 the chatbot the chatbot
 to become familiarto with
 become the familiar with the
 present image present
 data, enablingimagethedata, ena-to
 chatbot
 bling
 assessthe
 thechatbot
 image to assessand
 content thegenerate
 image content
 the bestand generate
 response. the bestrepeated
 Through response. Through
 training, the
 repeated training,
 chatbot model canthe chatbot
 fully graspmodel can fully grasp
 the relationship of the the relationshipwith
 conversation of thetheconversation
 user; in this
 with the user;
 manner, in this can
 the chatbot manner,
 become themore
 chatbot canin
 fluent become more fluent
 conversations with in conversations
 people, narrowing withthe
 people, narrowing
 gap between the gap
 chatbots’ andbetween
 humans’chatbots’ and humans’
 communication skills.communication skills.

 Context Previous Context Previous Context Previous
 Ex: How was the Output Ex: How was the Output Ex: How was the Output
 weather? Ex: None weather? Ex: It weather? Ex: It is
 Step 1

 Chatbot Chatbot Chatbot
 Model Model Model

 Output Output Output
 Ex: It Ex: is Ex: hot

 Extra Information
 (From previous sentence or image caption)

 Context Previous Context Previous Context Previous
 Ex: It is hot. Do you Output Ex: It is hot. Do you Output Ex: It is hot. Do you Output
 need some drink? Ex: None need some drink? Ex: I need some drink? Ex: I like
 Step 2

 Chatbot Chatbot Chatbot
 Model Model Model

 Output Output Output
 Ex: I Ex: like Ex: coke

 Theworkflow
 Figure6.6.The
 Figure workflowofofthe
 thedesigned
 designedchatbot
 chatbotmodel.
 model.

 Imagecaption
 Image captiontechnology
 technologyhas hasbeen
 beenaatopic
 topicofofin-depth
 in-depthexploration
 explorationininmany
 manypapers,
 papers,
 and this paper builds on the framework in references [19,20]. The concept
 and this paper builds on the framework in references [19,20]. The concept discussed in discussed in
 this paper was also implemented with favorable results in the Keras framework
 this paper was also implemented with favorable results in the Keras framework [17], as [17], as
 demonstratedinin[21].
 demonstrated [21].AAdetailed
 detailedframework
 frameworkofofourourimage
 imagecaption
 captionmodel
 modelisispresented
 presentedin in
 Figure7,7,with
 Figure withFigure
 Figure7a7aas
 asthe
 theoriginal
 originalLSTM
 LSTMversion
 version[21]
 [21]and
 andFigure
 Figure7b7bas
 asthe
 theoptimized
 optimized
 GRU version used in this paper. The original LSTM framework was
 GRU version used in this paper. The original LSTM framework was exchanged forexchanged foraaGRU
 GRU
 framework, to reduce the number of training parameters; the concatenation
 framework, to reduce the number of training parameters; the concatenation layer that had layer that
 had originally been in the middle was implemented through a tensor adding method.
 originally been in the middle was implemented through a tensor adding method. The im-
 The improved approach, in addition to being able to consider input data from both sides,
 proved approach, in addition to being able to consider input data from both sides, can
 can also eliminate half of the parameters for this layer and improve training efficiency.
 also eliminate half of the parameters for this layer and improve training efficiency. Fur-
 Furthermore, batch normalization (BN) techniques were added to the model; BN can
 adaptively adjust the value range of the input data, reducing situations in which the
 activation function cannot be calculated as a result of the value of the previous layer being
 too great or too small. The parameters of the image caption model discussed in this paper
 can be found in Table 2. We also added dropout to prevent overfitting; we chose Adam for
Multi-Sensor Context-Aware Based Chatbot Model: An Application of Humanoid Companion Robot
Sensors 2021, 21, x FOR PEER REVIEW 8 of 21

 thermore, batch normalization (BN) techniques were added to the model; BN can adap-
 Sensors 2021, 21, 5132 8 of 21
 tively adjust the value range of the input data, reducing situations in which the activation
 function cannot be calculated as a result of the value of the previous layer being too great
 or too small. The parameters of the image caption model discussed in this paper can be
 foundthe
 in Table 2. We
 optimizer also
 and added dropout
 categorical to prevent
 cross entropy overfitting;
 for the we function.
 output loss chose Adam
 The for
 nextthe
 section
 optimizer and categorical
 compares these two cross entropy
 versions for thedetail.
 in greater output loss function. The next section com-
 pares these two versions in greater detail.
 Input Layer
 Input: Output:
 Input Layer (4096) (4096)
 Input: Output:
 (4096) (4096) Dropout
 Input: Output:
 Dropout (4096) (4096)
 Input: Output:
 (4096) (4096) Dense Input Layer
 Input: Output: Input: Output:
 Dense Input Layer (4096) (256) (34) (34)
 Input: Output: Input: Output:
 (4096) (256) (34) (34) Repeat Vector Embedding
 Input: Output: Input: Output:
 Repeat Vector Embedding (256) (34, 256) (34) (34, 256)

 Input: Output: Input: Output:
 (256) (34, 256) (34) (34, 256)
 ADD
 Input: Output:
 Concatenate [(34, 256), (34, 256)] (34, 256)
 Input: Output:
 [(34, 256), (34, 256)] (34, 512) Batch Normalization
 Input: Output:
 LSTM (34, 256) (34, 256)
 Input: Output:
 (34, 512) (500) GRU
 Input: Output:
 Dense (34, 256) (500)

 Input: Output:
 (500) (7549) Dense
 Input: Output:
 (500) (7549)

 (a) (b)
 FigureFigure
 7. Architecture of the image
 7. Architecture caption:
 of the image (a) LSTM
 caption: version;
 (a) LSTM and (b)and
 version; Enhanced GRU version.
 (b) Enhanced GRU version.

 Table Table
 2. Parameter settings
 2. Parameter of Image
 settings caption.
 of Image caption.
 Parameter
 Parameter
 ValueValue
 Dropout Rate 0.5
 Dropout Rate 0.5
 GRU Cell
 GRU Cell 500 500
 Optimizer
 Optimizer AdamAdam
 Loss Function
 Loss Function Categorical
 Categorical CrossCross Entropy
 Entropy

 FigureFigure
 8 presents the operations
 8 presents of theof
 the operations image caption
 the image model;
 caption the figure
 model; demonstrates
 the figure demonstrates
 that the image caption model can also be split into left-hand and
 that the image caption model can also be split into left-hand and right-hand right-hand inputs. The The
 inputs.
 left-hand inputs are image-related; after an image is input into the model,
 left-hand inputs are image-related; after an image is input into the model, we must first we must first
 use VGG16 [22] to[22]
 use VGG16 perform the task
 to perform theof feature
 task extraction.
 of feature The VGG16
 extraction. The VGG16 modelmodel
 used in thisin this
 used
 paperpaper
 is a complete modelmodel
 is a complete that hasthatalready been trained
 has already usingusing
 been trained a large volume
 a large of picture
 volume of picture
 data. The
 data.final
 Theoutput layer of
 final output theof
 layer original was not
 the original wasnecessary for this
 not necessary forwork; but because
 this work; but because
 this model had already
 this model been trained,
 had already the final
 been trained, theoutput layer layer
 final output was eliminated, and the
 was eliminated, andother
 the other
 parts parts
 were treated as excellent
 were treated feature
 as excellent extraction
 feature tools. tools.
 extraction If we Ifinput an entire
 we input picture
 an entire into into
 picture
 the image caption
 the image modelmodel
 caption from thefrombeginning, this may
 the beginning, thisresult in training
 may result issuesissues
 in training from overly
 from overly
 large input data. data.
 large input The useTheofuse
 theoffeature extraction
 the feature method
 extraction reduces
 method the input
 reduces dimensions
 the input dimensions
 and canandpresent the picture’s
 can present data data
 the picture’s in thein most streamlined
 the most streamlined andandadequate
 adequate manner
 manneras as
 vec-
 vectors.
 tors. The
 Theimage
 imagecaption
 captionmodel’s
 model’sright-hand
 right-handinputs
 inputs areare aa series
 series of
 of one-hot
 one-hot vectors
 vectors that
 that rep-
 represent
 the current sequence of the output text. After repeated iterative processes, the image
 caption model describes the image in text by generating the text word-by-word to present
 a complete sentence.
Multi-Sensor Context-Aware Based Chatbot Model: An Application of Humanoid Companion Robot
Sensors 2021, 21, x FOR PEER REVIEW 9 of 21

Sensors 2021, 21, 5132 resent the current sequence of the output text. After repeated iterative processes, the9 of
 im-21
 age caption model describes the image in text by generating the text word-by-word to
 present a complete sentence.

 VGG16 Feature Input: VGG16 Feature Input: VGG16 Feature Input:
 Extraction Ex: [1, 0, 0, …0] Extraction Ex: [0, 1, 0, …0] Extraction Ex: [0, 0, 1, …0]

 ...
 Image Caption Image Caption Image Caption
 Model Model Model

 Output Output Output
 Ex: man Ex: is Ex: skiing

 Figure 8. Workflow
 Workflow of the designed Image caption model.

 The model
 The modeltraining
 trainingprocess
 processdiscussed
 discussedininthis
 this paper
 paper is is presented
 presented in Figure
 in Figure 9. The
 9. The de-
 tailed pseudocode is listed in Algorithm 1. The details are addressed in Appendix A. A.
 detailed pseudocode is listed in Algorithm 1. The details are addressed in Appendix

 Algorithm 1: The procedure of the proposed system.
 1. Data preprocessing (image caption)
 2. Image caption model initialization
 3. for all training epochs
 4. for all image caption training samples
 5. Training process
 6. end for
 7. end for
 8. Data preprocessing (chatbot)
 9. Chatbot model initialization
 10. for all training epochs
 11. for all chatbot training samples
 12. Training process
 13. end for
 14. end for

 The process can be split into two major sections: the first section is the image caption
 model, and the second part is the natural language processing part. In the image caption
 model, we first perform preprocessing on the data by consolidating and coding all the
 image data and the labeled text data. Then, after we initialize the model, we conduct model
 training. In the training process, we must observe whether the convergence conditions have
 been satisfied; if not, then training must be continued. If the text caption model training is
 complete, then the process enters its next stage. In the natural language processing part,
 the first and most crucial task is still the data preprocessing, followed by the initialization
 of the chatbot model. Next, training of the chatbot model must be performed using the
 processed data. Similarly, if the chatbot model does not satisfy the convergence conditions,
 the training must continue. Conversely, if the chatbot model has satisfied the convergence
 conditions, the training immediately ends. When the convergence conditions are met, the
 entire training process is complete. However, the image caption model must be trained first
 Figure 9. System flowchart of the training process.
 because this model plays a key role in incorporating image information in the chatbot input.
 If the image caption model is inadequately trained, the image information input into the
 chatbot becomes meaningless and may result in erroneous interpretation. After the training
 of the image caption model is completed, it can be used to convert image information to
 textual information, which is then fed to the input end of the chatbot. This conversion
 approach enables a chatbot model to be constructed without substantial modification. This
 also allows the model to interpret image information, which is a primary focus of this
 study. The problem of overfitting was considered during the training process, which was
 terminated immediately when overfitting occurred.
Multi-Sensor Context-Aware Based Chatbot Model: An Application of Humanoid Companion Robot
Output Output Output
 Ex: man Ex: is Ex: skiing

 Figure 8. Workflow of the designed Image caption model.

Sensors 2021, 21, 5132
 The model training process discussed in this paper is presented in Figure 9. The10de-
 of 21

 tailed pseudocode is listed in Algorithm 1. The details are addressed in Appendix A.

 Figure
 Figure 9.9. System
 System flowchart
 flowchart ofof the
 the trainingprocess.
 training process.

 4. The Experimental Results
 The first part of this section is a detailed explanation of the training data. Next, we
 present training and comparisons based on the model proposed in the previous section.
 Finally, we describe how the consolidated functions were integrated with the humanoid
 robot to test the effectiveness of the proposed model and analyze the feasibility.

 4.1. The Training Data
 The databank for sentence training in this paper consisted of data sets of dialogue from
 online English classes [18]. Although image caption technology typically uses COCO data
 sets [23], which involve substantial amounts of data, as the training data, this experiment
 used the Flickr 8k Dataset [24], which has a smaller data volume, so that the model training
 results can be used with actual robots that may not have rapid computing abilities. The
 Flickr 8k Dataset has 8000 color images in total, and each image has five captions (see
 Figure 10 for the training data). From the images, we can see that each image has a chunk
 of text that describes the picture; these descriptions are man-made captions. Therefore,
 each image has various appropriate descriptions. Therefore, one of the key points of this
 paper is that image data must be described in text and incorporated into the chatbot’s
 considerations. In this manner, conversations between humans and bots can become
 more precise and more in line with actual scenarios. Furthermore, in addition to the data
 mentioned, some new images were also added to the experiment in pursuit of greater
 realism and more accurate representation of the experimental scenario.
chunk of text that describes the picture; these descriptions are man-made captions. There-
 fore, each image has various appropriate descriptions. Therefore, one of the key points of
 this paper is that image data must be described in text and incorporated into the chatbot’s
 considerations. In this manner, conversations between humans and bots can become more11 of 21
Sensors 2021, 21, x FOR PEER REVIEW
Sensors 2021, 21, 5132 precise and more in line with actual scenarios. Furthermore, in addition to the data men-11 of 21
 tioned, some new images were also added to the experiment in pursuit of greater realism
 and more accurate representation of the experimental scenario.
 chunk of text that describes the picture; these descriptions are man-made captions. There-
 fore, each image has various appropriate descriptions. Therefore, one of the key points of
 this paper is that image data must be described in text and incorporated into the chatbot’s
 considerations. In this manner, conversations between humans and bots can become more
 precise and more in line with actual scenarios. Furthermore, in addition to the data men-
 tioned, some new images were also added to the experiment in pursuit of greater realism
 and more accurate representation of the experimental scenario.

 (a) (b)
 FigureFigure
 10. Training samples
 10. Training in Flickr
 samples 8 k [24]:
 in Flickr 8 k(a) caption:
 [24]: “A dog“A
 (a) caption: atdog
 the beach”; and (b)and
 at the beach”; caption: “A
 (b) caption: “A
 little girl covered in paint sits in front of a painted rainbow with her hands in a bowl.”.
 little girl covered in paint sits in front of a painted rainbow with her hands in a bowl”..

 4.2.
 4.2. The The Experimental
 Experimental ResultsResults
 To discuss
 To discuss the strengths
 the strengths and weakness
 and weakness of theof the proposed
 proposed modelmodeland theand the original
 original LSTMLSTM
 modelmodel in depth,
 in depth, we madewe (a)made the following
 the following comparisons.
 comparisons. The chatbot
 The chatbot training
 training
 (b) models models
 are are
 compared in Figure 11, whereas the image caption
 compared in Figure 11, whereas the image caption training models are compared in Fig- training models are compared in
 Figure
 Figure 10.
 12.Training
 The samples
 figures in Flickr 8 k [24]:
 demonstrate that (a) caption:
 during the “A dog at the
 chatbot beach”;
 model and (b)
 training caption:the
 process, “A
 ure 12. The figures demonstrate that during the chatbot model training process, the pro-
 little girl covered in paint sits in front of a painted rainbow with her hands in a bowl.”.
 posedproposed
 method ismethodfaster and is faster
 stablerand andstabler and has
 has a lower finalatraining
 lower finalerrortraining
 rate. Thiserror
 also rate.
 sub- This
 also
 stantiates substantiates
 thatExperimental that
 the model proposed the model proposed in this paper outperforms the original and
 4.2. The Results in this paper outperforms the original and conven-
 tionalconventional
 frameworks. However,frameworks. duringHowever, during
 the training the training
 process process
 of the image of themodel,
 caption imagethe caption
 To the
 discuss the strengths and weakness of the proposed model and also
 the original LSTM
 performance of updated model proposed in this study was also slightly better. Althoughbetter.
 model, performance of updated model proposed in this study was slightly
 model
 Although in depth, we
 the error value made the
 sometimes following
 increasedcomparisons. The chatbot training models are
 the error value sometimes increased slightly duringslightly during
 the training the training
 process, overallprocess, overall
 it did not it
 compared
 did not in Figure
 affect 11, whereas
 convergence. The the convergence
 final image caption training
 results models
 proved that are
 the compared
 method in Fig-
 proposed
 affect convergence. The final convergence results proved that the method proposed in this
 urethis
 12. The figures demonstrate that duringconventional the chatbot model training process, the pro-
 paperinsignificantly
 paper significantly
 outperformedoutperformed
 conventional chatbot models. chatbot models.
 Conventional Conventional
 chatbot
 posed method is even
 fasterinand stabler and has atraining,
 lower final training error rate. This also sub-
 models, even in the later stages of training, exhibit trends of increasingly higher error val-higher
 chatbot models, the later stages of exhibit trends of increasingly
 stantiates
 error values. thatThis
 the phenomenon
 model proposed oncein thissubstantiates
 more paper outperforms that thethe original
 stability andandthe conven-
 validity
 ues. This phenomenon once more substantiates that the stability and the validity of the
 tional
 of the frameworks.
 method proposed However,
 in thisduring
 paperthe aretraining
 excellent. process of the image caption model, the
 method proposed in this paper are excellent.
 performance
 To verifyofthe updated modelofproposed
 practicality the method in this study was
 proposed alsostudy,
 in this slightlywebetter. Although
 incorporated a
 the error value
 humanoid robotsometimes
 system into increased slightlyexperiment
 the following during the training process,the
 and integrated overall
 trainedit did not
 model
 into the
 affect humanoid The
 convergence. robot.
 finalThe experiment
 convergence was split
 results proved into a total
 that of fourproposed
 the method scenarios;ineachthis
 scenario
 paper was matched
 significantly with a picture
 outperformed to represent chatbot
 conventional the present situation.
 models. The bot, based
 Conventional on
 chatbot
 the situation
 models, evenand user
 in the dialogue
 later stages in of the moment,
 training, exhibitmade a corresponding
 trends of increasingly response.
 higher Therefore,
 error val-
 the bot’s
 ues. This inputs
 phenomenon were not oncelimited
 moretosubstantiates
 the user’s voice that data, but alsoand
 the stability included imageofdata.
 the validity the
 The experimental
 method proposed results are depicted
 in this paper in Figures 13–16.
 are excellent.

 Figure 11. The learning curve of the chatbot model.

 Figure 11.
 Figure The learning
 11. The learning curve
 curve of
 of the
 the chatbot
 chatbot model.
Sensors 2021, 21,
Sensors 2021, 21, 5132
 x FOR PEER REVIEW 12
 12 of
 of 21
 21

 Figure 12. The learning curve of the image caption model.

 To verify the practicality of the method proposed in this study, we incorporated a
 humanoid robot system into the following experiment and integrated the trained model
 into the humanoid robot. The experiment was split into a total of four scenarios; each sce-
 nario was matched with a picture to represent the present situation. The bot, based on the
 situation and user dialogue in the moment, made a corresponding response. Therefore,
 the bot’s inputs were not limited to the user’s voice data, but also included image data.
 Figure 12. The learning curve of the image caption model.
 The experimental results are depicted in Figures 13–16.
 To verify the practicality of the method proposed in this study, we incorporated a
 humanoid robot system into the following experiment and integrated the trained model
 into the humanoid robot. The experiment was split into a total of four scenarios; each sce-
 nario was matched with a picture to represent the present situation. The bot, based on the
 situation and user dialogue in the moment, made a corresponding response. Therefore,
 the bot’s inputs were not limited to the user’s voice data, but also included image data.
 The experimental results are depicted in Figures 13–16.
 (a) (b)

 (a)
 (c) (b)
 (d)

 (c)
 (e) (d)
 (f)

 (e)
 (g) (f)
 (h)

 Figure
 Figure13.
 13.Experimental
 Experimentalresult
 resultin
 inscenario
 scenarioA.
 A.

 (g) (h)

 Figure 13. Experimental result in scenario A.
Sensors 2021,21,
Sensors 21, 5132 13 of21
 21
Sensors 2021,
 2021, 21, x
 x FOR
 FOR PEER
 PEER REVIEW
 REVIEW 13
 13 of
 of 21

 (a)
 (a) (b)
 (b)

 (c)
 (c) (d)
 (d)

 (e)
 (e) (f)
 (f)
 Figure
 Figure14.
 Figure 14.Experimental
 14. Experimentalresult
 Experimental resultin
 result ininscenario
 scenarioB.
 scenario B.B.

 (a)
 (a) (b)
 (b)

 (c)
 (c) (d)
 (d)

 (e)
 (e) (f)
 (f)
 Figure
 Figure15.
 Figure 15.Experimental
 15. Experimentalresult
 Experimental resultin
 result ininscenario
 scenarioC.
 scenario C.
 C.
Sensors2021,
Sensors 21,x 5132
 2021,21, FOR PEER REVIEW 1414
 ofof
 2121

 (a) (b)

 (c) (d)

 (e) (f)

 Figure
 Figure16.
 16.Experimental
 Experimentalresult
 resultininscenario
 scenarioD.D.

 TheTheexperiment
 experimentcombined
 combinedimageimage caption with naturalnaturallanguage
 languageprocessing
 processingtechnology.
 technol-
 ogy. The robot was expected to give corresponding responses through related imagesound
 The robot was expected to give corresponding responses through related image and and
 information.
 sound information.In theIn experimental
 the experimental procedure,
 procedure,the the
 robot used
 robot useda microphone
 a microphone to to
 capture
 cap-
 sound
 ture soundandand converted
 convertedsound into into
 sound text text
 through the speech-to-text
 through the speech-to-text(STT)(STT)
 technique. Then,
 technique.
 the robot
 Then, converted
 the robot the obtained
 converted the obtainedimage information
 image information intointo
 texttext
 by using
 by using a deep
 a deepartificial
 arti-
 neural
 ficial network.
 neural Sound
 network. and image
 Sound information
 and image was converted
 information into textual
 was converted into information
 textual infor- for
 mation for storage then inputted to the natural language processing model. The robot wasto
 storage then inputted to the natural language processing model. The robot was able
 givetoreasonable
 able responses
 give reasonable according
 responses to the to
 according information converted
 the information from image
 converted and sound.
 from image and
 sound. After the robot computed suitable sentences, it used the text-to-speech (TTS) tech-to
 After the robot computed suitable sentences, it used the text-to-speech (TTS) technique
 convert
 nique textual information
 to convert into sound
 textual information intoinformation. Thus, the
 sound information. text was
 Thus, readwas
 the text to allow
 read tothe
 user the
 allow to understand the robot’s
 user to understand theresponse.
 robot’s response.
 InInsome
 someinstances,
 instances,thetherobot’s
 robot’sresponses
 responseswere werenotnotperfect.
 perfect.These
 Theseimperfect
 imperfectresponses
 responses
 may have arisen because the databank of image data and dialogue
 may have arisen because the databank of image data and dialogue data was insufficient data was insufficient
 and,therefore,
 and, therefore,the therobot
 robotwaswasunable
 unabletotofully
 fullyconsider
 considerall allcircumstances
 circumstancesrelevant
 relevanttotothethe
 required response. In the future, the model may process dialogue
 required response. In the future, the model may process dialogue records that combine records that combine
 images;the
 images; themodel
 modelmay maythen
 thenintegrate
 integratelarger
 largerand
 andmore
 morecomprehensive
 comprehensivechatbot chatbottraining
 training
 databanks to increase the fluency of communications between
 databanks to increase the fluency of communications between humans and robots. humans and robots.

 5. Discussion
 5. Discussions
 In this experiment, the algorithm developed was integrated with a humanoid robot to
 In this experiment, the algorithm developed was integrated with a humanoid robot
 facilitate conversation with humans. The robot used its built-in microphone to capture users’
 tovoice
 facilitate conversation
 signals in additionwith humans.
 to using The robot
 the image used
 sensor its built-in
 installed on itsmicrophone to capture
 head to capture image
 users’ voice signals in addition to using the image sensor installed
 information in the surrounding environment. Because the robot used in the experiment on its head to capture
 image information
 features in the surrounding
 multiple degrees of freedom, environment.
 it is capable ofBecause
 aligningthe robotwith
 its sight usedthat
 in the exper-
 of the user
 iment featuresMoreover,
 to converse. multiple degrees
 it can turnof freedom,
 its head toit observe
 is capabletheofsurrounding
 aligning its environment
 sight with that of
 while
 the user to converse. Moreover, it can turn its head to observe the
 conversing. Accordingly, the robot uses the image–text conversion model to incorporatesurrounding environ-
 ment while
 relevant conversing.
 textual Accordingly,
 information the robot
 into the chatbot usesThe
 input. thecontent
 image–text
 of theconversion
 conversationmodel to
 revealed
 incorporate relevant textual information into the chatbot input. The
 that the response provided by the robot was associated with the image information it content of the conver-
 sation revealed
 received. that the response
 This indicated provided
 that the robot does by notthe robotconsider
 merely was associated with thebetween
 the relationships image
You can also read