Multi-Sensor Context-Aware Based Chatbot Model: An Application of Humanoid Companion Robot
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
sensors Article Multi-Sensor Context-Aware Based Chatbot Model: An Application of Humanoid Companion Robot Ping-Huan Kuo 1,2 , Ssu-Ting Lin 3 , Jun Hu 3 and Chiou-Jye Huang 4, * 1 Department of Mechanical Engineering, National Chung Cheng University, Chiayi 62102, Taiwan; phkuo@ccu.edu.tw 2 Advanced Institute of Manufacturing with High-Tech Innovations (AIM-HI), National Chung Cheng University, Chiayi 62102, Taiwan 3 Department of Intelligent Robotics, National Pingtung University, Pingtung 90004, Taiwan; cbc105010@gmail.com (S.-T.L.); wuorsut@gmail.com (J.H.) 4 Department of Data Science and Big Data Analytics, Providence University, Taichung 43301, Taiwan * Correspondence: cjh1007@gm.pu.edu.tw; Tel.: +886-4-2632-8001 (ext. 15124) Abstract: In aspect of the natural language processing field, previous studies have generally analyzed sound signals and provided related responses. However, in various conversation scenarios, image information is still vital. Without the image information, misunderstanding may occur, and lead to wrong responses. In order to address this problem, this study proposes a recurrent neural network (RNNs) based multi-sensor context-aware chatbot technology. The proposed chatbot model incorporates image information with sound signals and gives appropriate responses to the user. In order to improve the performance of the proposed model, the long short-term memory (LSTM) structure is replaced by gated recurrent unit (GRU). Moreover, a VGG16 model is also chosen for a feature extractor for the image information. The experimental results demonstrate that the integrative technology of sound and image information, which are obtained by the image sensor and sound Citation: Kuo, P.-H.; Lin, S.-T.; Hu, J.; sensor in a companion robot, is helpful for the chatbot model proposed in this study. The feasibility Huang, C.-J. Multi-Sensor of the proposed technology was also confirmed in the experiment. Context-Aware Based Chatbot Model: An Application of Humanoid Companion Robot. Sensors 2021, 21, Keywords: multi-sensor fusion; natural language; processing; context-aware computing; chatbot; 5132. https://doi.org/10.3390/ companion robot s21155132 Academic Editor: Eui Chul Lee 1. Introduction Received: 15 June 2021 Many studies have proposed robots that can chat with people. Opportunities to use Accepted: 27 July 2021 chatbots will continue to increase. In addition to accompanying older adults living alone, Published: 29 July 2021 chatbots are also applied to medical treatment. For example, individuals may be afraid of public spaces due to mental illnesses. Under these circumstances, chatbots can be used Publisher’s Note: MDPI stays neutral to provide them with companion and health care. Numerous studies of chatbots have with regard to jurisdictional claims in been reported. Traditional chatbots only use the before and after sentences for training. published maps and institutional affil- However, such training results in low diversity, and accuracy is also low. This traditional iations. system also renders chatbots unable to flexibly answer questions. For example, when being asked “What do you see?” traditional chatbots can only answer “I don’t know” due to the lack of image information. The aforementioned problem is only one example; numerous others are similar. Copyright: © 2021 by the authors. Numerous studies regarding chatbots have been published. Ref. [1] proposed and Licensee MDPI, Basel, Switzerland. applied a hybrid chatbot to communication software. This technology combines a retrieval- This article is an open access article based model and a QANet model and can answer user questions through an E-Learning distributed under the terms and system. In additional to supporting English, this technology can also converse with conditions of the Creative Commons chatbots in Chinese and has performed well in experiments. Ref. [2] also applied chatbots Attribution (CC BY) license (https:// in digital learning. In this study, the role of the proposed chatbot is to resolve Visual, creativecommons.org/licenses/by/ 4.0/). Auditory, Read, Write, and Kinesthetic (VARK) information throughout the process of Sensors 2021, 21, 5132. https://doi.org/10.3390/s21155132 https://www.mdpi.com/journal/sensors
Sensors 2021, 21, 5132 2 of 21 responding to the learner. Furthermore, the proposed chatbot also records the learner’s beta brainwaves and uses machine learning algorithms to classify learning orientations. Ref. [3] explored many studies on chatbots used in Facebook Messenger, mainly discussing how teachers use chatbots to supplement their teaching. The article also stated that the use of chatbots to assist in teaching is not yet mature and, therefore, chatbots as a field of research offer considerable room for development. Ref. [4] proposed a neural network- based chatbot that integrates personal data for analysis and estimation and uses the useful data to return appropriate responses. The main architecture for this technology is LSTM [5], and experiment results have substantiated the feasibility of this technology. In Ref. [6], bidirectional recurrent neural network (BiRNN) technology was used as the basis for a chatbot, and a GRU [7] was used as the core of the technology. The system architecture was exceedingly large and required substantial computing resources, but was verified favorably in experiments. Ref. [8] proposed a chatbot called Xatkit and stated that although it requires improvement, it currently can be applied in some real-world scenarios, and could be applied in many venues in the future. Chatbots have diverse applications. To address the massive demand for manpower in customer service, some companies have proposed automated customer service chat- bots. However, automated customer service chatbots sometimes misunderstand sentences. Ref. [9] presented research into automated customer service chatbots and stated that al- though chatbot technology is not yet fully mature, demand for this technology is notably high. Moreover, cognitive behavioral therapy (CBT) is a common method of therapy for patients with panic disorder. Ref. [10] explored the feasibility of using chatbots on mobile devices for CBT. The study found that using chatbots for CBT was both feasible and effective, pioneering the application of chatbots in medicine. Deep learning as a field of research has burgeoned in the early 21st century. Its techno- logical developments are diverse and have been applied to many research topics [11–15]. The present paper features deep learning techniques to resolve questions that chatbots cannot answer due to environmental contexts. The first step toward this type of practical learning is to enable bots to understand the present environment through visual images and to describe the environment through text. The most direct method is to design bots that use image caption technologies to understand their environments. The original purpose of image captions was, and is, to automatically generate annotations for an image. For example, if an image is entered as input, the image caption model can write a sentence about the image based on the image status. Therefore, in this article, the described method enables the bot to view its present environment clearly and to know about any object in its environment. Conventional chatbot models typically are only able to perform model training based on context; this type of model training is unable to consider image data. Even if the model can understand its present environment, it cannot incorporate the un- derstood image data into dialogue responses. In this study, the proposed method can deepen the chatbot model’s considerations of the results of its previous conversations, and together with the current dialogue, can train the chatbot. A statement generated by image caption technology can also be an input for the chatbot and can be incorporated into the next dialogue. This method enables the chatbot to answer questions more accurately; the highly diverse responses vary based on the present environment. The four major contributions of this work are: (1) integrating image and sound data to achieve further precision and practicality in chatbot and human communications; (2) adjusting and optimizing a framework based on a commonly used LSTM model to train the image caption and chatbot model efficiently, thus improving scenario recognition abilities; (3) comparing the original LSTM-based model with the adjusted, GRU-based version and verifying that the training efficiency was improved by the adjustments; and (4) testing the trained model on a humanoid robot in a series of experiments to verify that the other model proposed in this paper and the applications of both models are effective and feasible.
Sensors 2021, 21, x FOR PEER REVIEW 3 of 21 Sensors 2021, 21, 5132 3 of 21 The remaining sections of this paper are as follows. Section 2 details the operational The remaining concepts sections and process of paper. of this this paper are as3 follows. Section Section introduces 2 detailsand the models the operational algorithmic concepts framework being proposed. Section 4 presents and compares the experimentalgorithmic and process of this paper. Section 3 introduces the models and results. The framework discussionsbeing proposed.inSection are addressed Section45.presents Section and compares 6 is the the experiment results. The conclusion. discussions are addressed in Section 5. Section 6 is the conclusion. 2. System Architecture 2. System Architecture In the past, chatting with bots was typically only possible through direct voice or text In the past, chatting with bots was typically only possible through direct voice or text conversationsusing conversations usingconventional conventionalchatbot chatbotmodels. models.A Atypical typicalbot botwaswasunable unableto toconsider considerits its surrounding environment through its vison when responding. surrounding environment through its vison when responding. The present research relies The present research relies on chat on chat training training with with images images that that the the bot bot can can see. see. ToTo allow allow thethe chatbot chatbot models modelsto to integrate integrate image data into input sets, this paper proposes an improved image data into input sets, this paper proposes an improved model that integrates image model that integrates image captions, natural language processing, and similar technologies; captions, natural language processing, and similar technologies; this model enables the bot this model enables the botchat to to chat and respond and respond according according to theto the environment environment it sees.itThesees. The feasibility feasibility and practi- and practicality of the model were substantiated by the experiment results. In the future, this method canthis cality of the model were substantiated by the experiment results. In the future, be methodto applied can be applied chatbots to chatbots and home care, asand home well as tocare, as well increase theas to increase precision the precision of typical chatbot’s of typical chatbot’s verbal expressions and to enable typical verbal expressions and to enable typical chatbots more aware of reality when conversingchatbots more aware of reality whenhumans. with conversing with humans. Theoverview The overviewof ofthe theproposed proposedframework framework is is illustrated illustrated in in Figure Figure 1. Prior 1. Prior to undergo- to undergoing ing any any training, training, the the chatbot chatbot cancan be be saidsaid to to bebe like like a baby—incapableofofanything. a baby—incapable anything.But Butafter after going through correct learning, the chatbot slowly becomes going through correct learning, the chatbot slowly becomes more human-like and gains more human-like and gains the ability the ability to to communicate communicate with with people. people. The training training process process resembles resembles the the process process by by whichadults which adultsteach teachchildren childrentoto recognize recognize objects, objects, typically typically holding holding a picture a picture andand telling telling the the child, child, “This“This is a dog”. is a dog”. Although Although the ability the ability to expressto express oneself oneself in words inmaywordsnotmay not be be entirely entirelyduring present presentthe during initialthe initial training, training, the rate the rate ofcan of failure failure can be reduced be reduced throughthrough continuous con- tinuous training. training. Therefore, Therefore, we can use wethis can process use thisto process enabletorobots enable torobots develop to cognitive develop cognitive abilities abilities regarding regarding their surroundings. their surroundings. This process Thisis process the sameisasthe same as teaching teaching children children to speak, that to speak, is, that is,their developing developing ability totheir speak ability through to speak continuousthrough continuousand conversations conversations correcting their and mistakes. correctingTeaching robotsTeaching their mistakes. is the same—although the bot may answer robots is the same—although the bot incorrectly may answer in thein- beginning, correctly inthrough continuous the beginning, learning, through they can learning, continuous learn to have theynormal can learn conversations to have normal with people. The concept conversations map of The with people. this concept paper is map presented of thisinpaper Figureis2;presented the chatbot in can Figureperform 2; the analyses chatbot can based on current perform image analyses data.on based When the image current user asks questions, data. When the notuser onlyasks is the chatbot questions, able to make not only is theconsiderations chatbot able based to make on the environmental considerations based dataonitthe currently detects, the environmental databotit is also able to generate a corresponding response based on the currently detects, the bot is also able to generate a corresponding response based on the conversation context. The benefit of this context. conversation method The is inbenefit the integration of multiple of this method pieces is in the of data, allowing integration of multiple thepieces chatbot of to better data, make responses allowing the chatbot focusing to better onmake relevant questions. responses In thison focusing manner, relevant thequestions. bot’s responses In this become manner,more precise the bot’s and more responses in linemore become withprecise reality. and more in line with reality. Image Sensor Sound Sensor Robot Action Figure 1. An Figure 1. An overview overview of of the the proposed proposed framework. framework.
Sensors 2021,21, Sensors2021, 21,5132 44 of of 21 Sensors 2021, 21, xx FOR FOR PEER PEER REVIEW REVIEW 4 of 21 21 Figure 2. Figure 2. The The concept concept of of the the proposed proposed framework. framework. framework. This paper This This paper can paper can be can be split be split into split into two into two major two major areas. major areas. The areas. The first The first is first is picture is picture scenario picture scenario analysis, scenario analysis,and analysis, and and the the second the second second is is natural is natural language naturallanguage processing. languageprocessing. processing.The The system Thesystem framework systemframework is frameworkisispresentedpresented presented ininin Figure Figure Figure 3. 3. When 3. When When the thethe entire entire system entire system system is operating, operating, is operating, is we consider we consider we consider two inputs: two inputs: two inputs: image image image data and data data voiceandinputs. and voice voice inputs. For inputs. For the For image theinputs, the image inputs, image inputs, first, we first, we perform perform first, we perform feature featurefeature extraction extraction through extraction through the VGG through the model. the VGG model. VGG model. This This reducesreduces the the dimensionality dimensionality of the of the image image data, data, avoidingavoiding This reduces the dimensionality of the image data, avoiding training problems resultingtraining training problemsproblems resulting resulting from overly from large large from overly overly inputinput large dimensions. input dimensions. dimensions. Following Following Followingthe feature the feature the extraction, feature we then extraction, extraction, useuse we then we then the the use image the im- im- caption age model caption to model generate to generatea paragraph a paragraphof text of that text matches that age caption model to generate a paragraph of text that matches the image data. For the the matches image the data. image For data. the voice For the inputs, voice we voice inputs, must inputs, we first we must convert must first the first convert convert thesound signals the sound sound signals into text signals into messages into text text messages through messages through voice-to-text through voice-to- voice-to- conversion text conversiontechnologies. technologies. Then the Then chatbot the model chatbot must model text conversion technologies. Then the chatbot model must generate appropriate re- generate must appropriate generate responses appropriate re- based sponses onbased text messages on text convertedconverted messages from the user’s from voice the and voice user’s text information and text about the information sponses based on text messages converted from the user’s voice and text information pictures. about thetheIn this context, pictures. thisthe In this outputthe context, data typedata output is text. typeTherefore, is text. for the user text. Therefore, Therefore, for theto be the userableto about pictures. In context, the output data type is for user to to be hear able the to chatbot’s hear the response,response, chatbot’s we use text-to-voice we use conversion text-to-voice technologies conversion to convert technologies to be able to hear the chatbot’s response, we use text-to-voice conversion technologies to the text convert the message the text into text message sounds, message into either into sounds, to send sounds, either to either to the to senduser send to or to the to enable the user user or the or to bot to enable to enable the perform the bot bot to a to convert corresponding action. perform aa corresponding corresponding action.In this manner, action. In In thisthe whole this manner, manner, the system the whole framework whole system not system framework only framework not considers not only only perform signals considers from sound, signals from it also incorporates sound, it also considerations incorporates of images. of considerations Therefore, images. the proposed Therefore, the considers signals from sound, it also incorporates considerations of images. Therefore, the method proposed improves method chatbot improves affinity chatbot by enabling affinity chatbots by enabling to chatbots generate to responses generate toresponses be more in to proposed method improves chatbot affinity by enabling chatbots to generate responses to line be with in more reality line and to with interact reality and better to with humans. interact better with humans. be more in line with reality and to interact better with humans. Image Caption Image Sensor VGG Model Image Caption Image Sensor VGG Model Model Model Chatbot Sound Sensor Speech to Text Chatbot Sound Sensor Speech to Text Model Model Response Text to Speech Robot Action Response Text to Speech Robot Action Figure 3. Figure Figure 3. The 3. architecture of The architecture architecture of the of the system. the system. system. To To validate To validatethe validate themethod the methodproposed method proposedinin proposed inthis thispaper, this paper,we paper, weapplied we appliedthe applied the the trained trained trained models models models with with witha humanoid robot, then verified the efficacy of the method through a series of experiments. aa humanoid humanoid robot,robot, then then verified verified the the efficacy efficacy of of the the method method through through aa series series ofof experi- experi- Humanoid ments. robots robots Humanoid are theare closest the robot robot closest type to humans type to and are humans and therefore are suitable therefore for suitable ments. Humanoid robots are the closest robot type to humans and are therefore suitable chatting and for chatting chatting and companionship and companionship purposes; companionship purposes; therefore, purposes; therefore, a humanoid therefore, aa humanoid robot humanoid robot was robot waschosen was chosen for the chosen for for for experiments the experiments in this in paper. this The The paper. humanoid humanoid robotrobot usedusedin this in paper this is presented paper is in Figure presented in 4; Fig- the experiments in this paper. The humanoid robot used in this paper is presented in Fig- the ure humanoid 4; the humanoidrobot robot is 51 cmis tall 51 cm and tall weighs and approximately weighs approximately3.5 kg.3.5 It kg.has It 20 degrees has 20 degreesof ure 4; the humanoid robot is 51 cm tall and weighs approximately 3.5 kg. It has 20 degrees freedom, of freedom, freedom, and andhashas anan inbuilt inbuilt Logitech Logitech C920 C920HD HDProProWebcam Webcamand andmicrophone. microphone.The Therobot robot of and has an inbuilt Logitech C920 HD Pro Webcam and microphone. The robot is is powered powered by by a a XM430 XM430 motor motor and and installed installed with with an an Intel Intel Core Core i3 i3 processor processor and and OpenCR OpenCR is powered by a XM430 motor and installed with an Intel Core i3 processor and OpenCR controller. controller. ItsIts Inertial Its InertialMeasurement MeasurementUnit Unitinvolves involvesa three-axis a three-axis gyroscope, gyroscope, a three-axis controller. Inertial Measurement Unit involves a three-axis gyroscope, aa three-axis three-axis ac- ac- acceleration celeration sensor, sensor, and and a a three-axis three-axis magnetometer. magnetometer. The The operating operating system system isis Ubuntu. Ubuntu. celeration sensor, and a three-axis magnetometer. The operating system is Ubuntu.
Sensors Sensors 21,21, 2021, 2021, 5132 PEER REVIEW x FOR 5 of 5 of 2121 Webcaam Webcaam Microphone Microphone (a) (b) Figure 4. 4. Figure The architecture The ofof architecture thethe humanoid robot: humanoid (a)(a) robot: thethe front view; front and view; (b)(b) and the side the view. side view. 3. 3. The The Proposed Proposed Model Model ThisThis section section introduces introduces thethe frameworkofof framework thethe proposedmodel proposed modelinindetail. detail.Because Because most most ofofthethe data data processed processed inin this this work work waswas time-series time-series data, data, RNNs RNNs constituted constituted anan appropriate appropriate framework. framework. The The mostmost well-known well-known RNN RNN framework framework is theis commonly the commonly usedused LSTM. LSTM. The The lit- literature currently features many LSTM variations, such erature currently features many LSTM variations, such as the GRU framework. However, as the GRU framework. However, thetheGRUGRU framework framework is issimpler simplerthan thanthetheLSTM LSTMframework, framework,but butininactual actualoperations operationsthe the GRU can achieve results that are equal to (if not better than) GRU can achieve results that are equal to (if not better than) the results of LSTM. Also, the results of LSTM. Also, because because thethe GRU GRU framework framework is is simpler, simpler, under under normal normal conditions conditions GRU GRU has has better better training training efficiency efficiency than than LSTM. LSTM. ToTocompare comparethe thetwo twoRNNRNNmodels, models,we wemust mustmodify modifyand andenhance enhance the original LSTM RNN in the hopes of achieving better the original LSTM RNN in the hopes of achieving better convergence efficiency than convergence efficiency thanthe the original neural original neural network. network. Many studies have performed in-depth explorations of chatbots, and this paper takes Many studies have performed in-depth explorations of chatbots, and this paper takes the framework from reference [16] as the basis for improvements. In the original version, an the framework from reference [16] as the basis for improvements. In the original version, adversarial learning method for generative conversational agents (GCA) is applied for the an adversarial learning method for generative conversational agents (GCA) is applied for chatbot model. Such a learning process originally also belonged to one of the sequence-to- the chatbot model. Such a learning process originally also belonged to one of the sequence- sequence training regimes. However, such an approach can also transfer to an end-to-end to-sequence training regimes. However, such an approach can also transfer to an end-to- training process by applying the specific structure. The chatbot model can generate an end training process by applying the specific structure. The chatbot model can generate appropriate sentence by considering the dialogue history. The main core of GCA is LSTM, an appropriate sentence by considering the dialogue history. The main core of GCA is and the model also considers the current answer and the previous conversation history. LSTM, and the model also considers the current answer and the previous conversation The concept for this paper also was implemented with favorable results in the Keras history. The concept for this paper also was implemented with favorable results in the framework [17], as presented in [18]. The Keras framework is a deep learning library which Keras framework [17], as presented in [18]. The Keras framework is a deep learning library includes many Application Programming Interfaces (APIs) for building the corresponding which models. includes Kerasmany is also Application one of the Programming most used deep Interfaces learning(APIs) for building frameworks. In [18],the Keras corre-is sponding used formodels. Keras is also of the implementation one theofchatbot the most used deep model. Basedlearning frameworks. on the previous work Inin[18], [18], Keras is used for the implementation of the chatbot model. Based this study improves the model structure and gives better performance than the original on the previous work in [18], this study improves the model structure and gives better performance version. The detailed framework of the chatbot is exhibited in Figure 5; Figure 5a presents than the orig- inal theversion. original TheLSTM detailed versionframework [18]; Figureof5bthe chatbot displays theisGRU exhibited versionin that Figure was5;optimized Figure 5ain presents the original LSTM version [18]; Figure 5b displays this study. In addition to the LSTM-to-GRU transformation, the differences between the GRU version that was op- the timized in this study. two versions includeInthe addition addition to of thetwo LSTM-to-GRU transformation, dense (fully connected) layerstheto differences the new version. be- tween the two versions The parameters are listedinclude the addition in Table 1. We also ofadded two dense (fully to a dropout connected) layers to the prevent overfitting and new version. The parameters are listed in Table 1. We used Adam for the optimizer and categorical cross entropy for the generated lossalso added a dropout to prevent function. overfitting and used A more detailed Adam forof comparison thetheoptimizer and categorical two versions is laid out in cross the entropy for the gener- next section. ated loss function. A more detailed comparison of the two versions is laid out in the next section.
Sensors Sensors 2021, FOR 21, 21, x2021, 5132 PEER REVIEW 6 of 21 6 of 21 Input Layer Input Layer Input Layer Input Layer Input: Output: Input: Output: Input: Output: Input: Output: (50) (50) (50) (50) (50) (50) (50) (50) Embedding Embedding Input: Output: (50) (50, 100) Input: Output: (50) (50, 100) GRU GRU Input: Output: Input: Output: LSTM LSTM (50, 100) (300) (50, 100) (300) Input: Output: Input: Output: (50, 100) (300) (50, 100) (300) Concatenate Input: Output: [(300), (300)] (600) Concatenate Input: Output: Dense Dense [(300), (300)] (600) Input: Output: Input: Output: (600) (3500) (600) (3500) Dense Input: Output: Concatenate (600) (3500) Input: Output: [3500), (3500)] (7000) Dense Dense Input: Output: (3500) (7000) Input: Output: (7000) (7000) (a) (b) Figure 5. Figure Architecture of the chatbot: 5. Architecture of the (a) LSTM(a) chatbot: version; LSTMand (b) Enhanced version; GRU version. and (b) Enhanced GRU version. Table 1. Parameters setting ofsetting Table 1. Parameters the chatbot. of the chatbot. Parameter Value Parameter Value Dropout Rate 0.25 Dropout Rate 0.25 GRU CellGRU Cell 300 300 OptimizerOptimizer Adam Adam Loss Function Loss Function Categorical Cross Entropy Categorical Cross Entropy The chatbot The model chatbotproposed in this study model proposed wasstudy in this based was mainly on the based concept mainly of a concept on the gen- of a erative conversational agent (GCA) [16]. Records of a conversation with a chatbot can generative conversational agent (GCA) [16]. Records of a conversation with a chatbotbe can stored using the x using be stored vectorthe andx one-hot encoding. vector and one-hotAccordingly, the following encoding. Accordingly, theequations followingare equations defined:are defined: = ̅ , ̅ X, ̅=…[ x ̅1 , x2 , x3 . . . x Ss ] (1) (1) h i Y = y 1 , y 2 , y 3 . . . y Ss (2) = , , … (2) Ec = We X (3) = (3) Ea = We Y (4) = layer parameter, Ec represents the corresponding where We represents the embedding (4) conversation record, and Ea represents the unfinished response. An RNN is then used to where We represents the embedding layer parameter, Ec represents the corresponding con- extract the embedding vectors of the conversation record and response, as illustrated in the versation record, and Ea represents the unfinished response. An RNN is then used to ex- following equation: tract the embedding vectors of the conversationec =record Γc ( Ec and ; Wc )response, as illustrated in the (5) following equation: ea = Γ a ( Ea ; Wa ) (6) = ; (5) e = [ ec e a ] (7) = y ; (6) h = σ (W1 e + b1 ) (8) = P = ϕ(W2 yh + b2 ) (7) (9) where W 1 and W 2 are synaptic weights, b1 and b2 are biases, σ is the Rectified Linear Unit (ReLU), and ϕ is the Softmax = activation + function. Finally, the index with (8) the largest
Sensors 2021, 21, x FOR PEER REVIEW 7 of 21 Sensors 2021, 21, 5132 = + (9) 7 of 21 where W1 and W2 are synaptic weights, b1 and b2 are biases, σ is the Rectified Linear Unit (ReLU), and is the Softmax activation function. Finally, the index with the largest prob- ability value is identified from P before the next iteration is started. Accordingly, the most probability value is identified from P before the next iteration is started. Accordingly, the suitable response can be output. most suitable response can be output. The operational method of the chatbot used in this paper is displayed in Figure 6. The operational method of the chatbot used in this paper is displayed in Figure 6. The The figuredisplays figure displaysthe thetwo twoinput inputsources sourcesof ofthe thechatbot. chatbot.The Theleft-hand left-handinputinputsource sourceisisthethe previous previous sentence spoken by the other party, and the right-hand input source is is sentence spoken by the other party, and the right-hand input source a collec- a collection tion of previous of previous output output results. results. BecauseBecause of theof the unique unique advancements advancements of the proposed of the proposed training, training, the chatbot the chatbot model can model can notably notably understandunderstand and analyze and analyze the contextual the contextual relation- relationships of ships of the sentences. the sentences. Finally, we Finally, exportwe theexport the generated generated dialogue word-by-word dialogue word-by-word in sequence, inthen se- quence, assemble thentheassemble words into the awords completeinto sentence. a completeWhen sentence. When the the chatbot chatbot the expresses expresses output the output results for the user to hear, the user may respond a results for the user to hear, the user may respond a second time based on the answersecond time based on the just answer justNext, received. received. Next, in in addition to addition the second toresponse the second fromresponse the user,from the user, the the chatbot’s ownchatbot’s previous own previous response response is added is added to the chatbot’sto the chatbot’s left-hand left-hand input. input. Thisinput This additional additional input is is exceedingly exceedingly helpful to the chatbot’s ability to understand the contextual meaning. Furthermore,Fur- helpful to the chatbot’s ability to understand the contextual meaning. this thermore, additionalthis additional input input can be can be substituted for substituted the output textfor the outputby produced text theproduced image data. by This the image allowsdata. This allows the chatbot the chatbot to become familiarto with become the familiar with the present image present data, enablingimagethedata, ena-to chatbot bling assessthe thechatbot image to assessand content thegenerate image content the bestand generate response. the bestrepeated Through response. Through training, the repeated training, chatbot model canthe chatbot fully graspmodel can fully grasp the relationship of the the relationshipwith conversation of thetheconversation user; in this with the user; manner, in this can the chatbot manner, become themore chatbot canin fluent become more fluent conversations with in conversations people, narrowing withthe people, narrowing gap between the gap chatbots’ andbetween humans’chatbots’ and humans’ communication skills.communication skills. Context Previous Context Previous Context Previous Ex: How was the Output Ex: How was the Output Ex: How was the Output weather? Ex: None weather? Ex: It weather? Ex: It is Step 1 Chatbot Chatbot Chatbot Model Model Model Output Output Output Ex: It Ex: is Ex: hot Extra Information (From previous sentence or image caption) Context Previous Context Previous Context Previous Ex: It is hot. Do you Output Ex: It is hot. Do you Output Ex: It is hot. Do you Output need some drink? Ex: None need some drink? Ex: I need some drink? Ex: I like Step 2 Chatbot Chatbot Chatbot Model Model Model Output Output Output Ex: I Ex: like Ex: coke Theworkflow Figure6.6.The Figure workflowofofthe thedesigned designedchatbot chatbotmodel. model. Imagecaption Image captiontechnology technologyhas hasbeen beenaatopic topicofofin-depth in-depthexploration explorationininmany manypapers, papers, and this paper builds on the framework in references [19,20]. The concept and this paper builds on the framework in references [19,20]. The concept discussed in discussed in this paper was also implemented with favorable results in the Keras framework this paper was also implemented with favorable results in the Keras framework [17], as [17], as demonstratedinin[21]. demonstrated [21].AAdetailed detailedframework frameworkofofourourimage imagecaption captionmodel modelisispresented presentedin in Figure7,7,with Figure withFigure Figure7a7aas asthe theoriginal originalLSTM LSTMversion version[21] [21]and andFigure Figure7b7bas asthe theoptimized optimized GRU version used in this paper. The original LSTM framework was GRU version used in this paper. The original LSTM framework was exchanged forexchanged foraaGRU GRU framework, to reduce the number of training parameters; the concatenation framework, to reduce the number of training parameters; the concatenation layer that had layer that had originally been in the middle was implemented through a tensor adding method. originally been in the middle was implemented through a tensor adding method. The im- The improved approach, in addition to being able to consider input data from both sides, proved approach, in addition to being able to consider input data from both sides, can can also eliminate half of the parameters for this layer and improve training efficiency. also eliminate half of the parameters for this layer and improve training efficiency. Fur- Furthermore, batch normalization (BN) techniques were added to the model; BN can adaptively adjust the value range of the input data, reducing situations in which the activation function cannot be calculated as a result of the value of the previous layer being too great or too small. The parameters of the image caption model discussed in this paper can be found in Table 2. We also added dropout to prevent overfitting; we chose Adam for
Sensors 2021, 21, x FOR PEER REVIEW 8 of 21 thermore, batch normalization (BN) techniques were added to the model; BN can adap- Sensors 2021, 21, 5132 8 of 21 tively adjust the value range of the input data, reducing situations in which the activation function cannot be calculated as a result of the value of the previous layer being too great or too small. The parameters of the image caption model discussed in this paper can be foundthe in Table 2. We optimizer also and added dropout categorical to prevent cross entropy overfitting; for the we function. output loss chose Adam The for nextthe section optimizer and categorical compares these two cross entropy versions for thedetail. in greater output loss function. The next section com- pares these two versions in greater detail. Input Layer Input: Output: Input Layer (4096) (4096) Input: Output: (4096) (4096) Dropout Input: Output: Dropout (4096) (4096) Input: Output: (4096) (4096) Dense Input Layer Input: Output: Input: Output: Dense Input Layer (4096) (256) (34) (34) Input: Output: Input: Output: (4096) (256) (34) (34) Repeat Vector Embedding Input: Output: Input: Output: Repeat Vector Embedding (256) (34, 256) (34) (34, 256) Input: Output: Input: Output: (256) (34, 256) (34) (34, 256) ADD Input: Output: Concatenate [(34, 256), (34, 256)] (34, 256) Input: Output: [(34, 256), (34, 256)] (34, 512) Batch Normalization Input: Output: LSTM (34, 256) (34, 256) Input: Output: (34, 512) (500) GRU Input: Output: Dense (34, 256) (500) Input: Output: (500) (7549) Dense Input: Output: (500) (7549) (a) (b) FigureFigure 7. Architecture of the image 7. Architecture caption: of the image (a) LSTM caption: version; (a) LSTM and (b)and version; Enhanced GRU version. (b) Enhanced GRU version. Table Table 2. Parameter settings 2. Parameter of Image settings caption. of Image caption. Parameter Parameter ValueValue Dropout Rate 0.5 Dropout Rate 0.5 GRU Cell GRU Cell 500 500 Optimizer Optimizer AdamAdam Loss Function Loss Function Categorical Categorical CrossCross Entropy Entropy FigureFigure 8 presents the operations 8 presents of theof the operations image caption the image model; caption the figure model; demonstrates the figure demonstrates that the image caption model can also be split into left-hand and that the image caption model can also be split into left-hand and right-hand right-hand inputs. The The inputs. left-hand inputs are image-related; after an image is input into the model, left-hand inputs are image-related; after an image is input into the model, we must first we must first use VGG16 [22] to[22] use VGG16 perform the task to perform theof feature task extraction. of feature The VGG16 extraction. The VGG16 modelmodel used in thisin this used paperpaper is a complete modelmodel is a complete that hasthatalready been trained has already usingusing been trained a large volume a large of picture volume of picture data. The data.final Theoutput layer of final output theof layer original was not the original wasnecessary for this not necessary forwork; but because this work; but because this model had already this model been trained, had already the final been trained, theoutput layer layer final output was eliminated, and the was eliminated, andother the other parts parts were treated as excellent were treated feature as excellent extraction feature tools. tools. extraction If we Ifinput an entire we input picture an entire into into picture the image caption the image modelmodel caption from thefrombeginning, this may the beginning, thisresult in training may result issuesissues in training from overly from overly large input data. data. large input The useTheofuse theoffeature extraction the feature method extraction reduces method the input reduces dimensions the input dimensions and canandpresent the picture’s can present data data the picture’s in thein most streamlined the most streamlined andandadequate adequate manner manneras as vec- vectors. tors. The Theimage imagecaption captionmodel’s model’sright-hand right-handinputs inputs areare aa series series of of one-hot one-hot vectors vectors that that rep- represent the current sequence of the output text. After repeated iterative processes, the image caption model describes the image in text by generating the text word-by-word to present a complete sentence.
Sensors 2021, 21, x FOR PEER REVIEW 9 of 21 Sensors 2021, 21, 5132 resent the current sequence of the output text. After repeated iterative processes, the9 of im-21 age caption model describes the image in text by generating the text word-by-word to present a complete sentence. VGG16 Feature Input: VGG16 Feature Input: VGG16 Feature Input: Extraction Ex: [1, 0, 0, …0] Extraction Ex: [0, 1, 0, …0] Extraction Ex: [0, 0, 1, …0] ... Image Caption Image Caption Image Caption Model Model Model Output Output Output Ex: man Ex: is Ex: skiing Figure 8. Workflow Workflow of the designed Image caption model. The model The modeltraining trainingprocess processdiscussed discussedininthis this paper paper is is presented presented in Figure in Figure 9. The 9. The de- tailed pseudocode is listed in Algorithm 1. The details are addressed in Appendix A. A. detailed pseudocode is listed in Algorithm 1. The details are addressed in Appendix Algorithm 1: The procedure of the proposed system. 1. Data preprocessing (image caption) 2. Image caption model initialization 3. for all training epochs 4. for all image caption training samples 5. Training process 6. end for 7. end for 8. Data preprocessing (chatbot) 9. Chatbot model initialization 10. for all training epochs 11. for all chatbot training samples 12. Training process 13. end for 14. end for The process can be split into two major sections: the first section is the image caption model, and the second part is the natural language processing part. In the image caption model, we first perform preprocessing on the data by consolidating and coding all the image data and the labeled text data. Then, after we initialize the model, we conduct model training. In the training process, we must observe whether the convergence conditions have been satisfied; if not, then training must be continued. If the text caption model training is complete, then the process enters its next stage. In the natural language processing part, the first and most crucial task is still the data preprocessing, followed by the initialization of the chatbot model. Next, training of the chatbot model must be performed using the processed data. Similarly, if the chatbot model does not satisfy the convergence conditions, the training must continue. Conversely, if the chatbot model has satisfied the convergence conditions, the training immediately ends. When the convergence conditions are met, the entire training process is complete. However, the image caption model must be trained first Figure 9. System flowchart of the training process. because this model plays a key role in incorporating image information in the chatbot input. If the image caption model is inadequately trained, the image information input into the chatbot becomes meaningless and may result in erroneous interpretation. After the training of the image caption model is completed, it can be used to convert image information to textual information, which is then fed to the input end of the chatbot. This conversion approach enables a chatbot model to be constructed without substantial modification. This also allows the model to interpret image information, which is a primary focus of this study. The problem of overfitting was considered during the training process, which was terminated immediately when overfitting occurred.
Output Output Output Ex: man Ex: is Ex: skiing Figure 8. Workflow of the designed Image caption model. Sensors 2021, 21, 5132 The model training process discussed in this paper is presented in Figure 9. The10de- of 21 tailed pseudocode is listed in Algorithm 1. The details are addressed in Appendix A. Figure Figure 9.9. System System flowchart flowchart ofof the the trainingprocess. training process. 4. The Experimental Results The first part of this section is a detailed explanation of the training data. Next, we present training and comparisons based on the model proposed in the previous section. Finally, we describe how the consolidated functions were integrated with the humanoid robot to test the effectiveness of the proposed model and analyze the feasibility. 4.1. The Training Data The databank for sentence training in this paper consisted of data sets of dialogue from online English classes [18]. Although image caption technology typically uses COCO data sets [23], which involve substantial amounts of data, as the training data, this experiment used the Flickr 8k Dataset [24], which has a smaller data volume, so that the model training results can be used with actual robots that may not have rapid computing abilities. The Flickr 8k Dataset has 8000 color images in total, and each image has five captions (see Figure 10 for the training data). From the images, we can see that each image has a chunk of text that describes the picture; these descriptions are man-made captions. Therefore, each image has various appropriate descriptions. Therefore, one of the key points of this paper is that image data must be described in text and incorporated into the chatbot’s considerations. In this manner, conversations between humans and bots can become more precise and more in line with actual scenarios. Furthermore, in addition to the data mentioned, some new images were also added to the experiment in pursuit of greater realism and more accurate representation of the experimental scenario.
chunk of text that describes the picture; these descriptions are man-made captions. There- fore, each image has various appropriate descriptions. Therefore, one of the key points of this paper is that image data must be described in text and incorporated into the chatbot’s considerations. In this manner, conversations between humans and bots can become more11 of 21 Sensors 2021, 21, x FOR PEER REVIEW Sensors 2021, 21, 5132 precise and more in line with actual scenarios. Furthermore, in addition to the data men-11 of 21 tioned, some new images were also added to the experiment in pursuit of greater realism and more accurate representation of the experimental scenario. chunk of text that describes the picture; these descriptions are man-made captions. There- fore, each image has various appropriate descriptions. Therefore, one of the key points of this paper is that image data must be described in text and incorporated into the chatbot’s considerations. In this manner, conversations between humans and bots can become more precise and more in line with actual scenarios. Furthermore, in addition to the data men- tioned, some new images were also added to the experiment in pursuit of greater realism and more accurate representation of the experimental scenario. (a) (b) FigureFigure 10. Training samples 10. Training in Flickr samples 8 k [24]: in Flickr 8 k(a) caption: [24]: “A dog“A (a) caption: atdog the beach”; and (b)and at the beach”; caption: “A (b) caption: “A little girl covered in paint sits in front of a painted rainbow with her hands in a bowl.”. little girl covered in paint sits in front of a painted rainbow with her hands in a bowl”.. 4.2. 4.2. The The Experimental Experimental ResultsResults To discuss To discuss the strengths the strengths and weakness and weakness of theof the proposed proposed modelmodeland theand the original original LSTMLSTM modelmodel in depth, in depth, we madewe (a)made the following the following comparisons. comparisons. The chatbot The chatbot training training (b) models models are are compared in Figure 11, whereas the image caption compared in Figure 11, whereas the image caption training models are compared in Fig- training models are compared in Figure Figure 10. 12.Training The samples figures in Flickr 8 k [24]: demonstrate that (a) caption: during the “A dog at the chatbot beach”; model and (b) training caption:the process, “A ure 12. The figures demonstrate that during the chatbot model training process, the pro- little girl covered in paint sits in front of a painted rainbow with her hands in a bowl.”. posedproposed method ismethodfaster and is faster stablerand andstabler and has has a lower finalatraining lower finalerrortraining rate. Thiserror also rate. sub- This also stantiates substantiates thatExperimental that the model proposed the model proposed in this paper outperforms the original and 4.2. The Results in this paper outperforms the original and conven- tionalconventional frameworks. However,frameworks. duringHowever, during the training the training process process of the image of themodel, caption imagethe caption To the discuss the strengths and weakness of the proposed model and also the original LSTM performance of updated model proposed in this study was also slightly better. Althoughbetter. model, performance of updated model proposed in this study was slightly model Although in depth, we the error value made the sometimes following increasedcomparisons. The chatbot training models are the error value sometimes increased slightly duringslightly during the training the training process, overallprocess, overall it did not it compared did not in Figure affect 11, whereas convergence. The the convergence final image caption training results models proved that are the compared method in Fig- proposed affect convergence. The final convergence results proved that the method proposed in this urethis 12. The figures demonstrate that duringconventional the chatbot model training process, the pro- paperinsignificantly paper significantly outperformedoutperformed conventional chatbot models. chatbot models. Conventional Conventional chatbot posed method is even fasterinand stabler and has atraining, lower final training error rate. This also sub- models, even in the later stages of training, exhibit trends of increasingly higher error val-higher chatbot models, the later stages of exhibit trends of increasingly stantiates error values. thatThis the phenomenon model proposed oncein thissubstantiates more paper outperforms that thethe original stability andandthe conven- validity ues. This phenomenon once more substantiates that the stability and the validity of the tional of the frameworks. method proposed However, in thisduring paperthe aretraining excellent. process of the image caption model, the method proposed in this paper are excellent. performance To verifyofthe updated modelofproposed practicality the method in this study was proposed alsostudy, in this slightlywebetter. Although incorporated a the error value humanoid robotsometimes system into increased slightlyexperiment the following during the training process,the and integrated overall trainedit did not model into the affect humanoid The convergence. robot. finalThe experiment convergence was split results proved into a total that of fourproposed the method scenarios;ineachthis scenario paper was matched significantly with a picture outperformed to represent chatbot conventional the present situation. models. The bot, based Conventional on chatbot the situation models, evenand user in the dialogue later stages in of the moment, training, exhibitmade a corresponding trends of increasingly response. higher Therefore, error val- the bot’s ues. This inputs phenomenon were not oncelimited moretosubstantiates the user’s voice that data, but alsoand the stability included imageofdata. the validity the The experimental method proposed results are depicted in this paper in Figures 13–16. are excellent. Figure 11. The learning curve of the chatbot model. Figure 11. Figure The learning 11. The learning curve curve of of the the chatbot chatbot model.
Sensors 2021, 21, Sensors 2021, 21, 5132 x FOR PEER REVIEW 12 12 of of 21 21 Figure 12. The learning curve of the image caption model. To verify the practicality of the method proposed in this study, we incorporated a humanoid robot system into the following experiment and integrated the trained model into the humanoid robot. The experiment was split into a total of four scenarios; each sce- nario was matched with a picture to represent the present situation. The bot, based on the situation and user dialogue in the moment, made a corresponding response. Therefore, the bot’s inputs were not limited to the user’s voice data, but also included image data. Figure 12. The learning curve of the image caption model. The experimental results are depicted in Figures 13–16. To verify the practicality of the method proposed in this study, we incorporated a humanoid robot system into the following experiment and integrated the trained model into the humanoid robot. The experiment was split into a total of four scenarios; each sce- nario was matched with a picture to represent the present situation. The bot, based on the situation and user dialogue in the moment, made a corresponding response. Therefore, the bot’s inputs were not limited to the user’s voice data, but also included image data. The experimental results are depicted in Figures 13–16. (a) (b) (a) (c) (b) (d) (c) (e) (d) (f) (e) (g) (f) (h) Figure Figure13. 13.Experimental Experimentalresult resultin inscenario scenarioA. A. (g) (h) Figure 13. Experimental result in scenario A.
Sensors 2021,21, Sensors 21, 5132 13 of21 21 Sensors 2021, 2021, 21, x x FOR FOR PEER PEER REVIEW REVIEW 13 13 of of 21 (a) (a) (b) (b) (c) (c) (d) (d) (e) (e) (f) (f) Figure Figure14. Figure 14.Experimental 14. Experimentalresult Experimental resultin result ininscenario scenarioB. scenario B.B. (a) (a) (b) (b) (c) (c) (d) (d) (e) (e) (f) (f) Figure Figure15. Figure 15.Experimental 15. Experimentalresult Experimental resultin result ininscenario scenarioC. scenario C. C.
Sensors2021, Sensors 21,x 5132 2021,21, FOR PEER REVIEW 1414 ofof 2121 (a) (b) (c) (d) (e) (f) Figure Figure16. 16.Experimental Experimentalresult resultininscenario scenarioD.D. TheTheexperiment experimentcombined combinedimageimage caption with naturalnaturallanguage languageprocessing processingtechnology. technol- ogy. The robot was expected to give corresponding responses through related imagesound The robot was expected to give corresponding responses through related image and and information. sound information.In theIn experimental the experimental procedure, procedure,the the robot used robot useda microphone a microphone to to capture cap- sound ture soundandand converted convertedsound into into sound text text through the speech-to-text through the speech-to-text(STT)(STT) technique. Then, technique. the robot Then, converted the robot the obtained converted the obtainedimage information image information intointo texttext by using by using a deep a deepartificial arti- neural ficial network. neural Sound network. and image Sound information and image was converted information into textual was converted into information textual infor- for mation for storage then inputted to the natural language processing model. The robot wasto storage then inputted to the natural language processing model. The robot was able givetoreasonable able responses give reasonable according responses to the to according information converted the information from image converted and sound. from image and sound. After the robot computed suitable sentences, it used the text-to-speech (TTS) tech-to After the robot computed suitable sentences, it used the text-to-speech (TTS) technique convert nique textual information to convert into sound textual information intoinformation. Thus, the sound information. text was Thus, readwas the text to allow read tothe user the allow to understand the robot’s user to understand theresponse. robot’s response. InInsome someinstances, instances,thetherobot’s robot’sresponses responseswere werenotnotperfect. perfect.These Theseimperfect imperfectresponses responses may have arisen because the databank of image data and dialogue may have arisen because the databank of image data and dialogue data was insufficient data was insufficient and,therefore, and, therefore,the therobot robotwaswasunable unabletotofully fullyconsider considerall allcircumstances circumstancesrelevant relevanttotothethe required response. In the future, the model may process dialogue required response. In the future, the model may process dialogue records that combine records that combine images;the images; themodel modelmay maythen thenintegrate integratelarger largerand andmore morecomprehensive comprehensivechatbot chatbottraining training databanks to increase the fluency of communications between databanks to increase the fluency of communications between humans and robots. humans and robots. 5. Discussion 5. Discussions In this experiment, the algorithm developed was integrated with a humanoid robot to In this experiment, the algorithm developed was integrated with a humanoid robot facilitate conversation with humans. The robot used its built-in microphone to capture users’ tovoice facilitate conversation signals in additionwith humans. to using The robot the image used sensor its built-in installed on itsmicrophone to capture head to capture image users’ voice signals in addition to using the image sensor installed information in the surrounding environment. Because the robot used in the experiment on its head to capture image information features in the surrounding multiple degrees of freedom, environment. it is capable ofBecause aligningthe robotwith its sight usedthat in the exper- of the user iment featuresMoreover, to converse. multiple degrees it can turnof freedom, its head toit observe is capabletheofsurrounding aligning its environment sight with that of while the user to converse. Moreover, it can turn its head to observe the conversing. Accordingly, the robot uses the image–text conversion model to incorporatesurrounding environ- ment while relevant conversing. textual Accordingly, information the robot into the chatbot usesThe input. thecontent image–text of theconversion conversationmodel to revealed incorporate relevant textual information into the chatbot input. The that the response provided by the robot was associated with the image information it content of the conver- sation revealed received. that the response This indicated provided that the robot does by notthe robotconsider merely was associated with thebetween the relationships image
You can also read