Deep-Learning-Based Pupil Center Detection and Tracking Technology for Visible-Light Wearable Gaze Tracking Devices
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
applied sciences Article Deep-Learning-Based Pupil Center Detection and Tracking Technology for Visible-Light Wearable Gaze Tracking Devices Wei-Liang Ou, Tzu-Ling Kuo, Chin-Chieh Chang and Chih-Peng Fan * Department of Electrical Engineering, National Chung Hsing University, 145 Xingda Rd., South Dist., Taichung City 402, Taiwan; soft96123@gmail.com (W.-L.O.); claire5883claire5883@gmail.com (T.-L.K.); g107064065@mail.nchu.edu.tw (C.-C.C.) * Correspondence: cpfan@dragon.nchu.edu.tw Abstract: In this study, for the application of visible-light wearable eye trackers, a pupil tracking methodology based on deep-learning technology is developed. By applying deep-learning object detection technology based on the You Only Look Once (YOLO) model, the proposed pupil tracking method can effectively estimate and predict the center of the pupil in the visible-light mode. By using the developed YOLOv3-tiny-based model to test the pupil tracking performance, the detection accuracy is as high as 80%, and the recall rate is close to 83%. In addition, the average visible-light pupil tracking errors of the proposed YOLO-based deep-learning design are smaller than 2 pixels for the training mode and 5 pixels for the cross-person test, which are much smaller than those of the previous ellipse fitting design without using deep-learning technology under the same visible-light conditions. After the combination of calibration process, the average gaze tracking errors by the proposed YOLOv3-tiny-based pupil tracking models are smaller than 2.9 and 3.5 degrees at the training and testing modes, respectively, and the proposed visible-light wearable gaze tracking system performs up to 20 frames per second (FPS) on the GPU-based software embedded platform. Citation: Ou, W.-L.; Kuo, T.-L.; Keywords: deep-learning; YOLOv3-tiny; visible-light; pupil tracking; gaze tracker; wearable Chang, C.-C.; Fan, C.-P. eye tracker Deep-Learning-Based Pupil Center Detection and Tracking Technology for Visible-Light Wearable Gaze Tracking Devices. Appl. Sci. 2021, 11, 1. Introduction 851. https://doi.org/10.3390/ Recently, wearable gaze tracking devices (or called eye trackers) have started to app11020851 become more widely used in human-computer interaction (HCI) applications. To evaluate people’s attention deficit, vision, and cognitive information and processes, gaze tracking Received: 29 November 2020 Accepted: 12 January 2021 devices have been well-utilized in this field [1–6]. In [4], the eye-tracking technology was a Published: 18 January 2021 vastly applied methodology to examine hidden cognitive processes. The authors developed a program source code debugging procedure that was inspected by using the eye-tracking Publisher’s Note: MDPI stays neutral method. The eye-tracking-based analysis was elaborated to test the debug procedure and with regard to jurisdictional claims in record the major parameters of eye movement. In [5], the eye-movement tracking device published maps and institutional affil- was used to observe and analyze a complex cognitive process for C# programming. By iations. evaluating the knowledge level and eye movement parameters, the test subjects were involved to analyze the readability and intelligibility of the query and method at the Language-Integrated Query (LINQ) declarative query of the C# programming language. In [6], the forms and efficiency of debugging parts of software development were observed Copyright: © 2021 by the authors. by tracking eye movements with the participation of test subjects. The routes of gaze for Licensee MDPI, Basel, Switzerland. the test subjects were observed continuously during the procedure, and the coherences This article is an open access article were decided by the assessment of the recorded metrics. distributed under the terms and Many up-to-date consumer products have applied embedded gaze tracking technol- conditions of the Creative Commons ogy in their designs. Through the use of gaze tracking devices, the technology improves Attribution (CC BY) license (https:// the functions and applications of intuitive human-computer interaction. Gaze tracking creativecommons.org/licenses/by/ designs use an image sensor based on near-infrared (NIR), which was utilized in [7–16]. 4.0/). Some previous eye-tracking designs used high-contrast eye images, which are recorded by Appl. Sci. 2021, 11, 851. https://doi.org/10.3390/app11020851 https://www.mdpi.com/journal/applsci
Appl. Sci. 2021, 11, x FOR PEER REVIEW 2 of 21 Appl. Sci. 2021, 11, 851 2 of 21 designs use an image sensor based on near-infrared (NIR), which was utilized in [7–16]. Some previous eye-tracking designs used high-contrast eye images, which are recorded by active active NIRNIR image image sensors. sensors. NIR NIR image image sensor-based sensor-based technology technology can achieve can achieve highhigh accu- accuracy racy when used indoors. However, the NIR-based design is not when used indoors. However, the NIR-based design is not suitable for use outdoors or suitable for use outdoors or with with sunlight. sunlight. Moreover, Moreover, long-term long-term operation operation maymay have have thethe potential potential toto hurtthe hurt theuser’s user’s eyes. When eye trackers are considered for general and long-term eyes. When eye trackers are considered for general and long-term use, visible-light pupil use, visible-light pupil trackingtechnology tracking technologybecomesbecomesaasuitable suitablesolution. solution. Duetotothe Due the characteristics characteristics of the of the selected selected sensor sensor module, module, the contrast the contrast of theofvisible-light the visible- light eye eye may image image bemay be insufficient, insufficient, and the nearfield and the nearfield eye imageeyemay image containmay somecontain randomsome ran- types ofdom types noise. of noise.visible-light In general, In general,eye visible-light images usuallyeye images include usually include insufficient insufficient contrast con- and large trast and random largeInrandom noise. order tonoise. In order overcome thisto overcome thissome inconvenience, inconvenience, other methods some other have [17–19] meth- ods [17–19] been developed haveforbeen developed for visible-light-based visible-light-based gaze tracking design.gaze tracking design. For For the previous the pre- wearable vious eye wearable tracking designeyeintracking design [15,19] with theinvisible-light [15,19] withmode, the visible-light the estimation mode, the estimation of pupil position was easilyposition of pupil disturbed wasbyeasily light and shadow disturbed byocclusion, and the prediction light and shadow occlusion, anderrorthe of the pupil prediction position, error of thus, the pupilbecomes serious. position, thus,Figure 1 illustrates becomes serious.some Figureof1the visible-light illustrates somepupil of thetracking visible- results by using light pupil the previous tracking results by ellipse usingfitting-based the previousmethods in [15,19]. Thus, ellipse fitting-based methodsa large pupil in [15,19]. tracking error will lead to inaccuracy of the gaze tracking operation. Thus, a large pupil tracking error will lead to inaccuracy of the gaze tracking operation.In [20,21], for indoor applications, In [20,21], forthe designers indoor developed applications, the eye tracking designers devices based developed on visible-light eye tracking devices basedimageon sensors. A previous visible-light imagestudy [20]Aperformed sensors. previous experiments indoors, but study [20] performed showed thatindoors, experiments the devicebut could showednotthat work thenormally outdoors. device could not work When the wearable normally eye-tracking outdoors. device is correctly When the wearable eye-track- used in all indoor ing device and outdoor is correctly used in environments, all indoor and the method outdoor based on thethe environments, visible-light method based imageon sensor is sufficient. the visible-light image sensor is sufficient. Figure1.1.Visible-light Figure Visible-lightpupil pupiltracking trackingresults resultsby bythe theprevious previousellipse ellipsefitting-based fitting-basedmethod. method. Related RelatedWorks Works InInrecent recent years, thethe years, progress of deep-learning progress of deep-learningtechnologies has become technologies very important, has become very im- especially in the fieldinof portant, especially computer the vision andvision field of computer patternand recognition. Through deep-learning pattern recognition. Through deep- technology, inference models learning technology, canmodels inference be trained can toberealize trainedreal-time to realizeobject detection real-time and object recog- detection nition. For the deep-learning-based designs in [22–37], the eye tracking and recognition. For the deep-learning-based designs in [22–37], the eye tracking systems systems could detect couldthe location detect of eyes or the location of pupils, eyes orregardless of using of pupils, regardless theusing visible-light or near-infrared the visible-light or near- image infraredsensors. imageTable 1 describes sensors. Table 1 the overview describes the of the deep-learning-based overview gaze tracking of the deep-learning-based gaze designs. tracking Fromdesigns. the From viewpoints of device of the viewpoints setup, the setup, device deep-learning-based gaze tracking the deep-learning-based gaze designs tracking indesigns [22–37] incan be divided [22–37] can beinto two different divided into twostyles, differentwhich include styles, which theinclude non-wearable the non- and the wearable wearable and thetypes. wearable types. For For thedeep-learning-based the deep-learning-basednon-wearable non-wearableeye eyetracking trackingdesigns, designs,inin[22], [22],the theauthors authors proposed proposed a deep-learning-based gaze estimation scheme to estimate the gazedirection a deep-learning-based gaze estimation scheme to estimate the gaze direction from fromaasingle singlefacefaceimage. image.The Theproposed proposedgazegazeestimation estimationdesigndesignwas wasbased basedon onapplying applying multiple multiple convolutional neural networks (CNN) to learn the model for gazeestimation convolutional neural networks (CNN) to learn the model for gaze estimation from fromthe theeyeeyeimages. images.The Theproposed proposeddesign designprovided providedaccurate accurategaze gazeestimation estimationfor forusers users with different head poses by including the head pose information into with different head poses by including the head pose information into the gaze estimationthe gaze estimation framework. framework.ComparedComparedwith withthe theprevious previousmethods, methods,the thedeveloped developedgazegazeestimation estimationdesign design increased increased the accuracy of appearance-based gaze estimation under headpose the accuracy of appearance-based gaze estimation under head posevariations. variations. In [23], the purpose of this work was to use CNNs to classify eye tracking data. Firstly, In [23], the purpose of this work was to use CNNs to classify eye tracking data. a CNN was used to classify two different web interfaces for browsing news data. Secondly, Firstly, a CNN was used to classify two different web interfaces for browsing news data. a CNN was utilized to classify the nationalities of users. In addition, the data-preprocessing Secondly, a CNN was utilized to classify the nationalities of users. In addition, the data- and feature-engineering techniques were applied. In [24], for emerging consumer elec- tronic products, an accurate and efficient eye gaze estimation design was important, and
Appl. Sci. 2021, 11, 851 3 of 21 the products were required to remain reliable under difficult environments with low- power consumption and low hardware cost. In this paper, a new hardware-friendly CNN model with minimal computational requirements was developed and assessed for efficient appearance-based gaze estimation. The proposed CNN model was tested and compared against existing appearance-based CNN approaches, achieving better eye gaze accuracy with large reductions of computational complexities. In [25], iRiter technology was developed, which can assist paralyzed people to write on screen by using only their eye movements. iRiter detects precisely the movement of the eye pupil by referring to the reflections of external near-infrared (IR) signals. Then, the deep multilayer perceptron model was used for the calibration process. By the position vector of pupil area, the coordinates of the pupil were mapped to the given gaze position on the screen. In [26], the authors proposed a gaze zone estimation design by using deep-learning technology. Compared with traditional designs, the developed method did not need the procedure of calibration. In the proposed method, a Kinect was used to capture the video of a computer user, a Haar cascade classifier was applied to detect the face and eye regions, and data on the eye region was used to estimate the gaze zones on the monitor by the trained CNN. The proposed calibration-free method performed a high accuracy to be applied for human–computer interaction. In [27], which focused on healthcare applications, the high accuracy in semantic segmentation of medical images were important; a key issue for training the CNNs was obtaining large-scale and precisely annotated images, and the authors sought to address the lack of annotated data used with eye tracking method. The hypothesis was that segmentation masks generated to help eye tracking would be very similar to those rendered by hand annotation, and the results demonstrated that the eye tracking method created segmentation masks, which were suitable for deep-learning-based semantic segmentation. In [29], for the conditions of unconstrained gaze estimation, most of the gaze datasets were collected under laboratory conditions, and the previous methods were not evaluated across multiple datasets. In this work, the authors studied key challenges, including target gaze range, illumination conditions, and facial appearance variation. The study showed that image resolution and the use of both eyes affected gaze estimation performance. Moreover, the authors proposed the first deep appearance-based gaze estimation method, i.e., GazeNet, to improve the state of the art by 22% for the cross-dataset evaluation. In [30], a subconscious response influenced the estimation of human gaze from factors related to the human mental activity. The previous methodologies only based on the use of gaze statistical modeling and classifiers for assessing images without referring to a user’s interests. In this work, the authors introduced a large-scale annotated gaze dataset, which was suitable for training deep-learning models. Then, a novel deep-learning-based method was used to capture gaze patterns for assessing image objects with respect to the user’s preferences. The experiments demonstrated that the proposed method performed effectively by considering key factors related to the human gaze behavior. In [31], the authors proposed a two-step training network, which was named the Gaze Estimator, to raise the gaze estimation accuracy on mobile devices. At the first step, an eye landmarks localization network was trained on the 300W-LP dataset, and the second step was to train a gaze estimation model on the GazeCapture dataset. In [28], the network localized the eye precisely on the image and the CNN network was robust when facial expressions and occlusion happened; the inputs of the gaze estimation network were eye images and eye grids. In [32], since many previous methods were based on a single camera, and focused on either the gaze point estimation or gaze direction estimation, the authors proposed an ef- fective multitask method for the gaze point estimation by multi-view cameras. Specifically, the authors analyzed the close relationship between the gaze point estimation and gaze direction estimation, and used a partially shared convolutional neural networks to estimate
Appl. Sci. 2021, 11, 851 4 of 21 the gaze direction and gaze point. Moreover, the authors introduced a new multi-view gaze tracking data set consisting of multi-view eye images of different subjects. In [33], the U2Eyes dataset was introduced, and U2Eyes was a binocular database of synthesized images regenerating real gaze tracking scripts with 2D/3D labels. The U2Eyes dataset could be used in machine and deep-learning technologies for eye tracker designs. In [34], for using appearance-based gaze estimation techniques, only using a single camera for the gaze estimation will limit the application field to short distance. To overcome this issue, the authors developed a new long-distance gaze estimation design. In the training phase, the Learning-based Single Camera eye tracker (LSC eye tracker) acquired gaze data by a commercial eye tracker, the face appearance images were captured by a long-distance camera, and deep convolutional neural network models were used to learn the mapping from appearance images to gazes. In the application phase, the LSC eye tracker predicted gazes based on the acquired appearance images by the single camera and the trained CNN models with effective accuracy and performance. In [35], the authors introduced an effective eye-tracking approach by using a deep- learning-based method. The location of the face relative to the computer was obtained by detecting color from the infrared LED with OpenCV, and the user’s gaze position was inferred by the YOLOv3 model. In [36], the authors developed a low-cost and accurate remote eye-tracking device which used an industrial prototype smartphone with integrated infrared illumination and camera. The proposed design used a 3D gaze-estimation model which enabled accurate point-of-gaze (PoG) estimation with free head and device motion. To accurately determine the input eye features, the design applied the convolutional neural networks with a novel center-of-mass output layer. The hybrid method, which used artificial illumination, a 3D gaze-estimation model, and a CNN feature extractor, achieved better accuracy than the existing eye-tracking methods on smartphones. On the other hand, for the deep-learning-based wearable eye tracking designs, in [28], the eye tracking method with video-oculography (VOG) images needed to provide an accurate localization of the pupil with artifacts and under naturalistic low-light conditions. The authors proposed a fully convolutional neural network (FCNN) for the segmentation of the pupil area that was trained on 3946 hand-annotated VOG images. The FCNN output simultaneously performed pupil center localization, elliptical contour estimation, and blink detection with a single network on commercial workstations with GPU acceleration. Compared with existing methods, the developed FCNN-based pupil segmentation design was accurate and robust for new VOG datasets. In [34], to conquer the obstructions and noises in nearfield eye images, the nearfield visible-light eye tracker design utilized the deep-learning-based gaze tracking technologies to solve the problem. In order to make the pupil tracking estimation more accurate, this work uses the YOLOv3-tiny [38] network-based deep-learning design to detect and track the position of the pupil. The proposed pupil detector is a well-trained deep-learning network model. Compared with the previous designs, the proposed method reduces the pupil tracking errors caused by light reflection and shadow interference in visible-light mode, and will also improve the accuracy of the calibration process of the entire eye tracking system. Therefore, the gaze tracking function of the wearable eye tracker will become powerful under visible-light conditions.
Appl. Sci. 2021, 11, 851 5 of 21 Table 1. Overview of the deep-learning-based gaze tracking designs. Deep-learning-based Operational Eye/Pupil Tracking Gaze Estimation and Dataset Used Methods Mode/Setup Method Calibration Scheme Multiple pose-based Uses a facial Visible-light mode and VGG-like CNN Sun et al. [22] UT Multiview landmarks-based design to /Non-Wearable models for gaze locate eye regions estimation Joint eye-gaze CNN Visible-light mode The CNN-based eye architecture with both Lemley et al. [24] MPII Gaze /Non-Wearable detection eyes for gaze estimation Uses the OpenCV method Visible-light mode GoogleNetV1 based Cha et al. [26] Self-made dataset to extract the face region /Non-Wearable gaze estimation and eye region Uses face alignment and VGGNet-16-based Visible-light mode Zhang et al. [29] MPII Gaze 3D face model fitting to GazeNet for gaze /Non-Wearable find eye zones estimation CNN-based multi-view Visible-light mode ShanghaiTechGaze and Uses the ResNet-34 model Lian et al. [32] and multi-task gaze /Non-Wearable MPII Gaze for eye features extraction estimation Uses the OpenFace CLM- Visible-light mode framework/Face++/YOLOv3 Infrared-LED based Li et al. [34] Self-made dataset /Non-Wearable to extract facial ROI and calibration landmarks Visible-light mode Uses YOLOv3 to detect the Infrared-LED based Rakhmatulin et al. [35] Self-made dataset /Non-Wearable eyeball and eye’s corners calibration Infrared mode Eye-Feature locator CNN 3D gaze-estimation Brousseau et al. [36] Self-made dataset /Non-Wearable model model-based design Uses a fully convolutional 3D eyeball model, Near-eye infrared German center for neural network for pupil marker, and projector- Yiu et al. [28] mode vertigo and balance segmentation and uses assisted-based /Wearable disorders ellipse fitting to locate the calibration eye center Uses the Near-eye Marker and affine YOLOv3-tiny-based lite Proposed design visible-light Self-made dataset transform-based model to detect the mode/Wearable calibration pupil zone The main contribution of this study is that the pupil location in nearfield visible-light eye images is detected precisely by the YOLOv3-tiny-based deep-learning technology. To our best knowledge from surveying related studies, the novelty of this work is to provide the leading reference design of deep-learning-based pupil tracking technology using the nearfield visible-light eye images for the application of wearable gaze tracking devices. Moreover, the other novelties and advantages of the proposed deep-learning-based pupil tracking technology are described as follows: (1) By using a nearfield visible-light eye image dataset for training, the used deep- learning models achieved real-time and accurate detection of the pupil position at the visible-light mode. (2) The proposed design detects the position of the pupil’s object at any eyeball movement condition, which is more effective than the traditional image processing methods without deep-learning technologies at the near-eye visible-light mode. (3) The proposed pupil tracking technology can overcome efficiently the light and shadow interferences at the near-eye visible-light mode, and the detection accuracy of a pupil’s
Appl. Sci. 2021, 11, 851 6 of 21 location is higher than that of previous wearable gaze tracking designs without using Appl. Sci. 2021, 11, xdeep-learning FOR PEER REVIEWtechnologies. 6 of 21 To enhance detection accuracy, we used the YOLOv3-tiny-based deep-learning model to detect the pupil’s object in the visible-light near-eye images. We also compared its detection results to the Toother designs, enhance which detection include accuracy, wethe methods used without using deep- the YOLOv3-tiny-based deep-learning model learning technologies. Theto pupil detect detection the pupil’sperformance object in the visible-light was evaluatednear-eye images. in terms We also compared of precision, its detection recall, and pupil tracking results errors. to the By the other designs, proposed which include the YOLOv3-tiny-based methods without using deep- deep-learning-based learning technologies. The pupil detection method, the trained YOLO-based deep-learning models achieve enough performance was evaluated in terms precision with of preci- sion, recall, and pupil tracking errors. By the proposed YOLOv3-tiny-based deep-learn- small average pupil tracking errors, which are less than 2 pixels for the training mode and ing-based method, the trained YOLO-based deep-learning models achieve enough preci- 5 pixels for the cross-person test under the near-eye visible-light conditions. sion with small average pupil tracking errors, which are less than 2 pixels for the training mode and 5 pixels for the cross-person test under the near-eye visible-light conditions. 2. Background of Wearable Eye Tracker Design Figure 2 depicts the generaloftwo-stage 2. Background Wearable Eyeoperational flow of the wearable gaze tracker Tracker Design with the four-point calibration scheme. The wearable gaze Figure 2 depicts the general two-stage operational tracking device flow ofcanthe use the NIR wearable gaze tracker or visible-light image sensor-based with the gaze tracking four-point calibration scheme. design, whichgaze The wearable can tracking apply appearance-, device can use the NIR model-, or learning-based technology. or visible-light First, the gaze image sensor-based gazetracking trackingdevice design,estimates which canand tracks apply appearance-, model-, the pupil’s center to prepareor learning-based the calibrationtechnology. process. For First, thetheNIR gaze tracking image device estimates sensor-based design, and tracks the pupil’scomputations, for example, the first-stage center to preparewhich the calibration involve process. For thefor the processes NIRpupil imagelocation, sensor-based de- sign, for extraction of the pupil’s example, region the first-stage of interest (ROI), thecomputations, binarization which involve process withthethe processes two-stage for pupil lo- cation, extraction of the pupil’s region of interest (ROI), the binarization process with the scheme in pupil’s ROI, and the fast scheme of ellipse fitting with binarized pupil’s edge two-stage scheme in pupil’s ROI, and the fast scheme of ellipse fitting with binarized pu- points, are utilized [15]. Then, the pupil’s center can be detected effectively. For the visible- pil’s edge points, are utilized [15]. Then, the pupil’s center can be detected effectively. For light image sensor-based design, the visible-light for sensor-based image example, bydesign, usingfor the gradients-based example, by using theiris center gradients-based iris location technology [21] or the deep-learning-based methodology [37], the gaze center location technology [21] or the deep-learning-based methodology [37], the gaze tracking device realizes antracking effective iris center device realizestracking an effectivecapability. iris centerNext, at capability. tracking the second processing Next, at the second pro- stage, the functioncessing and process for function stage, the calibrationandand gazefor process prediction calibrationcanandbegaze done. To increase prediction can be done. the accuracy of gazeTo tracking, increase thetheaccuracy head movement effect can of gaze tracking, be compensated the head movement effect by the canmotion be compensated measurement device by the [39].motion In themeasurement device for following section, [39].the In proposed the following section,gaze wearable for the proposed wear- tracking able processing design, the first-stage gaze trackingprocedure design, theuses first-stage processing procedurepupil the deep-learning-based uses the deep-learning- tracking based pupil tracking technology by using a visible-light technology by using a visible-light nearfield eye image sensor. Next, the second-stage nearfield eye image sensor. Next, the second-stage process achieves the function of calibration for the gaze tracking and process achieves the function of calibration for the gaze tracking and prediction work. prediction work. Figure 2. The general processing Figure flow of 2. The general the wearable processing gaze flow of tracker. gaze tracker. the wearable 3. Proposed Methods 3. Proposed Methods The proposed wearable eye tracking The proposed wearable method is divided eye tracking methodinto severalinto is divided steps as follows: several steps as follows: (1) Collecting the visible-light (1) Collectingnearfield eye images the visible-light in eye nearfield datasets and images labelingand in datasets these eye image labeling these eye image datasets; datasets; (2) Choosing and (2) designing Choosing and thedesigning the deep-learning deep-learning networknetwork architecture architecture for pupil object for pupil detection; (3) Training and testing the trained inference model; object detection; (3) Training and testing the trained inference model; (4) Using the deep-(4) Using the deep-learning model to infer and detect the pupil’s object, and estimate the center learning model to infer and detect the pupil’s object, and estimate the center coordinate coordinate of pupil box; (5) Following the calibration process and predicting the gaze points. Figure 3 depicts the of pupil box; (5) Following the calibration process and predicting the gaze points. Figure computational process of the proposed visible-light gaze tracking design based on the 3 depicts the computational process of the proposed visible-light gaze tracking design YOLO deep-learning model. In the proposed design, a visible-light camera module is used based on the YOLO deep-learning model. In the proposed design, a visible-light camera module is used to capture nearfield eye images. The design uses a deep-learning network
calculates the pupil center from the detected pupil box for the subsequent calibration and gaze tracking process. Figure 4 demonstrates the scheme of the self-made wearable eye tracking device. In the proposed wearable device, both the outward-view scene camera and the nearfield eye Appl.Appl. Sci. 2021, Sci. 2021, 11, x 11, FOR PEER REVIEW camera are required to provide the whole function of the wearable gaze tracker. 851 7 ofThe 7 ofpro- 21 21 posed design uses deep-learning-based technology to detect and track the pupil zone, es- timate the pupil’s center by using the detected pupil object, and then use the coordinate information to perform and predict the gaze points. Compared with the previous image to capture based nearfield on YOLO eye to images. train anThe designmodel inference uses a deep-learning to detect network the object based and of pupil, on YOLO then, the processing methods of tracking the pupil at the visible-light mode, the proposed method to train an inference proposed system model to detect calculates the thecenter pupil objectfrom of pupil, the and then, detected thebox pupil proposed for the system subsequent can detect the pupil object at any eyeball position, and it also is not affected by light and calculates the pupil calibration centertracking and gaze from the detected pupil box for the subsequent calibration and shadow interferences in realprocess. applications. gaze tracking process. Figure 4 demonstrates the scheme of the self-made wearable eye tracking device. In the proposed wearable device, both the outward-view scene camera and the nearfield eye camera are required to provide the whole function of the wearable gaze tracker. The pro- posed design uses deep-learning-based technology to detect and track the pupil zone, es- timate the pupil’s center by using the detected pupil object, and then use the coordinate information to perform and predict the gaze points. Compared with the previous image processing methods of tracking the pupil at the visible-light mode, the proposed method can detect the pupil object at any eyeball position, and it also is not affected by light and shadow interferences in real applications. Figure 3. The operational flow of the proposed visible-light based wearable gaze tracker. Figure 3. The operational flow of the proposed visible-light based wearable gaze tracker. Figure 4 demonstrates the scheme of the self-made wearable eye tracking device. In the proposed wearable device, both the outward-view scene camera and the nearfield eye camera are required to provide the whole function of the wearable gaze tracker. The proposed design uses deep-learning-based technology to detect and track the pupil zone, estimate the pupil’s center by using the detected pupil object, and then use the coordinate information to perform and predict the gaze points. Compared with the previous image processing methods of tracking the pupil at the visible-light mode, the proposed method can detect the pupil object at any eyeball position, and it also is not affected by light and shadow interferences in real applications. Figure 3. The operational flow of the proposed visible-light based wearable gaze tracker. Figure 4. The self-made visible-light wearable gaze tracking device. 3.1. Pre-Processing for Pupil Center Estimation and Tracking In the pre-processing work, the visible-light nearfield eye images are recorded for datasets, and the collected eye image datasets will be labeled for training and testing the trained inference model. Moreover, we choose several YOLO-based deep-learning models for pupil object detection. By using the YOLO-based deep-learning model to detect the pupil’s object, the proposed system can track and estimate the center coordinate of pupil box for calibration and gaze prediction. Figure 4. The Figure 4. self-made visible-light The self-made wearable visible-light gazegaze wearable tracking device. tracking device. 3.1. Pre-Processing 3.1. Pre-Processing for Pupil for Pupil Center Center Estimation Estimation and Tracking and Tracking In pre-processing In the the pre-processing work,work, the visible-light the visible-light nearfield nearfield eye eye images images are recorded are recorded for for datasets, and the collected eye image datasets will be labeled for training and testing the the datasets, and the collected eye image datasets will be labeled for training and testing trained trained inference inference model. model. Moreover, Moreover, we choose we choose several several YOLO-based YOLO-based deep-learning deep-learning modelsmodels for pupil object detection. By using the YOLO-based deep-learning model to detect the the for pupil object detection. By using the YOLO-based deep-learning model to detect pupil’s object, the proposed system can track and estimate the center coordinate of pupil pupil’s object, the proposed system can track and estimate the center coordinate of pupil box for calibration and gaze prediction. box for calibration and gaze prediction.
Appl. Appl.Sci. 2021, Sci. 11,11, 2021, 851x FOR PEER REVIEW 8 of 2121 8 of 3.1.1.Generation 3.1.1. GenerationofofTraining Trainingand andValidation Validation Datasets Datasets InInorder ordertotocollect collectnearfield nearfieldeye eyeimages imagesfrom fromeight eightdifferent differentusers usersforfortraining trainingandand testing, each user testing, each user used used a self-made wearable eye tracker to emulate the gaze wearable eye tracker to emulate the gaze tracking tracking op- erations, after operations, afterwhich which thethe users rotated users theirtheir rotated gaze gaze around the screen around to capture the screen the nearfield to capture the eye images nearfield eye with images various eyeball eyeball with various positions. The collected positions. and recorded The collected near-eye and recorded images near-eye cover all images coverpossible eyeballeyeball all possible positions. FigureFigure positions. 5 illustrates nearfield 5 illustrates visible-light nearfield eye images visible-light eye collected images for training collected and in-person for training and in-persontests. In Figure tests. 5, the numbers In Figure of near-eye 5, the numbers images of near-eye from person images 1, person from person 1, 2, person person 2, 3,person person3,4,person person4,5, person person 6, 5, person person 7,6,and person person 8 are 7, and 849, 987, person 734, 8 are 849,826, 977, 987, 896, 734, 886, 826, and 977, 939, 896, respectively. 886, Moreover, the and 939, respectively. near-eye Moreover, images the cap- near-eye images captured tured from person from9 toperson person916 to are person used16 forare theused for the cross-person cross-person test. Thus, thetest. Thus, the nearfield eye nearfield images fromeye images personfrom 1 toperson person1 8towithout person 8closed without closed eyes eyes are selected, are selected, of whichofapproxi- which approximately mately 900 images 900 images per person per person are usedare used for training for training and and testing. testing. The The imageimage datasets datasets con- consisted of a total of 7094 eye patterns. sisted of a total of 7094 eye patterns. Next, these Next, these selected eye eye images will go througha a images will go through labeling process before training the inference model labeling process before training the inference model for pupil for pupil object detection object and and detection tracking. track- Toing. increase the number To increase of images the number in thein of images training phase,phase, the training the datathe augmentation data augmentation process, pro- which cess, randomly which randomly changeschanges angle, saturation, exposure, angle, saturation, and hue and exposure, of the selected hue of theimage selecteddataset, image isdataset, enabledisby using the enabled framework by using of YOLO.ofAfter the framework YOLO. theAfter 160,000 the iterations for training 160,000 iterations the for train- YOLOv3-tiny-based ing the YOLOv3-tiny-based models, the total number of images used for training willtobe models, the total number of images used for training will be up 10upmillion. to 10 million. Figure5.5.Collected Figure Collecteddatasets datasetsofofnearfield nearfieldvisible-light visible-lighteye eyeimages. images. Before Beforetraining trainingthethemodel, model,the thedesigner designermust mustfind findthe thecorrect correctimage imageobject object(or (oranswer) answer) ininthe theimage imagedatasets, datasets,tototrain trainthe themodel modelwhatwhattotolearn learnandandtotoidentify identifythethelocation locationofofthethe correct correctobject objectininthe thewhole wholeimage, image,which whichisiscommonly commonlyknown knownasasthetheground groundtruth, truth,sosothe the labeling labelingprocess processmust mustbebedone donebefore beforethe thetraining trainingprocess. process.TheTheposition positionofofthethepupil pupilcenter center varies variesininthe thedifferent differentnearfield nearfieldeyeeyeimages. images.Figure Figure6 6demonstrates demonstratesthe thelabeling labelingprocess processforfor the location of the pupil’s ROI object. the location of the pupil’s ROI object. For the following learning process of the following learning process of the detector detectorof ofthe thepupil pupil zone, zone, thethe sizesize of labeled of the the labeled box affects box also also affects the detection the detection accuracy accuracy of the of the network network architecture. architecture. In our experiments, In our experiments, firstly, afirstly, labeleda labeled box with box with the 10 ×the 10 10 × 10size pixels pixels size is tested, isand tested, and then, then, a box a labeled labeled withbox with a size a size ranging ranging from 20 ×from 20 × 20 20 pixels pixels to 30 × 30to 30 ×is30adopted. pixels pixels isThe adopted. Thelabeled size of the size ofbounding the labeled boxbounding box is closely is closely related relatedof to the number tofeatures the number of analyzed features analyzed by the deep-learning architecture. In the labeling process by the deep-learning architecture. In the labeling process [40], in order to improve the [40], in order to improve the accuracy of the ground truth, the ellipse fitting method was applied for
Appl. Sci. 2021, 11, x FOR PEER REVIEW 9 of 21 Appl. Sci. 2021, 11, 851 9 of 21 accuracy of the ground truth, the ellipse fitting method was applied for the pre-position- the ingpre-positioning of the pupil’s of the pupil’s center, center, after which weafter usedwhich we used a manual a manual method method to further to further correct the co- correct the coordinates of the pupil center in each selected nearfield eye image. Before ordinates of the pupil center in each selected nearfield eye image. Before the training pro- the training process, all labeled images (i.e., 7094 eye images) are randomly divided cess, all labeled images (i.e., 7094 eye images) are randomly divided into a training part, into a training part, verification part, and test part, with corresponding ratios of 8:1:1. Therefore, verification part, and test part, with corresponding ratios of 8:1:1. Therefore, the number the numbereye of labeled of images labeledusedeye images usedisfor for training training 5676, is number and the 5676, and of the number labeled imagesof labeled used for images used and verification for verification and testing is 709. testing is 709. Figure 6. The labeling process for location of the pupil’s ROI. Figure 6. The labeling process for location of the pupil’s ROI. 3.1.2.The 3.1.2. TheUsed UsedYOLO-Based YOLO-BasedModels Modelsand andTraining/Validation Training/ValidationProcess Process Theproposed The proposeddesign designusesusesthetheYOLOv3-tiny YOLOv3-tinydeep-learning deep-learningmodel. model.The Thereasons reasonsfor for choosingaanetwork choosing networkmodel modelbased basedon onYOLOv3-tiny YOLOv3-tinyare: are:(1) (1)The TheYOLOv3-tiny-based YOLOv3-tiny-basedmodel model achievesgood achieves gooddetection detectionand andrecognition recognitionperformance performanceatatthe thesame sametime; time;(2)(2)The TheYOLOv3- YOLOv3- tiny-based tiny-basedmodelmodelisiseffective effectiveforforsmall smallobject objectdetection, detection,andandthe thepupil pupilininthethenearfield nearfieldeye eye image is also a very small object. In addition, because the number of image is also a very small object. In addition, because the number of convolutional layersconvolutional layers ininYOLOv3-tiny YOLOv3-tinyis issmall,small,thethe inference inference model modelbased on YOLOv3-tiny based on YOLOv3-tiny can achieve real-time can achieve real- operations on theon time operations software-based the software-basedembedded platform. embedded platform. In Inthetheoriginal originalYOLOv3-tiny YOLOv3-tiny model, model, as as shown shownin Figure in Figure 7, each layer 7, each usesuses layer threethree anchoran- boxes to predict chor boxes the bounding to predict boxes inboxes the bounding a gridincell. However, a grid only one pupil cell. However, only oneobject category pupil object needs to be category detected needs to beindetected each nearfield in eacheye image,eye nearfield after whichafter image, the number which the of number anchor boxes of an- required can be reduced. Figure 8 illustrates the network architecture chor boxes required can be reduced. Figure 8 illustrates the network architecture of the of the YOLOv3-tiny model with only YOLOv3-tiny one anchor. model with only Besides that the one anchor. number Besides of the that required number anchor boxes can of required be anchor reduced, since the image features of the detected pupil object is not intricate, boxes can be reduced, since the image features of the detected pupil object is not intricate, the number ofthe convolutional layers can alsolayers number of convolutional be reduced. can alsoFigure 9 showsFigure be reduced. the network 9 shows architecture the network of the ar- YOLOv3-tiny model with one anchor and a one-way path. In Figure chitecture of the YOLOv3-tiny model with one anchor and a one-way path. In Figure 9, 9, since the number of convolutional since the numberlayers is reduced, layers of convolutional the computational is reduced, the complexity can also computational be reduced complexity can effectively. also be reduced effectively. Before the labeled eye images are fed into the YOLOv3-tiny-based deep-learning models, the parameters used for training deep-learning networks must be set properly, e.g., batch, decay, subdivision, momentum, etc., after which the inference models for pupil center tracking will be achieved accurately. In our training design for models, the parameters of Width/height, Channels, Batch, Max_batches, Subdivision, Momentum, Decay, and Learning_rate are set to 416, 3, 64, 500200, 2, 0.9, 0.0005, and 0.001, respectively. In the architecture of the YOLOv3-tiny model, 13 convolution layers with differ- ent filter numbers and pooling layers with different strides are established to allow the YOLOv3-tiny model to have a high enough learning capability for training from images. In addition, the YOLOv3-tiny-based model not only uses the convolution neural network, but also adds upsampling and concrete calculation methods. During the learning process, the YOLO model uses training datasets for training, uses the images in the validation dataset to test the model, observes whether the model is indeed learning correctly, uses the weight back propagation method to gradually converge the weight, and observes whether the weight converges with the loss function.
Appl. Sci. 2021, 11, 851 10 of 21 Appl. Sci. 2021, 11, x FOR PEER REVIEW 10 of 21 Appl. Sci. 2021, 11, x FOR PEER REVIEW 10 of 21 Figure7.7.The Figure Thenetwork networkarchitecture architectureofofthe theoriginal originalYOLOv3-tiny YOLOv3-tiny model. model. Figure 7. The network architecture of the original YOLOv3-tiny model. Figure 8. The network architecture of the YOLOv3-tiny model with one anchor. Figure8.8.The Figure Thenetwork networkarchitecture architecture ofof the the YOLOv3-tiny YOLOv3-tiny model model with with one one anchor. anchor.
Appl. Appl.Sci. 2021,11, Sci.2021, 11,851 x FOR PEER REVIEW 1111ofof2121 Figure9.9.The Figure Thenetwork networkarchitecture architectureof of the the YOLOv3-tiny YOLOv3-tiny model model with with one one anchor anchor and and aa one-way one-way path. path. Before10 Figure the labeled depicts theeye images training andare fed into testing the YOLOv3-tiny-based processes by the YOLO-based deep-learning models for models, pupils. tracking the parameters used forand In the detection training deep-learning classification processnetworks mustmodel, by the YOLO be set properly, the grid e.g.,anchor cell, batch, decay, box, andsubdivision, activationmomentum, function areetc., used after which the inference to determine models the position for of the pupil object center tracking bounding box inwill the be achieved image. accurately. Therefore, In our training the well-trained design model can for models, detect the param- the position of the pupil in the image with the bounding box. The detected pupil box eters of Width/height, Channels, Batch, Max_batches, Subdivision, Momentum, Decay, can be compared Appl. Sci. 2021, 11,withFORthe xand ground REVIEWtruth Learning_rate PEER aretoset generate the to 416, 3, 64,intersection over 500200, 2, 0.9, unionand 0.0005, (IoU), precision, 0.001, and recall respectively. 12 of values; Inwe theuse these values architecture to judge of the the quality YOLOv3-tiny of the13 model, model training.layers with different convolution filter numbers and pooling layers with different strides are established to allow the YOLOv3-tiny model to have a high enough learning capability for training from images. In addition, the YOLOv3-tiny-based model not only uses the convolution neural network, but also adds upsampling and concrete calculation methods. During the learning process, the YOLO model uses training datasets for training, uses the images in the validation da- taset to test the model, observes whether the model is indeed learning correctly, uses the weight back propagation method to gradually converge the weight, and observes whether the weight converges with the loss function. Figure 10 depicts the training and testing processes by the YOLO-based models for tracking pupils. In the detection and classification process by the YOLO model, the grid cell, anchor box, and activation function are used to determine the position of the object bounding box in the image. Therefore, the well-trained model can detect the position of the pupil in the image with the bounding box. The detected pupil box can be compared with the ground truth to generate the intersection over union (IoU), precision, and recall values; we use these values to judge the quality of the model training. Figure 10. The Figure training 10. The andand training testing processes testing bybythe processes theYOLO-based YOLO-basedmodels models for for tracking pupil. tracking pupil. Table 2 presents the confusion Table 2 presentsmatrix used formatrix the confusion the performance used for theevaluation performance of classifica- evaluation of class tion. In patternfication. recognition for information In pattern recognitionretrieval and classification, for information recall retrieval and is mentioned classification, recall is me to as the true positive rate tioned to or sensitivity, as the and true positive precision rate is mentioned or sensitivity, as theispositive and precision mentionedpredic- as the positiv tive value. predictive value. Table 2. Confusion matrix for classification. Actual Values Positives Negatives Positives True Positives False Positives values (TP) (FP)
Appl. Sci. 2021, 11, 851 12 of 21 Table 2. Confusion matrix for classification. Actual Values Positives Negatives True Positives False Positives Positives Predictive values (TP) (FP) False Negatives True Negatives Negatives (FN) (TN) Both precision and recall are based on an understanding and measurement of rele- vance. Based on Table 1, the definition of precision is expressed as follows: Precision = TP/(TP + FP), (1) and recall is defined as follows: Recall = TP/(TP + FN). (2) After using the YOLO-based deep-learning model to detect the pupil’s ROI object, the midpoint coordinate of the pupil’s bounding box can be found, and the midpoint coordinate value is the estimated pupil‘s center point. The midpoint coordinates for the estimated pupil’s centers are evaluated with Euclidean distance by comparing the estimated coordinates with the coordinates of ground truth, and the tracking errors of pupil’s center will also be obtained for the performance evaluation. Using the pupil‘s central coordinates for the coordinate conversion of calibration, the proposed wearable device can correctly estimate and predict the outward gaze points. 3.2. Calibration and Gaze Tracking Processes Before using the wearable eye tracker to predict the gaze points, the gaze tracking Appl. Sci. 2021, 11, x FOR PEER REVIEW 13 of 2 device must undergo a calibration procedure. Because the appearance of each user’s eye and the application distance to the near-eye camera are different, the calibration must be done to confirm the coordinate mapping scheme between the pupil’s center and the the application distance to the near-eye camera are different, the calibration must be done outward scene, after which the wearable gaze tracker can correctly predict the gaze points to confirm the coordinate mapping scheme between the pupil’s center and the outward correspond to the outward scene, scene. after which the Figure wearable11gaze depicts the tracker canused calibration correctly process predict the of the gaze points correspond proposed wearable gaze tracker. The steps of the applied calibration process are described to the outward scene. Figure 11 depicts the used calibration process of the proposed weara as follows: ble gaze tracker. The steps of the applied calibration process are described as follows: Figure 11. The used11.calibration Figure process ofprocess The used calibration the proposed wearable of the proposed gaze tracker. wearable gaze tracker. At the first step,At thethe system collects first step, the correspondence the system between the between collects the correspondence center point coor- point co the center dinates of the pupils captured ordinates of theby the nearfield pupils captured eye camera by the andeye nearfield thecamera gaze points and thesimulated gaze points simu by the outward lated camera. Before by the outwardcalibration, in addition camera. Before to recording calibration, thetopositions in addition recordingof thethe positions o center points of the the center pupil points movement, of the the trajectory pupil movement, of markers captured the trajectory by thecaptured of markers outward by the out camera must also ward be camera recorded.mustFigure also be12recorded. shows the Figure 12 shows trajectory the trajectory diagram diagram of tracking theof tracking thethe pupil’s center with pupil’s centereye nearfield with the nearfield image eye image for calibration. for calibration. In addition, FigureIn12addition, reveals theFigure 12 re veals the corresponding trajectory diagram of tracking four markers’ center points with the outward image for calibration.
Figure 11. The used calibration process of the proposed wearable gaze tracker. At the first step, the system collects the correspondence between the center point co- ordinates of the pupils captured by the nearfield eye camera and the gaze points simu- Appl. Sci. 2021, 11, 851 lated by the outward camera. Before calibration, in addition to recording the positions of 13 of 21 the center points of the pupil movement, the trajectory of markers captured by the out- ward camera must also be recorded. Figure 12 shows the trajectory diagram of tracking the pupil’s center with the nearfield eye image for calibration. In addition, Figure 12 re- corresponding trajectory diagram veals the corresponding of tracking trajectory diagramfour markers’four of tracking center points center markers’ with the outward points with image for calibration. the outward image for calibration. Figure 12. Diagram of tracking pupil’s center with the nearfield eye camera for calibration. Figure 12. Diagram of tracking pupil’s center with the nearfield eye camera for calibration. InFigure In Figure12, 12, the the yellow yellow line line indicates indicatesthe thesequence sequenceofofthethemovement movement path pathof of gazing gazingthe markers, and the red rectangle indicates the range of the eye movement. the markers, and the red rectangle indicates the range of the eye movement. In Figure 13, In Figure 13, when the calibration when the calibrationprocess is executed, process the user is executed, thegazes user at the four gazes markers at the sequentially four markers from ID sequentially 1, IDID from 2, 1,IDID3,2,toIDID3,4,toand ID 4,the andsystem records the system the coordinates records of theofpupil the coordinates movement the pupil movement path Appl. Sci. 2021, 11, x FOR PEER REVIEW 14 of 21 while path watching while watching the gazing the gazingcoordinates of the coordinates of outward the outwardscene withwith scene markers. Then, markers. the Then, pro- the posed system proposed system performs performs thethe coordinate coordinateconversion conversionto achieve the the to achieve calibration effect. calibration effect. Figure13. Figure 13.Diagram Diagramof oftracking trackingfour fourmarkers’ markers’center centerpoints pointswith withthe theoutward outwardcamera camerafor forcalibration calibra- tion ① When the user gazes at the marker ID1 for calibration. ② When the user 1 When the user gazes at the marker ID1 for calibration. 2 When the user gazes at the marker gazes at the ID2 marker for ID2 for 3calibration. calibration. ③ When When the user gazesthe usermarker at the gazes ID3 at the formarker ID3 for4 calibration. calibration. ④ When When the user gazes at the user gazes at the marker the marker ID4 for calibration. ID4 for calibration. The Thepseudo-affine transformisisused pseudo-affine transform usedfor forthe thecoordinate coordinate conversion conversion between between thethe pupil pupil and and the outward scene. The equation of the pseudo-affine conversion is expressed as follows: as the outward scene. The equation of the pseudo-affine conversion is expressed follows: Xp a1 a2 a3 a4 Yp, X = = , (3) (3) Y a5 a6 a7 a8 X p Yp 1 1 where(X, (X, Y) is the outward coordinate of the markers’ center point coordinate system, where Y) is the outward coordinate of the markers’ center point coordinate system, and and X p , Yp is , the is the nearfield nearfield eye coordinates eye coordinates of the pupils’ of the pupils’ center center point coordinate point coordinate system.system. In (3), In (3), the unknown transformation coefficients the unknown transformation coefficients of a1 ~a8 need of to~ be need solved tobybe solved the by the coordinate coor- values , carry order to dinate values computed and computed and estimated estimated from (X, Y) andfromX p ,(X, Yp Y). Inand . In out order thetocalculation carry out the of calculation the transformationof the parameters transformationof a1parameters of ~ Equation ~a8 for calibration, for calibration, Equation (4) describes (4) de- the pseudo- scribes affine the pseudo-affine transformation transformation with four-points with four-points calibration, which is used calibration, to solve thewhich is used to transformation solve the transformation coefficients of the coordinate conversion. The equation of the pseudo-affine conversion with four-point calibration is expressed as follows: 1 0 0 0 0 ⎡ ⎤ ⎡ ⎤ 0 0 0 0 1 ⎡ ⎤
You can also read