Recognition of Traffic Lights in Live Video Streams on Mobile Devices
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
SUBMISSION TO IEEE-TCSVT - SPECIAL ISSUE ON VIDEO ANALYSIS ON RESOURCE-LIMITED SYSTEMS 1 Recognition of Traffic Lights in Live Video Streams on Mobile Devices Jan Roters, Xiaoyi Jiang, and Kai Rothaus Abstract—A mobile computer vision system is presented that more mobile phones than inhabitants and almost every current helps visually impaired pedestrians cross roads. The system mobile device has a built-in camera. detects pedestrian lights in the environment and gives feedback Recently, mobile devices have attracted substantial attention about the current phase of the crucial light. For this purpose the live video stream of a mobile phone is analyzed in four in the computer vision and multimedia community and became steps: localization, classification, video analysis, and time-based an active research field [2]–[4]. Due to the increasing compu- verification. In particular, the temporal analysis allows to alleviate tational power and memory capacity more and more complex the inherent problems such as occlusions (by vehicles), falsified algorithms can run directly on mobile devices. colors, etc. and to further increase the decision certainty over a In this work we present a mobile vision system to detect period of time. Due to the limited resources of mobile devices very efficient and precise algorithms have to be developed to ensure the pedestrian lights in live video streams to help pedestrians reliability and the interactivity of the system. A prototype system with visual impairment cross roads. Thereby, we are faced was implemented on a Nokia N95 mobile phone and tested in real with different challenges. Pedestrian lights are standing on the environment. It was trained to detect German traffic lights. For opposite site of the street in an unknown environment. There the prototype training and testing, we generated image and video can be more than one traffic light or perhaps there are even databases including manually specified ground truth meta-data. These databases described in this paper are publicly available distracting lights from surrounding objects. Moreover, general for the research community. Quantitative performance analysis issues typical in real world applications, such as awkward light is provided to demonstrate the reliability and interactivity of the and weather conditions, are essential in our system as well. prototype system. Due to the limited resources of mobile devices very efficient algorithms have to be developed. The traffic lights have to be recognized in low resolution and low quality video streams. I. I NTRODUCTION For this purpose we define features of the traffic lights to IGHTLESS people are limited in mobility. Alone in S Germany the quantity of people with visual disabilities increased about 50% from 1985 to 2007. It is thus more analyze single video frames to get the locations of all visible lights in the field of view. Afterwards, we classify them to identify the crucial light, i. e. the light that matters to the user. important than ever before to develop assistance systems to To increase the detection performance we extend our approach help visually impaired people participate in everyday life. to concurrent frames using video analysis on the live video In this work, a system for mobile devices is presented that stream. For these steps we take care of two main objectives: helps people with visual impairment cross roads with nearby 1) Interactivity: The system should perform fast so that traffic lights. Since guide dogs are too expensive and pedes- the user gets the information within a short time if it is trian lights are rarely equipped with acoustic or haptic signals, safe to pass the pedestrian crossing or not. small mobile devices offer a cheap and handy alternative. Our 2) Reliability: A false positive feedback of a green light contacts to an organization of visually impaired people clearly (i.e. a red traffic light is shown but the user gets a confirmed such a need, which initialized and motivated our positive feedback to walk) should be avoided in any work. circumstances. Our research has been motivated by two aspects: (1) the As a proof of concept, a prototype system was developed on a demand of cheap and easy-to-use assistance systems to help Nokia N95 mobile phone that is able to give the user feedback visually impaired people participate in all day life and (2) the within a few seconds in real field tests. We tested the prototype possibilities of mobile vision, which are offered by modern in real environments, i. e. normal situations, different lightings mobile computing devices equipped with cameras (e.g. smart (sunlight and dusk) and awkward weather conditions (rainfall phones or PDAs with camera). and snowfall). Mobile phones are becoming ubiquitous [1]. According Several interesting works have been reported in the field to the International Telecommunications Union, mobile sub- of mobile vision for supporting people with visual disabili- scriptions raised from 1 billion in 2002 to approximately 4.6 ties. For instance, Liu [5] presented a currency reader that billion at the end of 2009. In the Western world there are can identify the value of U.S. paper currency. Wachenfeld J. Roters, X. Jiang, and K. Rothaus are with the Department of Mathematics et al. [6] used a mobile phone to read barcodes and to and Computer Science, University of Münster, Einsteinstrasse 62, 48149 obtain related additional product information from internet. Münster, Germany. E-mail: {jan.roters;xjiang;kai.rothaus}@uni-muenster.de A system for helping blind people choose clothes is presented Copyright (c) 2011 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be in [7]. The prototype reported in [8] is a machine that will obtained from the IEEE by sending an email to pubs-permissions@ieee.org. read a document for a person with visual impairment and
SUBMISSION TO IEEE-TCSVT - SPECIAL ISSUE ON VIDEO ANALYSIS ON RESOURCE-LIMITED SYSTEMS 2 respond to voice commands for control. Many such works have been reported at the Conference series on Computers and Accessibility (ASSETS), Computers Helping People with Special Needs (ICCHP), Human-Computer Interaction (HCI), and other relevant conferences. An early system that detects traffic lights was presented in 2004 by Aranda and Mares [9]. It used a portable PC in a backpack featured with a digital camera and a pair of auriculars. The mobile system ‘Crosswatch’ [10] helps pedes- Fig. 1. Program usage: pedestrian standing at the traffic light pole holding trians at traffic intersections with zebra crossings orientate the mobile device in an upright position themselves in the correct direction. In that work a prototype was developed to run with interactive frame rates on a Nokia camera phone. There also exist options for maintaining the with when detecting pedestrian lights are discussed. Also, the mobility of sightless people in an indoor environment. In German pedestrian lights used in our design and field tests are [11] a mobile navigation system is presented, which uses specified. special color markers to guide the user in a prepared indoor environment. Traffic light detection is not only helpful for A. Program Usage pedestrians but also an important task for driver assistance To use the program we assume that the pedestrian knows systems. In [12] the traffic light detection is used for estimating the path to walk. For instance these paths are trained by crossroads to give the driver additional guidance information. orientation and mobility specialists to improve skills to walk Another vision-based traffic light detection system is presented independently with a white cane and to get aware of the in [13], where a static camera position is however assumed to surroundings and their orientation. simplify the detection approach. At traffic intersections the orientation and mobility special- This paper is an extension to our previous work [14] ists usually teach their clients locations of the traffic lights of single frame analysis to video analysis and time-based with the acoustic or haptic signals. Nevertheless, without those verification. To our best knowledge, this work is the first attached signals the pedestrian has to know the location of the prototype of traffic light analysis reported in the literature, pole of the traffic light at his side of the street. To use the which works on the basis of a mobile phone and thus has traffic light detection the pedestrian has to be in range of the the potential of being used by visually impaired people. traffic light pole (see Fig. 1). In addition to the overall system architecture, we need to To get the traffic into field of view the mobile device is carefully design all system components to cope with the very held in an upright position in an approximate direction of the limited resources of current mobile phones on the one hand traffic light on the other side of the street. Since the user only and to achieve sufficient performance in both accuracy and knows an approximate direction the mobile device may be time on the other hand. In particular, the video analysis and panned slowly a few degrees left or right until the device tells time-based verification introduced in this paper substantially the user that a traffic light has been detected. Furthermore, improves the performance of traffic light identification. the user may take a few steps left or right to get another The remainder of this paper is organized as follows. In perspective. Section II, we concretize the external restrictions and specify An example of the program usage is shown in the challenges and problems of real world conditions. Fur- a demonstration video on the authors’ website at thermore, we define the design of traffic lights that should http://cvpr.uni-muenster.de/research/pedestrianlights. be detected. The system architecture is presented in Section III. We discuss the used mobile device of the prototype and give further information about the image/video databases, B. Real World Conditions which we generated for training and testing purpose and made The development of a mobile vision system to detect pedes- publicly available. Thereafter, the four steps of the algorithm trian lights very accurately is challenging due to several real are described: localization (Sec. IV) and classification (Sec. V) world conditions. The chosen mobile capture device limits the on single frames of the video stream, the extension to video possibilities of computer vision algorithms in several aspects: analysis (Sec. VI), and time-based verification (Sec. VII). In Section VIII we give example results and a quantitative 1) The resolution of the capture device is relatively low. performance assessment of our traffic light detection approach. 2) Mobile devices often provide only poor image quality, Possible extensions are outlined and conclusions are given in e. g. falsified colors and unsharpened images due to Section IX. automatic white balance and auto focus. 3) Computation power and memory resource are re- stricted. II. P ROBLEM S PECIFICATION Not only the capture device, but also the objects to be captured In this section we discuss how to use the program on impose restrictions by design and location: the mobile device to get traffic lights into field of view. 4) Pedestrian lights have different appearances in differ- Furthermore, the problems and the restrictions one is faced ent countries and even for different manufactures.
SUBMISSION TO IEEE-TCSVT - SPECIAL ISSUE ON VIDEO ANALYSIS ON RESOURCE-LIMITED SYSTEMS 3 (a) (b) (c) (d) (e) (f) Fig. 2. Challenges of detecting traffic lights in images: (a) minimal distance, (b) maximal distance, (c+d) two traffic lights, (e) occlusion, and (f) rotation (from [14]) (a) (b) (c) (a) (b) (c) Fig. 3. Difficult illumination: (a) dusk, (b) frontlighting, and (c) night Fig. 4. Examples of pedestrian light design in Germany: (a) two lights, (b) three lights, (c) three lights and an additional ”please wait” sign 5) The distance to the pedestrian lights could vary between GPS signal of the mobile phone. The necessary adaptations approximately 4 and 24 meters. Therefore, the scale of will be discussed in Section IX. a traffic light in an image is very small (see Fig. 2(a) Our prototype system was trained to detect pedestrian and (b)). lights that occur in most German cities (see Fig. 4). For the 6) There may be many traffic lights in the image but only remainder of this paper the following features of a pedestrian one is crucial (see Fig. 2(c) and (d)). light are assumed to be valid preconditions: Sight and light conditions may complicate the traffic light 1) Shape: rectangular with aspect ratio of 1/2, 1/3, or 1/4. detection: 2) Color arrangement: at the bottom there is one green 7) Traffic lights can be temporarily occluded by vehicles light, at the top/middle there are one or two red lights. (see Fig. 2(e)). At the top there is an optional blinking white light, but 8) Traffic lights could be hardly visible in bad weather in our approach we ignore this light. situations like fog, heavy rain or snowfall. 3) Circuitry: either red or green light is switched on. 9) The illumination condition varies between night and 4) Background: the majority of the traffic light is dark. daylight. Thus, the captured colors of one traffic light 5) Design: possible shapes of the green or red lights are depend on the capture time (see Fig. 3). limited. Finally, the user of the system could hold the mobile capture 6) Installation: mounted at a vertical pole at a height of device in an unfavorable position: approximately 2.15 meters and a distance between 4 and 10) The image could have been captured with a non- 24 meters. neglected rotation (see Fig. 2(f)). III. S YSTEM A RCHITECTURE Problems related to camera failure and awkward ambient light situations will not be discussed because it would go Our traffic light detection pipeline (see Fig. 5) consists of beyond the scope of this work. Furthermore, awkward weather two concurrent steps and an additional step that combines conditions are excluded. them. In the first concurrent step we identify the crucial pedes- trian light in the field of view in the recent frame of the video stream. It consists of localization followed by classification. In C. Specification of Pedestrian Lights in Germany the localization step we try to filter all the image regions out Due to the different appearances of pedestrian lights in that may contain traffic lights. The classification step decides different countries or, even to some extent, cities, we restrict which regions contain pedestrian lights and which light is our system to detect one chosen pedestrian light design. It crucial to the user. should be possible to adapt to other designs and choose the The video analysis concurrently computes the crucial light correct pedestrian light recognition system according to the location independent from the traffic light identification. For
SUBMISSION TO IEEE-TCSVT - SPECIAL ISSUE ON VIDEO ANALYSIS ON RESOURCE-LIMITED SYSTEMS 4 Captured Identification Frames Localization Classification Time-Based t0 ti Verification ... Sec. IV Sec. V t i-1 “Green” Video Analysis t i-1,t i ti Sec. VI Sec. VII Fig. 5. Overview of the traffic light detection pipeline. On the left the input frames of the live video stream are presented. In the middle the single frame analysis (top) and the extension to video analysis (bottom) are shown. At the right both results are compared and a feedback is generated. this purpose the location of the crucial light in the previous B. Pedestrian Light Databases frame is used to track its location in the most recent frame. We have built up two databases for the training and testing Time-based verification helps us improve the pedestrian’s purpose. One is holding images and the other is holding video safety. Since we expect the crucial light locations computed sequences. Both contain pedestrian crossings with traffic lights by different approaches to be similar this verification step and were captured from positions where pedestrians have to compares the results of the concurrent steps. After a short wait for a green signal. Both databases are publicly avail- period of successful comparisons a feedback for the user is able (at http://cvpr.uni-muenster.de/research/pedestrianlights) generated. for the research community. A ground truth segmentation was made manually, storing all visible pedestrian lights. Furthermore, the crucial light is A. Prototype System marked and the phases of the traffic lights (red or green) are given. In Table I the statistics of the databases are presented. As a proof of concept a prototype system was developed for The total number of images is shown, which is divided into a Nokia N95 mobile phone. In the community of blind people the number of images with a crucial red and green light, Nokia mobile phones are very common due to the large variety respectively, and the images without a crucial light. of available software, e. g. screen readers, mobile reading and The number of images without a crucial light is composed shopping assistants. of the images without any traffic light and the images with at The N95 is equipped with a 330MHZ ARM processor and least one traffic light, but without a crucial one. 18Mb of available RAM. A built-in autofocus camera takes Furthermore, in each database there are images with a dan- photographs with up to 5MP. This device offers three capture gerous constellation, i. e. a crucial red light and an additional modes: green light. 1) Take photographs (up to 2582 × 1944) automatically The video database is made up of 14 image sequences. when the previous one is finished. Each sequence represents a video stream with approximately 2) Use the video stream with a resolution up to 640 × 480. 8 frames per second and between 99 and 853 images. 3) Take the viewfinder video stream with 320 × 240 resolution. It is the stream that is shown on the display IV. L OCALIZATION OF P EDESTRIAN L IGHTS while recording videos or taking pictures. The localization approach presented in this section can be Due to the preparation time between two photographs, mode 1) considered as filter and refinement operations on single frames cannot provide an interactive facility, even with a low capture of the video stream. As mentioned before, traffic lights have resolution. The video stream 2) is encoded in YUV 420 planar specific features (i. e. shape, arrangement, circuitry, design, format, which has the major drawback that only every fourth background, installation). All these features could be used in pixel contains the correct color value (luminance). As we will a special filter algorithm to localize traffic light candidates. see later we are in need of the correct color values during Although a parallel combination scheme of the used filters the localization (Sec. IV-A). Our work is directly based on can achieve a high accurate recognition rate, i. e. high relia- the detection of the red and green traffic light colors. Thus, bility, the computational power usage would be too much to the RGB color model is more intuitive and we use the video ensure interactivity. Note that it is much faster to verify if stream of the viewfinder 3), which provides this RGB data. a feature is valid for a specific candidate than to inspect all
SUBMISSION TO IEEE-TCSVT - SPECIAL ISSUE ON VIDEO ANALYSIS ON RESOURCE-LIMITED SYSTEMS 5 TABLE I G ROUND TRUTH STATISTICS OF THE IMAGE AND THE VIDEO DATABASE itself). Another is located along the red color and the rest of the samples is introduced by noise. So we estimate a Gaussian image video mixture model in 3D with four contributions: black cluster, # database database images (total) 501 5635 gray cluster, red cluster, and noise cluster (see Fig. 7(b)). Since images with red the most significant colors to detect red lights should be the crucial light 309 3822 red color, we only keep the Gaussian distribution of the red images with green crucial light 184 1675 cluster (see Fig. 7(c)). images without The green color samples (see Fig. 7(d)) are distributed a crucial light 8 138 in three significant portions. Similar to the red distribution images without a we estimate a Gaussian mixture model in 3D with three crucial light but 5 20 with another light contributions. One cluster is near the gray axis of the RGB- images without cube and another cluster contains values with low intensities any traffic light 3 118 (see Fig. 7(e)). Only the remaining cluster contains the green images with dangerous colors that occur in the lamps of the traffic lights. Thus, only constellations 9 127 this cluster of the Gaussian distribution is kept for the green images with more than one traffic light 165 4262 light (see Fig. 7(f)). red lights (total) 424 6891 (2) Design the filter rules: Here we only discuss the color green lights (total) 244 2888 filter for the red traffic lights; similar filter rules apply for the green traffic lights. The Gaussian distribution of the red cluster is defined by its mean color µ = (0.48, 0.06, 0.07) possible image regions according to the special feature. In this and the three eigenvectors v1 , v2 , and v3 corresponding to the section we thus present an approach to localizing possible traf- eigenvalues λ1 = 0.0590, λ2 = 0.0032, λ3 = 0.0005. fic lights in low resolution images in a sequential architecture A color c = (r, g, b) is considered as red traffic light color (see Fig. 6). This architecture provides interactivity, but also if and only if the following three rules are fulfilled: a high reliability. Furthermore, it is robust against the scale of Ired (c) := c · v1 ≥ thred,1 (1) traffic lights and also against rotation (to some degree). (c − µ) · v2 ≤ thred,2 · Ired (c) (2) As a first step of our localization procedure a red and a green color filter are used (Sec. IV-A). After a connected component |(c − µ) · v3 | ≤ thred,3 (3) analysis we compute the size and the circuitry to reduce false It means that the red intensity Ired , which is the distribution positives (Sec. IV-B). In Section IV-C we explain the next along the dominant axis, should be lower bounded (Eq. (1)). step: examination of the background color. The optional last Furthermore, the distance to the red intensity axis along v2 step is a shape-based segmentation of the pedestrian light (see should be limited toward the gray diagonal (Eq. (2)). The third Sec. IV-D). rule is motivated by the observation that the distribution along At the end of this section we optimize parameters of the v3 is very tight. More precisely, the distance of c along this traffic light localization (see Sec. IV-E) and investigate the direction is thresholded (Eq. (3)). rotational robustness (see Sec. IV-F). The resulting red traffic light region in the RGB-cube is wedge-shaped with missing apex. In Figure 7 examples are A. Red and Green Color Filter shown for the red 7(c) and the green 7(f) color clusters with thresholds th1 = 0.20, th2 = 0.25, and th3 = 0.07. The most significant feature of traffic lights is the bright (3) Optimize parameters: The image database was divided color of the lamps. Due to the increased use of LED lights in in two disjoint sets, the training and the validation set. To traffic lights the color is very specific. In this step we search optimize the parameters we apply the whole localization for such colors in the region of interest, i. e. the limited region approach on the training data with different parameter settings when the vertical line filter is applied or otherwise the whole and take the best (see Sec. IV-E). image. Therefore, the color of each pixel is checked to fulfill The responds of the color filters are represented by a binary some filter rules. We use the RGB color space, since this is the image where 1 corresponds to a positive filter result and 0 to default color space on most mobile devices and a conversion a pixel, which is not part of a traffic light lamp according to to another color space is time-consuming. its color. As a post-procession step, we apply a morphological Figure 7 shows a plot of red (a) and green (d) traffic light closing and compute the connected components. colors, which are extracted from the ground truth. In the following we explain how to establish the color filters for the traffic lights based on the extracted colors in three steps: B. Segmentation using Size and Circuitry (1) analyze the color distribution of ground truth, (2) design During the last step we have identified pixels that have the fast and valuable parameterized filter rules, (3) optimize the desired color to be part of a traffic light lamp. These pixels parameters. are already grouped to connected components. (1) Analyze the data: One portion of the red color samples We assume that the crucial traffic light is between 4 and 24 in Figure 7(a) is distributed along the gray axis of the RGB- meters away (see Sec. II-B). In our setting with a small and cube (one cluster near black and one cluster along the axis fixed focal length of the mobile cameras this range corresponds
SUBMISSION TO IEEE-TCSVT - SPECIAL ISSUE ON VIDEO ANALYSIS ON RESOURCE-LIMITED SYSTEMS 6 Image Connected Light Spot Traffic Light Locations Components Candidates and Regions Sec. IV A Sec. IV B Sec. IV C Red / Green Color Filter Size / Circuitry Filter Background Color Filter Fig. 6. Sequential combination scheme for localization from left to right: (1) input color image, (2) color filter response in green and red, resp., (3) color regions after pruning, (4) dark filter response in black, search region in blue, initial bounding boxes in light blue, (5) localized traffic lights. (a) (b) (c) (d) (e) (f) Fig. 7. Red (a) and green (d) traffic light colors from ground truth. Clustering of the red (b) and green (e) samples visualized by the mean colors of the respective cluster. Complete filter for red (c) and green (f) colors. to a width of the traffic light between 2.5 and 15 pixels. Due C. Background Color Filter to the known possible aspect ratios of 1/2, 1/3, or 1/4 (see The result of the last step are connected components of Sec. II-C) we also know the possible corresponding heights. adequate sizes and colors. We know that the green lamp under These parameters can be utilized to filter out regions that are a red light is switched off and vice versa. This fact enables too small or too huge by thresholding the size of the connected us to implement a background filter, which inspects the image components. region under a red light candidate and above a green light. In our system we defined the search region to have the same Due to reason of circuitry we know that exclusively the size as the connected component it belongs to (half height if red or the green light is switched on. Connected components two red components were merged). If there are no dark pixels featuring red and green pixels cannot be part of a valid traffic within this appropriate search region, it allows us to refuse light. Furthermore, vertical neighbored connected components this candidate. of different colors represent dangerous constellations. Thus, In our implementation this filter is simply defined as all such candidates are refused. I(p) ≤ thred, dark or resp. I(p) ≤ thgreen, dark (4) As a post-processing step we melt two red connected where I(p) = (R(p) + G(p) + B(p)) /3 is the intensity of components that are vertically neighbored, since a red light the pixel p. Furthermore, thred, dark and thgreen, dark are darkness may consist of two lamps. The size of the melted components thresholds. The result of this step is a so-called initial bounding are again checked against the size constraint. box. It is a box around each traffic light candidate. The
SUBMISSION TO IEEE-TCSVT - SPECIAL ISSUE ON VIDEO ANALYSIS ON RESOURCE-LIMITED SYSTEMS 7 candidate is given by the connected components of the color 100 100 sample and the search region of the background color filter. 80 80 60 60 Recall in % Recall in % D. Shape-Based Segmentation 40 40 We have already localized possible traffic light candidates, 20 20 by their lamp color, size, arrangement and background color. In this last step we aim to segment the traffic lights according 0 20 40 60 80 100 0 20 40 60 80 100 Precision in % Precision in % to their rectangular shapes. Firstly, we assume that the rotation (a) (b) angle of the capture is fairly low (about ±10o ). A traffic light Fig. 8. Recall and precision for the localization of (a) red and (b) green region should fulfill the following constraints: traffic lights (from [14]) 1) Traffic light and background are contained. 2) Aspect ratio is between 1/4 and 1/2. 3) Many pixels (e.g. 80%) are either light or background. In the following we optimize the parameter groups for red 4) Width of the region lies between 2 to 15 pixels. and subsequently for green traffic lights using the training To ease the computation we consider axis-parallel rectangular set of our ground truth database. Finally, we validate these regions only. The task can be modeled by an optimization, like: optimizations on the validation set. Find the region of maximal size, which fulfills all constraints. 1) Optimize Parameters for Red Traffic Lights: The missing This optimization is however time-consuming, since many of a red sign could cause serious problems. So our optimiza- possible regions have to be considered for each traffic light. tion criterion is to maximize the precision with a bounded miss Therefore, in our implementation we use a fast but subopti- rate. Fig. 8(a) shows the performance of the investigated red mal region growing approach. The initial bounding box (see parameter settings. We claim a recall1 Sec. IV-C) is first simultaneously expanded to the left and the R = TP/(TP + FN ) (5) right. We stop, if the left or right border consists of too many non-background pixel. After computing the vertical boundary, of at least 75% and choose the setting with the best precision2 we apply an analogous technique to find the top and the bottom P = TP/(TP + FP ). (6) of the traffic light. Even using a suboptimal but fast optimization strategy, this The result of our optimization are the parameters thred,1 = 0.3, last step decreases the performance so that an interactive thred,2 = 0.15, thred,3 = 0.028, thred, dark = 0.19. With a recall application is impossible on our hardware. Furthermore, the of 76.0% a precision of 89.5% is achieved. This optimized computation of the borders is somehow non-robust. Since the performance is visualized as a black asterisk in the Fig. 8(a). profit of this segmentation is negligible compared to the com- 2) Optimize Parameters for Green Traffic Lights: The op- putational costs, we abandon the segmentation step. In future timization of the green parameter set depends on a bounded settings the segmentation might be profitable. For instance we precision. The precision equals 100% if and only if we have need a segmented region for a model-based verification [14]. detected no false green light. We allow at most 1.5% FP Therefore, we keep the segmentation as optional step in our (i.e. P ≥ 98.5%) and choose the parameter vector yielding the localization pipeline. best recall. Fig. 8(b) shows the performance of the investigated green parameter settings. The best thresholds of the green E. Parameter Optimization of Traffic Light Localization filter are: thgreen,1 = 0.2, thgreen,2 = 0.15, thgreen,3 = 0.05, thgreen,dark = 0.19. With these parameters we achieve a recall In this section the optimization of the parameters of our of about 85.0% (see black asterisk in Fig. 8(b)). localization approach is discussed. Our traffic light detection 3) Validate the Localization Results: As mentioned before, algorithm depends on eight main parameters, four color pa- the validation set consists of 201 images, which are not used rameters in each case (red and green light, resp.). These two during the parameter optimization. We validate the localization parameter groups are optimized separately. In our experiments approach with the optimized parameters on all visible traffic we subsample each parameter into 10 steps, getting 104 differ- lights in the images of the validation set. For all red lights of ent parameter settings for each color. With our ground truth, the validation set we achieved a recall of R = 71.8% and a we measure the quality of the setting by counting the number precision of P = 87%. For the green traffic lights a recall of of correctly detected traffic lights (TP ), falsely detected traffic R = 83.3% and a precision of P = 92.6% were achieved. lights (FP ), and missed traffic lights (FN ). A traffic light is The true positives, false positive and false negatives are listed detected correctly if the initial bounding box is completely in Table II. in the segmented bounding box from ground truth. For this Overall, false negative and false positive detections occur comparison the segmented box from ground truth is extended for 90 of the 267 traffic lights in the validation set of 201 by 2 pixels in each direction to prevent small deviations. images, which is equal to an error of 33.7%. This error seems We have divided the image database into two disjoint to be very high. It is mostly caused by very small, undetected sets. The first set (300 images) is used for training. With the remaining 201 images we verify the performance of our 1 also called true positive rate approach. 2 also called positive predictive rate or occasionally, detection rate
SUBMISSION TO IEEE-TCSVT - SPECIAL ISSUE ON VIDEO ANALYSIS ON RESOURCE-LIMITED SYSTEMS 8 TABLE II T RUE POSITIVES , FALSE POSITIVES AND FALSE NEGATIVES OF THE Traffic Light Crucial Traffic LOCALIZATION STEP FOR 177 RED AND 90 GREEN TRAFFIC LIGHTS OF Locations Light THE VALIDATION SET CONSISTING OF 201 IMAGES Sec. V A red green Selection Filter true positive 127 75 false positive 19 6 false negative 50 15 Fig. 10. Sequential combination scheme for classification. Selection of the crucial traffic light in the image 100 Count of True Positive Detections 80 60 In this section we describe how to select the traffic light 40 that is crucial for the pedestrian (Sec. V-A). Furthermore, 20 the performance of the identification approach is presented in 0 Section V-B. At the end of this section some example results -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 Rotation Angle in Degree are shown and discussed (Sec. V-C). Fig. 9. Rotational robustness of the localization approach for red traffic lights (similar for green lights) A. Selection of the Crucial Light By reason of perspective, the important traffic light should traffic lights in the background, which were mostly not the be the biggest and highest of all traffic lights in the image. crucial light. To obtain a deeper insight we investigated the These two simple criterion are used to select the crucial traffic crucial light detection error ratio. The error of missing the light. More precisely, we report a traffic light candidate TLCi crucial red traffic lights is about 3.8% (5 lights missed from as crucial if all of the following constraints are true: 132 crucial lights). In comparison 8 from 66 crucial green • TLCi is the broadest traffic light lights have been missed (approx. 12.1%). Consequently, the • TLCi has got the smallest distance from the top of the error of missing the crucial traffic light is considerably lower. image • No other traffic light has a distance from the top of the F. Rotational Robustness image similar to TLCi Experiments on the rotational robustness showed that a For the third point we report two traffic lights to have a rotational angle of ±10o only slightly affects the performance similar distance from the top of the image if the difference is of our approach (see Fig. 9). For these tests we have rotated the less than 10 pixels. images in both directions with linear subsampling. We report The color of such a traffic light TLCi is obvious since the the angular range in which the result remains stable. region contains exactly one type of traffic light color, either Including all images (training and validation set) in this red or green. In the case that there exists no TLCi for which test scenario, we can identify 328 (i. e. 77.4%) of the red and all constraints are fulfilled, we have found no pedestrian light. 206 (i. e. 84.4%) of the green traffic lights with no rotation. There could be different failures. The catastrophic error is that If the images are rotated by maximal ±10o , we recognized a green light is reported during a red phase. Reporting no 254 red and 180 green traffic lights. This means that the traffic light or a false red report are errors that abridge the localization remains stable for 77.4% red and 87.4% green convenience but do not affect the user’s safety. lights in comparison to the case with no rotation. There are several reasons why rotation affects the localization result: 1) The search region of the background color filter (see B. Performance of Classification Sec. IV-C) contains more (bright) pixels that do not In Section IV-E we optimized the parameters of the lo- belong to the traffic light region. This situation appears calization based on the recall and precision. In this section most when the traffic lights are far away and the search the performance of identifying the crucial pedestrian light is region is small. presented on the training set. 2) When two red components were merged, the width The performance for detecting the crucial traffic light is grows by image rotation, so that the size filter (see presented in Fig. 11 using ROC-curves. Here, the true positive Sec. IV-B) may refuse candidates. rate is plotted against the number of false positives. Further- more, the standard deviation is visualized by the vertical lines. V. C LASSIFICATION OF P EDESTRIAN L IGHTS Our optimized parameter setting (the black asterisk) leads to The localization procedure (Sec. IV) results in a set of traffic a stable recognition of the crucial traffic light. As desired the light candidates TLC1 , . . . , TLCk . In this section we discuss number of false positives is very small in the case of green how to select the correct candidate (see Fig. 10) of the current light detection. We report in 2 cases a wrong crucial green light frame of the video stream. The features we could use are the (precision of 98.1%) and keep a recall (i.e. true positive rate) position and size of the traffic light candidate in the image. If of 86.3%. The performance of the red traffic light detection is the segmentation step of the localization pipeline is left out, similar: We classify in 4 cases false red traffic lights (precision we use the initial bounding box as segmentation. of 97.4%) and achieve a recall of 86.3%.
SUBMISSION TO IEEE-TCSVT - SPECIAL ISSUE ON VIDEO ANALYSIS ON RESOURCE-LIMITED SYSTEMS 9 100 100 80 80 True Positive in % True Positive in % 60 60 40 40 20 20 0 20 40 60 80 100 0 10 20 30 40 50 False Positive False Positive (a) (b) Fig. 11. ROC-curve for detecting the crucial (a) red and (b) green traffic light. The light gray markers represent the performance of each parameter set, the black line the mean values and the vertical gray lines the standard deviations. Black asterisk is the optimized parameter set. C. Results of Traffic Light Identification vehicles. After a few moments these vehicles will have Our validation set consists of 201 images, which are not passed the crossing and the detection could repeat. used during the parameter optimization. We fixed the param- 2) Falsified Colors eters and applied the approach on this validation set. For red In some situations the automatic illumination correction traffic lights we yield a precision of 96.5% and a recall of falsifies the traffic light colors. By moving the camera 83.3%. The precision for green traffic lights is 98.3% and the and repeating the traffic light identification approach a recall is 90.8%. We report 5 wrong crucial traffic lights and result may be given. Even slight movements give the falsely report no traffic light in 28 of the verification images. camera the chance to readjust the automated camera This corresponds to an overall miss rate of 16.4%. settings, like white balance and exposure. Fig. 12 depicts some results produced with our approach 3) Contradictory Scene of traffic light identification. Thereby, we put a white frame Two traffic lights close to each other may be contradic- around all traffic light candidates and an additional blue frame tory (see Fig. 2(d)). In such situations no feedback can around the reported crucial one. In the first two results (a-b) be given, since a feedback could be very dangerous. By perfect recognitions are presented, even in dark illumination changing the perspective the scene may be resolved so conditions (a) or bright traffic light color (b). that a decision is possible. However, there are still some limitations, which we present 4) Repeating Results in Figure 12(c-d). If traffic lamps are captured with low In a video stream the same identification result may saturation (c) the traffic light could be missed. Sometimes the repeat. If this happens a few times successively it will scene is contradictory (d). increase the certainty that it is correct. Sometimes noisy objects are detected as traffic light candi- To make use of the video stream we want to track the crucial dates (see Fig. 12(e-f)). Objects on trees (e) could be identified light between consecutive frames to improve the performance as traffic light candidates. Such situations are much more of the system. The tracking could be used in two different difficult, since the objects may be placed above the crucial ways: (1) Track the crucial traffic light in the following traffic light. A template matching could decrease such false frames to save computation time of the re-localization and positives. Currently, template matching is not integrated in re-classification. (2) Apply the localization and classification our system. Another situation in which an additional template in every frame, in addition also track the crucial light be- matching step could be helpful are transversely mounted street tween two consecutive frames. Afterwards, compare the two traffic lights (see Fig. 12(f)). determined positions of the traffic light. Whereas the first Some problems (e.g. (d) and (f)) are introduced by a approach improves the interactivity, the second improves the poor perspective angle and can be corrected by changing the reliability. In our system we choose way (2), the time-based viewpoint. This is shown in (g-h). In the next section we verification, since false positive detections should be avoided discuss an extension to the video stream, which among others in any circumstances. reduces the effect of poor perspective. In our setting the distance between the mobile device and the traffic light is at least 4 meters. Since we assume that the user does not change his position very fast and the rotation VI. V IDEO A NALYSIS angle is at most ±10◦ , we can neglect the 3D perspective view The identification of the crucial traffic light in single images changes, scaling and rotation in the tracking approach. Thus, was described in Section IV and V. The traffic lights in the we only have to deal with translation between two frames and image were localized and thereafter, the crucial light was we can use motion estimation algorithms to get the position selected. In this section the traffic light detection is extended of the crucial light in a new frame. from single images to video streams due to the following For the remainder of this section the objective is to estimate reasons: a motion vector which defines the translation between two 1) Temporary Occlusion consecutive frames in the proximity of the crucial light. With Objects that occlude the crucial traffic light are big this vector the location of the crucial light in the new frame is
SUBMISSION TO IEEE-TCSVT - SPECIAL ISSUE ON VIDEO ANALYSIS ON RESOURCE-LIMITED SYSTEMS 10 (a) (b) (c) (d) (e) (f) (g) (h) Fig. 12. Results of the localization and classification. The found traffic lights are marked with a white border. An additional blue border marks the crucial light. (a-b) perfect result, the crucial traffic light was located and classified correctly. (c) no traffic light reported due to failure of localization. (d) decision could not be made due to classification. (e-f) noisy objects. Change of perspective with different result between (g) and (h). More results are presented in [14]. determined easily. Approaches to estimating the motion vector were set with the following trade-off. We want to have between using phase correlation [15] or more complex methods like the 5 and 10 frames per second to compute the whole approach. determination of optical flow [16] cannot be used interactively Furthermore, the tracking should be subjectively stable. The on mobile devices due to the need of high computational final parameter set was acquired in live field tests of our burden. To estimate the image difference between two frames prototype system: The size of the small areas around the we thus compute feature points in the first frame around points is 5 × 5. We search for the 5 best feature points and the crucial light location and search corresponding points search in a small radius of 30 pixels in each direction for the in the following frame. An applicable algorithm for motion matched position. The displacement vectors are the difference estimation on hand-held devices was presented in [17], in between the old position and the new one. These displacement which a multi-resolution scheme is used to search features vectors are combined to one single displacement vector that in the image. However, due to the fact that in our system describes the image translation. The resulting vector is the another complex algorithm (for traffic light localization and mean vector of all similar displacement vectors if and only if classification) has to work on each frame, an even faster at least 3 displacement vectors nearly have the same values, approach is needed. For our purpose we use the KLT tracker i. e. a maximum Euclidean difference of 4. Otherwise, if less [18] due to the fact that it detects features that are good to than 3 of those vectors have been found the motion estimation track. We reduced the computational time usage of the tracker has failed. by only searching for good features in a small area around the With the presented approached and the given thresholds crucial traffic light candidate (30 pixels in each direction). To a stable motion estimation approach was realized. After this match the feature points we define the features as the small computation we get the location of the crucial traffic light in fixed-size areas around the points. These features are searched the recent frame independent of the traffic light identification in the second frame in a specified radius around the initial step (see Fig. 5). position of the feature point. We correlate the features by using the sum of absolute differences. VII. T IME -BASED V ERIFICATION In our setting we use several thresholds. To estimate good In this section we want to verify the results of the concurrent parameter values we tested different settings. These thresholds steps of our traffic light detection system (see Section IV, V were not optimized by using the ground truth meta-data, but and VI). Thereby, the main focus is the reduction of false
SUBMISSION TO IEEE-TCSVT - SPECIAL ISSUE ON VIDEO ANALYSIS ON RESOURCE-LIMITED SYSTEMS 11 TABLE III positive detections. For this purpose we introduce the state R ECALL AND PRECISION OF THE WHOLE SYSTEM FOR THE VIDEO queue, which allows a verification over time. DATABASE We have to combine two results: (1) the identified crucial red green pedestrian light from the localization (Sec. IV) and classi- recall 52.4% 55.3% fication (Sec. V) step; (2) the tracked traffic light location precision 100% 100% from video analysis (Sec. VI). Based on our observations we suppose that the locations match if their distance is less than 5 pixels. Four scenarios are possible: VIII. R ESULTS 1) Traffic light identification and video analysis are suc- In this section we want to present results of our traffic light cessful and the location of the crucial light from identi- recognition system. In particular, we discuss the results of the fication step and the estimated one of the video analysis traffic light identification on single images in comparison to match. the additional video analysis and time-based verification. 2) Traffic light identification and video analysis are suc- The state queue for the presented results was configured cessful, but the locations differ. with SQsize = 10 and SQmin = 5 due to the following reasons. 3) Video analysis succeeds but traffic light identification On the one hand, a feedback should be given within one step fails (i. e. localization or classification). second (with optimal detection) due to the fact that we want 4) Video analysis fails (i. e. motion could not be estimated). to compute at least 5 frames per second. On the other hand, These scenarios are mapped to the state queue in the a larger size SQsize of the state queue may store outdated following way. Case 1) is the only positive. They are mapped feedbacks. With SQsize = 10 and the minimum 5 frames per to a red or green state, dependent on the current traffic light second the given feedback is (maximal) one second old. phase of the recent frame. Example results of the whole traffic light detection approach Although the traffic light identification in the recent frame with applied video analysis and time-based verification are fails, case 3) is not critical due to the fact that the motion shown in Figure 13, 14 and 15. For visualizing the results the estimation is successful and the traffic light could possibly be dark box in the images shows the crucial traffic light that was verified in the following frames. These results are represented detected in the localization and classification approach. The by a black state. blue box represents the result of motion estimation. Below The remaining two cases are critical and mapped to a each image the state queue is shown with its color states and blue state. In terms of case 4) it is impossible to verify the feedback appears in the bottom area. the identification result, due to the failed motion estimation, In the introduction of the paper we declared our two main which represents the basis of our verification approach. Thus, objectives: reliability and interactivity. In the following we we are not sure enough if the detected pedestrian light is discuss the results with regard to these two objectives. A the crucial one. If case 2) occurs, the identification or the short video that demonstrates the working of our system in tracking detected a false crucial light. For example, the motion real environment can be found at http://cvpr.uni-muenster.de/ estimation may point to the recent crucial traffic light in red, research/pedestrianlights. but the identification result may point to a green light in the background. With these states we can verify the traffic light detection A. Reliability over time. For this purpose we build a queue that stores the As mentioned before, the reliability is the most important states of the last SQsize combination results, called state queue. design criterion for the traffic light localization and classi- It differs from a normal queue in the following way. Access fication in single images. We have optimized the parameter to all elements is allowed. When the queue is full and a new values to prevent false positive green light detections in state is pushed the oldest element is removed automatically. any circumstances and therefore to achieve a high precision, A feedback to stay (color c =red) or walk (color c =green) whereas dangerous feedbacks were reduced to a minimum. is given if and only if the crucial light of color c is identified Moreover, we achieved a high recall for red traffic lights so in the recent video frame and the following conditions are that red lights would not be missed. fulfilled: With additional video analysis our system reached a better 1) At least SQmin correct traffic light detections with the precision for green and red traffic lights, but a lower recall (see same color c are required, counted from the last inserted Table III). With video analysis and time-based verification we red or green state. observed 0 false positive feedbacks in 5635 frames, neither 2) These occurrences must not be interrupted by a blue for red nor for green traffic lights, i. e. the system responses state. very safely. The drawback of this reliability improvement 3) Between these occurrences color c must not switch from is the number of false negative feedbacks, which limit the red to green or from green to red. interactivity (see Sec. VIII-B). If at least one of the conditions is not fulfilled, no feedback Table IV shows another representation of the results. The is given and the pedestrian should wait for a feedback. bold printed results present the reliability improvement of Consequently, black states do not directly exert influence on the system with and without video analysis and time-based the state queue. Only when there are more than SQsize −SQmin verification. In 93.1% of all images neither the single image black states no feedback is given as a result of these states. analysis nor the video analysis gave a false positive feedback
SUBMISSION TO IEEE-TCSVT - SPECIAL ISSUE ON VIDEO ANALYSIS ON RESOURCE-LIMITED SYSTEMS 12 TABLE IV C OMPARISON OF FALSE POSITIVE GREEN LIGHT DETECTIONS BETWEEN normally provides a feedback within 2 seconds, the traffic light VIDEO ANALYSIS AND SINGLE IMAGE ANALYSIS . T HE BOLD MARKED identification normally gives between 4 and 8 feedbacks in a NUMBERS SHOW THE IMPROVEMENT OF VIDEO ANALYSIS . second. 6.9% OF THE DETECTED GREEN LIGHTS WERE FALSELY DETECTED WITH SINGLE IMAGE ANALYSIS AND REJECTED BY VIDEO ANALYSIS . T HE From the 14 available sequences there are 2 which seem VIDEO ANALYSIS DID NOT PRODUCE EXTRA ERROR (0%) WHEN THE to be outlier (see Table V). Except from sequence 1 and 4 SINGLE IMAGE ANALYSIS CORRECTLY DETECTS GREEN LIGHTS . there are small means and standard deviations. Furthermore, single images single images the maximum frame count between two feedbacks is less than not false positive false positive 38. video sequences Sequence 1 and 4 indicate that there are situations in which not false positive 93.1% 6.9% our system does not provide an interactive feedback. After video sequences false positive 0% 0% 397 frames (sequence 1), i. e. between 40 and 80 seconds, the user would not get a response of the system. This is due to fast changing false positive and false negative detections and of a green crucial light. In 6.9% the single image analysis furthermore, due to the reactions of the motion estimation; falsely detected a green crucial traffic light, whereas the see Figure 15 for an example of sequence 1. It is important to video analysis rejected it. Since the precision is 100%, both note that although the interactivity is decreased in this case, no remaining values are 0%. It means that the video analysis false positive feedbacks are given. This is a correct decision did not produce extra error when the single image analysis for safety reasons related to our main design criterion. correctly detects green lights. Moreover, there is no image, in Figure 14 (seq. 4) shows another case of benefit of video which both detected a false candidate. analysis. In (b) and (c) false crucial green lights were identified This power of temporal analysis was not only observed in and refused by the video analysis. The false candidate of (c) working with the video database, but also in many additional is even verified in (d) once, which is refused by time-based tests under real conditions using the prototype system. In verification. Without video analysis 3 false feedbacks would numerous such field tests we did not observe a single situation have been given (a-d). Instead, our system decides to remain where a false positive response was produced. without feedbacks and to wait for more reliable detections. An example showing very good detection results is shown in Figure 13. During the phase switch no feedback is given IX. D ISCUSSION AND C ONCLUSIONS for SQmin frames. Figure 14 shows a situation in which the system would A system was presented for detecting traffic lights for have failed without video analysis and verification. In 3 of visually impaired pedestrians on a mobile device. As a proof the 6 frames the system would have given a feedback to walk, of concept a prototype designed for German pedestrian lights although the crucial light is red. Due to video analysis these was developed for a Nokia N95 mobile phone and tested in real false positive responses are prevented. environment. It runs with about 5 to 10 frames per second, so that in general a feedback is given in less than a few seconds. We tested this prototype in several situations, e. g. rainfall, B. Interactivity snowfall, dusk, frontlighting, etc. On the one hand we did The second main objective of our system is interactivity. not observe a false positive feedback, but on the other hand We measure the interactivity by the number of feedbacks of the amount of missed traffic lights increased very much. In the system. With video analysis and time-based verification our field tests the power consumption turned out to be not a feedbacks are given as described in Section VII. Each result matter of fact. The mobile device with our prototype system of the bare traffic light identification step is interpreted as was active for about 2 hours without running out of battery. feedback in our interactivity results. Several challenges have been tackled: low image quality The additional steps in our system with temporal analysis and resolution, restricted computational power and memory potentially reduce the interactivity, since we have to wait for at resource, scalability, rotational robustness, temporary occlu- least SQmin verified frames to give a feedback (see Fig. 13). sion, and the selection of the crucial traffic light. In particular, Furthermore, if the motion estimation fails, the verification the temporal analysis turns out to be powerful in enhancing will start from scratch. the system performance. Overall, a good trade-off between The interactivity of the system is presented in Table V. interactivity and reliability has been achieved. There, the number of frames between two consecutive feed- With enough caution, the presented prototype in its current backs is measured. It shows that the overall frame count state in fact would improve the safety of visually impaired between two feedbacks is 1.8 in average with a standard pedestrians. For instance, if the user gets the signal to walk, deviation of 9.2 frames for the system with the additional video he could signalize the drivers his intention to walk by holding analysis and time-based verification. With our assumption to his white cane on the street in a higher angle that can be seen get between 5 and 10 frames per second and with a stable by the drivers. With such sort of signalization the user would traffic light recognition a feedback is normally given within be safer using the prototype than having no information about 2 seconds. The mean interactivity of the bare traffic light the phase of traffic lights. identification step is similar with 1.1 frames, but the standard Working with devices of limited resources like mobile deviation is 18 times smaller. Whereas the whole system phones is always an art of making compromises between
You can also read