Human Whistle Detection and Frequency Estimation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Human Whistle Detection and Frequency Estimation Mikael Nilsson, Josef Ström Bartůněk, Jörgen Nordberg, and Ingvar Claesson Department of Signal Processing, School of Engineering, Blekinge Institute of Technology Box 520, SE-372 25 Ronneby, Sweden E-mails: mkn@bth.se, jsb@bth.se, jno@bth.se, icl@bth.se Abstract to propose a robust feature extraction scheme for whis- tle detection. To achieve robustness the non-linear tech- Human whistle could be a way to perform activation of nique called the Successive Mean Quantization Transform different kind of devices, for example turn on and off a light (SMQT) [9, 10] is used in this paper. The SMQT has prop- in a smart room. Therefore, in this paper a human whis- erties that reveal the underlying structure in data; hence it tle detection and frequency estimation system is presented. will reduce or remove dissimilarities due to different sen- Further, an investigation of human whistling and a robust sors used. Experiments are conducted on real signals and non-linear feature extraction is presented. A system for ro- the system is investigated under noisy situations. bust performance due to sensor change and various noise The paper is organized as follows. In the next section situations is proposed using these features. Experiments in the analysis of the human whistling is performed. Section 3 various noise situations are conducted. presents a short description of the Successive Mean Quan- tization Transform (SMQT). Section 4 discusses the pro- posed feature extraction. Section 5 presents decision rules 1 Introduction given the features. In Section 6 experimental results for the proposed whistle detector are highlighted. Conclusions are given in Section 7. Most humans have the ability to whistle. Human whistling is typically single frequency dominated signals with a distinct characterization, although harmonics might 2 Human Whistle Characteristics occur. Whistling is produced by means of a constant airflow from the lungs. The air is moderated by the tongue, lips, Human whistles vary in frequency. Hence it is desired teeth or fingers to create turbulence, and the mouth acts as a to find a typical frequency range for human whistling. To resonant chamber. Whistling can be considered as a simple do so, a database with different people whistling has been way of communication between humans, typically to bring created. The database consists of 20 randomly selected test attention. subjects and has shown, from experimental results, to be General whistling, human or non human, is a commu- enough for this initial study. The test subjects were told nication mean in various situations; for example dolphins to whistle melodies and to try to achieve as high and low whistle and referees whistle in soccer games [8, 11]. Fur- frequency whistle sound as possible. All these recordings thermore, due to the characteristics of whistling, there is a are performed in a noise free environment. Note, that the close connection to single tone detection [3, 4]. However, recorded signals are typically non-stationary, since the sig- human whistling is an underexposed and a remarkable area nal may contain no whistle or different whistle frequency of expression, it is raw and direct. The applicability of hu- at different times. The signal potentially containing human man whistling detection can be manifold. It can be used as whistle is denoted s(n) where n is the time discrete index. an aid for handicapped to activate alarms, activate lights or The sampling rate used in this paper is chosen to 48 kHz. other devices in a smart room. Some work has addressed The aim with this analysis is to find the lower and upper the usefulness of human whistle [1, 7]. However, to the best frequency limits for human whistle. In the analysis of hu- of the author’s knowledge, no detailed analysis or descrip- man whistle the Power Spectral Density (PSD) is estimated tion of a digital detection algorithm can be found for human using Welch’s method [6]. The calculation of the mean PSD whistling. is done by using a block size of 512 samples, 50% overlap In this paper, the human whistle is investigated by time- and a Hamming window, see Fig. 1. frequency analysis. Results from the analysis is used From Fig. 1, it can be seen that the human whistle is typ-
−25 −30 −35 SMQTL : D(x) → M(x) (1) −40 The SMQTL function can be described by a binary tree Magnitude [dB] −45 −50 where the vertices are Mean Quantization Units (MQUs). −55 A MQU consists of three steps, a mean calculation, a quan- −60 tization and a split of the input set. −65 The first step of the MQU finds the mean value of the −70 2 4 6 8 10 12 Frequency [kHz] 14 16 18 20 22 24 data, denoted V(x), according to 1 Figure 1. Average Power Spectral Density V(x) = V(x) (2) |D| from whistle database. x∈D The second step uses the mean to quantize the values of the data points into {0, 1}. Let a comparison function be ically located in the range of 500-5000 Hz. Of course, some defined as people might exceed these limits, such as trained whistlers. However within these limits it is possible for most people 1, if V(y) > V(x) to produce whistles. Even if the signal of interest is typ- ξ V(y), V(x) = (3) 0, else ically below 5000 Hz it is still interesting to have infor- mation about higher frequencies, since in some signals, i.e. where y ∈ D. Further, let denote concatenation, and music, whistle-like sounds can occur but information from then the higher frequencies can avoid such false detections. Given these limits two order-100 Hamming window U(x) = ξ V(y), V(x) (4) based Finite Impulse Response (FIR) filters are designed y∈D [5]; one bandpass filter (Hbp ) and one bandstop filter (Hbs ) is the mean quantized data set. The set U(x) is the main both using 500-5000 Hz as a pass- and stopband respec- output from a MQU. The third step splits the input set into tively, see Fig. 2. two subsets Hbp Hbs 50 50 D0 (x) = {x | V(x) ≤ V(x), ∀x ∈ D} Magnitude [dB] Magnitude [dB] 0 0 −50 (5) −100 −50 D1 (x) = {x | V(x) > V(x), ∀x ∈ D} −150 −100 0 5 10 15 20 0 5 10 15 20 where D0 (x) propagates left and D1 (x) right in the binary Frequency [kHz] Frequency [kHz] 0 0 −10 tree, see Fig. 3. Phase [radians] Phase [radians] −50 −20 −30 −100 −40 −50 −150 D(x) 0 5 10 15 20 0 5 10 15 20 Frequency [kHz] Frequency [kHz] Figure 2. Filters Hbp and Hbs . MQU U(x) D0 (x) D1 (x) 3 Description of the SMQT MQU MQU Part of the calculation of the feature vectors involves us- ing the Successive Mean Quantization Transform (SMQT) [9, 10]. A short description of the SMQT is given here for convenience. Note that the description use set theory nota- tion. Figure 3. The operation of one Mean Quanti- Let x be a data point and D(x) be a set of |D(x)| = D zation Unit (MQU). data points. The value of the data point will be denoted V(x). The SMQT has only one parameter input, the level L. The output set from the transform is denoted M(x). The The MQU constitutes the main computing unit for the transform of level L from D(x) to M(x) is denoted SMQT. The first level transform, SMQT1 , is based on the
output from a single MQU, where U is the output set at the s(n) Frame b(t) SMQT Normalization root node. The outputs in the binary tree need extended Blocking notation. Let the output set from one MQU in the tree be pbp (t) denoted U(l,n) where l = 1, 2, . . . , L is the current level and |·| Hbp n = 1, 2, . . . , 2(l−1) is the output number for the MQU at pbs (t) FFT |·| Hbs level l, see Fig. 4. Weighting of the values of the data points Figure 5. The steps from signal s(n) to the Level 1: 2 L−1 MQU U(1,1) feature vectors pbp (t) and pbs (t) Level 2: 2L−2 MQU U(2,1) MQU U(2,2) (since L = 8). Normalization of the result from the SMQT is performed by b(t) − 2L−1 Level 3: 2L−3 MQU U(3,1) MQU U(3,2) MQU U(3,3) MQU U(3,4) (7) 2L−1 where b(t) is the SMQT result from block t. This normal- ization will ensures that the values in the result will be guar- .. . MQU MQU MQU MQU MQU MQU MQU MQU antied to be in the range [−1, 1]. A 512-point Fast Fourier Transform (FFT) is applied Figure 4. The Successive Mean Quantization to the normalized result. Further the bandlimiting filters Transform (SMQT) as a binary tree of Mean Hbp (k) and Hbs (k) are applied on the FFT output, where Quantization Units (MQUs). k denotes the discrete frequency. Finally, the absolute value of the filtered results yields the feature vectors pbp (t) and pbs (t). A truncation to 256 values is performed due to sym- in the U(l,n) sets are performed and the final SMQTL is metry from the FFT on real signals. Hence, two feature found by adding the results. The weighting is performed by vectors, pbp (t) and pbs (t), of size 256 are found for every 2L−l at each level l. Hence, the result for the SMQTL can block t. be found as L 2l−1 L−l 5 Detection and Frequency Estimation M(x) = {x | V (x) = l=1 n=1 V u(l,n) · 2 , ∀x ∈ M, ∀u(l,n) ∈ U(l,n) } To detect human whistle, the feature vectors pbp (t) and (6) pbs (t) will be examined, see Fig. 5. The following observa- As a consequence of this weighing the number of quanti- tions are the motivation for the design of the robust whistle zation levels, denoted QL , for a structure of level L will be detector: QL = 2L . I The largest value in pbp (t) should typically be larger than the mean of pbs (t) in the presence of whistle. 4 Calculation of Feature Vectors II In presence of whistle pbp (t) has typically a few very Given the human whistle characteristics, we extract ro- dominant values. bust features for whistle detection. To extract these features, III If I and II are not fulfilled there is no distinct whistle the signal s(n) will undergo the steps outlined in Fig. 5. tone or it is totally drowned in noise. The first step divides s(n) into non-overlapping blocks of size 512. To extract robust features on these blocks a IV To avoid noisy single block detections a smoothing of Successive Mean Quantization Transform (SMQT) of level the detection results should be performed. Such noisy L (SMQTL ) is applied to each block, as described in the detections could typically occur if music is present in previous section. In this paper L = 8 is used at all times. the signal. The SMQT will make the features robust to various sen- sor changes. It reduces or removes the effect of different In order to implement the above points, it is necessary to microphones, different dynamic range, bias shift and gain extract the following quantities for each block t. The max- shift. The output from the SMQT is in the range [0 . . . 255] imum value, max pbp (t), as well as the vector index of the
maximum value, arg max pbp (t), are found from pbp (t). Note that the index of the maximum value can be consid- 1, if J(t) > θJ b(t) = . (15) ered to be an estimate of the discrete fundamental frequency 0, else index for the whistle, provided whistle sound is present. Further, the mean value, mean pbs (t), is calculated from To summarize I and II, and thereby checking for the pbs (t). A ratio is found as statement in III, the detection function c(t) is found as max pbp (t) γ(t) = . (8) c(t) = a(t)b(t) (16) mean pbs (t) + 1 which implies that both a(t) and b(t) should be one for the For non-whistle blocks γ(t) will have lower values than if whistle decision to be valid. Further, the decision function whistle is present. A threshold on γ(t) will be the first re- c(t) will be smoothed over time to make IV valid. This quirement in the decision if whistle is present or not smoothing over the last D blocks is performed by 1, if γ(t) > θγ ⎧ a(t) = . (9) t 0, else ⎨ 1, if c(τ ) > D/2 d(t) τ =t−D+1 (17) In II we consider the transformation ⎩ 0, else pbp (t) = pbp (t) − min pbp (t) + 1/N (10) where d(t) constitutes the final whistle detection function. which yields a vector pbp (t) in which the smallest value is 1/N . Furthermore, consider the normalized vector 6 Experiments pbp (t) v(t) = (11) sum pbp (t) The system described in section 4 is implemented and runs in real-time on a Pentium 2.13GHz computer. The and also the uniform probability vector parameters θγ = 25, θJ = 0.45 and D = 50 are exper- T imentally chosen. During the experimentations, different 1 1 1 v(t) = ... (12) background noises have been tested, including music, white N N N 1×N noise, babble and car noise, see Fig. 6. In order to compare the whistle detector the actual whistle location, denoted where N is the size of pbp (t) (N =256 in this case). Note do (t), is included in the figure. The detection is typically that the vectors found, v(t) and v(t), can now be consid- delayed with a few blocks due to the smoothing according ered to be probability vectors. Clearly, if v(t) and v(t) are to Eq. (17). Also different microphones and sound-cards similar there is no clear peak in v(t) and the block are not were used to analyze the performance. The parameters cho- to be considered as a whistle block. The obvious question sen are found to be robust to changes in sensor and platform, is now how to create a similarity measure. One way is to this mainly due to the use of the SMQT. Some typical false use the Jensen difference [2, 12]. The Jensen difference is detections occur in the presence of music with long duration based on the Shannon entropy (the block index t is dropped tones in the frequency band 500-5000 Hz (which are very for simplicity) similar to whistle). Some single frequency screaming have N been found to cause false detection in some cases. H(v) = − vi log (vi ) (13) Hardware consisting of a simple microcontroller, high i=1 voltage relay and numerous analog and digital com- ponents was constructed. The hardware acts as an and is defined as electronic switch controlling the electricity flow in the cable. The switch is controlled by the com- v+v 1 puter through COM serial port interface. A video J(v, v) = H − (H (v) + H (v)) . (14) clip www.asb.tek.bth.se/staff/jsb/whistle/whistle.html cre- 2 2 ated by the authors demonstrates a live scenario where a The Jensen difference is always nonnegative and becomes lamp is turned on/off by the whistle in the room. Simul- zero only if v = v [12]. To simplify notation is taneously different types of sounds are played in the room J(v(t), v(t)) denoted by J(t). The decision for accepting acting as a noise source. This particular hardware can also a block as a whistle block is made by setting a threshold on be used in other applications where various devices need a the Jensen difference, that is power supply.
250 Ambient/Clean Car Babble White Music 0 Magnitude [dB] 200 150 −50 k 100 50 −100 500 1000 1500 2000 2500 3000 3500 4000 4500 1 do(t) 0.5 0 500 1000 1500 2000 2500 3000 3500 4000 4500 1 d(n) 0.5 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Block Figure 6. Human whistle signal in various noise situations. 7 Conclusion [7] K. Kanagisawa, A. Ohya, and S. Yuta. An operator interface for an autonomous mobile robot using whistle sound and a source direction detection system. In Proceedings of the Human whistling was investigated from a database with 1995 IEEE IECON 21st International Conference on Indus- collected whistle sounds. The typical frequency range for trial Electronics, Control, and Instrumentation, volume 2, human whistle was found to be 500-5000 Hz. Given this pages 1118–1123, November 1995. knowledge a feature extraction technique was proposed. [8] S. Lefevre, B. Maillard, and N. Vincent. 3 classes segmen- The feature vectors was further analyzed to achieve detec- tation for analysis of football audio sequences. In 14th In- tion and frequency estimation of human whistling. The fi- ternational Conference on Digital Signal Processing, vol- nal system runs at real-time and was capable of detecting ume 2, pages 975–978, July 2002. human whistle during various noise situations. [9] M. Nilsson, M. Dahl, and I. Claesson. The successive mean quantization transform. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol- References ume 4, pages 429–432, March 2005. [10] M. Nilsson, M. Dahl, and I. Claesson. Gray-scale image enhancement using the SMQT. Accepted and presented at [1] M. Böhlen and J. T. Rinker. Unexpected, Unremarkable, and IEEE International Conference on Image Processing (ICIP), Ambivalent OR How The Universal Whistling Machine Ac- Genova 2005. tivates Language Remainders. In Computational Semiotics [11] P. Tyack, W. Williams, and G. Cunningham. Time- for Games and New Media, COSIGN2004, 2004. frequency fine structure of dolphin whistles. In Proceed- [2] J. Burbea and C. Rao. On the Convexity of Some Diver- ings of the IEEE-SP International Symposium on Time- gence Measures Based on Entropy Functions. IEEE Trans. Frequency and Time-Scale Analysis, pages 17–20, October Information Theory, IT-28(3):489–495, 1982. 1992. [3] Y. Chan, Q. Ma, H. So, and R. Inkol. Evaluation of various [12] R. Vergin and D. O’Shaughnessy. On the Use of Some Di- fft methods for single tone detection and frequency estima- vergence Measures in Speaker Recognition. In Proceedings tion. In IEEE 1997 Canadian Conference on, volume 1, of ICASSP, pages 309–312, 1999. pages 211–214, May 1997. [4] J. Dubnowski, J. French, and L. Rabiner. Tone detection for automatic control of audio tape drives. In IEEE Transactions on Acoustics, Speech, and Signal Processing, volume 24, pages 212–215, June 1976. [5] P. J. G. and M. D. G. Digital Signal Processing. Prentice- Hall, 1996. ISBN 0-13-394338-9. [6] H. M. Hayes. Statistical Digital Signal Processing and Mod- eling. Wiley & Sons Inc., 1996. ISBN 0-471-59431-8.
You can also read