Assisted video sequences indexing: shot detection and motion analysis based on interest points
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Assisted video sequences indexing: shot detection and motion analysis based on interest points Emmanuel Etiévent (corresponding author) Frank Lebourgeois Jean-Michel Jolion Labo. Reconnaissance de Formes et Vision etievent@rfv.insa-lyon.fr Insa Lyon, INSA - bat 403 20, Avenue Albert Einstein 69621 Villeurbanne cedex ♦ Keywords Video indexing ; motion analysis ; interest points ; shot cut detection. ♦ Abstract This work deals with content-based video indexing. It is part of a multidisciplinary project about television archives. We focus on semi-automatic compressed video analysis mainly as a means of assisting semantic indexing, i.e. we take into account interaction between automatic analysis and the operator. First, we have developed such an assistant for shot cut detection, using adaptive thresholding. Then, we have considered the possible applications of motion analysis and moving object detection : assisting moving object indexing, summarising videos, and allowing image and motion queries. We propose an approach based on interest points, specifically with a multiresolution contrast-based detector, and we test different types of interest point detectors. This approach does not require a full spatiotemporal segmentation.
INTRODUCTION Video indexing consists in describing the content of audiovisual sequences from a video database to allow their retrieval: this concerns television archives, digital libraries, video servers, and digital broadcasting. Like for text document indexing, the purpose is to allow content-based retrieval instead of using only a bibliographic record. Video content includes for instance characters, objects, dialogues, specific events occurring in a video... Video content can be described at two complementary levels : • The semantic description which is the level the user understands : it allows concept- based retrieval and usually requires human interpreting. However, some aspects can be assisted by video (and sound track) automatic analysis, like finding the sequence structure (shot and scenes), analysing camera and objects motion, detecting and recognising characters, recognising speech. • The physical characterisation of images, objects, and also motion : extracting visual features allows retrieval based on visual similarity by comparing the features. This approach is only a complement of the previous one because there is no direct relation with the user conception, based on the semantic level. However this method is automatic, and it allows the use of an image or a sketch for queries by example : this is useful for searching a specific object and for exploiting visual features, like shape or texture, which are difficult to describe by semantic means. The work we present in this paper is part of the Sésame1 project (Audiovisual Sequences and Multimedia Exploration System). Target users are information officers for the indexing part and for instance journalists for the retrieval part. The aim is to use more complete information than the bibliographical records used nowadays for operations like retrieving, browsing, analysing or editing videos. Since we do not restrict the type of videos (films, reports, news, TV programs), we have to use the interpretation capacity of the operator which is very difficult to model. We work on Mpeg compressed video which is compulsory for realistic applications. The project involves several complementary fields : knowledge modelling to organise video annotation [Prié 98], database for storing and querying this complex annotation structure and image features [Decleir 98], high performance parallel 1 This work is partially supported by France Télécom (through CNET/CCETT), research contract N° 96 ME 17. 1
architectures [Mostéfaoui 97], and semi-automatic image analysis [Lebourgeois 98]. An integration prototype will allow an actual evaluation. We present here two different aspects : shot detection, and a prospective study about motion analysis applied to indexing. 1. SHOT DETECTION A shot is a video segment defined by the continuity of camera shooting. Shots result from video editing operations like cutting, assembling or introducing transition effects. Shot detection is based on shot transition detection. Indeed, cuts introduce a discontinuity which can be detected by the discontinuity of an image feature or of the measure of a similarity between two frames. We used a classical similarity measure based on the histogram difference on each colour plane. We use an adaptive threshold2 [Faudemay 97] to take into account the local variability of the measure within a shot. The threshold computation is based on the standard deviation in a window centered on each point, excluding the considered point. Tests show that using 5+5 frame does not disturb the thresholding, so short shots are not missed (a half-second shot is possible for instance in a film announcement). 1.1. Semi-automatic operation As we said before, a specific point of our approach is to involve the operator to validate results. Only uncertain cases need validation (that is known as rejection in decision theory). The uncertain cases are determined according to a tolerance which is tuned by the operator if the default one does not suit the analysed video : certain event tunable tolerance computed threshold uncertain cases no event detection probability Figure 1 : Thresholding with tunable rejection. 2Anadditional fixed and loose threshold would be useful anyway to avoid false alarms in shots where the variability is very low and an insignificant perturbation creates a small peak. 2
We consider several constraints : • Computations and validation are performed asynchronously to avoid waiting times. • The default rejection tolerance is set quite high to avoid missing difficult shot cuts. • The operator should be able to tune the tolerance depending on the results at any time. As a consequence, instead of binary answers, the computation step provides a detection probability which allows the determination of uncertain cases with a tunable tolerance at validation time. Generally speaking, for other types of semi-automatic analysis, we can say that analysis of the raw results should be independent from the raw computations so that the operator can play a role in the analysis. Figure 2 : Examples of shot cut detection, with start frame and end frame for each shot (see frame number). A shot cut at frame 499, within motion, needs validation. (INA archives) 1.2. Comparison of detection methods Many methods have been developed. The following table gathers results from several works for which numeric results are published (this means they are evaluated on different sequences, and some are based on few data). 3
Gradual transitions Shot cuts Author Method missed false nb missed false nb Video type [Yeo 95] image difference with (14%) (57%) 7 7% 7% 41 1 Mpeg report a frame step [Corridoni image (dissolves) (20%) (20%) 4 3% 3% 181 films, adds 95] ratio (fades) 0 0 29 [Joly 94] variation type of (0%) (17%) 18 (1%) (2%) 306 films individual pixels [Zabih 95] edge matching (0) (27%) 11 2.5% 12% 118 short Mpeg videos [Shen 97] edge matching and 8% (4%) 98 4% (4%) 187 clips, films, television motion compensation (with motion) [Xiong 96] grey level colour short sequences with pairwise 5/4 2/15 = number of motion and likelihood 10/48 XX missed / false cuts perturbations. global histogram 14/78 9/46 (for 3864 frames ; local histogram 7/66 3/2 optimised net comparison 3/1 0/0 thresholds) histogram and XX 1/2 for 2284 frames Mpeg report and film adaptive threshold and 37 cuts announcement If we use rejection, with a tolerance of +/-20% of the threshold, 20 cases need validation, including 7 due to motion, 3 due to fades. [Yeo 95] and [Shen 97] work directly in the Mpeg compressed domain and are very efficient. Gradual transitions are less studied than cuts (they are less common in videos) and need improvements, though the last method claims quite good results. One concern is motion, which modifies the images and hence causes variations of shot transition detection measures. We will come back to this point in the next section. For a less biased comparison, note that the LIP6 lab of Paris 6 university, France, is now comparing algorithms on a common video base containing one hundred hours. 2. MOTION AND MOVING OBJECTS ANALYSIS FOR VIDEO INDEXING 2.1. Assisting video description Semi-automatic motion analysis and moving objects detection can simplify several tasks of video description. ♦ Objects temporal presence Object tracking automates the detection of the interval where an object is present. This applies to objects selected manually, or to objects detected by their motion, and to special cases like face detection (which works only for front views, so tracking the detected faces recovers the moments when the characters turn the head). A further step consists in comparing all the detected objects to check their recurrence along the video. 4
♦ Summarising videos Summarising shots gives condensed views of videos. Shots are defined by a fixed background and objects in motion (plus a sound track summary). In case of camera motion, the background is defined by several images or by a reconstructed view (images are transformed back contrarily to the motion [Taniguchi 97]). Objects are characterised by their motion, and optionally by several different enough views. ♦ Camera motion Camera motion possesses a meaning as regards the film structure. It is derived from global motion parameters [Xiong 97], which are computed also for object motion. ♦ Shot transitions Another shot transition detection method relies on detecting motion discontinuity [Gelgon 97] : transition detection becomes more robust to large motion and it avoids preliminary computations with another transition detection algorithm. 2.2. Image queries When objects and background are separated, features extracted from them allow similarity retrieval [Benayoun 98]. The operator can select the most significant elements to index. 2.3. Motion queries The first step is to establish what can be useful for motion queries : using a track for queries by example based on a video sample or a sketch? or how to describe motion more simply and more semantically, with words ? That is : • significant motion (as opposed to static shots. It is useful for navigating the video), • motion features (horizontal or vertical motion, depth motion, speed, regularity), • motion events like a start, a change of direction, • interaction between objects [Delis 98] [Courtney 97]. This means defining classes, with the problem of determining limits between them. 3. INTEREST POINTS, MOTION AND OBJECTS FOR VIDEO INDEXING 3.1. The tool : interest points Our lab worked on interest points [Bres 99], and here is a short glance about it. Interest points are defined by two-dimensional signal variations in their neighbourhood, for instance at corners, as opposed to 1D variation for basic edges. They describe an 5
image by a small amount of points, therefore they allow a fast image comparison and a small storage. That is why they are used for image matching, in robotics, and also for image indexing [Schmid 97] (see 3.2.1 Computing motion). We use three detection algorithms (see [Jolion 98] web site) : Plessey detector [Harris 88], Susan [Smith 97], and multiresolution contrast detector [Bres 99]. The formers are based on geometric models, which are well adapted for corner detection, while the latter does not and is more appropriate for natural images. The Susan detector is much faster than the others but is not very robust to Jpeg compression effects [Bres 99], which raises doubts for Mpeg videos. For videos, matching interest points from one image to the next in a shot gives motion vectors, which is the basis for motion analysis. This method should be fast compared to pixel-based methods (optical flow or spatiotemporal segmentation) or more complex matching (edges, curvature points). 3.2. Interest points, motion and objects Figure 3 shows the temporal superposition of interest points (the points of the first frames appear darker), next to one of the original image. 3.2.1. Computing motion ♦ Point cluster tracking In special cases with well-defined objects, interest points are grouped into clusters corresponding to objects or parts Figure 3 : Rotating dancer. of objects. A fast method consists in clustering the set of points (with morphologic methods for instance) and following them. A consistency measure is then needed for difficult cases to apply a more powerful method (for instance motion consistency over a given duration). ♦ Point matching Many methods exist, for instance in robotics (edge or corner matching in artificial images ; stereovision [Cédras 93] [Serra 96]). For robustness, tracking should take into account several frames. 6
The comparison of local measures associated with interest points, robust to noise and geometric transforms and masking, like differential invariants [Schmid 97], improves the matching. For comparison with differential invariants, we are testing the invariance of multiresolution contrast. 3.2.2. Interest points for video indexing Points of interest allow any of the elements we saw in chapter 2 "Motion and moving objects analysis for video indexing". Let us focus on some parts. ♦ Moving objets tracking The purpose is to determine the time interval where an object is present. Object motion is obtained by compensating the global motion. Rigid objects detection is based on the similar motion of the object points. It is more difficult with non-rigid objects, and due to the variety of the considered videos, detection cannot be perfect. Therefore an operator has to validate and correct the results. For an indexing system, we consider different modes : batch processing, or more interactive operating. In either case, to avoid waiting times, it is far preferable for the interaction steps and computation steps to work independently on a whole video segment rather than to work object by object. Notice that for an approximate object display, showing a region containing the interest points is enough for human understanding. q On demand analysis In case the operator is interested only by a part of the objects (the most significant) and does not want to run a full computation, we have the following steps : • outlining manually all theses objects, in one frame each, • extracting and tracking interest points included in theses regions, • asking the operator to validate and correct the ambiguous cases (with the display of the object at the beginning and at the end of its trajectory to see if it is the same). q Batch analysis of a sequence First, the computation step on the whole sequence includes extracting the interest points, computing the motion, grouping points according to motion similarity to detect objects, then the validating step is like mentioned above. ♦ Moving objects characterisation Interest points and the associated invariants are a way of characterising objects, for : • classifying similar objects from a video to assist the process of naming the objects, 7
• querying a video database by example. We need to store only several different views of an object from the whole sequence (or even none if the object is already indexed and has similar views already stored3). Characterising objects needs more accuracy than tracking. First, interest point thresholding can be adapted to the object to get more points locally. Then, the operator now must correct also the detected object shapes if they overlap other objects4. We emphasise the fact that the whole process does not need a full spatiotemporal segmentation at the pixel level. 3.3. Comparing interest points detectors From one image to the next, interest points change because of Mpeg coding, object distortions, and background variations when the object moves (which modifies the local invariants associated to points on the edge of the object). Matching requires somewhat steady points (number and location of points, invariants stability). At first, we compare the temporal and spatial stability of interest point detectors with a simple matching algorithm and global or small motion (by studying the variability of motion vectors in one frame). A second step consists in comparing the results of a real tracking algorithm. The 1 mn report on Figure 4 shows small 400 Number of images variations in general (it uses multiresolution 350 contrast detector, and a fixed threshold ; the 300 250 frames associated to shot cut are removed). 200 We plan to test quite long video sequences 150 from television archives of the French Institute 100 50 of Audio-visual. 0 0 20 40 60 80 100 Rate of change CONCLUSION Figure 4: Histogram of the rate of We have developed a shot cut detection change (%) of the number of interest points between two frames. assistant, using adaptive thresholding and taking into account the interaction with the operator. Concerning motion analysis, we have considered the possible applications for video indexing : assisting moving object 3 For that, a classification of the whole (and huge) database is not needed since we can reach the other possible instances of the object already indexed using the semantic annotations database. 4 But if some parts have no interest points, it does not matter to add them because they will not play any role in similarity queries. 8
indexing, summarising videos, and allowing image and motion queries. We have proposed an approach based on interest points, specifically with a multiresolution contrast-based detector, for analysing motion and detecting and characterising objects ; this approach does not require a full spatiotemporal segmentation. Experimental results will be presented at the conference and included in the final version of the paper. REFERENCES ♦ Sésame [Decleir 98] C. Decleir, M.S. Hacid, J. Kouloumdjian (1998) A Generic Model For Video Content Based Retrieval; Symposium on Applied Computing, ACM, 458-459. [Mostéfaoui 97] A. Mostéfaoui , L. Brunie (1997) Exploiting data structures in a High Performance Video Server for TV Archives; Digital Media Information Base (DMIB’97), ACM-SIGMOD, Ed. World Scientist, 159-166. [Prié 98] Y.Prié, A.Mille, J.M.Pinon (1998) AI-STRATA: A User-centered Model for Content-based description and Retrieval of Audiovisual Sequences; First Int. Advanced Multimedia Content Processing Conf., 143-152. [Lebourgeois 98] F. Lebourgeois, J.M. Jolion, P. Awart (1998) Toward a Video Description for Indexation ; 14th IAPR Int. Conf. on Pattern Recognition, Brisbane, August 1998, vol. I, 912-915. ♦ Shot detection [Corridoni 95] M. Corridoni, A. Del Bimbo (1995) Automatic Video Segmentation through Editing Analysis ; Technical Report Firenze University, http://www.nzdl.org/cgi-bin/gw?a=targetdoc&c=cstr&z=sw3E2P4hwhzy&d=6975. [Faudemay 97] P. Faudemay, L. Chen, C. Montacié, M.J. Caraty, X. Tu (1997) Segmentation multi-canaux de vidéos en séquences; Coresa 97. [Joly 94] P. Joly, P. Aigrain (1994) The Automatic Real-Time Analysis of Film Editing and Transition Effects and its Applications; Computers & Graphics, Vol. 18, No. 1, 1994, 93-103. [Shen 97] Bo Shen (1997) HDH Based Compressed Video Cut Detection; HPL-97-142 971204 External, http://www.hpl.hp.com/techreports/97/HPL-97-142.html. [Xiong 96] W. Xiong, J. Chung-Mong Lee, R.H. Ma, Automatic Video Data Structuring through Shot Partitioning and Key Frame Computing ;Technical report, http://www.nzdl.org/cgi-bin/gw?a=targetdoc&c=cstr&z=44Cx2P4hwhzy&d=22080. [Yeo 95] B. L. Yeo and B. Liu (1995) Rapid scene analysis on compressed video ; IEEE Transactions on circuits and systems for video technology, vol. 5, 533-544. [Zabih 95] R. Zabih, J. Miller, K. Mai (1995) A Feature-Based Algorithm for Detecting and Classifying Scene Breaks; ACM Multimedia 1995, http://simon.cs.cornell.edu/Info/People/rdz/dissolve.html. ♦ Interest points [Bres 99] S.Bres, J.M. Jolion (1999) Detection of Interest Points for Image Indexation ; Visual’99 Amsterdam, june 2-4, http://rfv.insa-lyon.fr/~jolion/PS/visual99.ps.gz. [Harris 98] C.Harris, M.Stephens (1988) A combined corner and edge detector; Proc. of 4th Alvey Vision Conf., 147-151. [Jolion 98] Interest points demo: http://rfv.insa-lyon.fr/~jolion/Cours/ptint.html. [Schmid 96] C.Schmid (1996) Appariement d'images par invariants locaux de niveaux de gris; Thèse INP Grenoble. 9
[Schmid 97] C.Schmid, R.Mohr (1997) Local Grayvalue Invariants for Image Retrieval; IEEE Trans. on Pattern Analysis and Machine Intelligence, 19(5), 530-535. [Smith 97] S.M.Smith, J.M.Brady (1997) SUSAN - A New Approach to Low Level Image Processing; Int. Journal of Computer Vision, 23(1), 45-78. ♦ Motion [Benayoun 98] S. Benayoun, H. Bernard, P. Bertolino, P. Bouthemy, M. Gelgon, R. Mohr, C. Schmid, F. Spindler (1998) Structuration de vidéos pour des interfaces de consultation avancées; Coresa 98, 205. [Cedras 93] C. Cédras, M. Shah (1993) Motion-Based Recognition: a Survey; technical report, http://www.nzdl.org/cgi-bin/Kniles?c=cstr&d=7153. [Courtney 97] J.D. Courtney (1997) Automatic video indexing via object motion analysis; Pattern Recognition 1997. [Delis 98] V. Delis, D. Papadias, N. Mamoulis (1998) Assessing Multimedia Similarity; ACM Multimédia 98, Session 7 C: Content-Based Retrieval Systems, http://www.acm.org/sigmm/MM98/electronic_proceedings/delis/index.html. [Gelgon 97] M. Gelgon, P. Bouthemy, G. Fabrice (1997) A Unified Approach to Shot Change Detection and Camera Motion Characterization; Technical Report RR-3304 INRIA Rennes, http://www.inria.fr/RRRT/RR-3304.html. [Serra 96] B. Serra (1996) Reconnaissance et localisation d’objets cartographiques 3D en vision aérienne dynamique; Thèse université de Nice,150-185. [Taniguchi 97] Y. Taniguchi, A. Akutsu, Y. Tonomura (1997) Panorama Excerpts: Extracting and Packing Panoramas for Video Browsing; ACM Multimedia 97, http://www1.acm.org:81/sigmm/MM97/papers/taniguchi/tani.html. [Xiong 97] W. Xiong, J.C.M. Lee (1997) Efficient Scene Change Detection and Camera Motion Annotation for Video Classification; Technical Report HKUST-CS97-16, http://www.nzdl.org/cgi-bin/gw?a=targetdoc&c=cstr&z=2Ess2P4hwhzy&d=22748. 10
You can also read