VITAG: AUTOMATIC VIDEO TAGGING USING SEGMENTATION AND CONCEPTUAL INFERENCE - IIT HYDERABAD
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM) ViTag: Automatic Video Tagging Using Segmentation and Conceptual Inference Abhishek A. Patwardhan, Santanu Das, Debi Prosad Dogra Sakshi Varshney, Maunendra Sankar Desarkar School of Electrical Sciences Department of Computer Sc. & Engineering IIT Bhubaneswar IIT Hyderabad Bhubaneswar, India Hyderabad, India Email: dpdogra@iitbbs.ac.in Email: {cs15mtech11015,cs15mtech11018, cs16resch01002,maunendra}@iith.ac.in Abstract—Massive increase in multimedia data has created the ViTag framework is outlined in Figure 1. a need for effective organization strategy. The multimedia col- lection is organized based on attributes such as domain, index- terms, content description, owners, etc. Typically, index-term is a prominent attribute for effective video retrieval systems. In this paper, we present a new approach of automatic video tagging referred to as ViTag. Our analysis relies upon various image similarity metrics to automatically extract key-frames. For each key-frame, raw tags are generated by performing reverse image tagging. The final step analyzes raw tags in order to discover hidden semantic information. On a dataset of 103 videos belonging to 13 domains derived from various YouTube categories, we are able to generate tags with 65.51% accuracy. We also rank the generated tags based upon the number of proper nouns present in it. The geometric mean of Reciprocal Rank estimated over the entire collection has been found to be 0.873. Keywords-video content analysis, video tagging, video orga- nization, video information retrieval Figure 1: Overview of the proposed ViTag architecture. I. I NTRODUCTION A. Related work Finding match for a user submitted query is challenging on large multimedia data. To reduce the empirical search, Automatic video tagging research is growing. Siersdor- video hosting websites often allow users to attach a descrip- fer et al. [1] have devised a technique based on content tion with the video. However, description or index terms can redundancy of videos. However, their approach requires be ambiguous, irrelevant, insufficient or even empty. This querying external video collection to generate tags for the creates a necessity for automatic video tagger. In this paper, video in question. Our approach exploits semantic similarity we present an automatic video tagging tool, referred to as information [2]. Moxley et al. [3] performs a search using ViTag. It involves video segmentation that extracts distinct, three attributes (frames, text, and concepts) to find matching representative frames from the input video by hierarchical videos out of a collection of videos. The approach needs combination of various image similarity metrics. In the next automatic speech recognition, and therefore it seems difficult step, raw tags obtained from the segmented video frames to work on generic videos involving challenging domains are investigated to estimate semantic similarity information. like animation, songs, games, etc. Toderici et al. [4] have Finally, we annotate the input video by combining raw tags trained a classifier that learns association of audio-visual with the inferred tags. features of a video to its tags. Machine learning based In accomplishing this, we make the following contribu- approaches are also promising. But they come at higher tions: (i) a hierarchical combination of three image similarity training and tuning overheads. Borth et al [5] have ex- metrics to design a video segmentation algorithm, (ii) a tracted key-frames for video summarization using k-means conceptual inference heuristic to automatically infer generic clustering to group similar frames into a cluster. Yao et tags from raw tags, and (iii) a fully automatic, end-to-end, al. [6] have tagged videos by mining user search behavior. and open-source tool to output the tags solely based on Their method requires dynamic information about user’s analyzing the input video. The approach implemented within behavior. Probabilistic model-based method proposed in [7] 978-1-7281-5527-2/19/$31.00 ©2019 IEEE 271 DOI 10.1109/BigMM.2019.00050 Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on February 12,2021 at 10:02:57 UTC from IEEE Xplore. Restrictions apply.
involves the two-step process, i.e, video analysis followed by querying classification framework to generate tags. Rest of the paper is organized as follows. Section II presents the overall methodology and implementation de- tails. In Section III, we present the results. Finally, in Section IV, we provide conclusions and future work. II. P ROPOSED V I TAG F RAMEWORK A. Video Feature Extraction ViTag first extracts the key-frames and feed them as inputs to the reverse image tagger that generates raw tags. The process of reading dissimilar frames from an input video is outlined in Algorithm 1. The threshold value can be set empirically. Algorithm 1 Selection of dissimilar frames Require: Video V 1: Output frame sequence K = ∅ 2: prev ← First frame in V Figure 2: Complete key-frame extraction module. 3: for all F rame : f ∈ V do 4: score ← compute mean square error(f , prev) 5: if score > threshold then Algorithm 2 Selecting a representative frame for a given 6: K =K ∪f window 7: prev ← f Require: Frames[1..N]: Window 8: end if Require: Scores[1..N-1]: Similarity scores for adjacent 9: end for frames within the window 10: Return K 1: maxVal, maxInd ← MAX(Scores[1..N-1]) 2: minVal, minInd ← MIN(Scores[1..N-1]) 3: if minVal > threshold then The algorithm consists of two stages. The input sequence 4: if Scores[maxInd-1] < Scores [maxInd+1] then of frames is partitioned into fixed-sized non-overlapping 5: select = maxInd+1 windows. Within each window, we estimate the similarity of 6: else two successive frames using features such as Mean Square 7: select = maxInd Error (MSE), SIFT, and Structural Similarity Index (SSI). 8: end if This results into a similarity vector (V ) for each window. A 9: else value in similarity vector (Vi ) depicts similarity score among 10: if Scores[minInd-1] < Scores[minInd+1] then two adjacent frames (Fi , Fi+1 ) within the window. The input 11: select = minInd+1 to the video segmentation process as depicted in Figure 2 is 12: else the set of frames selected by Algorithm 1. We analyze the 13: select = minInd similarity vector for each window so as to select a single 14: end if representative frame for that window. 15: end if Intuitively, we wish to select one frame that contains 16: return Frames[select] maximum information. Note that, a frame can be considered to contain maximum information in both cases when it matches to both of its neighbors with the highest matching score (considered as a representative frame for the window) B. Raw Tag Generation or its matching scores are low with the adjacent frames (it The second phase of ViTag receives key-frames and contains unique information). To capture both cases, the obtains raw tags by querying the reverse image tagger. The heuristic discussed in Algorithm 2 is used. The algorithm reverse image search engine provides a list of web-pages and picks a frame contributing to maximum score. If the min- the key-terms associated with the query image. Such a search imum score turns out to be less than a threshold value, technique is discussed in [8]. Typically such algorithms em- the heuristic assumes the existence of a frame containing ploy techniques including maximal stable external regions unique information. Hence we select the frame contributing (MSER) detection [9], object detection [10], vocabulary to minimum score. 272 Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on February 12,2021 at 10:02:57 UTC from IEEE Xplore. Restrictions apply.
tree [11], etc. We initially encode the frame within a query i = V Wui . (1) and it is fired to the search engine. The responses are then e(u,i)∈E parsed to extract tags. The steps are shown in Fig. 3. u∈T Furthermore, it is expected that many of the tags in T will be semantically similar to others. We model this situation within bipartite graph G by inserting extra edges across two nodes from the set T . Due to such construction, G is no more a bipartite graph. We refer to these newly inserted edges as semantic edges to distinguish them from the edges originally present in G. We insert such semantic edges to G in following two cases: (i) For each pair of tags (t1, t2)∈ T , we find semantic similarity score. We add semantic edge E(t1, t2) and E(t2, t1) and label that edge with semantic similarity score obtained. (ii) For all multi-word tags m ∈ T , we check for the presence of each individual word (w) within the set T . If it exists, we add edge E(m,w). We label edge with a score equal to reciprocal of the total number of words present in m. This allows us to capture the semantic similarity of a multi-word tag. After augmenting the semantic edges to graph G, we revise the score vector to reflect the changes made in G. To compute the revised V Figure 3: Reverse image tagging methodology used in our , we use (2). value of V work. V i + i = V Sxu ∗ Wui . (2) C. Conceptual Inference e(u,i)∈E e(x,u)∈E After performing a reverse image search for the key- u∈T x∈T frames, we post-process the obtained tags in order to infer We revise each entry in V with product of two weights, more generic tags. We achieve this by adding an extra mod- ule of conceptual inference that refers to external knowledge i.e., (i) weight Wui connecting node u ∈ T to node i ∈ C source built on the top of various concepts in the web. Such and (ii) node u connects to node x ∈ T with semantic edge , we sort it in descending order weight Sux . After revising V a representation is referred to as a concept graph. Formally, concept graph [12] is a knowledge representation structure and select top r entries. Semantic similarity metric used in storing a weighted association of natural language words frameworks such as NLTK [13] fails to capture semantic with (abstract) concept terms. similarity among commonly occurring words like iPhone, gadget etc. We fix this issue by referring to concept graph D. Semantic Similarity using Bipartite Graph engine. Let T be the set of (unique) raw tags obtained and C be the set of (unique) concept terms obtained by querying E. Implementation Details each raw tag from T to the concept graph engine. A directed bipartite graph G(T, C) with Edges E: T → C represents The video segmentation algorithm can be accelerated by a mapping of raw tags to various concept terms. We label parallelizing computations. We achieve this by applying each edge e(t, c) with a score w such that w: E → [0, 1]. classic loop transformation approach known as it loop tiling. A labelled score w on each edge represents how likely a We query host architecture to get a total number of process- concept c is associated with a tag t. Each score w is obtained ing cores (denoted as p) available on a system. We have tiled by querying concept graph engine. We need to identify a the iterations of a parallel loop by a factor of p. We run all set K ⊆ C such that each c ∈ K is associated with a large the iterations within each tile in parallel. Python3 has been number of incoming edges from T . To find K, it is important used to implement ViTag. For computing SIFT and SSIM to obtain a relative importance of each c ∈ C. Thus, we need scores, we have used OpenCV library. Our implementation ) of length equal to cardinality to find the score vector (say V uses Google Reverse Image Search engine [14] to obtain raw of C. Once we obtain V , it is easy to select top r entries tags of the key-frames. The conceptual inference heuristic for some r ∈ N by simply sorting the vector V . In order to is based on Microsoft Concept Graph utility [12], [15]. The compute the value Vi for i ∈ C, we sum up the weights of implementation and datasets are available at [16], [17]. the incoming edges for node i. Formally, 273 Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on February 12,2021 at 10:02:57 UTC from IEEE Xplore. Restrictions apply.
No Of Domain Description Examples videos Tourism 8 Diverse tourist places, Seven wonders Statue of Liberty, Gate way of India Products 7 Product review, advertisements iphone review, Nike-shoes ad Ceremony 8 Popular events, ceremonies, Major disasters Oscar Award functions, Japan Tsunami Famous persons 10 Documentary on famous persons, Artists in concerts Indian leaders, Jennifer Lopez Entertainment 9 Songs from popular movies, tv shows Abraham Lincoln, Mr. Bean Speech 7 Recent speeches by prominent personalities Kofi Annan, Barack Obama Animations 8 Popular animation movies, cartoon series Tom n jerry, Kung fu Panda, Frozen Wildlife 7 Videos/documentary on animal species Peacock, Butterfly, Kangaroo, Desert Geography and Climate 8 Weather forecasting videos, Videos covering maps Continents of world, Weather forecast Vehicles 8 Famous bikes, cars and automobiles Sports Bike, Lamborghini Car Science and Education 8 Lecture series, Videos on general awareness FPGA, Social media effects Video Games 8 Popular Computer and mobile games Counter Strike, Dangerous Dave Sports 7 Videos of popular tournaments Tennis, Cricket, Football, Chess Total 103 Table I: Details of the video dataset used in evaluation of ViTag III. R ESULTS AND E VALUATION We have used a video collection created using YouTube.com for experiments. While creating a collec- 1.0 tion, we have studied various video categories available on YouTube.com. It organizes videos into 31 different cate- gories out of which many categories are merged. Finally, 0.8 0.76 we have made 13 distinct domains. For each domain, we 0.71 0.7 0.7 have selected videos based upon inclusive opinion of each 0.64 0.64 0.63 0.61 0.62 0.6 0.61 Tag precision person in the group. For each popular content, we have 0.6 0.57 0.57 selected a random video obtained by searching the website. We have selected videos with length between 50 seconds to 4 minutes. A total of 103 videos have been collected. 0.4 Each domain consists of approximately 8 videos. Table I describes the details. We have tagged each video using ViTag. Furthermore, by using natural language processing 0.2 package (NLTK), we are able to reason whether a tag contains proper nouns or not. This information enables us to 0.0 rank the tags. We have also used reciprocal rank as another Entertainment Sports Products Speech Science & edu Video games Animations Ceremony Wildlife Geography Famous person Vehicles Tourism metric for evaluation. Domain Fig. 4 shows mean precision attained for each video do- main. Mean has been computed by taking the geometric mean of precision values attained for all videos belonging Figure 4: Tag precision recorded across various domains. to a particular domain. For Animations category, almost 77% of the generated tags are precise. For videos belonging to product reviews and advertisements, ViTag obtains a descending order. The rectangle with dashed line shows ideal minimal precision of 57.36%. For videos belonging to do- accuracy having the precision of one for all the videos in the mains like Tourism, wildlife, animation, events/ceremonies, collection. Around 55% of the ideal accuracy is achieved by ViTag attains more than 70% precision. The summary of tag ViTag. precision is presented in Table III. We estimate the effectiveness of conceptual inference It is also important to investigate how many videos lie heuristic using the second metric of binary relevance for in particular precision interval. Fig. 5 shows number of inferred tags. For 4 out of 103 videos, conceptual inference videos attaining a particular precision interval. ViTag attains heuristic cannot infer extra tag. For 43 out of the remaining full (100%) precision for 11.5% videos. For 70 out of 99 videos, the conceptual inference heuristic has inferred 103 videos, it is able to generate more than 60% relevant meaningful tags. It has inferred vague tags for remaining 56 tags. Fig. 6 summarizes the accuracy of ViTag. The plot videos. Table II lists videos for which conceptual inference is obtained by sorting the precision of all the videos in generates meaningful tags. The table also depicts cases 274 Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on February 12,2021 at 10:02:57 UTC from IEEE Xplore. Restrictions apply.
Sample video Domain Some raw tags Auto-inferred tags description Positive Famous Persons Indian freedom fighters Mahatma Gandhi, Bhagat Singh,freedom fighters leader person Conceptual Wildlife Dancing peacock peacock, peafowl bird Inference famous landmark, Tourism Eiffel Tower france eiffel tower, eiffel tower night view sight Negative Ceremony Christian wedding woman, event, female, facial expression group, information Conceptual Geography Weather forecast map, planet, world, earth object, material Inference Product iPhone review gadget, iPhone, Screenshot item, factor Table II: Effectiveness of Conceptual inference heuristic 25 1.0 24 20 20 0.8 18 16 Tag Precision 15 No of videos 0.6 12 10 0.4 5 5 5 3 0 20 40 60 80 100 0 0 0 0 Video count 0−.1 .1−.2 .2−.3 .3−.4 .4−.5 .5−.6 .6−.7 .7−.8 .8−.9 .9−.99 1 Tag precision Figure 6: Tag precision for the entire video collection in sorted order. The bounding rectangle depicts ideal accuracy. Figure 5: No of videos Vs Tag-precision interval. Tag Precision Reciprocal Rank are very interesting to reflect enough potential about our ap- Geometric mean 0.6467 0.873 proach. However, we think there are scopes for improvement Arithmetic mean 0.6491 0.905 Median 0.6389 1.0 as discussed in future work. A snapshot of the user interface of ViTag is presented in Fig. 8. Table III: Summary with Tag Precision and Reciprocal Rank as the evaluation metrics IV. C ONCLUSIONS AND F UTURE W ORK We propose an analytical, end-to-end and fully automatic approach to the problem of automatic video tagging. Our where conceptual inference fails to infer relevant tags. We approach exploits the combination of various image sim- have also evaluated ViTag based on reciprocal rank metric. ilarity metrics to select key-frames from the input video Fig. 7 shows the reciprocal rank for all the videos in the containing dissimilar information. Then we use reverse im- collection sorted in decreasing order. As seen from the age tagging engine to generate raw tags for the input video. figure, ViTag covers 85% of the ideal case scenario. Table We infer generic tags using conceptual inference heuristic. III summarizes the statistical results with reciprocal rank as The heuristic leverages semantic similarity among tags for a metric of evaluation. tag-inference. We have evaluated our implementation on In summary, our observations are as follows: For entire an open-collection comprising of 103 videos belonging to collection comprising of 103 videos, ViTag has generated 13 domains derived from various YouTube categories. Our 696 tags out of which 456 tags are precise. It attains about implementation has obtained 65.51% precision and 87% 65.51% of accuracy using precision as a metric. In 43.4% accuracy using reciprocal rank as a metric. Our approach of the cases, conceptual inference heuristic has inferred is not video domain specific and it does not need any pre- valuable generic tags. It has obtained 87.3% accuracy as tagged video dataset and training. This makes it practical per reciprocal rank metric. We believe the evaluation results and complementary to the state-of-art approaches. 275 Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on February 12,2021 at 10:02:57 UTC from IEEE Xplore. Restrictions apply.
[3] Emily Moxley, Tao Mei, Xian-Sheng Hua, Wei-Ying Ma, 1.0 and BS Manjunath, “Automatic video annotation through search and mining,” in Proceedings of the IEEE International Conference on Multimedia and Expo. IEEE, 2008, pp. 685– 0.8 688. Reciprocal Rank [4] George Toderici, Hrishikesh Aradhye, Marius Pasca, Luciano Sbaiz, and Jay Yagnik, “Finding meaning on youtube: Tag 0.6 recommendation and category discovery,” in Computer Vision and Pattern Recognition, IEEE Conference on. IEEE, 2010, pp. 3447–3454. 0.4 [5] Damian Borth, Adrian Ulges, Christian Schulze, and Thomas Breuel, “Keyframe extraction for video tagging & summa- rization.,” pp. 45–48, 01 2008. 0.2 0 20 40 60 80 100 [6] Ting Yao, Tao Mei, Chong-Wah Ngo, and Shipeng Li, “Anno- Video count tation for free: Video tagging by mining user search behavior,” in Proceedings of the 21st ACM international conference on Multimedia. ACM, 2013, pp. 977–986. Figure 7: Reciprocal rank for all videos of the dataset shown in a sorted order. The bounding rectangle depicts the ideal [7] Jialie Shen, Meng Wang, and Tat-Seng Chua, “Accurate case. online video tagging via probabilistic hybrid modeling,” Mul- timedia Systems, vol. 22, no. 1, pp. 99–113, 2016. [8] Yushi Jing, David Liu, Dmitry Kislyuk, Andrew Zhai, Jiajing Xu, Jeff Donahue, and Sarah Tavel, “Visual search at pinterest,” 05 2015. [9] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide baseline stereo from maximally stable extremal regions,” in Proceedings of the British Machine Vision Conference. 2002, pp. 36.1–36.10, BMVA Press, doi:10.5244/C.16.36. [10] Josef Sivic, Bryan C Russell, Alexei A Efros, Andrew Zisser- Figure 8: A snapshot of the user interface of ViTag. man, and William T Freeman, “Discovering objects and their location in images,” in Proceedings of the IEEE International We would like to come up with a deep neural network Conference on Computer Vision. IEEE, 2005, vol. 1, pp. 370– driven reverse image tagger to improve the accuracy of tag 377. generation. Also, we would like to explore various natural [11] David Nister and Henrik Stewenius, “Scalable recognition language processing techniques to detect and eliminate the with a vocabulary tree,” in Proceedings of the IEEE computer non-relevant tags. For conceptual inference heuristic, we society conference on computer vision and pattern recogni- would like to introduce a scoring mechanism to reason tion. Ieee, 2006, vol. 2, pp. 2161–2168. about the profitability of adding extra generic tags. We [12] Zhongyuan Wang, Haixun Wang, Ji-Rong Wen, and Yanghua would also like to explore parameter tuning. This may Xiao, “An inference approach to basic level of catego- have positive impact on existing accuracy. In addition to rization,” in Proceedings of the 24th acm international this, we would like to make it run on real-time multimedia on conference on information and knowledge management. video collection (such as www.YouTube.com). We think the ACM, 2015, pp. 653–662. current implementation stands as a good starting point to [13] Edward Loper and Steven Bird, “Nltk: The natural language explore above aspects. toolkit,” in Proceedings of the ACL-02 Workshop on Ef- fective Tools and Methodologies for Teaching Natural Lan- R EFERENCES guage Processing and Computational Linguistics - Volume [1] Stefan Siersdorfer, Jose San Pedro, and Mark Sanderson, 1, Stroudsburg, PA, USA, 2002, pp. 63–70, Association for “Automatic video tagging using content redundancy,” in Computational Linguistics. Proceedings of the 32nd international ACM SIGIR conference [14] “Google inc, google reverse image search.,” . on Research and development in information retrieval. ACM, 2009, pp. 395–402. [15] Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q Zhu, “Probase: A probabilistic taxonomy for text understanding,” [2] Jose San Pedro, Stefan Siersdorfer, and Mark Sanderson, in Proceedings of the ACM SIGMOD International Confer- “Content redundancy in youtube and its application to video ence on Management of Data. ACM, 2012, pp. 481–492. tagging,” ACM Transactions on Information Systems, vol. 29, no. 3, pp. 13, 2011. [16] “Vitag automatic video tagger.,” . [17] “Vitag evaluation: video collection.,” . 276 Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on February 12,2021 at 10:02:57 UTC from IEEE Xplore. Restrictions apply.
You can also read