TRECVID 2020 Ad-hoc Video Search: Task Overview - Retrieval Group
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
TRECVID 2020 Ad-hoc Video Search: Task Overview Georges Quénot Laboratoire d’Informatique de Grenoble, France George Awad Retrieval Group, Information Access Division, Information Technology Laboratory, NIST; Georgetown University Information Access Division Information Technology Laboratory
Outline Task Definition Dataset Topics (Queries) Participating Teams Evaluation & Results General Observation NIST disclaimer: Certain commercial products or company names are identified here to describe our study adequately. Such identification is not intended to imply recommendation or endorsement by the National Institute of Standards and Technology, nor is it intended to imply that the products or names identified are necessarily the best available for the purpose. TRECVID 2020
Task Definition Goal: promote progress in content-based video retrieval based on end user ad-hoc (generic) textual queries that include searching for persons, objects, locations, actions and their combinations. Task: Given a test collection, a query (surprise/fixed (progress)), and a master shot boundary reference, return a ranked list of at most 1000 shots (out of 1,082,657) which best satisfy the need. Testing data: 7475 Vimeo Creative Commons Videos (V3C1), 1000 total hours with mean video durations of 8 min. Reflects a wide variety of content, style and source device. Fixed testing data since 2019. Development data: ≈2000 hours of previous IACC.1-3 data used between 2010-2018 with concept and ad-hoc query annotations.
Vimeo Creative Commons Collection Partition V3C1 V3C2 V3C3 Total File Size 2.4TB 3.0TB 3.3TB 8.7TB Number of Videos 7’475 9’760 11’215 28’450 1000 hours, 1300 hours, 1500 hours, 3801 hours, Combined Video 23 minutes, 52 minutes, 8 minutes, 25 minutes, Duration 50 seconds 48 seconds 57 seconds 35 seconds Mean Video 8 minutes, 7 minutes, 8 minutes, 8 minutes, Duration 2 seconds 59 seconds 1 seconds 1 seconds Number of 1,082,659 1,425,454 1,635,580 4,143,693 Segments The Vimeo Creative Commons Collection (V3C)* consists of ‘free’ video material sourced from the web video platform vimeo.com. It is designed to contain a wide range of content which is representative of what is found on the platform in general. All videos in the collection have been released by their creators under a Creative Commons License which allows for unrestricted redistribution. * Rossetto, L., Schuldt, H., Awad, G., & Butt, A. (2019). V3C – a Research Video Collection. Proceedings of the 25th International Conference on MultiMedia Modeling.
AVS 2020 (20 main) Queries by complexity Query Person Action Object Location Find shots of a person paddling kayak in the water ✓ ✓ ✓ ✓ Find shots of people dancing or singing while wearing costumes outdoors ✓ ✓ ✓ ✓ Find shots of people or cars moving on a dirt road ✓ ✓ ✓ ✓ Find shots of one or more persons exercising in a gym ✓ ✓ ✓ Find shots of one or more persons standing in a body of water ✓ ✓ ✓ Find shots of someone jumping while snowboarding ✓ ✓ ✓ Find shots of one or more people drinking wine ✓ ✓ ✓ Find shots of a person wearing a necklace ✓ ✓ Find shots of a woman sitting on the floor ✓ ✓ Find shots of one or more people skydiving ✓ ✓ Find shots of a little boy smiling ✓ ✓ Find shots of group of people clapping ✓ ✓ Find shots of a woman with short hair indoors ✓ ✓ Find shots of two or more people under a tree ✓ ✓ Find shots showing an aerial view of buildings near water in the daytime ✓ ✓ Find shots of sailboats in the water ✓ ✓ Find shots of a man in blue jeans outdoors ✓ ✓ ✓ Find shots of a church from the inside ✓ ✓ Find shots of train tracks during the daytime ✓ Find shots of a long-haired man ✓
2019-2021 (20 progress) Queries by complexity Query Person Action Object Location Find shots of a person holding an opened umbrella outdoors ✓ ✓ ✓ ✓ Find shots of two people talking to each other inside a moving car ✓ ✓ ✓ ✓ Find shots of people walking across (not down) a street in a city ✓ ✓ ✓ Find shots of a shark swimming under the water ✓ ✓ ✓ Find shots of a person reading a paper including newspaper ✓ ✓ ✓ Find shots of fishermen fishing on a boat ✓ ✓ ✓ Find shots of a person jumping with a motorcycle ✓ ✓ ✓ Find shots of a person jumping with a bicycle ✓ ✓ ✓ Find shots of one or more women models on a catwalk demonstrating clothes ✓ ✓ Find shots of people doing yoga ✓ ✓ Find shots of a person sleeping ✓ ✓ Find shots of people hiking ✓ ✓ Find shots of bride and groom kissing ✓ ✓ Find shots of a person skateboarding ✓ ✓ Find shots of people queuing ✓ ✓ Find shots of two people kissing who are not bride and groom ✓ ✓ Find shots of a man in a clothing store ✓ ✓ Find shots of a person in a bedroom ✓ ✓ Find shots of a person's shadow ✓ Find shots showing electrical power lines ✓
Task Parameters System Types Description Training data Description categories Fully System uses official query Automatic (F) directly A Only IACC training data D Other training data sources Manually- Query built manually E Only training data collected automatically Assisted (M) using the query text Relevance- Allow judging top-30 F Only training data collected automatically Feedback (R) results up to 3 iterations using a query built manually from the official query text ® Novelty (optional) run type was introduced to encourage retrieving non-common relevant shots easily found across systems. ® Explainability of result items were allowed as extra optional information with the submitted shots TRECVID 2020
Teams – Main Task (39 runs) Team System Type Organization (9 Finishers / 25) M F R N VIdeoREtrievalGrOup City University of Hong Kong 4 4 1 FIU_UM Florida International University; University of Miami 2 1 Kindai_ogu Kindai University; Osaka Gakuin University 1 Indian Institute of Space Science and Technology (IIST), Thiruvananthapuram DVA_Researchers Development and Educational Communication Unit (DECU), Indian Space 1 Research Organisation (ISRO) ITI_CERTH Information Technologies Institute, Centre for Research and Technology Hellas 1 RUC_AIM3 Renmin University of China 4 RUCMM Renmin University of China 4 WasedaMeiseiSoftbank Waseda University; Meisei University; SoftBank Corporation 4 4 ZY_BJLAB XinHuaZhiYun Technology CO,. Ltd. 4 4 N : Novelty runs TRECVID 2020
Teams – Progress Task (74 runs) Team System Type Organization 12 Finishers M F R N VIdeoREtrievalGrOup City University of Hong Kong 6 8 FIU_UM Florida International University; University of Miami 6 Kindai_ogu Kindai University; Osaka Gakuin University 5 SIRET (2019)* Charles University 4 ATL (2019)* Alibaba group; ZheJiang University 4 Carnegie Mellon University; Monash University; Renmin University; Inf (2019)* 4 Shandong University EURECOM (2019)* EURECOM 3 ITI_CERTH Information Technologies Institute, Centre for Research and Technology Hellas 1 RUC_AIM3 Renmin University of China 4 RUCMM Renmin University of China 8 WasedaMeiseiSoftbank Waseda University; Meisei University; SoftBank Corporation 8 5 ZY_BJLAB XinHuaZhiYun Technology CO,. Ltd. 4 4 *: Teams submitted only progress runs in 2019 TRECVID 2020
Evaluation Methodology Ø NIST judged 100% of top (ranks 1 – 250) pooled results from all submissions and sampled (11.1%) the rest of pooled results (ranks 251 – 1000). Ø Stats of sampled and judged clips from rank 251 to 1000 across all runs and topics Ø min= 10.0 % Ø max = 88.5 % Ø mean = 53.2 % Ø One assessor per query, watched complete shot while listening to the audio. Ø Each query assumed to be binary: absent or present for each master reference shot. Ø Top submitted results were double judged if at least 10 runs submitted them, and assessor judged them as false positive. Ø Extended inferred average precision (xinfAP) was calculated using the judged and unjudged pool by sample_eval1 tool. Ø Compared runs in terms of mean extended inferred average precision across the all evaluated queries. 1https://www-nlpir.nist.gov/projects/trecvid/trecvid.tools/sample_eval/
Human Judgments Total Judgments 147 950 Total Hits 22 895 ain s M rie 0 2 ue 0 Hits at ranks Q + 1 ess 12 201 1 - 100 o g r ie s p r u er Hits at Q ranks 101 - 7969 10 an s 250 u m sor H ses Hits at ranks As 251 - 1000 2725 TRECVID 2020
Main Task Results
C_ D_R C_ UC Mean infAP D _ _A R I 0.05 0.15 0.25 0.35 0.1 0.2 0.3 0.4 0 C_ UC M3. D _ _A 2 0 R I _ C_ UC M3. 1 D _ _A 2 0 RU IM _2 3 C_ C_A .2 0 D _ IM _4 R 3 C_ UC .2 0 D_ M _ RU M . 3 C_ C_ D_ D _R M 1 CM 20_ VI C U . C_ d e _D CM 20_ D _ oR _R M 2 VI E U . d e tr ie CM 20_ oR v a M 3 C_ Et lG .20 D_ r ie rO W C v u _4 C_ as _D alG p.2 D _ ed _ rO 0 _ W a IT u 3 C_ as Me I_C p.2 D _ ed ise ER 0 _ T 1 Sorted Overall Scores W a i C_ as Me Soft H .2 D _ ed ise ba 0_ W aM iSo nk 1 a . C_ sed eise ft ba 20_ D _ aM iS n 3 e of k.2 VI d e ise ba 0_ t oR iS n 2 Et oft k.20 r ie ba _1 v n C_ alG k.2 D _ rO 0_ Automatic Runs Z u 4 C_ Y_B p.2 D JL 0 _ C_ _ZY AB 2 C_ D _ _B .20 D_ kin JLA _1 VI C_ dai B.20 d e D _ _o _ C_ oR ZY_ gu. 2 D _ tr E B 20 DV iev JLA _1 26 Automatic Runs across 20 Main queries A_ alG B.2 Re rO 0_ s u 4 C_ ear p.2 D _ ch 0 _ ZY ers 4 C_ _BJ .20 E_ LA _1 F B N_ IU_ .20 E _ U M _3 F C_ IU_ .20 E_ UM _2 FI U_ .20 U M _5 .2 0_ 1 Median = 0.197 3 “E” Runs
C_ D _W as Mean infAP ed C_ aM D _W ei se as i So 0.05 0.15 0.25 0.1 0.2 0.3 0 ed ft b aM an C_ k.2 ei D_ W se 0_ as i S 3 of ed t ba aM nk C_ ei .2 se 0_ D_ iS of 1 VI de tb oR an C_ Etr k.2 iev 0_ D_ alG 2 VI de rO oR up C_ Et r .2 iev 0_ D_ a 3 VI de l G rO C_ oR up D_ Et .2 Sorted Overall Scores r ie 0_ W va 4 as lG ed rO aM up C_ ei .2 se 0_ D_ iS of 1 VI de tb oR an Et k.2 r ie 0_ va 4 lG rO C_ up D_ .2 0_ 2 Manually-Assisted Runs ZY _B JLA C_D B. N_ 20 D_ _Z _1 VI Y_ de BJ oR LA Et B. rie 20 va _2 lG 13 Manually-Assisted Runs across 20 Main queries rO C_ up D_ .2 0_ ZY 5 _B JLA C_ B. D_ 20 ZY _4 _B JLA B. 20 _3 Median = 0.183
Statistical Significance Top 10 automatic runs - randomization test (p < 0.05) RUC_AIM3.20_2 RUC_AIM3.20_1 RUC_AIM3.20_4 RUCMM.20_2 RUCMM.20_1 RUCMM.20_3 RUCMM.20_4 RUC_AIM3.20_3 RUCMM.20_1 RUCMM.20_1 VIdeoREtrievalGrOup.20_1 VIdeoREtrievalGrOup.20_1 RUCMM.20_1 RUCMM.20_2 RUCMM.20_2 RUCMM.20_2 • No Significance Difference between RUC_AIM3 runs 1, 2 & 4. RUCMM.20_3 RUCMM.20_3 • RUC_AIM3 runs 2 are significantly better than run 3 RUCMM.20_3 • No Significance Difference between RUCMM runs 1, 2, 3 & 4. • VideoREtrievalGrOup run 3 is significantly better than run 1 RUCMM.20_4 RUCMM.20_4 RUCMM.20_4 VIdeoREtrievalGrOup.20_3 VIdeoREtrievalGrOup.20_3 VIdeoREtrievalGrOup.20_3 VIdeoREtrievalGrOup.20_1 VIdeoREtrievalGrOup.20_1 VIdeoREtrievalGrOup.20_1
Statistical Significance Top 10 manually-assisted runs - randomization test (p < 0.05) WasedaMeiseiSoftbank.20_1 WasedaMeiseiSoftbank.20_2 WasedaMeiseiSoftbank.20_3 VIdeoREtrievalGrOup.20_3 VIdeoREtrievalGrOup.20_4 VIdeoREtrievalGrOup.20_2 VIdeoREtrievalGrOup.20_2 VIdeoREtrievalGrOup.20_2 VIdeoREtrievalGrOup.20_1 ZY_BJLAB_1 WasedaMeiseiSoftbank.20_4 WasedaMeiseiSoftbank.20_4 WasedaMeiseiSoftbank.20_4 VIdeoREtrievalGrOup.20_2 ZY_BJLAB.20_2 ZY_BJLAB.20_1 ZY_BJLAB.20_1 ZY_BJLAB.20_1 ZY_BJLAB_1 • No significant difference between WasedaMeiseiSoftbank runs 1, 2 and 3. All are better than run 4 • VideoREtrievalGrOup run 3 is ZY_BJLAB.20_2 ZY_BJLAB.20_2 ZY_BJLAB.20_2 ZY_BJLAB.20_2 significantly better than runs 1 and 2. • ZY_BJLAB run 1 is better than run 2
Hits Per Topic (Main Task) Unique vs Common True Positive Shots 2000 people or cars moving on a someone jumping while dirt road snowboarding 1800 a long-haired man person wearing a 1600 necklace people dancing or a man in blue jeans 1400 singing while outdoors wearing costumes a woman with short hair 1200 outdoors indoors 1000 one or more people 800 drinking wine 11.25% of all hits 600 are unique 400 a church from the inside 200 0 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 Unique Common
Sorted Unique Hits by Team 1727 Unique Shots from 8 teams in their F & M runs 1200 1000 Top scoring teams not necessary 800 contributing a lot of unique true shots 600 and vice-versa 400 200 0 3 TH k up M M rs AB an M he _U CM rO R JL AI tb CE rc _B U lG C_ RU of ea FI I_ ZY va iS RU IT s se e Re tri ei A_ E aM oR DV ed de as VI W Teams
Top runs per query (Main Task) 1.00 Top 10 Automatic Runs per query Person paddling kayak someone jumping while in the water snowboarding 0.90 Sailboats in the 0.80 water a woman with short hair indoors 0.70 0.60 InfAP 0.50 0.40 One or more persons standing 0.30 in a body of water 0.20 Two or more people under a 0.10 tree 0.00 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 Queries
Top runs per query (Main Task) 1 Top 10 Manually-Assisted Runs per query Person paddling kayak in the water 0.9 Sailboats in the 0.8 water someone jumping while a woman with snowboarding short hair indoors 0.7 0.6 InfAP 0.5 People dancing or 0.4 One or more singing while persons standing wearing costumes 0.3 in a body of water outdoors 0.2 0.1 0 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 Queries
M _M _C _D _W as Mean Unique Shot Weights e da M 10 20 30 40 50 60 0 F_ ei M se _C iSo _D ftb M _R an _M UC … _N _A _D IM _V 3. Id 20 eo _1 RE tri F_ e va M l Gr _C _D O up _Z .… Y_ BJ F_ LA M B .2 _C 0_ F_ _D 1 M _R _C UC _D M _D M VA .2 _R 0_ es 1 ea Novelty Scores F_ rc M he rs _C .2 _D 0_ _I TI 1 _C ER F_ TH M .2 _N 0_ _E 1 _F F_ IU M _U _C M _D .2 0_ _k 5 in Novelty runs vs best common run from each team da i_ og u. 20 _1 unique shots. Novelty runs Common runs ground truth such that A weight is given to each highest weight is given to topic and shot pairs in the TRECVID 2020
Efficiency Automatic Systems Manually-Assisted Systems 10000 100 Good 1000 and slow Time (s) Time (s) 100 10 10 Good and fast 1 1 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 InfAP InfAP TRECVID 2020
Progress Task Evaluation year 2019 2020 2021 Systems: Submit 20 fixed progress 2019 queries Systems: Submit 20 fixed 2020 progress queries Submission year NIST: Eval 10 queries (set A) Systems: Submit 20 fixed 2021 progress queries NIST: Eval 10 queries (set B) Goals : Evaluate 10 (set A) common queries submitted in 2 years (2019, 2020) Evaluate 10 (set B) common queries submitted in 3 years (2019, 2020, 2021) Evaluate 20 common queries submitted in 3 years (2019 , 2020, 2021) Ground truth for 20 common queries can be released only in 2021 TRECVID 2020
Progress subtask results (2019-2020) Max performance per team (automatic Max performance per team (manually- systems) on 10 progress queries assisted systems) on 10 progress queries 0.3 0.25 0.35 0.3 Mean InfAP 0.2 Mean InfAP 0.25 0.15 0.2 0.1 0.15 0.1 0.05 0.05 0 0 3 L AB AB T M M O O k k M e f H In AT an an RE ob M RT RE RE CM CO U JL JL AI ftb ftb SI U_ _k VI VI CE _B _B RE C_ RU gu FI iSo iSo _ ZY ZY EU I RU IT i_o se se da ei ei aM M kin da ed e as as W W 2019 2020 2019 2020
Samples of (tricky/failed) results Find shots showing an aerial view of Find shots of people dancing or singing buildings near water in the daytime Find shots of a woman sitting on while wearing costumes outdoors the floor Find shots of one or more people Find shots of one or more persons Find shots of one or more persons skydiving standing in a body of water exercising in a gym All images are from the V3C1 dataset (Creative Commons Videos)
Easy vs Hard Queries Query Rank of easy queries (infAP >= 0.5) Rank of hard queries (infAP < 0.5) Person Action Object Location Find shots of a person paddling kayak in the water 1 ✓ ✓ ✓ ✓ Find shots of people dancing or singing while wearing costumes outdoors 1 ✓ ✓ ✓ ✓ Find shots of people or cars moving on a dirt road 13 ✓ ✓ ✓ ✓ Find shots of one or more persons exercising in a gym 7 ✓ ✓ ✓ Find shots of one or more persons standing in a body of water 11 ✓ ✓ ✓ Find shots of someone jumping while snowboarding 3 ✓ ✓ ✓ ✓ ✓ ✓ Find shots of one or more people drinking wine 10 Easy Find shots of a person wearing a necklace 4 ✓ ✓ Find shots of a woman sitting on the floor 9 ✓ ✓ Find shots of one or more people skydiving 6 ✓ ✓ Find shots of a little boy smiling 5 ✓ ✓ Find shots of group of people clapping 7 ✓ ✓ Hard Find shots of a woman with short hair indoors 8 ✓ ✓ Find shots of two or more people under a tree 12 ✓ ✓ Find shots showing an aerial view of buildings near water in the daytime 6 ✓ ✓ Find shots of sailboats in the water 2 ✓ ✓ Find shots of a man in blue jeans outdoors 2 ✓ ✓ ✓ Find shots of a church from the inside 3 ✓ ✓ Find shots of train tracks during the daytime 5 ✓ Find shots of a long-haired man 4 ✓
2020 Main Approaches • Still “concept-based” and “concept-free”(visual-textual embedding spaces) approaches but clear trend toward the latter • Clear advantage for “embedding space” approaches, especially for fully automatic search and even overall • Concept bank often used as a complement • Training data for semantic spaces: MSR and TRECVid VTT tasks, TGIF, IACC.3, Flickr8k, Flickr30k, MS COCO, Conceptual Captions, VATEX … ® Arms race?
2020 Main Approaches • Renmin University of China “RUC_AIM3” (presentation to follow): • Fully automatic (0.359): two-branch framework with global (VSE++) and fine- grain matching with Hierarchical Graph Reasoning (HGR) • Renmin University of China “RUCMM” (presentation to follow): • Fully automatic (0.269): “dual encoding network” with Word to Visual Word (W2VV++) and BERT as text encoder similar to TRECVid 2019 plus Sentence Encoder Assembly (SEA) by multi-space multi-loss learning • City University of Hong Kong (presentation to follow): • Fully automatic (0.229): dual-task model learns feature embedding and concept decoding simultaneously • Manually assisted (0233): same with user screening the concept list and removing unrelated or unspecific concepts
2020 Main Approaches • Waseda University; Meisei University; SoftBank Corporation: • Fully automatic (0.200): visual-semantic embedding (VSE++) • Manually assisted (0.252): concept-based retrieval similar to previous years’ concept bank approach and fusion with VSE • Centre for Research and Technology Hellas: • Fully automatic (0.202): attention-based cross-modal deep network inspired by the dual encoding approach • State Key Laboratory of Media Convergence Production Technology and Systems (ZY_BJLAB): • Fully automatic (0.202): search video retrieval using multi-modal video representations from collaborative experts.
2020 Task Observations Ø 2nd year on AVS using V3C1 dataset (sub-collection from a bigger V3C dataset). Ø Continued the planned 2019-2021 progress subtask. Ø 9 teams finished the main task and 12 (8+4) teams finished the progress task. Ø 26 automatic systems and 13 manually-assisted systems submitted runs in the main task. Ø 74 total systems (22 manually-assisted and 52 automatic) are submitted for the 2019-2020 progress subtask. Ø Run training types are dominated by “D” (non IACC.3 training data) runs. Only 3 “E” (no-annotation) runs and no “R” (relevance-feedback) systems submitted. Ø No teams submitted explainability results with their runs! Ø Only 2 Novelty systems submitted. Common systems performed higher on the novelty metric. Ø Majority of 2020 systems performed higher than their 2019 systems in the progress subtask Ø Few automatic systems are good and fast, while few manually-assisted systems are good and slow. Ø There is high similarity between automatic and manually-assisted in terms of query performance relatively to each other. Ø Among high scoring topics, there is more room for improvement among systems. Ø Among low scoring topics, most systems scores are collapsed in small narrow range. Ø Absolute number of hits are comparable to 2019. Overall performance are higher than 2019 (same dataset, different queries) Ø Top scoring teams didn’t necessarily report unique relevant shots (thus they are good in ranking relevant shots). Ø Hard queries are the ones asked for unusual combinations of facets (compared to well-known concepts) Ø Task is still challenging!
Interactive Video Retrieval During the Video Browser Showdown (VBS) At MMM 2021 27th International Conference on Multimedia Modeling, June 22-24, 2021 Prague, Czech Republic • 10 Ad-Hoc Video Search (AVS) topics : Each AVS topic has several/many target shots that should be found. • 10 Known-Item Search (KIS) tasks, which are selected completely random on site. Each KIS task has only one single 20 s long target segment. • Registration for the task is now closed Images borrowed from https://videobrowsershowdown.org/pics/
Agenda EST Time 7:30 – 7:50 AM • RUC_AIM3 • RUC_AIM3 at TRECVID 2020: Ad-hoc Video Search 7:50 – 8:10 AM • RUCMM • Sentence Encoder Assembly for Ad-hoc Video Search 8:10 – 8:30 AM • VIdeoREtrievalGrOup • Concept versus Embedding search 8:30 - 9:00 AM • Break 9:00 - 9:20 AM • AVS Task Discussion
You can also read