Human Perspective Scene Understanding via Multimodal Sensing - Chiori Hori Mitsubishi Electric Research Laboratories (MERL)

Page created by Byron Holt

Arts & Entertainment

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Human Perspective Scene Understanding via Multimodal Sensing - Chiori Hori Mitsubishi Electric Research Laboratories (MERL)

Human Perspective Scene Understanding
 via Multimodal Sensing

 Chiori Hori
 Mitsubishi Electric Research Laboratories (MERL)

Introduction of MERL
• The North American arm of the Corporate R&D organization
 of Mitsubishi Electric

– 30 years since 1991
– 81 members (More than 50 PhDs pursue research)
– Mission: Advanced application-motivated basic R&D
– Intelligent properties: more than 700 patents

– Target areas
 • Wired/wireless communications,
 • Signal processing,
 • Audio and video processing,
 • Spoken language interfaces,
 • Computer vision,
 • Mechatronics,
 • Fundamental algorithms

Target Knowledge to Talk
Static Knowledge DB Dynamic Knowledge DB
 I’m home.
 How do I get
 to MERL？

 From
 Smartphones
 Kendall …
 Dinner is ready

Smartphone Apps + cloud services Control appliances through internet

What are we talking about
 with machines?

I can Let’s have a
hear ! conversation
 freely.

 I can I can
 see ! walk !

 I can
 I can
 dance!
speak !

Human Machine Interaction: HMI

 It’s too high
 for you.
 Do you need
 help?

Yes, Please!!!!
Take it down.

Essential Technologies for HMI
 HumanMunderstanding
 ultip Machine understanding
 Livi le
 ng shel
 A ro roo v
 ad m w es are
 with ith lin
 pal big w ed
 mt 01oooo1oo1o1o1o1
 res indow 0101010101010100
 s s
 i nes 000001010101010111101010
 m a ch e
 ith l ac
 e i l i ng w ith firep
 c w ss
 High g room alm tre
 p
 Livin d with
 a
 A ro

• Humans understand scenes using natural language
• Machines understand scenes using multimodal sensing
 information
• To interact with humans in a natural and intuitive manner,
 machines need to translate the sensing information into
 natural language

Machines Need to Understand Context
 Using Natural Language

 Which
 Two robots:
 one? Source Signals
That’s a cool
 Spoken Language
 robot. The
 one humanoid
 humanoid, or
 010001100001 Multimodal Scene
 Understanding Multimodal sensed
 the
 and round
 one round
 one? Understanding
 information
 Video input

 Semantic
 Understanding
 Video Description
 Two robots:
 one humanoid, Natural Language
 and one round.
 Text
 ? Response
 Generation
 Response Generation representation
 based on scene
 understanding

History of AI Researches
 Speech Recognition & Dialog management
 The 1st AI Boom The 2nd AI Boom The 3rd AI Boom
 1960 1970 1980 1990 2000 2010 2015 2020
 Phoneme, Isolated Word Continuous Speech (CS) LVCSR Spontaneous Real Environment ・Conversational Speech
 Speech command, News paper reading speech Broadcast News, Call Home, Lecture Speech YouTube, Chatbot on Smartphone

 Sequence-to-Sequence
 DARPA project DNN Acoustic Model for MT (Cho, 2014)
 Early stages Maximum HMM Golden age Hybrid
 (Mohamed’09)
 DNN End-to-end ASR
 DP DP Matching Likelihood Break- (Graves’14)
 (Bellman’57)
 (Sakoe’70) (Bahl’83) Minimum WFST DNN through
 Spontaneous End-to-End
 Encoder-decoder ASR
 Noisy Channel
 Classification Error (Riley, Mohri’97) Speech Recognition (Bahdanau’15)
 Delta Cep
 Viterbi Algorithm Model
 (Katagiri’91) (Seide’11)
 (Furui ’81 )
 (Viterbi’67) (Jelinek’75) LSTM Acoustic Model Transformer
 (Graves’13)
 Cepstrum
 (Bogert’63)
 Beam Search
 (Lowerre’76)
 NN Winter (Vaswani’17)

 NN(TDNN) ATIS
 ANN-HMM POMDP
 Handcrafted dialog based on End-to-End
 (Waibel’87) DARPA
 (Renals’94) WFST-based DM
 spoken language understanding Dialog system
 1990 2008
 Computer Vision (CV) Action Recognition
 Object Recognition/Image classification
 DARPA VSAM
 Semantic segmentation, Image Caption
 Video Surveillance
Neurophysiology and Monitoring Video Caption
 Visual region ImageNet: Image alignment with WordNet
 (1998-2001)
 in Cerebral Cortex Hand-writing Large Scale Visual Recognition Challenge: ILSVRC
 CNN
 Cats/Monkey
 Hubel, Wiesel Early stages Character Recognition
 Neocognitron LeCun,
 Real-time
 Object Detection
 Paul Viola, Mike Jones
 (FeiFei, ’12)
 Task MS-COCO:
 Challenge: Sematic segmentation
 1950s, 1960s
 StaticVariation MSVT/MSR-VTT
 Neural Network Architecture (2001)
 (1995) Image Caption (’15-)

 Video
 Block World (Fukushima ’79) EigenFace
 “Video Google:
 Larry Roberts＠MIT DARPA Matthew Turk and
 Autonomous HDMDB51
 A Text Retrieval Approach

 image
 Alex Pentland (MIT)
 (1963) to Object Matching in Videos,”
 Land Vehicle project (1992) Human motion recognition Video Description (’16)
 Extracting 3D （1980—1985） J. Sivic, A. Zisserman
 Summer project NIST FERET project: (Kuehne, ’11) UCF101
 information about (2003) Charades
 Image description “Facial recognition
 solid objects from 2D Human action recognition Human action recognition
 “What it saw” ”No hands technology program” HOG
 (1966) photographs of line across America" ->FRVT Scale-invariant (Histogram of (Shah, ’12) (Gupta, ’16)
 drawings Takeo Kanade (1993) feature transform Oriented Gradients)
 （1985） (SIFT) Viola Jones Kinetics400,600
 First face recognition David Lawe, (2005) Human action recognition
 Takeo Kanade@Kyoto Univ. (1999) (Zisserman, ’17-18)
 (1970)
 AlexNet,VGG16, ResNet,BN-Inception,etc.

Hand-Crafted Scenario
 for Tour Guide
 Make a list of
 locations Recommend locations one by one
 eps:
 Trnst(if_forloop_not_end)

 eps:
 Set_rcmdlist Determine
 Rqst(rcmd)：eps
 eps：
 Stt(exprnc)/：
 Set_imprs(Experience
 user
 Stt(prcs(rcmd)) d)
 preferences
 Stt(prf
 (spot/general)): eps: eps: eps: eps: Good： eps： eps: eps: Accept： eps: eps: eps： eps：
 eps: eps:

 OQ(DST) eps Mk_ Expln(tgt
 Grt(start) Stt(next_act) Chck_
 Extrct_kywd rcmdlist(kwd Set_tgt Rcmd(tgt) Set_impr Aagree ) Cnfrm(dcst Set_imprs Rspns2imprs
 ) ) (Decided) Prcs4imprs(tgt) forloop(rcmdlist)
 s
 (Positive)
 Neutral:
 Stt(imprs(Next/Bad)
 ： Keep(tgt, rcmdlist)
 Decided:
 Set_imprs(Neutral/Ba
 Mv(tgt, rcmdlist, dcsnlist)
 d)
 eps: Negative:
 Stt(ro_requiremen
 t) : Grt(end) Set_imprs(Decided) Remove(tgt, rcmdlist)
 Eexperienced:Remove(tgt, rcmdlist)

 eps:
 Stt(prf(tgt))： Trnst(if_forloop_done)
 Set A as tgt
 Set_imprs(Positive) Confirm user
 selections
 Stt
 eps: (no_prefered_tgt eps: eps： eps：
 )：
 Grt(end) Rqst(dcsn4rcmdlist) Trnst Chck_rcmdlist
 eps (if_data_in_rcmdlist)

“Statistical dialog management applied to WFST-based eps：
 dialog systems”
 (Hori+ICASSP’09) NICT 2007
 Trnst(if_nodata_in rcmdlist)

Statistical Dialog Technologies
• Statistical dialog systems have been developed to provide
 greater robustness and flexibility, but rely on discrete dialog
 state graph to determine next system response
 – Many states and state transitions, esp for large problems

 Rule-based model Statistical model Statistical model
 （QA for Tour Guide） （Guide action simulator） （Hotel clerk simulator）

“Statistical dialog management applied to WFST-based dialog systems”
 (Hori+ICASSP’09) NICT 2007

How to scale up training data
• Language data is available with different levels of labels
 Type of data & labels Data size in words

 Learning embeddings
 Unlabeled documents
 Knowledge graph e.g., wikipedia
 Conversational data e.g., callhome
 Dialog with rich labels e.g., Kyoto tour guide
 Application intention understanding
 Application dialog data
• Strategy:
 – Learn word/sentence embeddings for unlabeled data
 – Learn embeddings on smaller data + stronger labels
 based on embeddings from larger data + weaker labels

Transition to Deep Learning
 Discrete State Graph
• Spoken language understanding:
 DNN models with continuous state space

• End-to-end dialog systems to generate
 system responses directly from user inputs

 – Learn deep network model to generate system output layer

 responses optional
 recurrent connections
 role gates

 role-dependen

 • without annotating intermediate symbols LSTM layers

 projection layer

 input layer

Neural Translation Models (Bahdanau+’14)
 Sentence-to-Sequence Models
 Output word sequence

 y1 y2 … 
 LSTM
 s0 s1 s2
 decoder
 network
LSTM encoder Hidden
 network h1 h2 h3 h4 h5 Activation Vectors

Word embedding x’1 x’2 x’3 x’4 x’5
 network

 x1 x2 x3 x4 x5
 Input word sequence

Neural Conversation Models (Vinyals+’15)
• Train from OpenSubtitles

A pair of two sentences
were trained without context.

Various movie characters
are mixed in the system role.

DSTC6:
Sentence-to-Sentence Generation Task
(MERL 2016)

 Context-dependent
 Response Generation

 Evaluation:
 - Comparison with
 10 answers by humans
 - Human rating

Context-Dependent System Response
 Generation
 New Phone Bathroom Renovation Bath product

USER I need to buy a new phone. I love the new bath bombs!

 what phone are you looking
AGENT we 're glad you like it !
 for ?
 An android phone. I want to renovate my Are the new flavors available
USER
 bathroom. yet?

 we are the experts in
 bathroom remodeling . take a
AGENT what phone do you have ? yes !
 look ! no obligation consult :
 
USER x Where can I visit?

 you can check out our new you can check out our you can find our store locator
AGENT
 phones here : remodeling services : here :

Milestones for Scene-aware Interaction
 2015 2016 2017 2018 2019 2020 2021 2022

 HMI:
 Scene understanding based navigation “ Follow silver car turning right ”
thor is

 Context based response generator
 Customer service on Twitter
 1. Neural Text Dialog Input: dialog history + user input
 * Data-driven + Annotation free

 Attentional multimodal fusion
 2. Video Description YouTube video, Input: visual + audio feature

 3. Audio Visual Joint Student-Teacher learning
 Question answering system for action videos
 Scene-aware Input: dialog history+ user input
 Dialog + audiovisual features

 Scene-aware Interaction Future Car navigation

Scene-Aware Interaction
 Car Navigation Use Case

Scene-aware Interaction for Car Navigation
 Annotation 1: Bounding-box based Object recognition

 2D Map+ GPS

 Point Cloud View

 Annotation 2: Sematic region segmentation

 2D Map+ GPS

 Point Cloud View

Scene-Aware Interaction https://www.youtube.com/watch?v=t0izXoT_Aoc

 Future Car Navigation Landmark-based Navigation
 Turn right at
 the corner?
 Yes!
 After passing
 the gas station.

 Tracking-based Voice Warning

3D Point Cloud Camera View Top-Down Map

Multimodal Feature Extraction 1
 Original Video

 Feature Extraction 1:
 Bounding-box based object recognition

Multimodal Feature Extraction 2
 Original Video

 Feature Extraction 2
 Sematic region segmentation

Multimodal Feature Extraction 3
 Original Video

 Feature Extraction 3:
 Bounding-box Tracking

Multimodal Feature Extraction 4
Prerecorded data of Mobile Mapping System (MMS) provides object location
in a view of streets. (http://www.mitsubishielectric.com/bu/mms/)

 Landmark Localization Original Video

 Point Cloud View

 Top Down map

Object recognition results are
 changed
 Feature Extraction 1:
 Bounding-box based object recognition

Scene-aware Interaction
 for Daily Life Monitoring
 Surveillance Camera Systems
 Ask seeking target:
 “Find a small girl wearing a pink T-shirt”

 Narrowing down by systems:
 Suspicious “Is she wearing a hat?”
 actions
 near the Answer by users to add more information:
 station “Yes, she is wearing a straw hat.”

 CNN: After Boston: The pros and cons of surveillance cameras Scene understanding
 using video description

• Visual features: Object and event recognition
• Audio features: Audio event recognition
• Scene-understanding: Video description
• Dialog history: Context-based future prediction
• Response generation: Sequence-to-sequence
 generation
 https://www.merl.com/demos/video-description

Sequence-to-sequence models
• Neural networks that can learn a mapping function between
 given input and output sequences in an end-to-end manner
 – LSTM encoder-decoder: Conversation+MT [Vinyals+’15]
 – Attention-based encoder decoder: MT [Bahdanau+’14]
 – Transformer: MT [Vaswani+’17]

• Widely used for various sequence-to-sequence tasks
 Task Input Output
 Speech recognition Speech signal Sentence text
 Machine translation Source language text Target language text
 Language understanding Sentence text Semantic label sequence
 Dialog generation User utterance System response
 Video description Image sequence Sentence text

LSTM encoder decoder [Vinyals+’15]
 Output

 y1 y2 y3 
Encoder Decoder
 LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

 LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

 x1 x2 x3 x4 y1 y2 y3

 Input
 LSTM: Long Short-Term Memory

Pros: Simple recurrent architecture with LSTM cells, which can memorize relatively long
 contextual information compared to vanilla RNNs losing contextual information exponentially.
Cons: Information of long input sequences may be lost by summarizing the sequence into a fixed
 dimensional vector in the last state.

Attention-based encoder decoder [Bahdanau+’14]

 y1 y2 y3 
Decoder
 LSTM LSTM LSTM LSTM Pros
 • BLSTM encoder captures bidirectional
 LSTM LSTM LSTM LSTM dependency within input sequence
 • Attention mechanism allows the
 decoder to access full encoder
 y1 y2 y3 outputs

 Cons
Encoder Attention mechanism • BLSTM (or LSTM) can utilize only
 adjacent state information, which may
 BLSTM BLSTM BLSTM BLSTM be insufficient to capture long context
 dependency over the input (or output)
 BLSTM BLSTM BLSTM BLSTM sequence

 x1 x2 x3 x4

Transformer [Vaswani+’17]
 y1 y2 y3 
Decoder

 " × Pros
 • Feed-forward network with residual
 Self-attention
 connections enables to learn very
 deep architecture, which significantly
 y1 y2 y3 improves the accuracy
 • Self-attention mechanism allows to
 Source attention utilize full-sequence context in
Encoder the encoder and the decoder
 • The most successful model at the
 moment for many tasks

 ! × Self-attention

 x1 x2 x3 x4

Large-scale pretrained language models (LMs) for
 various sequence-to-sequence tasks in NLP
• BERT: Transformer LM that predicts randomly masked words and
 next sentence or not [Devlin+’19]
 NSP y1 y2 y3 y'1 y'2 y'3 • Feed sentence pairs with
 symbol (e.g. QA
 pairs)
 × Self-attention
 • Need task-specific labeled
 data for fine-tuning
 • Achieve state-of-the-art
 y1 y3 y'1 y'2 performance on various
 NLP tasks

• GPT-3: Transformer LM that simply predicts next words [Brown+'20]
 y1 y2 y3 • Can be applied to various NLP tasks without fine-tuning
 • Achieve state-of-the-art performance on several
 tasks by just providing a task specifying sentence and a
 few input/output examples in inference time
 × Self-attention
 • The largest GPT-3 has 175 billion parameters!!
 ~= 500x larger than the largest BERT model
 y1 y2 y3 • Still difficult to generate natural long documents

Progress of AI platforms
• Hardware
 – GPU (Nvidia, AMD, …), TPU, FPGA, … #cores, clocks, and memory
 are increasing (e.g. New Nvidia A40 has 10752 cores, 48GB memory, ...)
• GPU libraries
 – CUDA, CUDAToolkit, OpenCL, …, useful and efficient
• Deep learning Toolkits
 – Caffe, Theano, Torch, CNTK, Chainer, MXNet, ...
 TensorFlow, PyTorch
 – Easy implementation of complicated network architecture and
 training/testing procedure by Python scripting
 • Build computational graphs in advance (TensorFlow)
 • Define-by-run (PyTorch, originally from Chainer)
• Publicly available code (e.g. GitHub) and models (e.g. Model zoo)
• Nice ideas (+ computational resources and data) are important!

6/21/21

 Semantic Representation
 using Audio, Video and Text Features
 Text features
 input layer projection hidden output
 layer layer layer

 Video
 Video features Description
 Text

 Text features
 Audio Text
 Understanding

 Audio features Video Audio features

 Speech
 Recognition

 Video features

Encoder-decoder LSTM for Video
 Description
 1University of Texas at Austin
 2University of California, Berkeley
 3University of Massachusetts, Lowell
 4 International Computer Science Institute, Berkeley
 2015

Multimodal Scene Understanding
 Objects and events in scenes are recognized
 using multimodal information
 2. Spatiotemporal features:
 Motion and Action
 1. Image and
 Sound
localization features: Airplane

 Objects and
 3. Audio Features:
 1. Image and
 their relations
 Girl “A girl is jumping.” Sounds and features
 localization Speech
 Girl
 Hill

 “A girl is standing “A “A
 girlgirl is looking
 is looking
 sitting on the hill.“
 on a hill.” at an
 at an airplane
 airplane flying
 flying overhead.”
 overhead.”

Multimodal Fusion

 Longstanding area of research:
“How to combine information from multiple modalities for
 machine perception?”
– Bayesian adaptation approaches
 J. R. Movellan and P. Mineiro. “Robust sensor fusion: Analysis and application to audio visual
 speech recognition. Machine Learning,“ 1998

– Stream weights
 G. Gravier, et al. “Maximum entropy and mce based hmm stream weight estimation for audio-
 visual asr,” ICASSP, 2002

 The first to fuse multimodal information using
 attention between modalities in a neural network

Naïve Fusion Attention-Based
 of Modalities Multimodal Fusion
 yi-1 yi yi+1
 yi-1 yi yi+1
 gi
 gi
 si-1 si
 si-1 si

 Naïve fusion NEW
 ! b1,i b2,i
 di
 Attention to
 WC1 WC2 Modalities
 d1,i d2,i
 Temporal
 c1,i attention c2,i WC1
 Temporal WC2

 c1,i attention c2,i
 a1,i,1 a2,i,1 a1,i,1 a2,i,1
 a1,i,L a2,i,L’ a1,i,L a2,i,L’
 a1,i,2 a2,i,2
 a1,i,2 a2,i,2

 x’11 x’12 x’1L x’21 x’22 x’2L’ x’11 x’12 x’1L x’21 x’22 x’2L’

 x11 x12 x1L x21 x22 x2L’ x11 x12 x1L x21 x22 x2L’

Modality 1 Modality 2 Modality 1 Modality 2

Context vector: weighted sum of frame features Attention weights for
Each modality projected into a common space each input modality and input time
 Selectively attends to specific modalities

Sample Videos with Automatic Description
 Image: VGG, Motion: C3D, Audio: MFCC
 1. This video shows improvements due to 2. Our use of audio features enable
 our multimodal attention mechanism identification of peeling action.

 3. Audio features make the description 4. Our multimodal attention mechanism and
 worse due to overdubbed music. audio features are complementary.

Unimodal < Naïve Multimodal Fusion < Attentional Multimodal Fusion

Table 4. A list of words with strong average attention weights for
each modality, obtained on the the MSR-VTT Subset using our
 Words
“Attentional with
 Fusion strong
 (AV)” average
 multimodal attentionattention
 method.
 weights for each modality
 Image Motion Audio
 (VGG-16) (C3D) (MFCC)
 bowl 0.9701 track 0.9887 talking 0.3435
 pan 0.9426 motorcycle 0.9564 shown 0.3072
 recipe 0.9209 baseball 0.9378 playing 0.2599
 piece 0.9136 football 0.9275 singing 0.2465
 paper 0.9098 horse 0.9212 driving 0.2284
 kitchen 0.8827 soccer 0.9099 working 0.2004
 toy 0.8758 basketball 0.9096 walking 0.1999
 folding attention
Our multimodal 0.8423enables
 tennis 0.8958
 us to see which wordsriding 0.1900
 rely most on each modality.
 makeup
 Image Features0.8326 player 0.8720 showing 0.1836
 Words
 guitardescribing generic two
 0.7723 object type 0.8345 dancing 0.1832
 Motion Features:
 applying 0.7691 video 0.8237 wrestling 0.1735
 Words describing scenes involving motion, such as sports and vehicles
 food Features:
 Audio 0.7547 men 0.8198 running 0.1689
 making
 Action 0.7470 running
 verbs associated with sound, such0.7680 applying
 as talking, singing, and0.1664
 driving

Audio Visual
Scene-aware Dialog

 Huda Alamri Jue Wang
 Vincent Cartillier Gordon Winchern
 Raphael G. Lopes Takaaki Hori
 Abhishek Das Anoop Cherian
 Irfan Essa Tim K. Marks
 Dhruv Batra Chiori Hori
 Devi Parikh

VQA to Visual Dialog
 Devi Parikh, and Dhruv Batra

VQA: Visual Dialog:
Answering a question Dialog discussing an
about an image image

Visual Dialog to
 Audio Visual Scene-Aware Dialog

 Audio Visual Scene-Aware Dialog (AVSD)

Visual Dialog: AVSD:
Dialog discussing an Dialog discussing a
image video

http://www.video-dialog.com/

Audio Visual Scene-aware Dialog
A dataset with different modalities:
 Video
 Audio
 Dialog history
 New question
 → Ground truth answer

AVSD data set

Sentence Selection or
 Sentence Generation
• Sentence selection:
 - Information Retrieval framework
 - Difficulty depends on how to prepare multiple candidates

• Sentence generation:
 - Speech recognition/Machine translation framework
 - Answers generation depends on language models

N-gram Distribution
 for Questions and Answers

 AVSD AVSD AVSD
Questions Answers sentence
 lengths

Multimodal Dialog

Open Research Platform

 2018 2019

Special Issue on the "7th Dialog System Technology Special Issue on ”Eighth Dialog System Technology
Challenge 2019" in Computer Speech and Language Challenge” in IEEE/ACM TASLP

DSTC8 Best System
“Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-
Aware Dialog” by Zekang Li et al.
● Considering the similarity between the summary and the video caption, summary
 and caption are concatenated to be one sequence. U = {Q1, R1, Q2, R2, . . . QN ,
 RN , } to denote the N turns of dialogue,
● Qn: n question, Rn: n response n containing m words.
● Probability to generate the response Rn for the given question Qn considering
 video V, audio A, dialogue history U

Multimodal Dialog

Essential Video Captioning Power
● Manual captions/summary is not available for real application.
● Answer generation models need video captioning power

 Joint Student-Teacher Learning for Audio-Visual Scene-Aware
 Answer Answer
 yi-1 yi yi+1 y’i-1 y’i y’i+1
 Mimic output distributions
 si-1 si s’i-1 s’i
 gi-1 gi g’i-1 g’i
 Mimic context vectors

 βQ,i β1,i β2,i βH,i β’Q,i β’1,i β’2,i β’H,i

 Multimodal encoder Multimodal encoder

 Question Video Dialog Description Video Dialog
 context +Question context

 Student network Teacher network

AVSD＠DSTC10
 3rd Edition of Audio Visual Scene-Aware Dialog Challenge
 https://github.com/dialogtekgeek/AVSD-DSTC10_Official

Task 1: Video QA dialog
Goal: Answer generation without using manual descriptions for inference
You can train models using manual descriptions but CANNOT use them
for testing.

Video description capability needs to be embedded within the answer
generation models.

Task 2: Grounding Video QA dialog
Goal: Answer reasoning temporal Localization
To support answers, evidence is required to be shown without using
manual descriptions.

Video Content
 Video Player
 2.00 10.00 20.00 25.00 27.00 30.00 40.00

Press the buttons to play and stop or
slide the bar to find the begin and end
timing for reasoning events.
 A boy is writing looking at a holding it drinking a glass of
 glass orange juice
If the given answer is Turn 1:
incorrect, please check
“False” and write a Question (Q): What actions are taken by the boy?
correct answer with Answer (A): the boy is writing while singing and then drinks a glass of orange juice.
timestamp.
 Please select visual or audio
Press the “Begin” and evidence.
“End” buttons to extract 2.00 20.00
the current time from the Explain reasons to extract
 The boy is writing in the notebook.
video player. the event to justify why the
 answer is correct.
 30.00 40.00

 The boy is drinking from the glass.
Press the "+" button to
 Find as much evidence
add more evidence slots
 10.00 27.00 as possible you can.
or press the "-" button to
remove the last evidence. The boy singing Rock music.

AVSD@DSTC
 Challenge Schedule
June 14th, 2021: Answer generation data release

June 30th, 2021: Answer reasoning temporal localization data
and baseline release

Sep. 13th, 2021: Test Data release
Sep. 21st, 2021: Test Submission due
 Join us!!!
Nov. 1st, 2021: Challenge paper submission due

Jan. or Feb., 2022: Workshop

You can also read