AutoVideo: An Automated Video Action Recognition System
                                           Daochen Zha†* , Zaid Pervaiz Bhat‡* , Yi-Wei Chen‡* , Yicheng Wang‡* , Sirui Ding‡* , Jiaben
                                          Chen†* , Kwei-Herng Lai†* , Anmoll Kumar Jain‡, Mohammad Qazim Bhat‡, Na Zou§, Xia Hu†
                                                                               †Department of Computer Science, Rice University
                                                                    ‡Department of Computer Science and Engineering, Texas A&M University
                                                             §Department of Engineering Technology and Industrial Distribution, Texas A&M University
                                                                {zaid.bhat1234,yiwei chen,wangyc,siruiding,jain.anmollkumar,nzou1},
arXiv:2108.04212v3 [cs.CV] 12 Oct 2021

                                           Action recognition is a crucial task for video understand-
                                           ing. In this paper, we present AutoVideo, a Python sys-
                                           tem for automated video action recognition. It currently sup-
                                           ports seven action recognition algorithms and various pre-
                                           processing modules. Unlike the existing libraries that only                                                 CSV Reader
                                           provide model zoos, AutoVideo is built with the standard
                                           pipeline language. The basic building block is primitive,
                                           which wraps a pre-processing module or an algorithm with               Pr e-Pr ocessi n g               Fr am e Ex t r act i on
                                           some hyperparameters. AutoVideo is highly modular and ex-
                                           tendable. It can be easily combined with AutoML searchers.
                                           The pipeline language is quite general so that we can easily
                                                                                                                                                Fr am e Tr an sf or m at i on
                                           enrich AutoVideo with algorithms for various other video-
                                           related tasks in the future. AutoVideo is released under MIT
                                           license at
                                                                                                                                                    TSN    TSM   ...   ECO

                                                                                                               Act i on Recogon i t i on
                                         Video-based action recognition aims at identifying different                                              Au t om at ed Tu n er
                                         actions from video clips. It is a crucial task for video un-
                                         derstanding with broad applications in various areas, such        Figure 1: System overview. Each module in AutoVideo
                                         as security (Meng, Pears, and Bailey 2007), healthcare (Gao       is wrapped as a primitive with some hyperparameters.
                                         et al. 2018) and behavior analysis (Poppe 2010).                  A pipeline consists of a series of primitives from pre-
                                            Recently, deep learning models have achieved promising         processing to action recognition. AutoVideo is equipped
                                         performance in action recognition tasks (Tran et al. 2015;        with tuners to search models and hyperparameters.
                                         Wang et al. 2016; Carreira and Zisserman 2017; Zolfaghari,
                                                                                                           standing libraries out there (Yue Zhao 2019; Team 2021),
                                         Singh, and Brox 2018). However, the successes of these
                                                                                                           they often only focus on providing a model zoo for users.
                                         models heavily rely on the efforts of human expertise. Given
                                                                                                           In contrast, we wrap the models with a highly modular and
                                         a new task beyond the benchmark datasets, a practitioner
                                                                                                           extendable pipeline language and inherently support hyper-
                                         often still needs extensive laborious trials to test different
                                                                                                           parameter tuning to save users’ efforts. Unlike some other
                                         models and tune the hyperparameters, which significantly
                                                                                                           research-oriented AutoML studies for videos (Piergiovanni
                                         impedes real-world applications.
                                                                                                           et al. 2019; Ryoo et al. 2019), we target an open-sourced
                                            To bridge this gap, we aim to design an easy-to-use toolkit
                                                                                                           toolkit with unified interfaces for practitioners to save the
                                         to help practitioners quickly develop prototypes for any new
                                                                                                           engineering effort in developing models.
                                         video action recognition tasks. In particular, we will present
                                         a highly modular and extendable system that wraps the data           In this paper, we present the first release of our system for
                                         loading, data processing, and state-of-the-art action recogni-    Automated Video Action Recognition, namely AutoVideo.
                                         tion models with the standard pipeline language. Following        It is designed under D3M infrastructure (Milutinovic et al.
                                         the pipeline language, we introduce data-driven searchers to      2020; Lai et al. 2021), i.e., data-driven model discovery via
                                         automatically tune the hyperparameters to reduce human ef-        automated machine learning. In AutoVideo, the data load-
                                         forts. While there are some other open-sourced video under-       ing, data processing, and learning models are all wrapped
                                                                                                           as primitives, and their combinations form various machine
                                            * Those   authors contribute equally to this project           learning pipelines. Following this design, AutoVideo sup-
HMDB-6     HMDB-51
                                                                 # Build pipeline based on config
             Manually-Tuned        70.35%     46.8%              pipeline = build_pipeline({
             Random Search         89.63%    53.92%                 "algorithm": args.alg,
             Hyperopt              92.6%     56.14%                 "load_pretrained": args.pretrained,
            Table 1: Accuracy of tuners in AutoVideo.            })

ports data-driven searchers that automatically explore differ-   # Fit
ent pipelines and hyperparameters to identify the best model     fit(train_dataset, train_media_dir,
design. AutoVideo is highly modular and extensible since             ,→ target_index=target_index,
it follows the standard pipeline language. As such, practi-          ,→ pipeline=pipeline)
tioners can easily add new modules for action recognition or
beyond, such as other video understanding tasks.                         Example 1: Code snippet of fitting a model.

                       AutoVideo System                          Here, train dataset is the dataframe describing the
Figure 1 shows an overview of AutoVideo system. It wraps         video file names and labels, train media dir con-
the pre-processing and action recognition modules as prim-       tains the directory of the folders with video files,
itives based on the standard pipeline language. Then a tuner     and target index specifies which column is label.
can be used to automatically discover the best pipelines. Our    Users can customize the configuration dictionary in
first release supports seven state-of-the-art action recogni-    build pipeline function to train different algorithms
tion models, including TSN (Wang et al. 2016), TSM (Lin,         with different hyperparameters. The fit function will re-
Gan, and Han 2019), I3D (Carreira and Zisserman 2017),           turn the fitted pipeline that could be pickled and saved for
ECO (Zolfaghari, Singh, and Brox 2018), C3D (Tran et al.         future use. Similar function calls can be executed for a test-
2015), R2P1D (Hou et al. 2019), and R3D (Hara, Kataoka,          ing dataset to make predictions.
and Satoh 2017). This section first introduces the standard         In addition to fitting models with specified configura-
pipeline language and then elaborates on the interface.          tions, users can also use tuners to automatically search the
                                                                 pipelines (i.e., models and hyperparameters) in a data-driven
Primitives and Pipelines                                         manner. An example interface is as below.
While there are many other AutoML systems (Jin, Song, and
                                                                 from autovideo.searcher import RaySearcher
Hu 2019; Thornton et al. 2013; Li et al. 2020), the standard
pipeline language used in D3M program provides generic           searcher = RaySearcher(train_dataset,
and extendable descriptions for modules and pipelines. The           ,→ train_media_dir,
interface can be easily adapted to various tasks. As such, we        ,→ valid_dataset=valid_dataset,
build AutoVideo upon D3M infrastructure.                             ,→ valid_media_dir=valid_media_dir)
   Primitive is the basic build block. It is an implementation
of a function with some hyperparameters. A pipeline is a         best_config =
Directed Acyclic Graph (DAG) consisting of several prim-            search_space=search_space,
itive steps. Data types, such as NumPy and ndarrays, can            config=config
be passed between steps in a pipeline. Each of the above         )
concepts is associated with metadata to describe parameters,
                                                                       Example 2: Code snippet of automated tuning.
hyper-parameters, etc. More details of pipeline languages
can be found in (Milutinovic et al. 2020). While our cur-
rent scope focuses on action recognition, the pipeline lan-      Here, search space is a dictionary specifying which
guage allows AutoVideo to be easily extended to support          models and the ranges of hyperparameters the tuner will
other video-related tasks, which will be our future work.        cover, and config specifies some tuning configurations.

Interface Design
                                                                 Evaluation of the Automated Tuners
The interface is based upon Axolotl1 , our high-level abstrac-
tion of D3M. We further wrap and adapt the Axolotl inter-        Our current system supports two types of tuners, including
faces to video-related functionalities. A minimum example        random search and Hyperopt (Bergstra et al. 2013). We com-
of fitting an AutoVideo model is as follows.                     pare the two tuners as well as the results obtained by manu-
                                                                 ally tuning the hyperparameters on HMDB-51 and a subset
from autovideo import fit, extract_frames                        of the original dataset with six categories named HMDB-6.
                                                                 Both tuners use the same configurations, such as the search
# Extract frames from the video
video_ext = train_dataset.iloc[0, 1                              space. The detailed experimental settings are provided in
    ,→ ].split(’.’)[-1]                                          Appendix. We observe that both the tuners outperform the
extract_frames(train_media_dir, video_ext)                       manually tuned pipeline by a fair margin, and Hyperopt
                                                                 achieves the best accuracy. The results verify the importance
   1                       of hyperparameter tuning in action recognition tasks.
