AUTOVIDEO: AN AUTOMATED VIDEO ACTION RECOGNITION SYSTEM
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
AutoVideo: An Automated Video Action Recognition System Daochen Zha†* , Zaid Pervaiz Bhat‡* , Yi-Wei Chen‡* , Yicheng Wang‡* , Sirui Ding‡* , Jiaben Chen†* , Kwei-Herng Lai†* , Anmoll Kumar Jain‡, Mohammad Qazim Bhat‡, Na Zou§, Xia Hu† †Department of Computer Science, Rice University ‡Department of Computer Science and Engineering, Texas A&M University §Department of Engineering Technology and Industrial Distribution, Texas A&M University {daochen.zha,khlai,xiahu}@rice.edu, {zaid.bhat1234,yiwei chen,wangyc,siruiding,jain.anmollkumar,nzou1}@tamu.edu, mqazimbhat@gmail.com, chenjb1@shanghaitech.edu.cn arXiv:2108.04212v3 [cs.CV] 12 Oct 2021 Abstract Action recognition is a crucial task for video understand- ing. In this paper, we present AutoVideo, a Python sys- tem for automated video action recognition. It currently sup- ports seven action recognition algorithms and various pre- processing modules. Unlike the existing libraries that only CSV Reader provide model zoos, AutoVideo is built with the standard pipeline language. The basic building block is primitive, which wraps a pre-processing module or an algorithm with Pr e-Pr ocessi n g Fr am e Ex t r act i on some hyperparameters. AutoVideo is highly modular and ex- tendable. It can be easily combined with AutoML searchers. The pipeline language is quite general so that we can easily Fr am e Tr an sf or m at i on enrich AutoVideo with algorithms for various other video- related tasks in the future. AutoVideo is released under MIT license at https://github.com/datamllab/autovideo. TSN TSM ... ECO Act i on Recogon i t i on Introduction Video-based action recognition aims at identifying different Au t om at ed Tu n er actions from video clips. It is a crucial task for video un- derstanding with broad applications in various areas, such Figure 1: System overview. Each module in AutoVideo as security (Meng, Pears, and Bailey 2007), healthcare (Gao is wrapped as a primitive with some hyperparameters. et al. 2018) and behavior analysis (Poppe 2010). A pipeline consists of a series of primitives from pre- Recently, deep learning models have achieved promising processing to action recognition. AutoVideo is equipped performance in action recognition tasks (Tran et al. 2015; with tuners to search models and hyperparameters. Wang et al. 2016; Carreira and Zisserman 2017; Zolfaghari, standing libraries out there (Yue Zhao 2019; Team 2021), Singh, and Brox 2018). However, the successes of these they often only focus on providing a model zoo for users. models heavily rely on the efforts of human expertise. Given In contrast, we wrap the models with a highly modular and a new task beyond the benchmark datasets, a practitioner extendable pipeline language and inherently support hyper- often still needs extensive laborious trials to test different parameter tuning to save users’ efforts. Unlike some other models and tune the hyperparameters, which significantly research-oriented AutoML studies for videos (Piergiovanni impedes real-world applications. et al. 2019; Ryoo et al. 2019), we target an open-sourced To bridge this gap, we aim to design an easy-to-use toolkit toolkit with unified interfaces for practitioners to save the to help practitioners quickly develop prototypes for any new engineering effort in developing models. video action recognition tasks. In particular, we will present a highly modular and extendable system that wraps the data In this paper, we present the first release of our system for loading, data processing, and state-of-the-art action recogni- Automated Video Action Recognition, namely AutoVideo. tion models with the standard pipeline language. Following It is designed under D3M infrastructure (Milutinovic et al. the pipeline language, we introduce data-driven searchers to 2020; Lai et al. 2021), i.e., data-driven model discovery via automatically tune the hyperparameters to reduce human ef- automated machine learning. In AutoVideo, the data load- forts. While there are some other open-sourced video under- ing, data processing, and learning models are all wrapped as primitives, and their combinations form various machine * Those authors contribute equally to this project learning pipelines. Following this design, AutoVideo sup-
HMDB-6 HMDB-51 # Build pipeline based on config Manually-Tuned 70.35% 46.8% pipeline = build_pipeline({ Random Search 89.63% 53.92% "algorithm": args.alg, Hyperopt 92.6% 56.14% "load_pretrained": args.pretrained, Table 1: Accuracy of tuners in AutoVideo. }) ports data-driven searchers that automatically explore differ- # Fit ent pipelines and hyperparameters to identify the best model fit(train_dataset, train_media_dir, design. AutoVideo is highly modular and extensible since ,→ target_index=target_index, it follows the standard pipeline language. As such, practi- ,→ pipeline=pipeline) tioners can easily add new modules for action recognition or beyond, such as other video understanding tasks. Example 1: Code snippet of fitting a model. AutoVideo System Here, train dataset is the dataframe describing the Figure 1 shows an overview of AutoVideo system. It wraps video file names and labels, train media dir con- the pre-processing and action recognition modules as prim- tains the directory of the folders with video files, itives based on the standard pipeline language. Then a tuner and target index specifies which column is label. can be used to automatically discover the best pipelines. Our Users can customize the configuration dictionary in first release supports seven state-of-the-art action recogni- build pipeline function to train different algorithms tion models, including TSN (Wang et al. 2016), TSM (Lin, with different hyperparameters. The fit function will re- Gan, and Han 2019), I3D (Carreira and Zisserman 2017), turn the fitted pipeline that could be pickled and saved for ECO (Zolfaghari, Singh, and Brox 2018), C3D (Tran et al. future use. Similar function calls can be executed for a test- 2015), R2P1D (Hou et al. 2019), and R3D (Hara, Kataoka, ing dataset to make predictions. and Satoh 2017). This section first introduces the standard In addition to fitting models with specified configura- pipeline language and then elaborates on the interface. tions, users can also use tuners to automatically search the pipelines (i.e., models and hyperparameters) in a data-driven Primitives and Pipelines manner. An example interface is as below. While there are many other AutoML systems (Jin, Song, and from autovideo.searcher import RaySearcher Hu 2019; Thornton et al. 2013; Li et al. 2020), the standard pipeline language used in D3M program provides generic searcher = RaySearcher(train_dataset, and extendable descriptions for modules and pipelines. The ,→ train_media_dir, interface can be easily adapted to various tasks. As such, we ,→ valid_dataset=valid_dataset, build AutoVideo upon D3M infrastructure. ,→ valid_media_dir=valid_media_dir) Primitive is the basic build block. It is an implementation of a function with some hyperparameters. A pipeline is a best_config = searcher.search( Directed Acyclic Graph (DAG) consisting of several prim- search_space=search_space, itive steps. Data types, such as NumPy and ndarrays, can config=config be passed between steps in a pipeline. Each of the above ) concepts is associated with metadata to describe parameters, Example 2: Code snippet of automated tuning. hyper-parameters, etc. More details of pipeline languages can be found in (Milutinovic et al. 2020). While our cur- rent scope focuses on action recognition, the pipeline lan- Here, search space is a dictionary specifying which guage allows AutoVideo to be easily extended to support models and the ranges of hyperparameters the tuner will other video-related tasks, which will be our future work. cover, and config specifies some tuning configurations. Interface Design Evaluation of the Automated Tuners The interface is based upon Axolotl1 , our high-level abstrac- tion of D3M. We further wrap and adapt the Axolotl inter- Our current system supports two types of tuners, including faces to video-related functionalities. A minimum example random search and Hyperopt (Bergstra et al. 2013). We com- of fitting an AutoVideo model is as follows. pare the two tuners as well as the results obtained by manu- ally tuning the hyperparameters on HMDB-51 and a subset from autovideo import fit, extract_frames of the original dataset with six categories named HMDB-6. Both tuners use the same configurations, such as the search # Extract frames from the video video_ext = train_dataset.iloc[0, 1 space. The detailed experimental settings are provided in ,→ ].split(’.’)[-1] Appendix. We observe that both the tuners outperform the extract_frames(train_media_dir, video_ext) manually tuned pipeline by a fair margin, and Hyperopt achieves the best accuracy. The results verify the importance 1 https://gitlab.com/axolotl1/axolotl of hyperparameter tuning in action recognition tasks.
Conclusions and Future Work Piergiovanni, A.; Angelova, A.; Toshev, A.; and Ryoo, M. S. 2019. Evolving space-time neural architectures for videos. This paper presents AutoVideo, an automated video action In Proceedings of the IEEE/CVF International Conference recognition system. Unlike the existing libraries, AutoVideo on Computer Vision, 1793–1802. is designed for AutoML since it is based on the highly mod- ular and pipeline language. It also explicitly supports model Poppe, R. 2010. A survey on vision-based human action selection and hyperparameter tuning. Our evaluation verifies recognition. Image and vision computing 28(6): 976–990. the necessity of tuning in practical use. In the future, we will Ryoo, M. S.; Piergiovanni, A.; Tan, M.; and Angelova, include more video understanding tasks into AutoVideo. A. 2019. Assemblenet: Searching for multi-stream neu- ral connectivity in video architectures. arXiv preprint References arXiv:1905.13209 . Bergstra, J.; Yamins, D.; Cox, D. D.; et al. 2013. Hyperopt: Team, P. 2021. PytorchVideo. https://github.com/ A python library for optimizing the hyperparameters of ma- facebookresearch/pytorchvideo. chine learning algorithms. In Proceedings of the 12th Python Thornton, C.; Hutter, F.; Hoos, H. H.; and Leyton-Brown, K. in science conference, volume 13, 20. Citeseer. 2013. Auto-WEKA: Combined selection and hyperparam- Carreira, J.; and Zisserman, A. 2017. Quo vadis, action eter optimization of classification algorithms. In Proceed- recognition? a new model and the kinetics dataset. In pro- ings of the 19th ACM SIGKDD international conference on ceedings of the IEEE Conference on Computer Vision and Knowledge discovery and data mining, 847–855. Pattern Recognition, 6299–6308. Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; and Paluri, Gao, Y.; Xiang, X.; Xiong, N.; Huang, B.; Lee, H. J.; Alrifai, M. 2015. Learning spatiotemporal features with 3d convo- R.; Jiang, X.; and Fang, Z. 2018. Human action monitoring lutional networks. In Proceedings of the IEEE international for healthcare based on deep learning. Ieee Access 6: 52277– conference on computer vision, 4489–4497. 52285. Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; and Van Gool, L. 2016. Temporal segment networks: To- Hara, K.; Kataoka, H.; and Satoh, Y. 2017. Learning wards good practices for deep action recognition. In Euro- spatio-temporal features with 3d residual networks for ac- pean conference on computer vision, 20–36. Springer. tion recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops, 3154–3160. Yue Zhao, Yuanjun Xiong, D. L. 2019. MMAction. https: //github.com/open-mmlab/mmaction. Hou, R.; Chen, C.; Sukthankar, R.; and Shah, M. 2019. An efficient 3d CNN for action/object segmentation in video. Zolfaghari, M.; Singh, K.; and Brox, T. 2018. Eco: Efficient arXiv preprint arXiv:1907.08895 . convolutional network for online video understanding. In Proceedings of the European conference on computer vision Jin, H.; Song, Q.; and Hu, X. 2019. Auto-keras: An effi- (ECCV), 695–712. cient neural architecture search system. In Proceedings of the 25th ACM SIGKDD International Conference on Knowl- edge Discovery & Data Mining, 1946–1956. Appendix To accelerate the evaluation, we sub-sample six categories Lai, K.-H.; Zha, D.; Wang, G.; Xu, J.; Zhao, Y.; Kumar, D.; from the original HMDB-51 datasets to obtain HMDB-6. Chen, Y.; Zumkhawaka, P.; Wan, M.; Martinez, D.; et al. The sampled categories are brush hair, cartwheel, 2021. TODS: An Automated Time Series Outlier Detection catch, chew, clap, and climb. System. In Proceedings of the AAAI Conference on Artificial The common hyperparameters are listed as follows: 50 Intelligence, volume 35, 16060–16062. training epochs, initial learning rate 0.001 (decays by a fac- Li, Y.; Zha, D.; Venugopal, P.; Zou, N.; and Hu, X. 2020. tor of 10 at epoch 20 & 40), weight decay 5e-4, batch size Pyodds: An end-to-end outlier detection system with auto- 4, momentum 0.9, number of segments 16, modality RGB, mated machine learning. In Companion Proceedings of the and dropout 0.5. Since HMDB-51 is a small dataset that can Web Conference 2020, 153–157. highly prone to overfitting, we have followed the widely ac- cepted practice (Lin, Gan, and Han 2019) to fine-tune the Lin, J.; Gan, C.; and Han, S. 2019. Tsm: Temporal shift weights pre-trained on the Kinetics dataset and freeze the module for efficient video understanding. In Proceedings Batch Normalisation layers. of the IEEE/CVF International Conference on Computer Vi- For the tuners, the following search space was used: sion, 7083–7093. (Continuous) - learning rate [0.001,0.0001], momentum Meng, H.; Pears, N.; and Bailey, C. 2007. A human action [0.9,0.99], weight decay [5e-4,1e-3] and (Discrete) - num- recognition system for embedded computer vision applica- ber of segments 8,16,32. We report the results on the vali- tion. In 2007 IEEE Conference on Computer Vision and dation set. 50 samples were drawn for HMDB6 and 15 sam- Pattern Recognition, 1–6. IEEE. ples were drawn for HMDB-51. In HMDB-51, each sample drawn takes roughly 10 GPU hours to train. Milutinovic, M.; Schoenfeld, B.; Martinez-Garcia, D.; Ray, S.; Shah, S.; and Yan, D. 2020. On Evaluation of AutoML Systems. In ICML Workshop.
You can also read