OPA: Object Placement Assessment Dataset
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
OPA: Object Placement Assessment Dataset Liu Liu∗ , Bo Zhang∗ , Jiangtong Li∗ , Li Niu∗ , Qingyang Liu† , Liqing Zhang∗ ∗ Shanghai Jiao Tong University † Beijing Institute of Technology Abstract including but not limited to: 1) the foreground object is too large or too small; 2) the foreground object does not have arXiv:2107.01889v1 [cs.CV] 5 Jul 2021 Image composition aims to generate realistic compos- supporting force (e.g., hanging in the air); 3) the foreground ite image by inserting an object from one image into an- object appears in a semantically unreasonable place (e.g., other background image, where the placement (e.g., lo- boat on the land); 4) unreasonable occlusion; 5) inconsis- cation, size, occlusion) of inserted object may be unrea- tent perspectives between foreground and background. The sonable, which would significantly degrade the quality of above unreasonable cases would significantly degrade the the composite image. Although some works attempted reality of composite images. Considering a wide range of to learn object placement to create realistic composite foreground objects and complicated scenarios, object place- images, they did not focus on assessing the plausibility ment is still a challenging task. of object placement. In this paper, we focus on object Some previous works attempted to learn reasonable ob- placement assessment task, which verifies whether a com- ject placement to generate realistic composite images. One posite image is plausible in terms of the object place- group of methods [6, 15, 19, 5] relied on explicit rules to ment. To accomplish this task, we construct the first Ob- find a reasonable location for the foreground object. For ex- ject Placement Assessment (OPA) dataset consisting of ample, the new background of inserted foreground should composite images and their rationality labels. Dataset be close to its original background [5] or the foreground is available at https://github.com/bcmi/Object-Placement- should be placed on a flat plane [6]. However, these ex- Assessment-Dataset-OPA. plicit rules are only applicable to limited scenarios. The other group of methods trained network to automatically learn the reasonable object placement, which can be further 1. Introduction divided into supervised and unsupervised methods. Super- As a common image editing operation, image compo- vised methods [16, 4, 21, 20, 11] leveraged the size/location sition aims to generate a realistic-looking image by past- of foreground object in the original image as ground-truth. ing the foreground object of one image on another image. They predicted the bounding box or transformation of the The composites can result in fantastic images that previ- foreground object based on the foreground and background ously only exist in the imagination of artists. However, features [16, 20]. Unsupervised methods like [17] did not it is challenging to insert a foreground object into a back- use ground-truth size/location. They learned reasonable ground image that satisfies the following requirements: 1) transformation of foreground object, by pushing the gen- the foreground object has compatible color and illumination erated composite images close to real images. with the background image; 2) the inserted object may have All the above works focus on generating reasonable an impact on the background image, like the reflection and composite images instead of object placement assessment. shadow; 3) the foreground object should be placed at a rea- In other words, they cannot automatically assess the ratio- sonable location on the background considering location, nality of a composite image in terms of object placement. size, occlusion, semantics, and etc. To satisfy the above To evaluate the quality of generated composite images, the requirements, image harmonization [18, 2], shadow gener- above works on learning object placement usually adopt the ation [10, 14], and object placement [16, 12] have been pro- following three approaches. 1) [16] scored the correlation posed to improve the quality of composite images from the between the distributions of predicted boxes and ground- above aspects, respectively. truth boxes. [20] calculated the Frechet Inception Distance In this paper, we focus on the third issue, object place- (FID) [9] between composite and real images to measure ment, aiming to paste foreground object on the background the placement plausibility. However, they cannot evaluate with suitable location, size, occlusion, etc. As shown in each individual composite image. 2) [17, 5] utilized the im- Figure 1, the cases of unreasonable object placement [1] are provement of downstream tasks (e.g., object detection) to
Figure 1: Some negative samples in our OPA dataset and the inserted foreground objects are marked with red outlines. From left to right: (a) objects with inappropriate size; (b) objects hanging in the air; (c) objects appearing in the semantically unreasonable place; (d) unreasonable occlusion; (e) inconsistent perspectives. evaluate the quality of composite images, where the training larly, given automatically (e.g., [17, 20]) or manually (e.g., sets of the downstream tasks are augmented with generated by users) created composite images, we can apply object composite images. However, the evaluation cost is quite placement assessment model to select the composite images huge and the improvement in downstream tasks may not re- with high rationality scores. liably reflect the quality of composite images, because [7] revealed that randomly generated composite images could 2. Dataset Construction also boost the performance of downstream tasks. 3) An- other common evaluation strategy is user study, where peo- In this section, we describe the construction process of ple are asked to score the rationality of placement [11, 16]. our Object Placement Asssessment (OPA) dataset, in which User study complies with human perception and each com- we first generate composite images and then ask human an- posite image can be evaluated individually. However, due to notators to label these composite images w.r.t. the rational- the subjectivity of user study, the gauge in different papers ity of object placement. may be dramatically different. There is no unified bench- mark dataset and the results in different papers cannot be 2.1. Composite Image Generation directly compared. We select suitable foreground objects and background In summary, as far as we are concerned, no previous images from Microsoft COCO dataset [13], which are used works focus on object placement assessment and no suit- to generate composite images. able dataset is available for this task. In this work, we focus Foreground object selection: There are 80 object cate- on the task of object placement assessment, that is, auto- gories in COCO [13] with annotated instance segmenta- matically assessing the rationality of a composite image in tion masks. We only keep unoccluded foreground objects, terms of object placement. We build an Object Placement because it is difficult to find reasonable placement for oc- Assessment (OPA) dataset for this task, based on COCO cluded objects. We delete some categories according to the [13] dataset. First, we select unoccluded objects from mul- following rules: 1) the categories which usually appear at tiple categories as our candidate foreground objects. Then, very specific locations, such as transportation-related cate- we design a strategy to select compatible background im- gories (e.g., traffic light, stop sign) and human-centric cat- ages for each foreground object. The foreground objects egories (e.g., tie, snowboard); 2) the categories of large are pasted on their compatible background images with ran- objects appearing in crowded space, such as large furni- dom sizes and locations to form composite images, which ture (e.g., refrigerator, bed); 3) the categories with too few are sent to human annotators for rationality labeling. Each remaining objects after removing occluded and tiny fore- image is labeled by four human annotators, where only the ground objects (e.g., toaster, hair drier); 4) the categories images with consistent labels are preserved in the dataset which are hard to verify the rationality, such as the flying to ensure the annotation quality. Finally, we split the col- object (e.g., kite, frisbee). In summary, the above categories lected dataset into training set and test set, in which the are either hard to find reasonable placement or hard to ver- background images and foreground objects have no over- ify the rationality of object placement. After filtering, 47 lap between training set and test set. More details about categories remain and the complete list is: airplane, apple, constructing the dataset will be elaborated in Section 2. banana, bear, bench, bicycle, bird, boat, book, bottle, bowl, With the constructed dataset, we regard the object place- broccoli, bus, cake, car, cat, cellphone, chair, cow, cup, dog, ment assessment task as a binary classification problem and donut, elephant, fire hydrant, fork, giraffe, horse, keyboard, any typical classification network can be applied to this task. knife, laptop, motorcycle, mouse, orange, person, pizza, With the functionality of object placement assessment, our potted plant, remote, sandwich, scissors, sheep, spoon, suit- model can help obtain realistic composite images. Particu- case, toothbrush, truck, vase, wineglass, zebra. With the
annotated instance segmentation masks from COCO [13] ages via random copy-and-paste as negative samples, which dataset, we select 100 unoccluded foreground objects for have no overlap with the composite images in Section 2.1. each category. Although the generated composite images contain both pos- Background image selection: For each foreground cate- itive samples and negative samples, negative samples are gory, there should be a set of compatible background im- dominant and thus the learned binary classifier is useful. ages. For example, airplanes do not appear indoors and To indicate the foreground object, we also feed foreground forks usually appear on the table. In this work, we elimi- mask into ResNet-50 [8] classifier. We apply the fine-tuned nate the burden of selecting compatible background images classifier to the composite images in Section 2.1 and select for object placement assessment task. the top 235,000 composite images with the highest scores We fine-tune PlaceCNN [22] pretrained on places365 for further labeling. The selected composite images are sup- [22] to select a set of compatible background images for posed to have relatively higher ratio of positive samples. each category. Specifically, for each category, we take the To acquire the binary rationality label (1 for reasonable images containing the objects of this category as positive object placement and 0 for unreasonable object placement), samples, and randomly sample an equal number of other we ask four human annotators to label the rationality for images as negative samples. Then, we fine-tune PlaceCNN each composite image. We purely focus on the object place- [22] based on positive and negative samples to learn a bi- ment issues and ignore the other issues (e.g., inconsistent il- nary classifier. For each category, we apply the trained bi- lumination between foreground and background, unnatural nary classifier to retrieve top 100 images which do not con- boundary between foreground and background). Due to the tain the objects of this category as a set of compatible back- subjectivity of this annotation task, we make detailed an- ground images. notation guidelines (e.g., the reasonable range of sizes for Composite image generation: We generate a composite each foreground category) and train human annotators for image by pasting one foreground object on another back- two weeks to make the annotations consistent across dif- ground image. To avoid too much repetition, we limit the ferent annotators. The detailed annotation guidelines are as size and location of the foreground object according to some follows, prior knowledge. • All foreground objects are considered as real objects For each foreground category, we first calculate a reason- instead of models or toys. able range of its size ratio, which is defined as the ratio of foreground object size over its corresponding image size. • The foreground object placement conforms to the basic Given a foreground object and a compatible background laws of physics. Except for the flying objects (e.g., image, we randomly sample 5 size ratios and 9 locations, airplane), all the other objects should have reasonable leading to 45 composite images. For size ratio, we divide supporting force. the range of size ratio of foreground category into five bins based on 20%, 40%, 60%, 80% quantiles, and randomly • The foreground object should appear in a semantically sample one size ratio from each bin. For location, we evenly reasonable place. We also make some specific rules for divide the background image into 9 partitions and randomly the ambiguous cases. For example, for the container sample one location from each partition. We resize the fore- categories (e.g., bowl, bottle), we stipulate that they ground object according to certain size ratio and place it at cannot be surrounded by fried dish. certain location, producing a composite image. Besides, we • If there is occlusion between the foreground object and remove the composite images with incomplete foreground background object, the rationality of occlusion should objects, e.g., half of the foreground object is out of the scope be considered. of the background image. • The size of the foreground object should be judged 2.2. Composite Image Labelling based on its location and relative distance to other Since the rationality of object placement is constrained background objects. by many complicated factors (e.g., location, size, occlusion, • We provide a reasonable range of size for each cate- semantics), the number of negative images is significantly gory and the estimated size of the foreground should larger than the positive samples among the randomly gen- be within the range of its category. For animal cate- erated composite images. To achieve relatively balanced gories (e.g., dog, sheep), we treat the sizes of animals positive-negative ratio and save the human labor, we first of all ages (from baby animal to adult animal) as rea- fine-tune a ResNet-50 [8] classifier pretrained on ImageNet sonable sizes. [3] to remove the obviously unreasonable composite im- ages. During fine-tuning, the real images are regarded as • The perspective of foreground object should look rea- positive samples. We additionally generate composite im- sonable.
Positive apple (a) cup (b) sandwich (c) person (d) cat (e) laptop (f) cellphone (g) bear (h) cake (i) bottle (j) Negative book (k) airplane (l) chair (m) boat (n) dog (o) truck (p) bird (q) bicycle (r) bench (s) car (t) Figure 2: Some positive and negative samples in our OPA dataset and the inserted foreground objects are marked with red outlines. Top row: positive samples; Bottom rows: negative samples, including objects with inappropriate size (e.g., f, g, h), without supporting force (e.g., i, j, k), appearing in the semantically unreasonable place (e.g., l, m, n), with unreasonable occlusion (e.g., o, p, q), and with inconsistent perspectives (e.g., r, s, t). • The inharmonious illumination and color, and unrea- in real-world applications. sonable reflection and shadow are out of the scope of consideration. 2.3. Dataset Statistics Although some of the above rules may be arguable, After composite image generation and composite image which depends on the definition of rationality, our focus is labelling, there are 24,917 positive samples and 48,554 neg- making the annotation criterion as explicit as possible and ative samples in our OPA dataset. Our OPA dataset has the annotations across different images as consistent as pos- 4,137 unrepeated foreground objects and 1,389 unrepeated sible, so that the constructed dataset is qualified for scien- background images. We show some example positive and tific study. Besides, similar categories are labeled by the negative images in our dataset examples in Figure 2. We same group of human annotators to further mitigate the in- also present the number of images (positive and negative) consistency. Finally, we only keep the images for which per foreground category in Figure 3. four human annotators reach the agreement. From the re- We divide our OPA dataset into 62,074 training maining images, we construct training set with 62,074 im- images and 11,396 test images, in which the fore- ages and test set with 11,396 images, whose foreground ob- grounds/backgrounds in training set and test set have no jects and background images have no overlap. We impose overlap. The training (resp., test) set contains 21,351 (resp., this constraint to better evaluate the generalization ability of 3,566) positive samples and 40,724 (resp., 7,830) nega- different methods, because the foreground object and back- tive samples. Besides, the training (resp., test) set con- ground image are generally out of the scope of training set tains 2,701 (resp., 1,436) unrepeated foreground objects and
Figure 3: The number of images per foreground category in our OPA dataset. 1,236 (resp., 153) unrepeated background images. [6] Georgios Georgakis, Arsalan Mousavian, Alexander C. Berg, and Jana Kosecka. Synthesizing training data for ob- 3. Conclusion ject detection in indoor scenes. In Robotics: Science and Systems XIII, 2017. In this work, we focus on the object placement assess- [7] Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung- ment task, which verifies the rationality of object placement Yi Lin, Ekin D. Cubuk, Quoc V. Le, and Barret Zoph. Simple in a composite image. To support this task, we have con- copy-paste is a strong data augmentation method for instance segmentation. arXiv preprint arXiv:2012.07177, 2020. tributed an Object Placement Assessment (OPA) dataset. [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. This dataset will facilitate the research in automatic object Deep residual learning for image recognition. In CVPR, placement, which can automatically forecast the diverse and 2016. plausible placement of foreground object on the background [9] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, image. Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equi- References librium. In NeurIPS, 2017. [10] Eric Kee, James F O’brien, and Hany Farid. Exposing photo [1] Samaneh Azadi, Deepak Pathak, Sayna Ebrahimi, and manipulation from shading and shadows. ACM Transactions Trevor Darrell. Compositional GAN: Learning image- on Graphics, 33(5):1–21, 2014. conditional binary composition. International Journal of [11] Donghoon Lee, Sifei Liu, Jinwei Gu, Ming-Yu Liu, Ming- Computer Vision, 128(10):2570–2585, 2020. Hsuan Yang, and Jan Kautz. Context-aware synthesis and [2] Wenyan Cong, Jianfu Zhang, Li Niu, Liu Liu, Zhixin Ling, placement of object instances. In NeurIPS, 2018. Weiyuan Li, and Liqing Zhang. Dovenet: Deep image har- [12] Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman, monization via domain verification. In CVPR, 2020. and Simon Lucey. ST-GAN: spatial transformer generative [3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, adversarial networks for image compositing. In CVPR, 2018. and Fei-Fei Li. ImageNet: A large-scale hierarchical image [13] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James database. In CVPR, 2009. Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and [4] Nikita Dvornik, Julien Mairal, and Cordelia Schmid. Mod- C. Lawrence Zitnick. Microsoft COCO: Common objects eling visual context is key to augmenting object detection in context. In ECCV, 2014. datasets. In ECCV, 2018. [14] Daquan Liu, Chengjiang Long, Hongpan Zhang, Hanning [5] Haoshu Fang, Jianhua Sun, Runzhong Wang, Minghao Gou, Yu, Xinzhi Dong, and Chunxia Xiao. ARShadowGAN: Yonglu Li, and Cewu Lu. InstaBoost: Boosting instance Shadow generative adversarial network for augmented real- segmentation via probability map guided copy-pasting. In ity in single light scenes. In CVPR, 2020. ICCV, 2019. [15] Tal Remez, Jonathan Huang, and Matthew Brown. Learning
to segment via cut-and-paste. In ECCV, 2018. [16] Fuwen Tan, Crispin Bernier, Benjamin Cohen, Vicente Or- donez, and Connelly Barnes. Where and who? Automatic semantic-aware person composition. In WACV, 2018. [17] Shashank Tripathi, Siddhartha Chandra, Amit Agrawal, Am- brish Tyagi, James M. Rehg, and Visesh Chari. Learning to generate synthetic data via compositing. In CVPR, 2019. [18] Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli, Xin Lu, and Ming-Hsuan Yang. Deep image harmonization. In CVPR, 2017. [19] Hao Wang, Qilong Wang, Fan Yang, Weiqi Zhang, and Wangmeng Zuo. Data augmentation for object detection via progressive and selective instance-switching. arXiv preprint arXiv:1906.00358, 2019. [20] Lingzhi Zhang, Tarmily Wen, Jie Min, Jiancong Wang, David Han, and Jianbo Shi. Learning object placement by inpainting for compositional data augmentation. In ECCV, 2020. [21] Song-Hai Zhang, Zhengping Zhou, Bin Liu, Xi Dong, and Peter Hall. What and where: A context-based recommenda- tion system for object insertion. Computational Visual Me- dia, 6(1):79–93, 2020. [22] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 40(6):1452–1464, 2017.
You can also read