OPA: Object Placement Assessment Dataset

Page created by Jeremy Vazquez

Food & Drink

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

OPA: Object Placement Assessment Dataset

OPA: Object Placement Assessment Dataset

                                                        Liu Liu∗ , Bo Zhang∗ , Jiangtong Li∗ , Li Niu∗ , Qingyang Liu† , Liqing Zhang∗
                                                            ∗
                                                              Shanghai Jiao Tong University † Beijing Institute of Technology

                                                                Abstract                                including but not limited to: 1) the foreground object is too
                                                                                                        large or too small; 2) the foreground object does not have
arXiv:2107.01889v1 [cs.CV] 5 Jul 2021

                                            Image composition aims to generate realistic compos-        supporting force (e.g., hanging in the air); 3) the foreground
                                        ite image by inserting an object from one image into an-        object appears in a semantically unreasonable place (e.g.,
                                        other background image, where the placement (e.g., lo-          boat on the land); 4) unreasonable occlusion; 5) inconsis-
                                        cation, size, occlusion) of inserted object may be unrea-       tent perspectives between foreground and background. The
                                        sonable, which would significantly degrade the quality of       above unreasonable cases would significantly degrade the
                                        the composite image. Although some works attempted              reality of composite images. Considering a wide range of
                                        to learn object placement to create realistic composite         foreground objects and complicated scenarios, object place-
                                        images, they did not focus on assessing the plausibility        ment is still a challenging task.
                                        of object placement. In this paper, we focus on object             Some previous works attempted to learn reasonable ob-
                                        placement assessment task, which verifies whether a com-        ject placement to generate realistic composite images. One
                                        posite image is plausible in terms of the object place-         group of methods [6, 15, 19, 5] relied on explicit rules to
                                        ment. To accomplish this task, we construct the first Ob-       find a reasonable location for the foreground object. For ex-
                                        ject Placement Assessment (OPA) dataset consisting of           ample, the new background of inserted foreground should
                                        composite images and their rationality labels. Dataset          be close to its original background [5] or the foreground
                                        is available at https://github.com/bcmi/Object-Placement-       should be placed on a flat plane [6]. However, these ex-
                                        Assessment-Dataset-OPA.                                         plicit rules are only applicable to limited scenarios. The
                                                                                                        other group of methods trained network to automatically
                                                                                                        learn the reasonable object placement, which can be further
                                        1. Introduction                                                 divided into supervised and unsupervised methods. Super-
                                            As a common image editing operation, image compo-           vised methods [16, 4, 21, 20, 11] leveraged the size/location
                                        sition aims to generate a realistic-looking image by past-      of foreground object in the original image as ground-truth.
                                        ing the foreground object of one image on another image.        They predicted the bounding box or transformation of the
                                        The composites can result in fantastic images that previ-       foreground object based on the foreground and background
                                        ously only exist in the imagination of artists. However,        features [16, 20]. Unsupervised methods like [17] did not
                                        it is challenging to insert a foreground object into a back-    use ground-truth size/location. They learned reasonable
                                        ground image that satisfies the following requirements: 1)      transformation of foreground object, by pushing the gen-
                                        the foreground object has compatible color and illumination     erated composite images close to real images.
                                        with the background image; 2) the inserted object may have         All the above works focus on generating reasonable
                                        an impact on the background image, like the reflection and      composite images instead of object placement assessment.
                                        shadow; 3) the foreground object should be placed at a rea-     In other words, they cannot automatically assess the ratio-
                                        sonable location on the background considering location,        nality of a composite image in terms of object placement.
                                        size, occlusion, semantics, and etc. To satisfy the above       To evaluate the quality of generated composite images, the
                                        requirements, image harmonization [18, 2], shadow gener-        above works on learning object placement usually adopt the
                                        ation [10, 14], and object placement [16, 12] have been pro-    following three approaches. 1) [16] scored the correlation
                                        posed to improve the quality of composite images from the       between the distributions of predicted boxes and ground-
                                        above aspects, respectively.                                    truth boxes. [20] calculated the Frechet Inception Distance
                                            In this paper, we focus on the third issue, object place-   (FID) [9] between composite and real images to measure
                                        ment, aiming to paste foreground object on the background       the placement plausibility. However, they cannot evaluate
                                        with suitable location, size, occlusion, etc. As shown in       each individual composite image. 2) [17, 5] utilized the im-
                                        Figure 1, the cases of unreasonable object placement [1] are    provement of downstream tasks (e.g., object detection) to

Figure 1: Some negative samples in our OPA dataset and the inserted foreground objects are marked with red outlines. From
left to right: (a) objects with inappropriate size; (b) objects hanging in the air; (c) objects appearing in the semantically
unreasonable place; (d) unreasonable occlusion; (e) inconsistent perspectives.

evaluate the quality of composite images, where the training larly, given automatically (e.g., [17, 20]) or manually (e.g.,
sets of the downstream tasks are augmented with generated by users) created composite images, we can apply object
composite images. However, the evaluation cost is quite placement assessment model to select the composite images
huge and the improvement in downstream tasks may not re- with high rationality scores.
liably reflect the quality of composite images, because [7]
revealed that randomly generated composite images could 2. Dataset Construction
also boost the performance of downstream tasks. 3) An-
other common evaluation strategy is user study, where peo- In this section, we describe the construction process of
ple are asked to score the rationality of placement [11, 16]. our Object Placement Asssessment (OPA) dataset, in which
User study complies with human perception and each com- we first generate composite images and then ask human an-
posite image can be evaluated individually. However, due to notators to label these composite images w.r.t. the rational-
the subjectivity of user study, the gauge in different papers ity of object placement.
may be dramatically different. There is no unified bench-
mark dataset and the results in different papers cannot be
2.1. Composite Image Generation
directly compared. We select suitable foreground objects and background
In summary, as far as we are concerned, no previous images from Microsoft COCO dataset [13], which are used
works focus on object placement assessment and no suit- to generate composite images.
able dataset is available for this task. In this work, we focus Foreground object selection: There are 80 object cate-
on the task of object placement assessment, that is, auto- gories in COCO [13] with annotated instance segmenta-
matically assessing the rationality of a composite image in tion masks. We only keep unoccluded foreground objects,
terms of object placement. We build an Object Placement because it is difficult to find reasonable placement for oc-
Assessment (OPA) dataset for this task, based on COCO cluded objects. We delete some categories according to the
[13] dataset. First, we select unoccluded objects from mul- following rules: 1) the categories which usually appear at
tiple categories as our candidate foreground objects. Then, very specific locations, such as transportation-related cate-
we design a strategy to select compatible background im- gories (e.g., traffic light, stop sign) and human-centric cat-
ages for each foreground object. The foreground objects egories (e.g., tie, snowboard); 2) the categories of large
are pasted on their compatible background images with ran- objects appearing in crowded space, such as large furni-
dom sizes and locations to form composite images, which ture (e.g., refrigerator, bed); 3) the categories with too few
are sent to human annotators for rationality labeling. Each remaining objects after removing occluded and tiny fore-
image is labeled by four human annotators, where only the ground objects (e.g., toaster, hair drier); 4) the categories
images with consistent labels are preserved in the dataset which are hard to verify the rationality, such as the flying
to ensure the annotation quality. Finally, we split the col- object (e.g., kite, frisbee). In summary, the above categories
lected dataset into training set and test set, in which the are either hard to find reasonable placement or hard to ver-
background images and foreground objects have no over- ify the rationality of object placement. After filtering, 47
lap between training set and test set. More details about categories remain and the complete list is: airplane, apple,
constructing the dataset will be elaborated in Section 2. banana, bear, bench, bicycle, bird, boat, book, bottle, bowl,
With the constructed dataset, we regard the object place- broccoli, bus, cake, car, cat, cellphone, chair, cow, cup, dog,
ment assessment task as a binary classification problem and donut, elephant, fire hydrant, fork, giraffe, horse, keyboard,
any typical classification network can be applied to this task. knife, laptop, motorcycle, mouse, orange, person, pizza,
With the functionality of object placement assessment, our potted plant, remote, sandwich, scissors, sheep, spoon, suit-
model can help obtain realistic composite images. Particu- case, toothbrush, truck, vase, wineglass, zebra. With the

annotated instance segmentation masks from COCO [13] ages via random copy-and-paste as negative samples, which
dataset, we select 100 unoccluded foreground objects for have no overlap with the composite images in Section 2.1.
each category. Although the generated composite images contain both pos-
Background image selection: For each foreground cate- itive samples and negative samples, negative samples are
gory, there should be a set of compatible background im- dominant and thus the learned binary classifier is useful.
ages. For example, airplanes do not appear indoors and To indicate the foreground object, we also feed foreground
forks usually appear on the table. In this work, we elimi- mask into ResNet-50 [8] classifier. We apply the fine-tuned
nate the burden of selecting compatible background images classifier to the composite images in Section 2.1 and select
for object placement assessment task. the top 235,000 composite images with the highest scores
We fine-tune PlaceCNN [22] pretrained on places365 for further labeling. The selected composite images are sup-
[22] to select a set of compatible background images for posed to have relatively higher ratio of positive samples.
each category. Specifically, for each category, we take the To acquire the binary rationality label (1 for reasonable
images containing the objects of this category as positive object placement and 0 for unreasonable object placement),
samples, and randomly sample an equal number of other we ask four human annotators to label the rationality for
images as negative samples. Then, we fine-tune PlaceCNN each composite image. We purely focus on the object place-
[22] based on positive and negative samples to learn a bi- ment issues and ignore the other issues (e.g., inconsistent il-
nary classifier. For each category, we apply the trained bi- lumination between foreground and background, unnatural
nary classifier to retrieve top 100 images which do not con- boundary between foreground and background). Due to the
tain the objects of this category as a set of compatible back- subjectivity of this annotation task, we make detailed an-
ground images. notation guidelines (e.g., the reasonable range of sizes for
Composite image generation: We generate a composite each foreground category) and train human annotators for
image by pasting one foreground object on another back- two weeks to make the annotations consistent across dif-
ground image. To avoid too much repetition, we limit the ferent annotators. The detailed annotation guidelines are as
size and location of the foreground object according to some follows,
prior knowledge.
• All foreground objects are considered as real objects
For each foreground category, we first calculate a reason-
instead of models or toys.
able range of its size ratio, which is defined as the ratio of
foreground object size over its corresponding image size. • The foreground object placement conforms to the basic
Given a foreground object and a compatible background laws of physics. Except for the flying objects (e.g.,
image, we randomly sample 5 size ratios and 9 locations, airplane), all the other objects should have reasonable
leading to 45 composite images. For size ratio, we divide supporting force.
the range of size ratio of foreground category into five bins
based on 20%, 40%, 60%, 80% quantiles, and randomly • The foreground object should appear in a semantically
sample one size ratio from each bin. For location, we evenly reasonable place. We also make some specific rules for
divide the background image into 9 partitions and randomly the ambiguous cases. For example, for the container
sample one location from each partition. We resize the fore- categories (e.g., bowl, bottle), we stipulate that they
ground object according to certain size ratio and place it at cannot be surrounded by fried dish.
certain location, producing a composite image. Besides, we • If there is occlusion between the foreground object and
remove the composite images with incomplete foreground background object, the rationality of occlusion should
objects, e.g., half of the foreground object is out of the scope be considered.
of the background image.
• The size of the foreground object should be judged
2.2. Composite Image Labelling based on its location and relative distance to other
Since the rationality of object placement is constrained background objects.
by many complicated factors (e.g., location, size, occlusion, • We provide a reasonable range of size for each cate-
semantics), the number of negative images is significantly gory and the estimated size of the foreground should
larger than the positive samples among the randomly gen- be within the range of its category. For animal cate-
erated composite images. To achieve relatively balanced gories (e.g., dog, sheep), we treat the sizes of animals
positive-negative ratio and save the human labor, we first of all ages (from baby animal to adult animal) as rea-
fine-tune a ResNet-50 [8] classifier pretrained on ImageNet sonable sizes.
[3] to remove the obviously unreasonable composite im-
ages. During fine-tuning, the real images are regarded as • The perspective of foreground object should look rea-
positive samples. We additionally generate composite im- sonable.

Positive

                apple (a)                 cup (b)                  sandwich (c)            person (d)                 cat (e)

                laptop (f)             cellphone (g)                bear (h)                cake (i)                 bottle (j)

Negative

                book (k)                airplane (l)                chair (m)               boat (n)                 dog (o)

                truck (p)                bird (q)                  bicycle (r)              bench (s)                car (t)

Figure 2: Some positive and negative samples in our OPA dataset and the inserted foreground objects are marked with red
outlines. Top row: positive samples; Bottom rows: negative samples, including objects with inappropriate size (e.g., f, g,
h), without supporting force (e.g., i, j, k), appearing in the semantically unreasonable place (e.g., l, m, n), with unreasonable
occlusion (e.g., o, p, q), and with inconsistent perspectives (e.g., r, s, t).

   • The inharmonious illumination and color, and unrea-                in real-world applications.
     sonable reflection and shadow are out of the scope of
     consideration.                                                     2.3. Dataset Statistics
    Although some of the above rules may be arguable,                      After composite image generation and composite image
which depends on the definition of rationality, our focus is            labelling, there are 24,917 positive samples and 48,554 neg-
making the annotation criterion as explicit as possible and             ative samples in our OPA dataset. Our OPA dataset has
the annotations across different images as consistent as pos-           4,137 unrepeated foreground objects and 1,389 unrepeated
sible, so that the constructed dataset is qualified for scien-          background images. We show some example positive and
tific study. Besides, similar categories are labeled by the             negative images in our dataset examples in Figure 2. We
same group of human annotators to further mitigate the in-              also present the number of images (positive and negative)
consistency. Finally, we only keep the images for which                 per foreground category in Figure 3.
four human annotators reach the agreement. From the re-                    We divide our OPA dataset into 62,074 training
maining images, we construct training set with 62,074 im-               images and 11,396 test images, in which the fore-
ages and test set with 11,396 images, whose foreground ob-              grounds/backgrounds in training set and test set have no
jects and background images have no overlap. We impose                  overlap. The training (resp., test) set contains 21,351 (resp.,
this constraint to better evaluate the generalization ability of        3,566) positive samples and 40,724 (resp., 7,830) nega-
different methods, because the foreground object and back-              tive samples. Besides, the training (resp., test) set con-
ground image are generally out of the scope of training set             tains 2,701 (resp., 1,436) unrepeated foreground objects and

Figure 3: The number of images per foreground category in our OPA dataset.

1,236 (resp., 153) unrepeated background images. [6] Georgios Georgakis, Arsalan Mousavian, Alexander C.
Berg, and Jana Kosecka. Synthesizing training data for ob-
3. Conclusion ject detection in indoor scenes. In Robotics: Science and
Systems XIII, 2017.
In this work, we focus on the object placement assess- [7] Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-
ment task, which verifies the rationality of object placement Yi Lin, Ekin D. Cubuk, Quoc V. Le, and Barret Zoph. Simple
in a composite image. To support this task, we have con- copy-paste is a strong data augmentation method for instance
segmentation. arXiv preprint arXiv:2012.07177, 2020.
tributed an Object Placement Assessment (OPA) dataset.
[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
This dataset will facilitate the research in automatic object Deep residual learning for image recognition. In CVPR,
placement, which can automatically forecast the diverse and 2016.
plausible placement of foreground object on the background [9] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
image. Bernhard Nessler, and Sepp Hochreiter. GANs trained by
a two time-scale update rule converge to a local nash equi-
References librium. In NeurIPS, 2017.
[10] Eric Kee, James F O’brien, and Hany Farid. Exposing photo
[1] Samaneh Azadi, Deepak Pathak, Sayna Ebrahimi, and manipulation from shading and shadows. ACM Transactions
Trevor Darrell. Compositional GAN: Learning image- on Graphics, 33(5):1–21, 2014.
conditional binary composition. International Journal of [11] Donghoon Lee, Sifei Liu, Jinwei Gu, Ming-Yu Liu, Ming-
Computer Vision, 128(10):2570–2585, 2020. Hsuan Yang, and Jan Kautz. Context-aware synthesis and
[2] Wenyan Cong, Jianfu Zhang, Li Niu, Liu Liu, Zhixin Ling, placement of object instances. In NeurIPS, 2018.
Weiyuan Li, and Liqing Zhang. Dovenet: Deep image har- [12] Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman,
monization via domain verification. In CVPR, 2020. and Simon Lucey. ST-GAN: spatial transformer generative
[3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, adversarial networks for image compositing. In CVPR, 2018.
and Fei-Fei Li. ImageNet: A large-scale hierarchical image [13] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James
database. In CVPR, 2009. Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
[4] Nikita Dvornik, Julien Mairal, and Cordelia Schmid. Mod- C. Lawrence Zitnick. Microsoft COCO: Common objects
eling visual context is key to augmenting object detection in context. In ECCV, 2014.
datasets. In ECCV, 2018. [14] Daquan Liu, Chengjiang Long, Hongpan Zhang, Hanning
[5] Haoshu Fang, Jianhua Sun, Runzhong Wang, Minghao Gou, Yu, Xinzhi Dong, and Chunxia Xiao. ARShadowGAN:
Yonglu Li, and Cewu Lu. InstaBoost: Boosting instance Shadow generative adversarial network for augmented real-
segmentation via probability map guided copy-pasting. In ity in single light scenes. In CVPR, 2020.
ICCV, 2019. [15] Tal Remez, Jonathan Huang, and Matthew Brown. Learning

to segment via cut-and-paste. In ECCV, 2018.
[16] Fuwen Tan, Crispin Bernier, Benjamin Cohen, Vicente Or-
     donez, and Connelly Barnes. Where and who? Automatic
     semantic-aware person composition. In WACV, 2018.
[17] Shashank Tripathi, Siddhartha Chandra, Amit Agrawal, Am-
     brish Tyagi, James M. Rehg, and Visesh Chari. Learning to
     generate synthetic data via compositing. In CVPR, 2019.
[18] Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli,
     Xin Lu, and Ming-Hsuan Yang. Deep image harmonization.
     In CVPR, 2017.
[19] Hao Wang, Qilong Wang, Fan Yang, Weiqi Zhang, and
     Wangmeng Zuo. Data augmentation for object detection via
     progressive and selective instance-switching. arXiv preprint
     arXiv:1906.00358, 2019.
[20] Lingzhi Zhang, Tarmily Wen, Jie Min, Jiancong Wang,
     David Han, and Jianbo Shi. Learning object placement by
     inpainting for compositional data augmentation. In ECCV,
     2020.
[21] Song-Hai Zhang, Zhengping Zhou, Bin Liu, Xi Dong, and
     Peter Hall. What and where: A context-based recommenda-
     tion system for object insertion. Computational Visual Me-
     dia, 6(1):79–93, 2020.
[22] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva,
     and Antonio Torralba. Places: A 10 million image database
     for scene recognition. IEEE Transactions on Pattern Analy-
     sis and Machine Intelligence, 40(6):1452–1464, 2017.