Content Authentication for Neural Imaging Pipelines: End-to-end Optimization of Photo Provenance in Complex Distribution Channels
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Content Authentication for Neural Imaging Pipelines: End-to-end Optimization of Photo Provenance in Complex Distribution Channels Pawel Korus Nasir Memon New York University, New York University and AGH University of Science and Technology http://kt.agh.edu.pl/%7Ekorus arXiv:1812.01516v1 [cs.CV] 4 Dec 2018 Abstract tures or watermarking; (2) passive forensics analysis which exploits inherent statistical properties result- Forensic analysis of digital photo provenance relies ing from the photo acquisition pipeline. While the on intrinsic traces left in the photograph at the time former provides superior performance and allows for of its acquisition. Such analysis becomes unreliable af- advanced protection features (like precise tampering ter heavy post-processing, such as down-sampling and localization [37], or reconstruction of tampered con- re-compression applied upon distribution in the Web. tent [24]), it failed to gain widespread adoption due to This paper explores end-to-end optimization of the en- the necessity to generate protected versions of the pho- tire image acquisition and distribution workflow to fa- tographs, and the lack of incentives of camera vendors cilitate reliable forensic analysis at the end of the dis- to modify camera design to integrate such features [3]. tribution channel. We demonstrate that neural imag- Passive forensics, on the other hand, relies on our ing pipelines can be trained to replace the internals of knowledge of the photo acquisition pipeline, and statis- digital cameras, and jointly optimized for high-fidelity tical artifacts introduced by its successive steps. While photo development and reliable provenance analysis. this approach is well suited for potentially analyzing In our experiments, the proposed approach increased any digital photograph, it often fails short due to the image manipulation detection accuracy from 55% to complex nature of image post-processing and distribu- nearly 98%. The findings encourage further research tion channels. Digital images are not only heavily com- towards building more reliable imaging pipelines with pressed, but also enhanced or even manipulated before, explicit provenance-guaranteeing properties. during or after dissemination. Popular images have many online incarnations, and tracing their distribu- tion and evolution has spawned a new field of image 1. Introduction phylogeny [12, 13] which relies on visual differences be- tween multiple images to infer their relationships and Ensuring integrity of digital images is one of the editing history. However, phylogeny does not provide most challenging and important problems in multime- any tools to reason about the authenticity or history dia communications. Photographs and videos are com- of individual images. Hence, reliable authentication of monly used for documentation of important events, and real-world online images remains untractable [38]. as such, require efficient and reliable authentication At the moment, forensic analysis often yields useful protocols. Our current media acquisition and distribu- results in near-acquisition scenarios. Analysis of native tion workflows are built with entertainment in mind, images straight from the camera is more reliable, and and not only fail to provide explicit security features, even seemingly benign implementation details - like the but actually work against them. Image compression rounding operators used in the camera’s image signal standards exploit heavy redundancy of visual signals processor [1] - can provide useful clues. Most forensic to reduce communication payload, but optimize for hu- traces quickly become unreliable as the image under- man perception alone. Security extensions of popular goes further post-processing. One of the most reliable standards lack in adoption [19]. tools at our disposal involves analysis of the imaging Two general approaches to assurance and verifica- sensor’s artifacts (the photo response non-uniformity tion of digital image integrity include [23, 29, 31]: (1) pattern) which can be used both for source attribution pro-active protection methods based on digital signa- and content authentication problems [9]. 1
Cross entropy loss Postproc. A Classification labels: pristine, postproc. A Z Postproc. B RAW Developed JPEG NIP MUX Downscale FAN = measurements image Compression Postproc. L2 loss Y = Desired Postproc. output image Z Acquisition Post-processing / editing Distribution Analysis Figure 1. Optimization of the image acquisition and distribution channel to facilitate photo provenance analysis. The neural imaging pipeline (NIP) is trained to develop images that both resemble the desired target images, but also retain meaningful forensic clues at the end of complex distribution channels. In the near future, the rapid progress in compu- tations of the current technology. While adoption of tational imaging will challenge digital image forensics digital watermarking in image authentication was lim- even in near-acquisition authentication. In the pur- ited by the necessity to modify camera hardware, our suit of better image quality and convenience, digital approach exploits the flexibility of neural networks to cameras and smartphones employ sophisticated post- learn relevant integrity-preserving features within the processing directly in the camera, and soon few pho- expressive power of the model. We believe that with tographs will resemble the original images captured solid security-oriented understanding of neural imaging by the sensor(s). Adoption of machine learning has pipelines, and with the rare opportunity of replacing recently challenged many long-standing limitations of the well-established and security-oblivious pipeline, we digital photography, including: (1) high-quality low- can significantly improve digital image authentication light photography [8]; (2) single-shot HDR with overex- capabilities. posed content recovery [14]; (3) practical high-quality We aim to inspire discussion about novel camera de- digital zoom from multiple shots [36]; (4) quality en- signs that could improve photo provenance analysis ca- hancement of smartphone-captured images with weak pabilities. We demonstrate that it is possible to opti- supervision from DSLR photos [17]. mize an imaging pipeline to significantly improve de- These remarkable results demonstrate tangible ben- tection of photo manipulation at the end of a complex efits of replacing the entire acquisition pipeline with real-world distribution channel, where state-of-the-art neural networks. As a result, it will be necessary to deep-learning techniques fail. The main contributions investigate the impact of the emerging neural imag- of our work include: ing pipelines on existing forensics protocols. While im- portant, such evaluation can be see rather as damage 1. The first end-to-end optimization of the imaging assessment and control and not as a solution for the pipeline with explicit photo provenance objectives; future. We believe it is imperative to consider novel 2. The first security-oriented discussion of neural possibilities for security-oriented design of our cameras imaging pipelines and the inherent trade-offs; and multimedia dissemination channels. In this paper, we propose to optimize neural imaging 3. Significant improvement of forensic analysis per- pipelines to improve photo provenance in complex dis- formance in challenging, heavily post-processed tribution channels. We exploit end-to-end optimization conditions; of the entire photo acquisition and distribution chan- 4. A neural model of the entire photo acquisition and nel to ensure that reliable forensics decisions can be distribution channel with a fully differentiable ap- made even after complex post-processing, where clas- proximation of the JPEG codec. sical forensics fails (Fig. 1). We believe that immi- nent revolution in camera design creates a unique op- To facilitate further research in this direction, and portunity to address some of the long-standing limi- enable reproduction of our results, we will publish our
neural imaging toolbox at https://github.com/ upon physical inconsistencies, such as reflections [32, 35], or paper acceptance. shadows [21]. To the best of our knowledge, there are currently no efforts to either assess the consequences 2. Related Work of the emerging neural imaging pipelines, or to exploit this opportunity to improve photo reliability. Trends in Pipeline Design Learning individual steps of the imaging pipeline (e.g., demosaicing) has a long history [20] but regained momentum in the recent 3. End-to-end Optimization of Photo years thanks to adoption of novel deep learning tech- Provenance Analysis niques. Naturally, the research focused on the most Digital image forensics relies on intrinsic statisti- difficult operations, i.e., demosaicing [16, 22, 33, 34] cal artifacts introduced to photographs at the time of and denoising [5, 25, 39]. The newly developed tech- their acquisition. Such traces are later used for reason- niques delivered not only improved performance, but ing about the source, authenticity and processing his- also additional features. Grahbi et al. proposed a con- tory of individual photographs. The main problem is volutional neural network (CNN) trained to perform that contemporary media distribution channels employ both demosaicing and denoising [16]. A recent work heavy compression and post-processing which destroy by Syu et al. proposes to exploit CNNs for joint opti- the traces and inhibit forensic analysis. mization of the color filter array and a corresponding The core of the proposed approach is to model the demosaicing filter [33]. entire acquisition and distribution channel, and opti- Optimization of digital camera design can go even mize the neural imaging pipeline (NIP) to facilitate further. The recently proposed L3 model by Jiang et al. photo provenance analysis after content distribution replaces the entire imaging pipeline with a large collec- (Fig. 1). The analysis is performed by a forensic anal- tion of local linear filters [18]. The L3 model reproduces ysis network (FAN) which makes a decision about the the entire photo development process, and aims to facil- authenticity / processing history of the analyzed photo- itate research and development efforts for non-standard graph. In the presented example, the model is trained camera designs. In the original paper, the model was to perform manipulation detection, i.e., aims to clas- used for learning imaging pipelines for RGBW (red- sify input images as either coming straight from the green-blue-white) and RGB-NIR (red-green-blue-near- camera, or as being affected by a certain class of post- infra-red) color filter arrays. processing. The distribution channel in the middle Replacing the entire imaging pipeline with a mod- mimics the behavior of modern photo sharing services ern CNN can also overcome long-standing limitations and social networks which habitually down-sample and of digital photography. Chen et al. trained a UNet re-compress the photographs. As will be demonstrated model [30] to develop high-quality photographs in later, forensic analysis in such conditions is severely low-light conditions [8] by exposing it to paired ex- inhibited. amples of images taken with short and long expo- The parameters of the NIP are updated to guaran- sure. The network learned to develop high-quality well- tee both faithful representation of a desired color pho- exposed color photographs from underexposed raw in- tograph (L2 loss), and accurate decisions of forensics put, and yielded better performance than traditional analysis at the end of the distribution channel (cross- image post-processing based on brightness adjustment entropy loss). Hence, the parameters of the NIP and and denoising. Eilertsen et al. also trained a UNet FAN models are chosen as: model to develop high-dynamic range images from a single shot [14]. The network not only learned to cor- ∗ X θnip = argmin kyn − nip(xn | θnip )k2 (1a) rectly perform tone mapping, but was also able to re- θnip n cover overexposed highlights. This significantly sim- X plifies HDR photography by eliminating the need for + log fanc (dc (nip(xn | θnip )) | θfan ) (1b) bracketing and dealing with ghosting artifacts. c XX ∗ Trends in Forensics The current research in foren- θfan = argmin log fanc (dc (nip(xn | θnip )) | θfan ) θfan n c sic image analysis focuses on two main directions: (1) learning deep features relevant to low-level forensic where: θnip/fan are the parameters of the NIP and FAN analysis for problems like manipulation detection [2, networks, respectively; xn are the raw sensor measure- 40], camera model identification [10], or detection of ments for the n-th example patch; nip(xn ) is the color artificially generated content [26]; (2) adoption of high- RGB image developed by NIP from xn ; dc () denotes a level vision to automate manual analysis that exposes color image patch processed by manipulation c; fanc ()
Color space RAW LUT Normalization & Gamma RGB White balancing Demosaicing Denoising conversion (h, w, 1) mapping calibration correction (h, w, 3) (sRGB) ½h, ½w, 4 NIP RGB (h, w, 3) Figure 2. Adoption of a neural imaging pipeline to develop raw sensor measurements into color RGB images: (top) the standard imaging pipeline; (bottom) adoption of the NIP model. is the probability that an image belongs to the c-th concatenated quantization matrices (for both the manipulation class, as estimated by the FAN model. luminance and chrominance channels). 3.1. The Neural Imaging Pipeline • The actual quantization is approximated by a con- tinuous function ρ(x)(see details below). We replace the entire imaging pipeline with a CNN model which develops raw sensor measurements into The key problem in making JPEG fully differen- color RGB images (Fig. 2). Before feeding the im- tiable lies in the rounding of DCT coefficients. We ages to the network, we pre-process them by reversing considered two possible approximations. Firstly, we the nonlinear value mapping according to the camera’s used a Taylor series expansion, which can be made ar- LUT, subtracting black levels from the edge of the sen- bitrarily accurate by considering more terms. Finally, sor, normalizing by sensor saturation values, and ap- we decided to use a smoother, and simpler sinusoidal plying white-balancing according to shot settings. We approximation obtained by matching the phase with also standardized the inputs by reshaping the tensors the sawtooth function: to have feature maps with successive measured color sin(2πx) channels. This ensures a well-formed input of shape ρ(x) = x − (2) 2π ( h2 , w2 , 4) with values normalized to [0, 1]. All of the remaining steps of the pipeline are replaced by a NIP. Both approximations are shown in Fig. 3a. See Section 4.1 for details on the considered pipelines. We validated our JPEGNet by comparing produced images with a reference codec from libJPEG. The re- 3.2. Approximation of JPEG Compression sults are shown in Fig. 3bcd for a standard rounding To enable end-to-end optimization of the entire ac- operation, and the two approximations, respectively. quisition and distribution channel, we need to ensure We used 5 terms for the harmonic rounding. The devel- that every processing step remains differentiable. In oped module produces equivalent compression results the considered scenario, the main problem is JPEG with standard rounding, and a good approximation for compression. We designed a JPEGNet model which its differentiable variants. Fig. 3e-h show a visual com- approximates the operation of a standard JPEG codec, parison of an example image patch, and its libJPEG and expresses successive steps of the codec as matrix and JPEGNet-compressed counterparts. multiplications or convolution layers that can be imple- In our distribution channel, we used quality level 50. mented in TensorFlow (see supplementary materials for 3.3. The Forensic Analysis Network a detailed network definition): The forensic analysis network (FAN) is implemented • RGB to/from YCbCr color-space conversions are as a CNN following the most recent recommendations implemented as 1 × 1 convolutions. on construction of neural networks for forensics analy- • Isolation of 8×8 blocks for independent processing sis [2]. Bayar and Stamm have proposed a new layer is implemented by a combination of space-to-depth type, which constrains the learned filters to be valid and reshaping operations. residual filters [2]. Adoption of the layer helps ignore • Forward/backward 2D discrete cosine transforms visual content and facilitates extraction of forensically- are implemented by matrix multiplication accord- relevant low-level features. In summary, our network ing to DxDT where x denotes a 8 × 8 input, and operates on 128×128×3 patches in the RGB color space D denotes the transformation matrix. and includes (see supplementary materials for full net- work definition): • Division/multiplication of DCT coefficients by the corresponding quantization steps are implemented • A constrained convolutions layer learning 5 × 5 as element-wise operations with properly tiled and residual filters and with no activation function.
(a) rounding approximations (b) libJPEG vs JPEGNet (standard) (c) libJPEG vs JPEGNet (harmonic rounding) (d) libJPEG vs JPEGNet (sin rounding) 2 round Q100 Q100 Q100 PSNR for libJPEG [dB] harmonic Q90 Q90 Q90 40 sinus Q80 Q80 Q80 Q70 Q70 Q70 0 Q60 Q50 Q60 Q50 Q60 Q50 35 Q40 Q40 Q40 Q30 Q30 Q30 Q20 Q20 Q20 30 Q10 Q10 Q10 −2 −2 −1 0 1 2 30 35 40 45 50 55 30 35 40 45 50 55 30 35 40 45 50 55 PSNR for JPGNet [dB] PSNR for JPGNet [dB] PSNR for JPGNet [dB] (e) uncompressed (96x96px) (f) libJPEG(50), PSNR=34.4 dB (g) JPEGNet(50), PSNR=35.5 dB (h) JPEGNet(50), PSNR=37.9 dB Figure 3. Implementation of JPEG compression as a fully differentiable JPEGNet module: (a) continuous approximations of the rounding function; (b)-(d) validation of the JPEGNet module against the standard libJPEG library with standard rounding, and the harmonic and sinusoidal approximations; (e) an example image patch; (f) standard JPEG compression with quality 50; (g)-(h) JPEGNet-compressed patches with the harmonic and sinusoidal approximations. • Four 5×5 convolutional layers with doubling num- Table 1. Digital cameras used in our experiments ber of output feature maps (starting from 32). The Camera Res.1 #Images Source Bayer layers use leaky ReLU activation and are followed Canon EOS 5D 12 864 dng MIT-5k RGGB by 2 × 2 max pooling. Canon EOS 40D 10 313 dng MIT-5k RGGB Nikon D5100 16 288 nef Private RGGB • A 1 × 1 convolutional layer mapping 256 features Nikon D700 12 590 dng MIT-5k RGGB into 256 features. Nikon D7000 16 >1k nef Raise RGGB Nikon D750 24 312 nef Private RGGB • A global average pooling layer reducing the num- Nikon D810 36 205 nef Private RGGB Nikon D90 12 >1k nef Raise GBRG ber of features to 256. 1 Sensor resolution in Megapixels [Mpx] • Two fully connected layers with 512 and 128 nodes activated by leaky ReLU. images with landscape orientation. These images will • A fully connected layer with N = 5 output nodes later be divided into separate training/validation sets. and softmax activation. 4.1. Neural Imaging Pipelines In total, the network has 1,341,990 parameters. We considered three NIP models with various com- 4. Experimental Evaluation plexity and design principles (Table 2): INet - a sim- ple convolutional network with layers corresponding to We started our evaluation by using several NIP mod- successive steps of the standard pipeline; UNet - the els to reproduce the output of a standard imaging well-known UNet architecture [30] adapted from [8]; pipeline (Sec. 4.1). Then, we used the FAN model to DNet - adaptation of the model originally used for joint detect popular image manipulations (Sec. 4.2). Ini- demosaicing and denoising [16]. A detailed specifica- tially, we validated that the models work correctly by tion of the networks’ architectures is included in sup- using it without a distribution channel (Sec. 4.3). Fi- plementary materials. nally, we performed extensive evaluation of the entire We trained a separate model for each camera in acquisition and distribution network (Sec. 4.4). our dataset. For training, we used 120 diverse full- We collected a data-set with RAW images from 8 resolution images. In each iteration, we extracted ran- cameras (Table 1). The photographs come from two domly located 128 × 128 patches and formed a batch public (Raise [11] and MIT-5k [6]) and from one private with 20 examples (one patch per image). For vali- data-set. For each camera, we randomly selected 150 dation, we used a fixed set of 512 × 512 px patches
optimization target bilinear demosaicing converged pipeline Native patch Sharpen Gaussian filtering JPEGNet Resampling (b) After distribution (a) Only manipulation (a) Improved demosaicing example (Canon EOS 40D) optimization target converged pipeline Figure 5. An example image patch with all of the considered manipulation variants: (a) just after the manipulation; (b) after the distribution channel (down-sampling and JPEG compression). (b) Improved denoising example (Nikon D5100) for a long time had problems with faithful color ren- Figure 4. Examples of serendipitous image quality improve- dering. The typical training times are reported in Ta- ments obtained by neural imaging pipelines: (a) better de- ble 2. The measurements were collected on a Nvidia mosaicing performance; (b) better denoising. Tesla K80 GPU. The models were trained until the rel- ative change of the average validation loss for the last 5 Table 2. Considered neural imaging pipelines dropped below 10−4 . The maximum number of epochs INet UNet DNet was 50,000. For the DNet model we adjusted the stop- # Parameters 321 7,760,268 493,976 ping criterion to 10−5 since the training progress was PSNR 42.8 44.3 46.2 slow, and often terminated prematurely with incorrect SSIM 0.989 0.990 0.995 color rendering. Train. speed [it/s] 8.80 1.75 0.66 Train. time 17 - 26 min 2-4 h 12 - 22 h 4.2. Image Manipulation Our experiment mirrors the standard setup for im- age manipulation detection [2, 4, 15]. We consider four extracted from the remaining 30 images. The models mild post-processing operations: sharpening - imple- were trained to reproduce color RGB images developed mented as an unsharp mask operator with the following by a standard imaging pipeline. We used our own im- kernel: plementation based on Python and rawkit [7] wrappers " # over libRAW [27]. Demosaicing was performed using 1 −1 −4 −1 an adaptive algorithm by Menon et al. [28]. −4 26 −4 (3) 6 −1 −4 −1 All NIPs successfully reproduced target images with high fidelity. The resulting color photographs are vi- applied to the luminance channel in the HSV color sually indistinguishable from the targets. Objective fi- space; resampling - implemented as successive 1:2 delity measurements for the validation set are collected down-sampling and 2:1 up-sampling using bilinear in- in Table 2 (average for all 8 cameras). Interestingly, the terpolation; Gaussian filtering - implemented using a trained models often revealed better denoising and de- convolutional layer with a 5 × 5 filter and standard de- mosaicing performance, despite the lack of a denoising viation equal 4; JPG compression - implemented using step in the simulated pipeline, and the lack of explicit the JPEGNet module with sinusoidal rounding approx- optimization objectives (see Fig. 4). imation and quality level 80. Fig. 5 shows the post- Of all of the considered models, INet was the easi- processed variants of an example image patch: (a) just est to train - not only due to its simplicity, but also after manipulation; and (b) after the distribution chan- because it could be initialized with meaningful pa- nel (as seen by the FAN module). rameters that already produced valid results and only 4.3. FAN Model Validation needed fine-tuning. We initialized the demosaicing fil- ters with bilinear interpolation, color space conversion To validate our FAN model, we initially imple- with a known multiplication matrix, and gamma cor- mented a simple experiment, where analysis is per- rection with a toy model separately trained to repro- formed just after image manipulation (no distribution duce this non-linearity. The UNet model was initial- channel distortion, as in [2]). We used the UNet model ized randomly, but improved rapidly thanks to skip to develop larger image patches to guarantee the same connections. The DNet model took the longest and size of the inputs for the FAN model (128×128×3 RGB
43 Table 3. Typical confusion matrices (Nikon D90). Entries 41 ≈ 0 are not shown; entries / 3% are marked with (*). PSNR [dB] 39 (a) standalone FAN optimization (UNet) → 55.2% 37 gau. sha. nat. res. jpg Predicted 35 True 33 native 73 24 * * 1.0 sharpen 7 88 * * * * * 45 35 18 FAN accuracy [%] .8 gaussian jpg * 12 22 54 10 .6 resampled * * 37 44 16 .4 FAN only (b) joint FAN+NIP optimization (INet) → 67.4% Joint FAN+NIP gau. .2 sha. nat. res. jpg 0 2 4 6 8 10 12 14 16 18 20 Predicted True Training progress [sampling steps] native 77 15 * * 4 Figure 6. Typical progression of validation metrics (Nikon sharpen 5 82 5 5 * D90) for standalone FAN training (F) and joint optimiza- gaussian * * 62 22 13 tion of FAN and NIP models (F+N). jpg * 8 42 37 11 resampled * 18 2 79 (c) joint FAN+NIP optimization (UNet) → 97.8% images). In such conditions, the model works just as gau. sha. nat. res. jpg expected, and yields classification accuracy of 99% [2]. Predicted True 4.4. Imaging Pipeline Optimization native 97 * sharpen 100 gaussian * 96 * In this experiment, we perform forensic analysis at jpg * * 96 the end of the distribution channel. We consider two resampled 100 optimization modes: (F) only the FAN network is op- timized given a fixed NIP model; (F+N) both the FAN Table 4. The fidelity-accuracy trade-off for joint optimiza- and NIP models are optimized jointly. In both cases, tion of the FAN and NIP models the NIPs are pre-initialized with previously trained Camera PSNR [dB] SSIM Acc. [%] models (Section 4.1). The training was implemented UNet model with two separate Adam optimizers, where the first EOS 40D 42.9 → 36.5 0.991 → 0.969 0.51 → 0.94 one updates the FAN (and in the F+N mode also the EOS 5D 43.0 → 36.0 0.991 → 0.966 0.57 → 0.97 NIP) and the second one updates the NIP based on the D7000 43.7 → 35.4 0.992 → 0.963 0.55 → 0.97 D90 43.3 → 35.8 0.991 → 0.965 0.56 → 0.98 image fidelity objective. INet model Similarly to previous experiments, we used 120 full- EOS 40D 42.2 → 40.9 0.986 → 0.983 0.50 → 0.56 resolution images for training, and the remaining 30 EOS 5D 40.1 → 40.6 0.986 → 0.985 0.55 → 0.56 images for validation. From training images, in each D7000 41.5 → 39.7 0.987 → 0.981 0.54 → 0.65 iteration we randomly extract new patches. The val- D90 43.8 → 40.8 0.992 → 0.986 0.56 → 0.65 idation set is fixed at the beginning and includes 100 random patches per each image (3,000 patches in to- tal) for classification accuracy assessment. To speed-up cally decreasing by 15% every 100 epochs. Each run image fidelity evaluation, we used 2 patches per im- was repeated 10 times. The typical progression of val- age (60 patches in total). In order to prevent over- idation metrics for the UNet model (classification ac- representation of empty patches, we bias the selection curacy and distortion PSNR) is shown in Fig. 6 (sam- by outward rejection of patches with pixel variance < pled every 50 epochs). The distribution channel signif- 0.01, and by 50% chance of keeping patches with vari- icantly disrupts forensic analysis, and the classification ance < 0.02. More diverse patches are always accepted. accuracy drops to ≈ 55% for standalone FAN optimiza- Due to computational constraints, we performed the tion (F). In particular, the FAN struggles to identify experiment for 4 cameras (Canon EOS 40D and EOS three low-pass filtering operations (Gaussain filtering, 5D, and Nikon D7000, and D90, see Table 1) and for JPEG compression, and re-sampling; see confusion ma- the INet and UNet models only. (Based on prelimi- trix in Tab. 3a), which bear strong visual resemblance nary experiments, we excluded the DNet model which at the end of the distribution channel (Fig. 5b). Opti- rapidly lost and could not regain image representation mization of the imaging pipeline significantly increases fidelity.) We ran the optimization for 1,000 epochs, classification accuracy (up to 98%) and makes all of the starting with a learning rate of 10−4 and systemati- manipulation paths easily distinguishable (Tab. 3c).
(a) the INet model (b) the UNet model Nikon D90 (F+N) Nikon D90 (F) Nikon D7000 (F+N) Nikon D7000 (F) Canon EOS 5D (F+N) Canon EOS 5D (F) Canon EOS 40D (F+N) Canon EOS 40D (F) .5 .6 .7 .8 .9 1 .5 .6 .7 .8 .9 1 Classification accuracy [%] Classification accuracy [%] Figure 7. Validation accuracy for image manipulation detection after the distribution channel: (F) denotes standalone FAN training given a fixed NIP; (F+N) denotes joint optimization of the FAN and NIP models. The observed improvement in forensic classification (F) UNet reference accuracy comes at the cost of image distortion, and leads to artifacts in the developed photographs. In our experiments, photo development fidelity dropped to 36 dB (PSNR) / 0.96 (SSIM) for the UNet model and 40 dB/0.97 for the INet model. The complete re- (F+N) UNet - worst sults are collected in Tab. 4. Qualitative illustration of the observed artifacts is shown in Fig. 8. The fig- ure shows diverse image patches developed by several variants of the UNet/INet models (different training runs). The artifacts vary in severity from disturbing (F+N) UNet - best to imperceptible, and tend to be well masked by image content. INet’s limited flexibility resulted in virtually imperceptible distortions, which compensates for mod- est improvements in FAN’s classification accuracy. (F) INet reference 5. Discussion and Future Work We replaced the photo acquisition pipeline with a neural network, and developed a fully-differentiable model of the entire photo acquisition and distribution (F+N) INet - typical channel. Such an approach allows for joint optimiza- tion of photo analysis and acquisition models. Our results show that it is possible to optimize the imag- ing pipeline for better provenance analysis capabilities Figure 8. Example image patches developed with joint NIP at the end of complex distribution channels. We ob- and FAN optimization (F+N). The UNet examples show served a significant improvement in manipulation de- best and worst observed patches at the end of 1,000 training epochs. The INet examples show a typical output example. tection accuracy w.r.t state-of-the-art classical foren- sics [2] (accuracy increased from ≈ 55% to nearly 98%). Competing optimization objectives lead to imaging Fig. 7 collects the obtained results for both the UNet artifacts which tend to be well masked in textured ar- and INet models. UNet delivers consistent and signifi- eas, but are currently too visible in empty and flat re- cant improvement in analysis accuracy, which increases gions. Limited artifacts and classification improvement up to 95-98%. INet is much simpler and yields only for a simpler pipeline model indicate that better config- modest classification improvements - up to ≈ 65%. It urations should be achievable by learning to control the also lacked consistency and for one of the tested cam- fidelity-accuracy trade-off. Further improvements may eras yielded virtually no improvement. However, since also possible be exploiting explicit HVS modeling. In the final decision is often taken by boosting results from the future, would like to consider joint optimization of many patches from a larger image, we believe the model additional components of the acquisition-distribution could still be useful given its simplicity and minimal channel: e.g., Bayer filters in the camera, or lossy com- quality degradation. pression during content distribution.
References [22] F. Kokkinos and S. Lefkimmiatis. Deep image demosaick- ing using a cascade of convolutional residual denoising net- [1] S. Agarwal and H. Farid. Photo forensics from jpeg dimples. works. arXiv, 2018. arXiv:1803.05215v4. In Information Forensics and Security (WIFS), 2017 IEEE [23] P. Korus. Digital image integrity–a survey of protection and Workshop on, pages 1–6. IEEE, 2017. verification techniques. Digital Signal Processing, 71:1–26, [2] B. Bayar and M. Stamm. Constrained convolutional neural 2017. networks: A new approach towards general purpose image [24] P. Korus, J. Bialas, and A. Dziech. Towards practical manipulation detection. IEEE Transactions on Information self-embedding for JPEG-compressed digital images. IEEE Forensics and Security, 13(11):2691–2706, 2018. Trans. Multimedia, 17(2):157–170, Feb 2015. [3] P. Blythe and J. Fridrich. Secure digital camera. In Digital [25] J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Kar- Forensic Research Workshop, pages 11–13, 2004. ras, M. Aittala, and T. Aila. Noise2noise: Learning [4] M. Boroumand and J. Fridrich. Deep learning for detecting image restoration without clean data. arXiv preprint processing history of images. Electronic Imaging, 2018(7):1– arXiv:1803.04189, 2018. 9, 2018. [26] H. Li, B. Li, S. Tan, and J. Huang. Detection of deep [5] H. Burger, C. Schuler, and S. Harmeling. Image denoising: network generated images using disparities in color compo- Can plain neural networks compete with bm3d? In Int. nents. arXiv preprint arXiv:1808.07276, 2018. Conf. Computer Vision and Pattern Recognition (CVPR), [27] LibRaw LLC. libRAW. https://www.libraw.org/. Ac- pages 2392–2399. IEEE, 2012. cessed 5 Nov 2018. [6] V. Bychkovsky, S. Paris, E. Chan, and E. Durand. Learning [28] D. Menon, S. Andriani, and G. Calvagno. Demosaicing with photographic global tonal adjustment with a database of directional filtering and a posteriori decision. IEEE Trans- input / output image pairs. In IEEE Conf. on Computer actions on Image Processing, 16(1):132–141, 2007. Vision and Pattern Recognition, 2011. [29] A. Piva. An overview on image forensics. ISRN Signal [7] P. Cameron and S. Whited. Rawkit. https://rawkit. Processing, 2013, 2013. Article ID 496701. readthedocs.io/en/latest/. Accessed 5 Nov 2018. [30] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolu- [8] C. Chen, Q. Chen, J. Xu, and V. Koltun. Learning to see tional networks for biomedical image segmentation. In Int. in the dark. arXiv, 2018. arXiv:1805.01934. Conf. on Medical image computing and computer-assisted [9] M. Chen, J. Fridrich, M. Goljan, and J. Lukas. Determining intervention, pages 234–241. Springer, 2015. image origin and integrity using sensor noise. IEEE Trans. [31] M. Stamm, M. Wu, and K. Liu. Information forensics: An Inf. Forensics Security, 3(1):74–90, 2008. overview of the first decade. IEEE Access, 1:167–200, 2013. [10] D. Cozzolino and L. Verdoliva. Noiseprint: a [32] Z. H. Sun and A. Hoogs. Object insertion and removal cnn-based camera model fingerprint. arXiv preprint in images with mirror reflection. In Information Forensics arXiv:1808.08396, 2018. and Security (WIFS), 2017 IEEE Workshop on, pages 1–6. [11] D. Dang-Nguyen, C. Pasquini, V. Conotter, and G. Boato. IEEE, 2017. RAISE - a raw images dataset for digital image forensics. [33] N.-S. Syu, Y.-S. Chen, and Y.-Y. Chuang. Learning In Proc. of ACM Multimedia Systems, 2015. deep convolutional networks for demosaicing. arXiv, 2018. [12] Z. Dias, S. Goldenstein, and A. Rocha. Large-scale image arXiv:1802.03769. phylogeny: Tracing image ancestral relationships. IEEE [34] R. Tan, K. Zhang, W. Zuo, and L. Zhang. Color image Multimedia, 20(3):58–70, 2013. demosaicking via deep residual learning. In IEEE Int. Conf. [13] Z. Dias, S. Goldenstein, and A. Rocha. Toward image Multimedia and Expo (ICME), 2017. phylogeny forests: Automatically recovering semantically [35] E. Wengrowski, Z. Sun, and A. Hoogs. Reflection corre- similar image relationships. Forensic science international, spondence for exposing photograph manipulation. In Im- 231(1):178–189, 2013. age Processing (ICIP), 2017 IEEE International Confer- [14] G. Eilertsen, J. Kronander, G. Denes, R. Mantiuk, and ence on, pages 4317–4321. IEEE, 2017. J. Unger. HDR image reconstruction from a single expo- [36] B. Wronski and P. Milanfar. See better and further with su- sure using deep CNNs. ACM Transactions on Graphics, per res zoom on the pixel 3. https://ai.googleblog.com/ 36(6):1–15, nov 2017. 2018/10/see-better-and-further-with-super-res.html. [15] W. Fan, K. Wang, and F. Cayre. General-purpose image Accessed on 22 Oct 2018. forensics using patch likelihood under image statistical mod- [37] C. P. Yan and C. M. Pun. Multi-scale difference map fu- els. In Proc. of IEEE Int. Workshop on Inf. Forensics and sion for tamper localization using binary ranking hashing. Security, 2015. IEEE Transactions on Information Forensics and Security, [16] M. Gharbi, G. Chaurasia, S. Paris, and F. Durand. Deep 12(9):2144–2158, 2017. joint demosaicking and denoising. ACM Transactions on [38] M. Zampoglou, S. Papadopoulos, and Y. Kompatsiaris. Graphics, 35(6):1–12, nov 2016. Large-scale evaluation of splicing localization algorithms [17] A. Ignatov, N. Kobyshev, R. Timofte, K. Vanhoey, and for web images. Multimedia Tools and Applications, L. Van Gool. Wespe: weakly supervised photo enhancer for 76(4):4801–4834, 2017. digital cameras. arXiv preprint arXiv:1709.01118, 2017. [39] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Be- [18] H. Jiang, Q. Tian, J. Farrell, and B. A. Wandell. Learning yond a gaussian denoiser: Residual learning of deep cnn for the image processing pipeline. IEEE Transactions on Image image denoising. IEEE Transactions on Image Processing, Processing, 26(10):5032–5042, oct 2017. 26(7):3142–3155, 2017. [19] Joint Photographic Experts Group. JPEG Privacy [40] P. Zhou, X. Han, V. Morariu, and L. S. Davis. Learning rich & Security. https://jpeg.org/items/20150910_privacy_ features for image manipulation detection. In IEEE Int. security_summary.html. Accessed 24 Oct 2018. Conf. Computer Vision and Pattern Recognition (CVPR), [20] O. Kapah and H. Z. Hel-Or. Demosaicking using artificial pages 1053–1061, 2018. neural networks. In Applications of Artificial Neural Net- works in Image Processing V, volume 3962, pages 112–121. International Society for Optics and Photonics, 2000. [21] E. Kee, J. O’brien, and H. Farid. Exposing photo manip- ulation from shading and shadows. ACM Trans. Graph., 33(5):165–1, 2014.
You can also read