Label Embedded Dictionary Learning for Image Classification

Page created by Gordon Arnold

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Label Embedded Dictionary Learning for Image Classification

Label Embedded Dictionary Learning for Image
                                                                Classification
                                           Shuai Shaoa , Yan-Jiang Wanga,∗, Bao-Di Liua,∗∗, Weifeng Liua , Rui Xua
                                            a
                                                College of Information and Control Engineering, China University of Petroleum,
arXiv:1903.03087v2 [cs.CV] 17 Apr 2019

                                                                           Qingdao 266580, China

                                         Abstract
                                         Recently, label consistent k-svd (LC-KSVD) algorithm has been successfully
                                         applied in image classification. The objective function of LC-KSVD is con-
                                         sisted of reconstruction error, classification error and discriminative sparse
                                         codes error with `0 -norm sparse regularization term. The `0 -norm, however,
                                         leads to NP-hard problem. Despite some methods such as orthogonal match-
                                         ing pursuit can help solve this problem to some extent, it is quite difficult to
                                         find the optimum sparse solution. To overcome this limitation, we propose
                                         a label embedded dictionary learning (LEDL) method to utilise the `1 -norm
                                         as the sparse regularization term so that we can avoid the hard-to-optimize
                                         problem by solving the convex optimization problem. Alternating direc-
                                         tion method of multipliers and blockwise coordinate descent algorithm are
                                         then exploited to optimize the corresponding objective function. Extensive
                                         experimental results on six benchmark datasets illustrate that the proposed
                                         algorithm has achieved superior performance compared to some conventional
                                         classification algorithms.
                                         Keywords: Dictionary learning, sparse representation, label embedded
                                         dictionary learning, image classification

                                            ∗
                                            PaperID:NEUCOM-S-19-01925 Corresponding author.
                                           ∗∗
                                            PaperID:NEUCOM-S-19-01925 Corresponding author.
                                            Email addresses: shuaishao@s.upc.edu.cn (Shuai Shao), yjwang@upc.edu.cn
                                         (Yan-Jiang Wang), thu.liubaodi@gmail.com (Bao-Di Liu), liuwf@upc.edu.cn
                                         (Weifeng Liu), xddxxr@126.com (Rui Xu)

                                         Preprint submitted to Neurocomputing                                      April 18, 2019

Figure 1:     The scheme of LEDL is on the right while the LC-KSVD is on the left.
The difference between the two methods is the sparse regularization term which LEDL
use the `1 -norm regularization term and LC-KSVD use the `0 -norm regularization term.
Compared with `0 -norm, the sparsity constraint factor of `1 -norm is unfixed so that the
basis vectors can be selected freely for linear fitting. Thus, our proposed LEDL method
can get smaller errors than LC-KSVD.

1. Introduction
    Recent years, image classification has been a classical issue in computer
vision. Many successful algorithms Yu et al. (2012b); Shi et al. (2018); Wright
et al. (2009); Yu et al. (2014); Song et al. (2018); Yu et al. (2013); Wang
et al. (2018); Yu et al. (2012a); Yang et al. (2009, 2017); Liu et al. (2014a,
2017); Jiang et al. (2013); Hao et al. (2017); Chan et al. (2015); Nakazawa
and Kulkarni (2018); Ji et al. (2014); Xu et al. (2019); Yuan et al. (2016)
have been proposed to solve the problem. In these algorithms, there is one
category that contributes a lot for image classification which is the sparse
representation based method.
    Sparse representation is capable of expressing the input sample features
as a linear combination of atoms in an overcomplete basis set. Wright et al.
(2009) proposed sparse representation based classification (SRC) algorithm

                                           2

which use the `1 -norm regularization term to achieve impressive performance.
SRC is the most representative one in the sparse representation based meth-
ods. However, in traditional sparse representation based methods, training
sample features are directly exploited without considering the discriminative
information which is crucial in real applications. That is to say, sparse rep-
resentation based methods can gain better performance if the discriminative
information is properly harnessed.
    To handle this problem, dictionary learning (DL) method is introduced
to preprocess the training sample features befor classification. DL is a gener-
ative model for sparse representation which the concept was firstly prposed
by Mallat and Zhang (1993). A few years later, Olshausen and Field (1996,
1997) proposed the application of DL on natural images and then it has been
widely used in many fields such as image denoising Chang et al. (2000); Li
et al. (2012, 2018), image superresolution Yang et al. (2010); Wang et al.
(2012b); Gao et al. (2018), and image classification Liu et al. (2017); Jiang
et al. (2013); Chang et al. (2016). A well learned dictionary can help to get
significant boost in classification accuracy. Therefore, DL based methods in
classification are more and more popular in recent years.
    Specificially, there are two strategies are proposed to successfully utilise
the discriminative information: i) class specific dictionary learning ii) class
shared dictionary learning. The first strategy is to learn specific dictionaries
for each class such as Wang et al. (2012a); Yang et al. (2014); Liu et al.
(2016). The second strategy is to learn a shared dictionary for all classes.
For example, Zhang and Li (2010) proposed discriminative K-SVD(D-KSVD)
algorithm to directly add the discriminative information into objective func-
tion. Furthermore, Jiang et al. (2013) proposed label consistence K-SVD
(LC-KSVD) method which add a label consistence term into the objective
function of D-KSVD. The motivation for adding this term is to encourage the
training samples from the same class to have similar sparse codes and those
from different classes to have dissimilar sparse codes. Thus, the discrimi-
native abilities of the learned dictionary is effectively improved. However,
the sparse regularization term in LC-KSVD is `0 -norm which leads to the
NP-hard Natarajan (1995) problem. Although some greedy methods such as
orthogonal matching pursuit (OMP) Tropp and Gilbert (2007) can help solve
this problem to some extent, it is usually to find the suboptimum sparse solu-
tion instead of the optimal sparse solution. More specifically, greedy method
solve the global optimal problems by finding basis vectors in order of recon-
struction errors from small to large until T (the sparsity constraint factor)

                                       3

times. Thus, the initialized values are crucial. To this end, `0 -norm based
sparse constraint is not conducive to finding a global minimum value to ob-
tain the optimal sparse solution.
    In this paper, we propose a novel dictionary learning algorithm named
label embedded dictionary learning (LEDL). This method introduces the `1 -
norm regularization term to replace the `0 -norm regularization of LC-KSVD.
Thus, we can freely select the basis vectors for linear fitting to get optimal
sparse solution. In addition, `1 -norm sparse representation is widely used
in many fields so that our proposed LEDL method can be extended and
applied easily. We show the difference between our proposed LEDL and LC-
KSVD in Figure 1. We adopt the alternating direction method of multipliers
(ADMM) Boyd et al. (2011) framework and blockwise coordinate descent
(BCD) Liu et al. (2014b) algorithm to optimize LEDL. Our work mainly
focuses on threefold.

   • We propose a novel dictionary learning algorithm named label embed-
     ded dictionary learning which introduces the `1 -norm regularization
     term as the sparse constraint. The `1 -norm sparse constraint is able to
     help easily find the optimal sparse solution.

   • We propose to utilize the alternating direction method of multipliers
     (ADMM) Boyd et al. (2011) algorithm and blockwise coordinate de-
     scent (BCD) Liu et al. (2014b) algorithm to optimize dictionary learn-
     ing task.

   • We verify the superior performance of our method on six benchmark
     datasets.

    The rest of the paper is organized as follows. Section 2 reviews two con-
ventional methods which are SRC and LC-KSVD. Section 3.1 presents LEDL
method for image classification. The optimization approach and the conver-
gence are elaborated in Section 3.2. Section 4 shows experimental results on
six well-known datasets. Finally, we conclude this paper in Section 5.

2. Related Work
   In this section, we overview two related algorithms, including sparse
representation based classification (SRC) and label consistent K-SVD (LC-
KSVD).

                                      4

Algorithm 1 Sparse representation based classification
Input: X ∈ RD×N , y ∈ RD×1 , α
Output: id(y)
1: Code y with nthe dictionary X via `o1 -minimization.
2: ŝ = arg mins ky − Xsk22 + 2αksk1
3:   for c = 1;c ≤ C;c++ do
4:       Compute the residual ec (y) = ky − xc ŝc k22
5:   end for
6:   id (y) = arg minc {ec }
7:   return id(y)

2.1. Sparse representation based classification (SRC)
    SRC was proposed by Wright et al. (2009). Assume that we have C classes
of training samples, denoted by Xc , c = 1, 2, · · · , C, where Xc is the training
sample matrix of class c. Each column of the matrix Xc is a training sample
feature from the cth class. The whole training sample matrix can be denoted
as X = [X1 , X2 , · · · XC ] ∈ RD×N , where D represents the dimensions of the
sample features and N is the number of training samples. Supposing that
y ∈ RD×1 is a testing sample vector, the sparse representation algorithm
aims to solve the following objective function:

                        ŝ = arg min ky − Xsk22 + 2αksk1
                                    
                                  s
                                                                               (1)

where, α is the regularization parameter to control the tradeoff between fit-
ting goodness and sparseness. The sparse representation based classification
is to find the minimum value of the residual error for each class.

                                    id (y) = arg min ky − xc ŝc k22          (2)
                                                         c

where id (y) represents the predictive label of y, ŝc is the sparse code of
cth class. The procedure of SRC is shown in Algorithm 1. Obviously, the
residual ec is associated with only a few images in class c.

2.2. Label Consistent K-SVD (LC-KSVD)
    Jiang et al. (2013) proposed LC-KSVD to encourage the similarity among
representations of samples belonging to the same class in D-KSVD. The au-
thors proposed to combine the discriminative sparse codes error with the re-
construction error and the classification error to form a unified objective func-
tion, which gave discriminative sparse codes matrix Q = [q1 , q2 , · · · , qN ] ∈

                                                         5

Algorithm 2 Label Consistent K-SVD
Input: X ∈ RD×N , H ∈ RC×N , Q ∈ RK×N , λ, ω, T , K
Output: B ∈ RD×K , W ∈ RC×K , A ∈ RK×K , S ∈ RK×N
1: Compute B0 by combining class-specific dictionary items for each class using K-SVD Aharon et al.
   (2006);
2: Compute S0 for X and B0 using sparse coding;
3: Compute A0 using A = QST SST + I −1 ;
                                          −1
4: Cpmpute W0 using    W = HST SST + I        ;
                                  
                       √B0
5: Solve Eq.(3); Use  √ ωA0  to initialize the dictionary.
                          λW0
6: Normalize
        n     B A W:                   o
   B ← kbb1k , kbb2k , · · · , kbbKk
        n   1 2    2 2            K  2 o

   A ← kba1k , kba2k , · · · , kbaKk
         n  1 2    2 2            K  2 o

   W ← kbw1k , kbw2k , · · · , kbwKk
            1 2    2 2        K 2
7: return B, W, A, S

RK×N , label matrix H = [h1 , h2 , · · · , hN ] ∈ RC×N and training sample ma-
trix X. The objective function is defined as follows:
            < B, W, A, S > = arg min kX − BSk2F + λ kH − WSk2F
                                     B,W,A,S

                                 + ω kQ − ASk2F
                                 s.t. ksi k0 < T (i = 1, 2 · · · , N )
                                                                                2          (3)
                                                √X               √  B
                                 = argmin  √ωQ  −  √ ωA  S
                                    B,W,A,S
                                                  λH               λW               F
                                 s.t. ksi k0 < T (i = 1, 2 · · · , N )
where T is the sparsity constraint factor, making sure that si has no more
than T nonzero entries. The dictionary B = [b1 , b2 , · · · , bK ] ∈ RD×K , where
K > D is the number of atoms in the dictionary, and S = [s1 , s2 , · · · , sN ] ∈
RK×D is the sparse codes of training sample matrix X. W = [w1 , w2 , · · · , wK ]
∈ RC×K is a classifier learned from the given label matrix H. We hope
the W can return the most probable class this sample belongs to. A =
[a1 , a2 , · · · , aK ] ∈ RK×K is a linear transformation relys on Q. λ and ω are
the regularization parameters balancing the discriminative sparse codes er-
ror and the classification contribution to the overall objective, respectively.
The algorithm is shown in Algorithm 2. Here, we denote m (m = 0, 1, 2, · · ·)
as the iteration number and (•)m means the value of matrix (•) after mth
iteration.

                                                6

While the LC-KSVD algorithm exploits the `0 -norm regularization term
to control the sparseness, it is difficult to find the optimal sparse solution to a
general image recognition. The reason is that LC-KSVD use OMP method to
optimise the objective function which usually obtain the suboptimal sparse
solution unless finding the perfect initialized values.

3. Methodology
    In this section, we first give our proposed label embedded dictionary
learning algorithm. Then we elaborate the optimization of the objective
function.

3.1. Proposed Label Embedded Dictionary Learning (LEDL)
     Motivated by that the optimal sparse solution can not be found easily with
`0 -norm regularization term, we propose a novel dictionary learning method
named label embedded dictionary learning (LEDL) for image classification.
This method introduces the `1 -norm regularization term to replace the `0 -
norm regularization of LC-KSVD. Thus, we can freely select the basis vectors
for linear fitting to get optimal sparse solution. The objection function is as
follows:

        < B, W, A, S >= arg min kX − BSk2F + λ kH − WSk2F
                            B,W,A,S

                            + ω kQ − ASk2F + 2εkSk`1                           (4)
         s.t. kB•k k22 ≤ 1, kW•k k22 ≤ 1, kA•k k22 ≤ 1 (k = 1, 2, · · · K)

where, (•)•k denotes the kth column vector of matrix (•). The `1 -norm reg-
ularization term is utilized to enforce sparsity and ε is the regularization
parameter which has the same function as α in Equation (1).

3.2. Optimization of Objective Function
    Consider the optimization problem (4) is not jointly convex in both S,
B, W and A, it is separately convex in either S (with B, W, A fixed),
B (with S, W, A fixed), W (with S, B, A fixed) or A (with S, B, W
fixed). To this end, the optimization problem can be recognised as four
optimization subproblems which are finding sparse codes (S) and learning
bases (B, W, A), respectively. Here, we employ the alternating direction

                                        7

Figure 2: The complete process of LEDL algorithm

method of multipliers (ADMM) Boyd et al. (2011) framework to solve the first
subproblem and the blockwise coordinate descent (BCD) Liu et al. (2014b)
algorithm for the rest subproblem. The complete process of LEDL is shown
in Figure 2.

3.2.1. ADMM for finding sparse codes
    While fixing B, W and A, we introduce an auxiliary variable Z and refor-
mulate the LEDL algorithm into a linear equality-constrained problem with
respect to each iteration has the closed-form solution. The objective function
is as follows:

          < B, W, A, C, Z >= arg min kX − BCk2F + 2εkZk`1
                                 B,W,A,C,Z

                                 + λ kH − WCk2F + ω kQ − ACk2F             (5)
                   s.t. C = Z,   kB•k k22   ≤ 1,   kW•k k22   ≤ 1,
                                 kA•k k22   ≤ 1(k = 1, 2 · · · K)
   While utilising the ADMM framework with fixed B, W and A, the la-
grangian function of the problem (5) is rewritten as:

  < C, Z, L >= arg min kX − BCk2F + λ kH − WCk2F + ω kQ − ACk2F
                  C,Z,L                                                    (6)
                                 T
                + 2εkZk`1 + 2L (C − Z) + ρ kC −           Zk2F

where L = [l1 , l2 , · · · , lN ] ∈ RK×N is the augmented lagrangian multiplier
and ρ > 0 is the penalty parameter. After fixing B, W and A, we initialize
the C0 , Z0 and L0 to be zero matrices. Equation (6) can be solved as follows:

                                       8

(1) Updating C while fixing Z, L, B, W and A:
                  Cm+1 =< Bm , Wm , Am , Cm , Zm , Lm >                       (7)
The closed form solution of C is
                                                     −1
           Cm+1 = Bm T Bm + λWm T Wm + ωAm T Am + ρI
                                                                              (8)
                × Bm T X + λWm T H + ωAm T Q + ρZm − Lm
                                                        

(2) Updating Z while fixing C, L, B, W and A

                 Zm+1 =< Bm , Wm , Am , Cm+1 , Zm , Lm >                      (9)
The closed form solution of Z is
                                                
                                        Lm ε
                    Zm+1   = max Cm+1 +    − I, 0
                                         ρ  ρ
                                                                           (10)
                                        Lm ε
                           + min Cm+1 +    + I, 0
                                         ρ  ρ
where I is the identity matrix and 0 is the zero matrix.
(3) Updating the Lagrangian multiplier L

                      Lm+1 = Lm + ρ (Cm+1 − Zm+1 )                           (11)
where the ρ in Equation (11) is the gradient of gradient descent (GD) method,
which has no relationship with the ρ in Equation (6). In order to make better
use of ADMM framework, the ρ in Equation (11) can be rewritten as θ.
                      Lm+1 = Lm + θ (Cm+1 − Zm+1 )                           (12)
3.2.2. BCD for learning bases
    Without consisdering the sparseness regulariation term in Equation (5),
the constrained minimization problem of (4) with respect to the single col-
umn has the closed-form solution which can be solved by BCD method. The
objective function can be rewritten as follows:

  < B, W, A > = arg min kX − BCk2F + λ kH − WCk2F + ω kQ − ACk2F
                   B,W,A

                + 2εkZk`1 + 2LT (C − Z) + ρ kC − Zk2F
           s.t. kB•k k22 ≤ 1, kW•k k22 ≤ 1, kA•k k22 ≤ 1(k = 1, 2 · · · K)
                                                                             (13)

                                       9

We initialize B0 , W0 and A0 to be random matrices and normalize them,
respectively. After that we use BCD method to update B, W and A.

(1) Updating B while fixing C, L, Z, W and A

                  Bm+1 =< Bm , Wm , Am , Cm+1 , Zm+1 , Lm+1 >                (14)
The closed-form solution of single column of B is
                                 T              T
                    X (Ck• )m+1 − B̃k Cm+1 (Ck• )m+1
                                            m
      (B•k ) m+1 =              T               T                      (15)
                    X (Ck• )m+1 − B̃k Cm+1 (Ck• )m+1
                                               m                     2
              
                  B•p , p 6= k
where B̃k =                    , (•)k• denotes the kth row vector of matrix (•).
                  0, p = k
(2) Updating W while fixing C, L, Z, B and A

              Wm+1 =< Bm+1 , Wm , Am , Cm+1 , Zm+1 , Lm+1 >                  (16)
The closed-form solution of single column of W is
                                 T              T
                    H (Ck• )m+1 − W̃k Cm+1 (Ck• )m+1
     (W•k ) m+1 =               T       m       T                      (17)
                    H (Ck• )m+1 − W̃k Cm+1 (Ck• )m+1
                                                m                        2
              
                  W•p , p 6= k
where W̃k =                    .
                  0, p = k
(3) Updating A while fixing C, L, Z, B and W

              Am+1 =< Bm+1 , Wm+1 , Am , Cm+1 , Zm+1 , Lm+1 >                (18)
The closed-form solution of single column of A is
                                 T  k                   T
                    Q (Ck• )m+1 − Ã          Cm+1 (Ck• )m+1
                                            m
      (A•k ) m+1 =              T                      T               (19)
                    Q (Ck• )m+1 − Ã     k   Cm+1 (Ck• )m+1
                                               m                     2
              
                  A•p , p 6= k
where Ãk =                    .
                  0, p = k

                                        10

Figure 3: Convergence curve of LEDL Algorithm on four datasets.

3.2.3. Convergence Analysis
    Assume that the result of the objective function after mth iteration is de-
fined as f (Cm , Zm , Lm , Bm , Wm , Am ). Since the minimum point is obtained
by ADMM and BCD methods, each method will monotonically decrease
the corresponding objective function after about 100 iterations. Consider-
ing that the objective function is obviously bounded below and satisfies the
Equation (20), it converges. Figure 3 shows the convergence curve of the
proposed LEDL algorithm by using four well-known datasets. The results
demonstrate that our proposed LEDL algorithm has fast convergence and
low complexity.

                 f (Cm , Zm , Lm , Bm , Wm , Am )
                ≥f (Cm+1 , Zm+1 , Lm+1 , Bm , Wm , Am )                    (20)
                ≥f (Cm+1 , Zm+1 , Lm+1 , Bm+1 , Wm+1 , Am+1 )

                                       11

3.2.4. Overall Algorithm
    The overall updating procedures of proposed LEDL algorithm is summa-
rized in Algorithm 3. Here, maxiter is the maximum number of iterations,
1 ∈ RK×K is a squre matrix with all elements 1 and indicates element dot
product. By iterating C, Z, L, B, W and A alternately, the sparse codes
are obtained, and the corresponding bases are learned.

Algorithm 3 Label Embedded Dictionary Learning
Input: X ∈ RD×N , H ∈ RC×N , Q ∈ RK×N , λ, ω, ε, ρ, θ, K
Output: B ∈ RD×K , W ∈ RC×K , A ∈ RK×K , C ∈ RK×N
1: C0 ← zeros (K, N ), Z0 ← zeros (K, N ), L0 ← zeros (K, N )
2: B0 ← rand (D, K), W0 ← rand (C, K), A0 ← rand (K, K)
3: B•k = kBB•kk , W•k = kW
              •k 2
                           W•k
                               k
                                 , A•k = kA
                               •k 2
                                            A•k
                                               k
                                                 , (k = 1, 2 · · · K)
                                                     •k 2
4: m = 0
5: while m ≤ max iter do
6:   m←m+1
7:   Update C:
                                                 −1
8:   Cm+1 = Bm T Bm + λWm T Wm + ωAm T Am + ρI
9:         × Bm T X + λWm T H + ωAm T Q + ρZm − Lm
                                                   
10:   Update Z: n                    o
11:   Zm+1 = max Cm+1 + Lρm − ρε I, 0
                   n                             o
                               Lm       ε
12:           + min Cm+1 +      ρ
                                    +   ρ
                                          I, 0
13:      Update L:
14:      Lm+1 = Lm + θ (Cm+1 − Zm+1 )
15:      Update B, W, A:
16:      Compute Dm+1 = Cm+1 Cm+1 T
                                           
                                             (1 − I)
17:      for k = 1;k ≤ K;k++ do
                                       T
                          X[(Ck• )m+1 ] −Bm (D•k )m+1
18:         (B•k ) m+1 =               T
                          X[(Ck• )m+1 ] −Bm (D•k )m+1
                                                      2
                                         T
                           H[(Ck• )m+1 ] −Wm (D•k )m+1
19:         (W•k ) m+1 =                T
                           H[(Ck• )m+1 ] −Wm (D•k )m+1
                                                        2
                                        T
                          Q[(Ck• )m+1 ] −Am (D•k )m+1
20:         (A•k ) m+1 =               T
                          Q[(Ck• )m+1 ] −Am (D•k )m+1
                                                      2
21:      end for
22:      Update the objective function:
23:      f = kX − BCk2F + λ kY − WCk2F + ω kQ − ACk2F + 2εkZk`1 + LT (C − Z) + ρ kC − Zk2F
24:   end while
25:   return B, W, A, C

    In testing stage, the constraint terms are based on `1 -norm sparse con-
straint. Here, we exploit the learned dictionary D to fit the testing sample y
to obtain the sparse codes s. Then, we use the trained classfier W to predict
the label of y by calculating max {Ws}.

                                                       12

Table 1: Classification rates (%) on different datasets

         Datasets\Methods   SRC     CRC       CSDL-SRC    LC-KSVD        LEDL
          Extended YaleB    79.1    79.2         80.2       73.5          81.3
             CMU PIE        73.7    73.3         77.4       67.1          77.7
            UC-Merced       80.4    80.7         80.5       79.4          80.7
                AID         71.6    72.6         71.6       70.2          72.9
             Caltech101     89.4    89.4         89.4       88.3          90.1
               USPS         78.4    77.9         78.8       71.1          81.1

4. Experimental results
    In this section, we utilize several datasets (Extended YaleB Georghiades
et al. (2001), CMU PIE Sim et al. (2002), UC Merced Land Use Yang and
Newsam (2010), AID Xia et al. (2017), Caltech101 Fei-Fei et al. (2007) and
USPS Hull (1994)) to evaluate the performance of our algorithm and compare
it with other state-of-the-art methods such as SRC Wright et al. (2009), LC-
KSVD Jiang et al. (2013), CRC Zhang et al. (2011) and CSDL-SRC Liu
et al. (2016). In the following subsection, we first give the experimental
settings. Then experiments on these six datasets are analyzed. Moreover,
some discussions are listed finally.

4.1. Experimental settings
    For all the datasets, in order to eliminate the randomness, we carry out
every experiment 8 times and the mean of the classification rates is reported.
And we randomly select 5 samples per class for training in all the experi-
ments. For Extended YaleB dataset and CMU PIE dataset, each image is
cropped to 32 × 32, pulled into column vector, and `2 normalized to form the
raw `2 normalized features. For UC Merced Land Use dataset, AID dataset,
we use resnet model He et al. (2016) to extract the features. Specifically,
the layer pool5 is utilized to extract 2048-dimensional vectors for them. For
Caltech101 dataset, we use the layer pool5 of resnet model and spatial pyra-
mid matching (SPM) with two layers (the second layer include five part,
such as left upper, right upper, left lower, right lower, center) to extract
12288-dimensional vectors. And finally, each of the images in USPS dataset
is resized into 16 × 16 vectors.
    For convenience, the dictionary size (K) is fixed to the twice the number
of training samples. In addition, we set ρ = 1 and initial θ = 0.5, then

                                         13

decrease the θ in each iteration. Moreover, there are other three parameters
(λ, ω and ε) need to be adjust to achieve the highst classification rates. The
details are showed in the following subsections.

4.2. Extended YaleB Dataset
    The Extended YaleB dataset contains 2,432 face images from 38 individ-
uals, each having 64 frontal images under varying illumination conditions.
Figure 4 shows some images of the dataset.

                Figure 4: Examples of the Extended YaleB dataset

    In addition, we set λ = 2−3 , ω = 2−11 , ε = 2−8 in our experiment. The ex-
perimental results are summarized in Table (1). We can see that our proposed
LEDL algorithm achieves superior performance to other classical classifica-
tion methods by an improvement of at least 1.1%. Compared with `0 -norm
sparsity constraint based dictionary learning algorithm LC-KSVD, our pro-
posed `1 -norm sparsity constraint based dictionary learning algorithm LEDL
algorithm exceeds it 7.8%. The reason of the high improvement between
LC-KSVD and LEDL is that `0 -norm sparsity constraint leads to NP-hard
problem which is not conductive to finding the optimal sparse solution for
the dictionary. In order to further illustrate the performance of our method,
we choose the first 20 classes samples as a subdataset and show the confu-
sion matrices in Figure 5. As can be seen that, our method achieves higher
classification rates in all the chosen 20 classes than LC-KSVD. Especially
in class1, class2, class3, class10, class16, LEDL can achieve at least 10.0%
performance gain than LC-KSVD.

                                      14

Figure 5: Confusion matrices on Extended YaleB dataset

4.3. CMU PIE Dataset
    The CMU PIE dataset consists of 41,368 images of 68 individuals with
43 different illumination conditions. Each human is under 13 different poses
and with 4 different expressions. In Figure 6, we list several samples from
this dataset.

                  Figure 6: Examples of the CMU PIE dataset

   The comparasion results are showed in Table 1, we can see that our
proposed LEDL algorithm outperforms over other well-known methods by an
improvement of at least 0.5%. To be attention, LEDL is capable of exceeding
LC-KSVD 10.6% in this dataset. The optimal parameters are λ = 2−3 ,
ω = 2−11 , ε = 2−8 .

                                      15

4.4. UC Merced Land Use Dataset
     The UC Merced Land Use dataset is widely used for aerial image clas-
sification. It consists of totally 2,100 land-use images of 21 classes. Some
samples are showed in Figure 7.

                  Figure 7: Examples of the UC Merced dataset

   In Table 1, we can see that our proposed LEDL algorithm is only similar
with CRC and still outperforms the other methods. Compared with LC-
KSVD, LEDL achieves the higher accuracy by an improvement of 1.3%.
Here, we set λ = 20 , ω = 2−9 , ε = 2−6 to get the optimal result. The
confusion matrices of the UC Merced Land Use dataset for all classes are
shown in Figure 8. We can see that, in all classes except the tennis, LEDL
almost achieve better results compared with LC-KSVD. In several classes
such as building, freeway, river, and sparse, our method achieves superior
performance to LC-KSVD by an improvement of at least 0.5%.

4.5. AID Dataset
    The AID dataset is a new large-scale aerial image dataset which can be
downloaded from Google Earth imagery. It contains 10,000 images from 30
aerial scene types. In Figure 9, we show several images of this dataset.
    Table 1 illustrates the effectiveness of LEDL for classifying images. We
adjust λ = 2−6 , ω = 2−14 , ε = 2−12 to achieve the highest accuracy by an
improvement of at least 0.3% in the five algorithms. While compared with
LC-KSVD, LEDL achieves an improvement of 2.7%.

                                      16

Figure 8: Confusion matrices on UCMerced dataset

                      Figure 9: Examples of the AID dataset

4.6. Caltech101 Dataset
    The caltech101 dataset includes 9,144 images of 102 classes in total, which
are consisted of cars, faces, flowers and so on. Each category have about 40
to 800 images and most of them have about 50 images. In figure 10, we show
several images of this dataset.
    As can be seen in Table 1, our proposed LEDL algorithm outperforms
all the competing approaches by setting λ = 2−4 , ω = 2−13 , ε = 2−14 and
achieves improvements of 1.8% and 0.7% over LC-KSVD and other methods,

                                       17

Figure 10: Examples of the Caltech101 dataset

respectively. Here, we also choose the first 20 classes to build the confusion
matrices. They are shown in Figure 11.

               Figure 11: Confusion matrices on Caltech101 dataset

4.7. USPS Dataset
    The USPS dataset contains 9,298 handwritten digit images from 0 to 9
which come from the U.S. Postal System. We list several samples from this
dataset in Figure 12.

                                       18

Figure 12: Examples of the USPS dataset

    Table 1 shows the comparasion results of five algorithms and it is easy
to find out that our proposed LEDL algorithm outperforms over other well-
known methods by an improvement of at least 2.3%. And our proposed
method achieves an improvement of 10.0% over LC-KSVD method. The
optimal parameters are λ = 2−4 , ω = 2−8 , ε = 2−5 .

4.8. Discussion
    From the experimental results on six datasets, we can obtain the following
conclusions.
    (1) All the above experimental results illustrate that, our proposed LEDL
algorithm is an effective and general classifier which can achieve superior
performacne to state-of-the-art methods on various datasets, especially on
Extended YaleB dataset, CMU PIE dataset and USPS dataset.
    (2) Our proposed LEDL method introduces the `1 -norm regularization
term to replace the `0 -norm regularization of LC-KSVD. However, compared
with LC-KSVD algorithm, LEDL method is always better than it on the six
datasets. Moreover, on the two face datasets and USPS dataset, our method
can exceed LC-KSVD nearly 10.0%.
    (3) Confusion matrices of LEDL and LC-KSVD on three datasets are
shown in Figure 5 8 and 11. They clearly illustrate the superiority of our
method. Specificially, for Extended YaleB dataset, our method achieve out-
standing performance in five classes (class1, class2, class3, class10, class16).
For UC Merced dataset, LEDL almost achieve better classification rates than
LC-KSVD in all classes except the tennis class. For Caltech101 dataset, our
proposed LEDL method perform much better than LC-KSVD method in

                                       19

some classes such as beaver, binocular, brontosaurus, cannon and ceiling fan.

5. Conclusion
   In this paper, we propose a Label Embedded Dictionary Learning (LEDL)
algorithm. Specifically, we introduce the `1 -norm regularization term to re-
place the `0 -norm regularization term of LC-KSVD which can help to avoid
the NP-hard problem and find optimal solution easily. Furthermore, we pro-
pose to adopt ADMM algorithm to solve `1 -norm optimization problem and
BCD algorithm to update the dictionary. Besides, extensive experiments
on six well-known benchmark datasets have proved the superiority of our
proposed LEDL algorithm.

6. Acknowledgment
    This research was funded by the National Natural Science Foundation of
China (Grant No. 61402535, No. 61671480), the Natural Science Founda-
tion for Youths of Shandong Province, China (Grant No. ZR2014FQ001),
the Natural Science Foundation of Shandong Province, China(Grant No.
ZR2018MF017), Qingdao Science and Technology Project (No. 17-1-1-8-jch),
the Fundamental Research Funds for the Central Universities, China Univer-
sity of Petroleum (East China) (Grant No. 16CX02060A, 17CX02027A),
and the Innovation Project for Graduate Students of China University of
Petroleum(East China) (No. YCX2018063).

References
Aharon, M., Elad, M., Bruckstein, A., et al., 2006. K-svd: An algorithm
 for designing overcomplete dictionaries for sparse representation. IEEE
 Transactions on signal processing 54 (11), 4311–4322. 6

Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., 2011. Distributed
  optimization and statistical learning via the alternating direction method
  of multipliers. Foundations and Trends R in Machine learning 3 (1), 1–122.
  4, 8

Chan, T.-H., Jia, K., Gao, S., Lu, J., Zeng, Z., Ma, Y., 2015. Pcanet: A
 simple deep learning baseline for image classification? IEEE Transactions
 on image processing 24 (12), 5017–5032. 2

                                     20

Chang, H., Yang, M., Yang, J., 2016. Learning a structure adaptive dic-
 tionary for sparse representation based classification. Neurocomputing
 190 (19), 124–131. 3

Chang, S. G., Yu, B., Vetterli, M., 2000. Adaptive wavelet thresholding for
 image denoising and compression. IEEE Transactions on image processing
 9 (9), 1532–1546. 3

Fei-Fei, L., Fergus, R., Perona, P., 2007. Learning generative visual models
  from few training examples: An incremental bayesian approach tested on
  101 object categories. Computer vision and Image understanding 106 (1),
  59–70. 13

Gao, D., Hu, Z., Ye, R., 2018. Self-dictionary regression for hyperspectral
 image super-resolution. Remote Sensing 10 (10), 1574–1596. 3

Georghiades, A. S., Belhumeur, P. N., Kriegman, D. J., 2001. From few to
 many: Illumination cone models for face recognition under variable lighting
 and pose. IEEE Transactions on pattern analysis and machine intelligence
 23 (6), 643–660. 13

Hao, S., Wang, W., Yan, Y., Bruzzone, L., 2017. Class-wise dictionary learn-
  ing for hyperspectral image classification. Neurocomputing 220 (12), 121–
  129. 2

He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image
  recognition. In: Computer Vision and Pattern Recognition (CVPR), 2016
  IEEE conference on. IEEE, pp. 770–778. 13

Hull, J. J., 1994. A database for handwritten text recognition research. IEEE
 Transactions on pattern analysis and machine intelligence 16 (5), 550–554.
 13

Ji, R., Gao, Y., Hong, R., Liu, Q., Tao, D., Li, X., 2014. Spectral-spatial
   constraint hyperspectral image classification. IEEE Transactions on geo-
   science and remote sensing 52 (3), 1811–1824. 2

Jiang, Z., Lin, Z., Davis, L. S., 2013. Label consistent k-svd: Learning a
  discriminative dictionary for recognition. IEEE Transactions on pattern
  analysis and machine intelligence 35 (11), 2651–2664. 2, 3, 5, 13

                                     21

Li, H., He, X., Tao, D., Tang, Y., Wang, R., 2018. Joint medical image fusion,
  denoising and enhancement via discriminative low-rank sparse dictionaries
  learning. Pattern Recognition 79, 130–146. 3

Li, S., Fang, L., Yin, H., 2012. An efficient dictionary learning algorithm
  and its application to 3-d medical image denoising. IEEE Transactions on
  biomedical engineering 59 (2), 417–427. 3

Liu, B.-D., Gui, L., Wang, Y., Wang, Y.-X., Shen, B., Li, X., Wang, Y.-
  J., 2017. Class specific centralized dictionary learning for face recognition.
  Multimedia Tools and Applications 76 (3), 4159–4177. 2, 3

Liu, B.-D., Shen, B., Gui, L., Wang, Y.-X., Li, X., Yan, F., Wang, Y.-J.,
  2016. Face recognition using class specific dictionary learning for sparse
  representation and collaborative representation. Neurocomputing 204 (5),
  198–210. 3, 13

Liu, B.-D., Shen, B., Wang, Y.-X., 2014a. Class specific dictionary learn-
  ing for face recognition. In: Security, Pattern Analysis, and Cybernetics
  (SPAC), 2014 IEEE International conference on. IEEE, pp. 229–234. 2

Liu, B.-D., Wang, Y.-X., Shen, B., Zhang, Y.-J., Wang, Y.-J., 2014b. Block-
  wise coordinate descent schemes for sparse representation. In: Acoustics,
  Speech and Signal Processing (ICASSP), 2014 IEEE International confer-
  ence on. IEEE, pp. 5267–5271. 4, 8

Mallat, S. G., Zhang, Z., 1993. Matching pursuit with time-frequency dictio-
 naries. IEEE Transactions on signal processing 41 (12), 3397–3415. 3

Nakazawa, T., Kulkarni, D. V., 2018. Wafer map defect pattern classification
  and image retrieval using convolutional neural network. IEEE Transactions
  on semiconductor manufacturing 31 (2), 309–314. 2

Natarajan, B. K., 1995. Sparse approximate solutions to linear systems.
  SIAM journal on computing 24 (2), 227–234. 3

Olshausen, B. A., Field, D. J., 1996. Emergence of simple-cell receptive field
  properties by learning a sparse code for natural images. Nature 381 (6583),
  607–609. 3

                                      22

Olshausen, B. A., Field, D. J., 1997. Sparse coding with an overcomplete
  basis set: a strategy employed by v1? Vision Research 37 (23), 3311–3325.
  3

Shi, H., Zhang, Y., Zhang, Z., Ma, N., Zhao, X., Gao, Y., Sun, J., 2018.
  Hypergraph-induced convolutional networks for visual classification. IEEE
  Transactions on neural networks and learning systems. 2

Sim, T., Baker, S., Bsat, M., 2002. The cmu pose, illumination, and expres-
  sion (pie) database. In: Automatic Face and Gesture Recognition (FG),
  2002 IEEE International conference on. IEEE, pp. 53–58. 13

Song, Y., Liu, Y., Gao, Q., Gao, X., Nie, F., Cui, R., 2018. Euler label consis-
  tent k-svd for image classification and action recognition. Neurocomputing
  310 (8), 277–286. 2

Tropp, J. A., Gilbert, A. C., 2007. Signal recovery from random measure-
  ments via orthogonal matching pursuit. IEEE Transactions on information
  theory 53 (12), 4655–4666. 3

Wang, H., Yuan, C., Hu, W., Sun, C., 2012a. Supervised class-specific dic-
 tionary learning for sparse modeling in action recognition. Pattern Recog-
 nition 45 (11), 3902–3911. 3

Wang, N., Zhao, X., Jiang, Y., Gao, Y., BNRist, K., 2018. Iterative metric
 learning for imbalance data classification. In: International Joint Confer-
 ence on Artificial Intelligence (IJCAI), 2018 Morgan Kaufmann conference
 on. Morgan Kaufmann, pp. 2805–2811. 2

Wang, S., Zhang, L., Liang, Y., Pan, Q., 2012b. Semi-coupled dictionary
 learning with applications to image super-resolution and photo-sketch syn-
 thesis. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE
 conference on. IEEE, pp. 2216–2223. 3

Wright, J., Yang, A. Y., Ganesh, A., Sastry, S. S., Ma, Y., 2009. Robust
 face recognition via sparse representation. IEEE Transactions on pattern
 analysis and machine intelligence 31 (2), 210–227. 2, 5, 13

Xia, G.-S., Hu, J., Hu, F., Shi, B., Bai, X., Zhong, Y., Zhang, L., Lu, X.,
  2017. Aid: A benchmark data set for performance evaluation of aerial scene

                                      23

classification. IEEE Transactions on geoscience and remote sensing 55 (7),
  3965–3981. 13

Xu, J., An, W., Zhang, L., Zhang, D., 2019. Sparse, collaborative, or nonneg-
 ative representation: Which helps pattern classification? Pattern Recog-
 nition 88, 679–688. 2

Yang, J., Wright, J., Huang, T. S., Ma, Y., 2010. Image super-resolution
  via sparse representation. IEEE Transactions on image processing 19 (11),
  2861–2873. 3

Yang, J., Yu, K., Gong, Y., Huang, T., 2009. Linear spatial pyramid match-
  ing using sparse coding for image classification. In: Computer Vision and
  Pattern Recognition (CVPR), 2009 IEEE conference on. IEEE, pp. 1794–
  1801. 2

Yang, M., Chang, H., Luo, W., 2017. Discriminative analysis-synthesis dic-
  tionary learning for image classification. Neurocomputing 219 (5), 404–411.
  2

Yang, M., Zhang, L., Feng, X., Zhang, D., 2014. Sparse representation based
  fisher discrimination dictionary learning for image classification. Interna-
  tional Journal of Computer Vision 109 (3), 209–232. 3

Yang, Y., Newsam, S., 2010. Bag-of-visual-words and spatial extensions for
  land-use classification. In: Advances in Geographic Information Systems
  (GIS), 2010 ACM International conference on. ACM, pp. 270–279. 13

Yu, J., Feng, L., Seah, H. S., Li, C., Lin, Z., 2012a. Image classification by
  multimodal subspace learning. Pattern Recognition Letters 33 (9), 1196–
  1204. 2

Yu, J., Rui, Y., Tang, Y. Y., Tao, D., 2014. High-order distance-based mul-
  tiview stochastic learning in image classification. IEEE Transactions on
  cybernetics 44 (12), 2431–2442. 2

Yu, J., Tao, D., Rui, Y., Cheng, J., 2013. Pairwise constraints based mul-
  tiview features fusion for scene classification. Pattern Recognition 46 (2),
  483–496. 2

                                     24

Yu, J., Tao, D., Wang, M., 2012b. Adaptive hypergraph learning and its
  application in image classification. IEEE Transactions on image processing
  21 (7), 3262–3272. 2

Yuan, L., Liu, W., Li, Y., 2016. Non-negative dictionary based sparse repre-
  sentation classification for ear recognition with occlusion. Neurocomputing
  171 (1), 540–550. 2

Zhang, L., Yang, M., Feng, X., 2011. Sparse representation or collabora-
  tive representation: Which helps face recognition? In: Computer Vision
  (ICCV), 2011 IEEE International conference on. IEEE, pp. 471–478. 13

Zhang, Q., Li, B., 2010. Discriminative k-svd for dictionary learning in face
  recognition. In: Computer Vision and Pattern Recognition (CVPR), 2010
  IEEE conference on. IEEE, pp. 2691–2698. 3

                                     25

You can also read