Simplified Parallel Architecture for LTE-A Turbo Decoder Implemented on FPGA
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Advances in Circuits, Systems, Signal Processing and Telecommunications Simplified Parallel Architecture for LTE-A Turbo Decoder Implemented on FPGA CRISTIAN ANGHEL, CONSTANTIN PALEOLOGU Telecommunications Department University Politehnica of Bucharest Iuliu Maniu, 1-3, Bucharest ROMANIA {canghel, pale}@comm.pub.ro Abstract: - This paper describes a turbo decoder for 3GPP Long Term Evolution Advanced (LTE-A) standard, using a Max LOG MAP algorithm, implemented on Field Programmable Gate Array (FPGA). Taking advantage of the quadratic permutation polynomial (QPP) interleaver proprieties and considering some FPGA block memory characteristics, a simplified parallel decoding architecture is proposed. It should be used especially for large data blocks, when high decoding latency is introduced by the serial decoding. The parallelization factor N is usually a power of 2, the maximum considered value being 8. The obtained parallel decoding latency is N times lower than the serial decoding latency. With the cost of very low latency added to this value, the parallel decoding performances are similar with the serial decoding ones. The novelty of the proposed parallel architecture is that only one interleaver is used, independently of the N value. Key-Words: - LTE-A, turbo decoder, Max LOG MAP, parallel architecture, FPGA 1 Introduction releases (after High Speed Packet Access was The discussions around the channel coding introduced) added support for new and interesting theory were intense in the last decades, but even features, while turbo coding remained still more interest around this topic was added once the unchanged. Some modifications were introduced by turbo codes were found by Berrou, Glavieux, and the Long Term Evolution (LTE) standard [5][6], not Thitimajshima [1][2][3]. significant as volume, but important as concept. At the beginning of their existence, after proving While keeping exactly the same coding structure as the obtained decoding performances, the turbo in UMTS, 3GPP proposed for LTE a new codes were introduced in different standards as interleaver scheme. recommendations, while convolutional codes were An UMTS dedicated turbo decoding scheme is still mandatory. The reason behind this decision was presented in [7]. Due to the new LTE/ LTE-A especially the high complexity of turbo decoder interleaver, the decoding performances are implementation. But the turbo codes became more improved compared with the ones corresponding to attractive once the supports for digital processing, UMTS standard. Moreover, the new LTE interleaver like Digital Signal Processor (DSP) or Field provides support for the parallelization of the Programmable Gate Array (FPGA), were extended decoding process inside the algorithm, taking more and more in terms of processing capacity. advantage on the main principle introduced by turbo Nowadays the chips include dedicated hardware decoding, i.e., the usage of extrinsic values from one accelerators for different types of turbo decoders, turbo iteration to another. The parallel decoding but this approach makes them standard dependent. represents one software adaptation requested by the The Third-Generation Partnership Project high data rates, while additional hardware changes (3GPP) [4] is an organization, which adopted early are also proposed [8]. these advanced coding techniques. Turbo codes There are many parallel decoding architectures were standardized from the first version of proposed in the literature in the last years. The Universal Mobile Telecommunications System obtained results are evaluated on 2 axes. The first (UMTS) technology, in 1999. The next UMTS one is the decoding performances degradation ISBN: 978-1-61804-271-2 102
Advances in Circuits, Systems, Signal Processing and Telecommunications introduced by the parallel method compared with dimensions are unchanged between the two block the serial decoding scheme and the second one is the schemes. In terms of decoding performances, with amount of resources needed for such parallel the cost of a small overhead added, the architecture implementation. A first set of parallel performances of the serial and parallel decoding architectures is described in [9]. Starting from the architectures are kept similar. classical method of implementing the Maximum A The paper is organized as follows. Section 2 Posteriori (MAP) algorithm, i.e., going to trellis describes the LTE coding scheme with the new once to compute the Forward State Metrics (FSM) introduced QPP interleaver. Section 3 presents the and then twice to compute the Backward State decoding algorithm. In Section 4, there are Metrics (BSM) and also the Log Likelihood Ratios discussed the implementation solutions and the (LLR), several solutions to reduce the decoding proposed decoding schemes, for both serial and latency of 2K clock periods per semi-iteration, parallel decoding. Section 5 presents throughput and where K is the data block length, are introduced. speed results obtained when targeting a The first one reduces the decoding time to half (only XC5VFX70T [13] chip on Xilinx ML507 [14] K) by starting simultaneously the BSM and FSM board; it also provides simulation curves comparing computation. After computing half of these values, the results obtained when using serial decoding, 2 LLR blocks start working in parallel, the parallel decoding, and parallel decoding with interleaver block being also doubled. Another overlap. Section 6 contains the conclusions of this proposed scheme eliminates the need for the second work. interleaver but increases the decoding time with K/2 compared with the previous one, a total decoding latency of 3K/2 clock periods being obtained. 2 LTE Coding Scheme A second set of parallel architectures takes The coding scheme presented in 3GPP LTE advantage of the Quadratic Permutation Polynomial specification is a classic turbo coding scheme, (QPP) interleaver algebraic-geometric properties, as including two constituent encoders and one described in [10][11]. Here efficient hardware interleaver module. It is described in Fig. 1. One can implementations of the QPP interleaver are observe at the input of the LTE turbo encoder the proposed, but the parallelization factor N represents data block Ck. The K bits corresponding to this also the number of used interleavers in the proposed block are sent as systematic bits at the output in the architectures. steam Xk. In the same time, the data block is A third approach consist in using a folded processed by the first constituent encoder resulting memory to store simultaneously all the values parity bits Zk, while the interleaved data block Ck' is needed for parallel processing [12]. But for this kind of implementation the main challenge is to correctly processed by the second constituent encoder distribute the data to each decoding unit once a resulting parity bits Z k' . Combining the systematic memory location containing all N values was read. bits and the two streams of parity bits, the following More precisely, the N decoding units working in sequence is obtained at the output of the encoder: parallel were writing their data in a concatenated X1, Z1, Z’1, X2, Z2, Z’2, …, XK, ZK, Z’K. order to the same location, but when the interleaved At the end of the coding process, in order to reading is taking place, these values are not going in drive back the constituent encoders to the initial the same order to the same decoding unit, but state, the switches from Fig. 1 are moved from instead they should be redistributed. To solve this, position A to B. Since the final states of the two an architecture based on 2 Batcher sorting networks constituent encoders are different, depending on the is proposed. But also in this approach, N interleavers input data block, this switching procedure will are needed to generate all the interleaved addresses generate tail bits for each encoder. These tail bits that input the master network. have to be transmitted together with the systematic In this paper, we introduce also a folded memory and parity bits resulting the following final based approach, but the main difference comparing sequence: XK+1, ZK+1, XK+2, ZK+2, XK+3, ZK+3, X’K+1, with the already existent solutions described above Z’K+1, X’K+2, Z’K+2, X’K+3, Z’K+3. is that our proposed solution uses only one As mentioned before, the novelty introduced by interleaver. Additionally, with some multiplexing the LTE standard in terms of turbo coding is the and demultiplexing blocks, the parallel architecture interleaver module. The output bits are reorganized remains close to the serial one, only the Soft Input using: Soft Output (SISO) decoding unit being instantiated N times. The block memories numbers and ISBN: 978-1-61804-271-2 103
Advances in Circuits, Systems, Signal Processing and Telecommunications Λi ( X k ) Λ1o ( X k ) Λi ( Zk ) Λ o2 ( X k' ) Λ o2 ( X k ) Λ i ( Z k' ) Fig. 2. LTE turbo decoder. max* ( x, y ) = ln(e x + e y ) = Fig. 1. LTE CTC encoder. − y−x (3) max( x, y ) + ln(1 + e ) ≈ max( x, y ). Ci' = C π ( i ) , i = 1, 2,..., K , (1) The LTE turbo decoder trellis diagram contains 8 states, as depicted in Fig. 3. Each diagram state where the interliving function π applied over the permits 2 inputs and 2 outputs. The branch metric output index i is defined as between the states Si and Sj is π(i ) = ( f1 ⋅ i + f 2 ⋅ i 2 ) mod K . (2) γ ij = V ( X k ) X ( i, j ) + Λ i ( Z k ) Z ( i, j ) , (4) The input block length K and the parameters f1 where X(i,j) represents the data bit and Z(i,j) is the and f2 are provided in Table 5.1.3-3 in [5]. parity bit, both associated to one branch. Also Λ i ( Z k ) is the LLR for the input parity bit. When 3 Decoding Algorithm SISO 1 decoder is taken into discussion this input The LTE turbo decoding scheme is depicted in LLR is Λ i ( Z k ) , while for SISO 2 it becomes Fig. 2. The two Recursive Systematic Convolutional (RSC) decoders use in theory the MAP algorithm. Λ i ( Z k' ) ; V(Xk)=V1(Xk) represents the sum between This classic algorithm provides the best decoding Λ i ( X k ) and W(Xk) for SISO 1 and V(Xk)=V2(X’k) performances, but it suffers from very high implementation complexity and it can lead to large represents the interleaved version of the difference dynamic range for its variables. For these reasons, between Λ1o ( X k ) and W(Xk) for SISO 2. In Fig. 2, the MAP algorithm is used as a reference for W(Xk) is the extrinsic information, while Λ1o ( X k ) targeted decoding performances, while for real implementation new sub-optimal algorithms have and Λ o2 ( X k' ) are the output LLRs generated by the been studied: Logarithmic MAP (Log MAP) [15], two SISOs. Maximum Log MAP (Max Log MAP), Constant In the LTE turbo encoder case, there are 4 Log MAP (Const Log MAP) [16], and Linear Log possible values for the branch metrics between 2 MAP (Lin Log MAP) [17]. states in the trellis: For the proposed decoding scheme, the Max Log MAP algorithm is selected. This algorithm reduces γ0 = 0 the implementation complexity and controls the γ1 = V ( X k ) dynamic range problem with the cost of acceptable (5) performances degradation, compared to classic γ 2 = Λi ( Z k ) MAP algorithm. The Max Log MAP algorithm γ 2 = V ( X k ) + Λi ( Z k ) . keeps from Jacobi logarithm only the first term, i.e., The decoding process is based on going forward and backward through the trellis. ISBN: 978-1-61804-271-2 104
Advances in Circuits, Systems, Signal Processing and Telecommunications metric for the state Si at the stage k is α k ( Si ) with 0 ≤ k ≤ K − 1 and 0 ≤ i ≤ 7 . The forward recursion is initialized with α 0 ( S0 ) = 0 and α 0 ( Si ) = 0, ∀i > 0 . Starting from the stage k=1 and continuing through the trellis until the last stage k=K, the un-normalized forward metrics are given by αˆ k ( S j ) = max {(α k −1 ( Si1 ) + γ i1 j ),(α k −1 ( Si 2 ) + γ i 2 j )} , (8) where Si1 and Si2 are the two states from stage k − 1 connected to the state Sj from stage k. After the computation of αˆ k ( S0 ) value, the rest of the Fig. 3. LTE turbo coder trellis. forward metrics are normalized as 3.1 Backward recursion α k ( Si ) = αˆ k ( Si ) − αˆ k ( S0 ) . (9) The trellis is covered backward and the computed metrics are stored in a normalized form at Because the forward metrics α are computed for each node of the trellis. These stored values are used the stage k, the decoding algorithm can obtain in the for the LLR computation at the trellis forward same time a LLR estimated for the data bits Xk. This recursion. The backward metric for the Si state at the LLR is found the first time by considering that the kth stage is β k ( Si ) , where 2 ≤ k ≤ K + 3 and likelihood of the connection between the state Si at 0 ≤ i ≤ 7 . The backward recursion is initialized with stage k − 1 and the state Sj at stage k is β K +3 ( S0 ) = 0 and β K +3 ( Si ) = 0, ∀i > 0 . Starting from the stage k=K+2 and continuing through the λk ( i, j ) = α k −1 ( Si ) + γ ij + β k ( S j ) . (10) trellis until stage k=2, the computed backward metrics are The likelihood of having a bit equal to 1 (or 0) is when the Jacobi logarithm of all the branch { } βˆk ( Si ) = max ( β k +1 ( S j1 ) + γ ij1 ),( β k +1 ( S j 2 ) + γ ij 2 ) , likelihoods corresponds to 1 (or 0) and thus: (6) Λo ( X k ) = max {λk ( i, j )} − max {λk ( i, j )}, ( Si → S j ): X i =1 ( Si → S j ): X i = 0 where βˆk ( Si ) represents the un-normalized metric (11) and Sj1 and Sj2 are the two states from stage k+1 where “max” operator is recursively computed over connected to the state Si from stage k. After the the branches, which have at the input a bit of 1 computation of βˆ ( S ) value, the rest of the k 0 { } { ( Si → S j ) : X i = 1 or a bit 0 ( Si → S j ) : X i = 0 . } backward metrics are normalized as βk ( Si ) = βˆk ( Si ) − βˆk ( S0 ) (7) 4 Proposed Decoding Scheme and then stored in the dedicated memory. 4.1 Serial Decoder Block Scheme From the theoretical decoding scheme depicted 3.2 Forward recursion in Fig. 2 it can be noticed that SISO 2 decoder starts During the forward recursion, the trellis is working only after SISO 1 decoder finishes its job covered in the normal direction, this process being and vice-versa, the usage of previously obtained similar with the one specific for Viterbi algorithm. extrinsic values being the main principle of the In order to allow the computation of the current turbo decoding. stage (k) metrics, only the forward metrics from the last stage (k − 1) have to be stored. The forward ISBN: 978-1-61804-271-2 105
Advances in Circuits, Systems, Signal Processing and Telecommunications W(Xk) W(Xk) memory Λi(Xk) V1(Xk) V1(Xk) 1 memory + memory Λi ( X k ) 2 V2(X’k) RSC Write normal Read interleaved (SISO1 or SISO2) Λ1o ( X k ) Λi(Zk) 1 1 V2(Xk) V2(Xk) V2(X’k) + Interleaver memory Λ i ( Z k ) memory memory 2 2 Λ o2 (X’k) Λi(Z’k) memory Write interleaved + memory Λ i Z ' k ( ) Λ o2 ( X k' ) Read normal Λ o2 (Xk) Λ o2 ( X k ) Xˆ k Deinterleaver memory Fig. 4. Proposed serial turbo decoder block scheme. Also, all the processing is based on complete vectors in order to perform the second semi- data blocks since the interleaver or deinterleaver iteration. procedures should be applied in between. It results The vector V1(Xk) is obtained by adding the input that the 2 SISOs are decoding data in non- vector Λ i ( X k ) with the extrinsic information vector overlapped time windows, so only one SISO unit W(Xk). While reading these 2 memories, SISO 1 can be used to process in a time-multiplexed starts the decoding process. At the output, the LLRs manner, as one can observe in Fig. 4, where a serial are available and performing the subtraction decoder block scheme based on the previous work between them and the delayed extrinsic values presented in [18] for a WiMAX CTC decoder is already read from W(Xk) memory, the vector V2(Xk) described. is computed and then stored into its corresponding The memory blocks are used for storing data memory in a normal order. The interleaving process from one semi-iteration to another and from one is started (the initially written ILM is read now in iteration to another. The dotted-line memory blocks normal order, so that interleaved read address for are virtual memories added only to ease the V2(Xk) are obtained) and the re-ordered LLRs understanding of the introduced notations. Also, it V2(X’k) are available, the corresponding values for should be mentioned that the Interleaver and the 3 tail bits X’K+1, X’K+2, X’K+3 being added at the Deinterleaver blocks are in fact the same, including end of this sequence. The second semi-iteration is a block memory called ILM (Interleaver Memory) ongoing. The same SISO unit is used, but reading and an interleaver. The ILM is the new approach this time data inputs from the other memory blocks. introduced by the author compared with the As one can see from Fig. 4, two switching previous serial implementation presented in [19] and mechanisms are included in the scheme. When in the goal is to prepare the architecture for parallel position 1, the memory blocks for V1(Xk) and decoding also. The memory is written with the interleaved addresses each time a new data block is Λ i ( Z k ) are used, while in position 2 the memory received. The values are then used as read addresses blocks for V2(X’k) and Λ i ( Z k' ) become active. (when interleaver process is ongoing) or as write At the output of the SISO unit, after each semi- addresses (when deinterleaver process is ongoing). iteration, K LLRs are obtained. The ones This ILM, together with the 3 memories from the corresponding to the second semi-iteration are left side of the picture (for the input data) are switched-buffers, allowing new data to be written stored in the Λ o2 ( X k' ) memory (the ILM output, while the previous one is still under decoding which was already available for the V2(Xk) process. interleaver process, is used as writing address for Λ o2 ( X k' ) memory , after a delay is added). The scheme depicted in Fig. 4 works as follows. SISO 1 reads the memory locations corresponding to V1(Xk) and Λ i ( Z k ) vectors. The reading process Reading in a normal order Λ o2 ( X k' ) memory and is performed forward and backward and it serves the also V2(Xk) memory provides inputs for W(Xk) first semi-iteration. At the end of this process, SISO memory and on the same time allows a new semi- 2 reads forward and backward from the memory iterations to start for SISO 1. So the W(Xk) memory blocks corresponding to V2(X’k) and Λ i ( Z k' ) update is made on the same time with a new semi- iteration start. Fig. 5 depicts a time diagram for the serial turbo decoding and the gray colored intervals ISBN: 978-1-61804-271-2 106
Advances in Circuits, Systems, Signal Processing and Telecommunications V1 ( X k ) / describe W(Xk) memory writing. One can observe V2 ( X k' ) that the upper 4 memories in the picture are Λi ( Zk ) / switched-buffers, so they are written while the Λ i ( Z k' ) previous data block is still processed. In the picture R stands for Read, W represents Write, (K − 1:0) is Λ1o ( X k ) / the backward trellis run, (0:K − 1) is the forward Λ o2 ( X k' ) trellis run and IL means interleaved read (for interleaver process) or write (for deinterleaver process). In order to be able to handle all the data block dimensions, the used memory blocks have 6144 Fig. 6. Proposed SISO block scheme. locations (this is the maximum data block length), except the ones storing the input data for RSCs, Without normalization, the forward and which have 6144 + 3 locations, including here also backward metric width should be wider in order to the tail bits. Each memory locations is 10 bits wide, avoid saturation, which means more memory the first bit being used for the sign, the next 6 bits blocks, more complex arithmetic (i.e., more used representing the integer part and the last 3 bits resources), and lower frequency (as an overall indicating the fractional part. This format was consequence). Hence, reducing the logic levels by decided studying the dynamic range of the variables eliminating the normalizing procedure does not (for the integer part) and the variations of the increase the system performances. decoding performances (for the fractional part). The ALPHA, BETA, and GAMMA blocks are The constituent modules of the SISO block are implemented in a dedicated way. Each metric the ones presented in Fig. 6. One can notice both the corresponding to each state is computed separately, un-normalized metric computing blocks ALPHA not using the same function with different input (forward) and BETA (backward), and the transition parameters. metric computing block GAMMA, which in Consequently, 16 equations should be used for addition includes the normalization function transition metric computation (2 possible transitions (subtract the metrics for the first state from all the for each of the 8 states from a stage). In fact, only 4 other metrics). equations are needed [as indicated in (5)]; moreover, The L block computes the output LLRs, which from these 4 equations one of them leads to zero are normalized by the NORM block. The MUX- value, so that the computational effort is minimized MAX block selects inputs corresponding to the for this implementation solution. forward or backward recursion and computes the The interleaver module is used both for maximum function. The MEM BETA block stores interleaving and deinterleaving. The interleaved the backward metrics, which are computed before index is obtained based on a modified form of (2), forward metrics. The metric normalization is i.e., required to preserve the dynamic range. π ( i ) = {[( f1 + f 2 ⋅ i ) mod K ] ⋅ i}mod K . (12) MODULES from ARCHITECTURE In order to obtain both functions, either the input data is stored in the memory in natural order and then it is read in interleaved order, either the input data is stored in the interleaved order and then it is read in natural order. The interleaved index computation is performed in three steps. First the value for ( f1 + f 2 ⋅ i ) mod K is computed. This partial result is multiplied by natural order index i and then a new modulo K function is applied. In the first stage of this process, SISO the remark that the formula is increased with f2 for consecutive values of index i is used. This way, a register value is increased with f2 at each new index i. If the resulted value is bigger than K, the value of Fig. 5. Time diagram for serial turbo decoder. K is subtracted from the register value. This ISBN: 978-1-61804-271-2 107
Advances in Circuits, Systems, Signal Processing and Telecommunications processing is one clock period long, this being the reason why data is generated in a continuous manner. 4.2 Parallel Decoder Block Scheme The proposed parallel architecture is similar to the serial one described in Fig. 4, only that the RSC SISO module is instantiated N times in the scheme. We propose an architecture that concatenates the N values from the N RSCs and points always at the Fig. 8. Virtual parallel interleaver. same memory location, for all the memories in the scheme. So instead of having K locations with 10 Fig. 7 describes the way ILM works. As one can bits per location as in the serial architecture, in the observe, while writing procedure, each index i from parallel one each memory contains K/N locations 0 to K − 1 generates a corresponding interleaved with 10N bits per locations. values. These interleaved values are written in a The main advantage introduced by the proposed normal order in ILM. The first K/N corresponding serial architecture is the fact that the interleaver interleaved values occupy the first position on each block works only once, before the decoding itself memory locations. The second K/N values are taking place. The ILM memory is written when a placed on the second position of each location, and new data block is received, while the previous one is so on. In order to perform this procedure, a true dual still under decoding. This approach allows a port BRAM is used. Each time a new position in simplified parallel scheme way of work. Knowing location n is written, the content of location n+1 is the parallelization factor N, the ILM memory can be also read from the memory, so that the next clock prepared for the parallel processing that follows. period the next interleaved value to be added to the More precisely, the ILM memory will have K/N already existing content at that location. When the locations, N values being written at each location. interleaver function is needed during a semi- As mentioned in [20], a Virtex 5 block memory can iteration, the ILM is read in a normal way, so that be configured from (32k locations x 1 bit) to (512 the N interleaved values from one location to locations x 72 bits). In the worst case scenario when represent the reading addresses for V2(Xk) memory. K=6144, based on the N values and keeping the But the QPP proprieties guarantee that the N values stored values on 10 bits as previously mentioned, that should be read in the interleaved way from the the parallel ILM memory can be (768 locations x 80 memory are placed at the same memory location, bits), (1536 locations x 40 bits), (3072 locations x only that their positions should be re-arranged 20 bits), or (6144 locations x 10 bits), so still only 2 before being sent to the corresponding RSCs. For BRAMs are used, as in the case of serial ILM. simplifying the representation, the case of K=40 and N=8 is exemplified in Fig. 8. On the left one can see the content of V2(Xk) memory. Each column represents the outupts of one of the N RSC SISOs. On the right there is described the content of ILM memory. The minimum values from each line of ILM (grey colored circle in figure) represents the line address for V2(Xk) memory. Then, using a re- ordering module implemented with multiplexers and de-multiplexers, each position from the read line is sent to its corresponding SISO. For example, position b from the first read line (index 5) is sent to SISO f, while position b from the second read line (index 8) is sent to SISO d. The same procedure applies also for deinterleaver process, only that the Fig. 7. ILM memory writing procedure. ISBN: 978-1-61804-271-2 108
Advances in Circuits, Systems, Signal Processing and Telecommunications MODULES from ARCHITECTURE Fig. 10. a) non overlaping split; b) overlapping split. Testing the parallel decoding performances, a SISOa certain level of degradation was observed, since the forward and backward metrics are altered at the data block extremities. In order to obtain similar results as in the serial decoding case, a small overhead is accepted. If at each parallel block boarder an SISOb overlap is added, the metrics computation will have a training phase. The minimum overlap window may be as long as the minimum standard defined Fig. 9. Time diagram for parallel turbo decoder (N=2). data block, in this case Kmin=40 bits. Fig. 10 describes this situation, for N=2 case. write addresses are extracted from ILM, while the The corresponding latency is in this case, read ones are in normal order. considering N>2, which leads to blocks with Kmin at From timing point of view, Fig. 9 depicts the both left and right sides: case when N=2 is used. Same comments as the ones for Fig. 5 apply. Latency _ po = ( 4( K / N + 2 K min ) + 2 Delay ) L . (15) 5 Implementation Results In order to evaluate the performances, the used From Fig. 5 and 9 it can be observed that the hardware programming language is Very High decoding latency is reduced in the case of parallel Speed Hardware Description Language (VHDL). decoding with almost a factor equal to N. There is a For the generation of RAM/ ROM memory blocks certain Delay, which in this implementation case is Xilinx Core Generator 11.1 was used. The 11 clock periods that adds at each forward trellis simulations were performed with ModelSIM 6.5. run, when the LLRs are computed, so 2 such values The synthesis process was done using Xilinx XST are introduced at each iteration. from Xilinx ISE 11.1. Using these tools, the The native latency for serial decoding is obtained system frequency when implementing the computed as follows: K clock periods needed for the decoding structure on a Xilinx XC5VFX70T- backward trellis run at the first semi-iterations, FFG1136 chip is around 210 MHz. another K clock periods plus Delay for the forward The values included in Table 1 are computed trellis run and LLR computation, and multiplied by based on (13), (14), and (15) for the N=8 case. It can 2 for the second semi-iteration. Considering L the be noticed that the overhead introduced by the number of executed iterations, it results a total overlapped split is less significant once the value of latency in clock periods for each block serial K increases, which represents the scenario when decoding of: parallel decoding is usually used. Latency _ s = ( 4 K + 2 Delay ) L , Table 1. Latency Values for N=8, L= 3 or 4 and K=1536, 4096 or 6144 (13) Latency_s [us] Latency_p [us] Latency_po [us] K L while for the parallel decoding the needed number 3 4 3 4 3 4 of clock periods is: 1536 88.08 117.4 11.28 15.04 15.85 21.14 4096 234.3 312.5 29.57 39.42 34.14 45.52 Latency _ p = ( 4 K / N + 2 Delay ) L . 6144 351.4 468.5 44.2 58.9 48.7 56.02 (14) ISBN: 978-1-61804-271-2 109
Advances in Circuits, Systems, Signal Processing and Telecommunications Table 2 provides the corresponding throughput 0 QPSK, 1024, 3 iter, N = 4 10 rate when the values from Table I are used. parallel with overlap parallel without overlap serial Table 2. Throughput Values for N=8, L= 3 or 4 and K=1536, 4096 or -1 10 6144 Tput_s [Mbps] Tput_p [Mbps] Tput_po [Mbps] K L -2 10 3 4 3 4 3 4 BER 1536 17.43 13.07 136.1 102.0 96.86 72.64 -3 4096 17.47 13.10 138.5 103.8 119.9 89.9 10 6144 17.48 13.11 139 104.2 125.9 94.4 -4 As one can observe from Table 2, the serial 10 decoding performance is close to the theoretical one. Let us consider for example the case K=6144 and -5 10 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 L=3. The native theoretical latency is 4KL clock SNR[dB] periods, which leads to a theoretical throughput of Fig. 12. Comparative decoding results for QPSK, L = 3, K = 1024, N = 17.5 Mbps, while the obtained results for the 4. proposed serial implementation is 17.48 Mbps. The following performance curves were obtained On the other hand, the parallel decoding without using a finite precision Matlab simulator. This overlap introduces a certain level of degradation approach was selected because the Matlab simulator compared with the serial decoding, the loss in terms produces exactly the same outputs as the ModelSIM of performances being dependent on the value of N. simulator, while the simulation time is smaller. All the simulation results are using the Max Log MAP algorithm. All pictures describe the Bit Error 6 Conclusions Rate (BER) versus Signal-to-Noise Ratio (SNR) The most important aspects regarding the FPGA expressed as the ratio between the energy per bit implementation of a turbo decoder for LTE-A and the noise power spectral density. systems were presented in this paper. The serial Fig. 11 depicts the obtained results when a block turbo decoder architecture was developed and of length K = 512 was decoded in a serial manner, in implemented in an efficient manner, especially from a parallel without overlapping manner and in a the interleaver/ deinterleaver processes point of parallel with overlapping manner. For this scenario view. The interleaver memory ILM was introduced N = 2, QPSK modulation was used and L = 3. Fig. so that the interleaver process to work effectively 12 presents the same type of results, for the case of only outside the decoding process itself. The ILM K = 1024 and N = 4. was written together with the input data, while the As one can observe from Fig. 11 and 12, the previous block was still under decoding. This parallel decoding with overlap is producing same approach allowed the transfer to the parallel results as the serial decoding. architecture in a simplified way, using only QPSK, 3 iter, 512 bits, N = 2 concatenated values at same memory locations. The 0 10 parallel architecture used the same number of block serial parallel with overlap memories and only one interleaver, with the cost of -1 10 parallel without overlap some multiplexing/ demultiplexing structures. The parallel decoding performances were -2 10 compared with the serial ones and certain degradation was observed. To eliminate this BER -3 degradation, a small overhead was accepted by the 10 overlapping split that was applied to the parallel data blocks. -4 10 -5 10 -3 -2 -1 0 1 2 3 SNR[dB] Fig. 11. Comparative decoding results for QPSK, L = 3, K = 512, N = 2. ISBN: 978-1-61804-271-2 110
Advances in Circuits, Systems, Signal Processing and Telecommunications Acknowledgment [11] Di Wu, R. Asghar, Yulin Huang, and D. Liu, The work has been funded by the Sectoral Implementation of a high-speed parallel turbo Operational Programme Human Resources decoder for 3GPP LTE terminals, ASICON Development 2007-2013 of the Ministry of ’09, IEEE 8th International Conference on European Funds through the Financial Agreement ASIC, pp. 481-484, 2009. POSDRU/159/1.5/S/134398. [12] C. Studer, C. Benkeser, S. Belfanti, and Quiting Huang, Design and implementation of a parallel turbo-decoder ASIC for 3GPP-LTE, References: IEEE Journal of Solid-State Circuits, vol. 46, [1] C. Berrou, A. Glavieux, and P. Thitimajshima, issue 1, pp 8-17, Jan. 2011. Near Shannon limit error-correcting coding and [13] “Xilinx Virtex 5 family user guide,” decoding: Turbo Codes, IEEE Proceedings of www.xilinx.com. the Int. Conf. on Communications, Geneva, [14] “Xilinx ML507 evaluation platform user Switzerland, May 1993, pp. 1064-1070. guide,” www.xilinx.com. [2] C. Berrou and A. Glavieux, Near optimum [15] P. Robertson, E. Villebrun, and P. Hoeher, A error correcting coding and decoding: Turbo- Comparison of Optimal and Sub-Optimal MAP Codes, IEEE Trans. Communications, vol. 44, Decoding Algorithms Operating in the Log no. 10, pp. 1261-1271, Oct. 1996. Domain, Proc. IEEE International Conference [3] C. Berrou and M. Jézéquel, Non binary on Communications (ICC’95), Seattle, pp. convolutional codes for turbo coding, 1009-1013, June 1995. Electronics Letters, vol. 35, no. 1, pp. 9-40, [16] S. Papaharalabos, P. Sweeney, and B. G. Jan. 1999. Evans, Constant log-MAP decoding algorithm [4] Third Generation Partnership Project. 3GPP for duo-binary turbo codes, Electronics Letters home page. www.3gpp.org. vol. 42, issue 12, pp. 709 – 710, June 2006. [5] 3GPP TS 36.212 V8.7.0 (2009-05) Technical [17] Jung-Fu Cheng and T. Ottosson, Linearly Specification, “3rd Generation Partnership approximated log-MAP algorithms for turbo Project; Technical Specification Group Radio decoding, Vehicular Technology Conference Access Network; Evolved Universal Terrestrial Proceedings, 2000. VTC 2000-Spring Tokyo. Radio Access (E-UTRA); Multiplexing and 2000 IEEE 51st vol. 3, pp. 2252 – 2256, 2000. channel coding (Release 8).” [18] C. Anghel, A. A. Enescu, C. Paleologu, and S. [6] F. Khan, LTE for 4G Mobile Broadband, Ciochina, CTC Turbo Decoding Architecture Cambridge University Press, New York, 2009. for H-ARQ Capable WiMAX Systems [7] M. C. Valenti and J. Sun, The UMTS turbo Implemented on FPGA, Ninth International code and an efficient decoder implementation Conference on Networks ICN 2010, Menuires, suitable for software-defined radios, France, April 2010. International Journal of Wireless Information [19] C. Anghel, V. Stanciu, C. Stanciu, and C. Networks, vol. 8, no. 4, Oct. 2001. Paleologu, CTC Turbo Decoding Architecture [8] M. Sanad and N. Hassan, Novel wideband for LTE Systems Implemented on FPGA, MIMO antennas that can cover the whole LTE IARIA ICN 2012, Reunion, France, 2012. spectrum in handsets and portable computers, [20] “Virtex 5 Family Overview,” Feb. 2009, The Scientific World Journal, vol. 2014, art. ID www.xilinx.com. 694805, 9 pages, 2014. [9] Suchang Chae, A low complexity parallel architecture of turbo decoder based on QPP interleaver for 3GPP-LTE/LTE-A, http://www.design- reuse.com/articles/31907/turbo-decoder- architecture-qpp-interleaver-3gpp-lte-lte- a.html [10] Y. Sun and J. R. Cavallaro, Efficient hardware implementation of a highly-parallel 3GPP LTE/ LTE-advance turbo decoder, Integration, the VLSI Journal, vol. 44, issue 4, pp. 305-315, Sept. 2011. ISBN: 978-1-61804-271-2 111
You can also read