13
electronics Article A Parallel Timing Synchronization Structure in Real-Time High Transmission Capacity Wireless Communication Systems Xin Hao 1,2, * , Changxing Lin 1,2 and Qiuyu Wu 1,2, * 1 Microsystem & Terahertz Research Center, China Academy of Engineering Physics, Chengdu 610200, China; [email protected] 2 Institute of Electronic Engineering, China Academy of Engineering Physics, Mianyang 621900, China * Correspondence: [email protected] (X.H.); [email protected] (Q.W.); Tel.: +86-028-65726039 (Q.W.) Received: 15 February 2020; Accepted: 10 April 2020; Published: 16 April 2020 Abstract: In the past few years, parallel digital signal processing (PDSP) architectures have been intensively studied to fulfill the growing demand of channel capacity in coherent optical communication systems. However, to our knowledge, real-time timing synchronization in such architectures is until now not implemented on a Field Programmable Gate Array (FPGA). In this article, a parallel timing synchronization architecture is proposed. In the architecture, a parallel First In First Out (FIFO) structure based on an index associated rearranging method, and a dual feedback loop based on the Gardner’s algorithm, are adopted. Taking advantages of the FIFO structure, 67% Look Up Table (LUT) is saved in comparison with earlier results, meanwhile the Numerically Controlled Oscillator (NCO) is efficiently improved to meet the FPGA timing requirements for real-time performance. MATLAB simulations are run to evaluate the Bit Error Rate (BER) deterioration of the architecture. The float- and fixed-point simulation results have shown that, The BER deteriorations are less than 0.5 dB and 1 dB, respectively. Further, the implementation of the architecture on a Xilinx XC7VX485T FPGA chip is achieved. A 20 giga bit per second (Gbps) 16 Quadrature Amplitude Modulation (16QAM) real-time system is achieved at the system clock of 159.524 MHz. This work opens a new pathway to improve the transmission capacity in real-time wireless communication systems. Keywords: parallel digital signal processing (PDSP); wireless communication; real-time; FPGA; timing synchronization 1. Introduction In modern society, the demand of transmission capacity in wireless communication systems is continuously increasing. Forecasted by Cisco Visual Networking Index (VNI) published in 2017, global mobile data traffic will rise to 49 exabytes per month by 2021, which only reached 7.2 exabytes per month at the end of 2016 [1]. To fulfill this trend, high-speed parallel digital signal processing (PDSP) technologies [2] in wireless communication systems are becoming overwhelmingly prevalent in recent years. In 2013, Microelectronics and Nanotechnology (IEMN) demonstrated an offline photoelectronic system [3] with an 8.2 giga bit per second (Gbps) communication rate. In 2012, Fraunhofer Institute for Applied Solid State Physics (IFA) fulfilled an offline electronic system [4] with a 24 Gbps rate, and a high offline bit rate has been achieved at about 50 Gbps from the Nippon Telegraph and Telephone Public Corporation (NTT) of Japan in 2014 [5]. Although with high complexity, the potential of PDSP architectures attracts many research interests, off-line. In optical fiber communication, recent literatures have reported systems based on PDSP architectures [6,7]. However, to our best knowledge, until now, only offline (non-real time) PDSP architectures [8] have been realized. Electronics 2020, 9, 652; doi:10.3390/electronics9040652 www.mdpi.com/journal/electronics

A Parallel Timing Synchronization Structure in Real-Time

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Parallel Timing Synchronization Structure in Real-Time

electronics

Article

A Parallel Timing Synchronization Structure inReal-Time High Transmission Capacity WirelessCommunication Systems

Xin Hao 1,2,* , Changxing Lin 1,2 and Qiuyu Wu 1,2,*1 Microsystem & Terahertz Research Center, China Academy of Engineering Physics, Chengdu 610200, China;

[email protected] Institute of Electronic Engineering, China Academy of Engineering Physics, Mianyang 621900, China* Correspondence: [email protected] (X.H.); [email protected] (Q.W.); Tel.: +86-028-65726039 (Q.W.)

Received: 15 February 2020; Accepted: 10 April 2020; Published: 16 April 2020�����������������

Abstract: In the past few years, parallel digital signal processing (PDSP) architectures have beenintensively studied to fulfill the growing demand of channel capacity in coherent optical communicationsystems. However, to our knowledge, real-time timing synchronization in such architectures is untilnow not implemented on a Field Programmable Gate Array (FPGA). In this article, a parallel timingsynchronization architecture is proposed. In the architecture, a parallel First In First Out (FIFO)structure based on an index associated rearranging method, and a dual feedback loop based on theGardner’s algorithm, are adopted. Taking advantages of the FIFO structure, 67% Look Up Table (LUT)is saved in comparison with earlier results, meanwhile the Numerically Controlled Oscillator (NCO)is efficiently improved to meet the FPGA timing requirements for real-time performance. MATLABsimulations are run to evaluate the Bit Error Rate (BER) deterioration of the architecture. The float-and fixed-point simulation results have shown that, The BER deteriorations are less than 0.5 dB and1 dB, respectively. Further, the implementation of the architecture on a Xilinx XC7VX485T FPGAchip is achieved. A 20 giga bit per second (Gbps) 16 Quadrature Amplitude Modulation (16QAM)real-time system is achieved at the system clock of 159.524 MHz. This work opens a new pathway toimprove the transmission capacity in real-time wireless communication systems.

Keywords: parallel digital signal processing (PDSP); wireless communication; real-time; FPGA;timing synchronization

1. Introduction

In modern society, the demand of transmission capacity in wireless communication systemsis continuously increasing. Forecasted by Cisco Visual Networking Index (VNI) published in 2017,global mobile data traffic will rise to 49 exabytes per month by 2021, which only reached 7.2 exabytesper month at the end of 2016 [1]. To fulfill this trend, high-speed parallel digital signal processing(PDSP) technologies [2] in wireless communication systems are becoming overwhelmingly prevalentin recent years.

In 2013, Microelectronics and Nanotechnology (IEMN) demonstrated an offline photoelectronicsystem [3] with an 8.2 giga bit per second (Gbps) communication rate. In 2012, Fraunhofer Institutefor Applied Solid State Physics (IFA) fulfilled an offline electronic system [4] with a 24 Gbps rate,and a high offline bit rate has been achieved at about 50 Gbps from the Nippon Telegraph and TelephonePublic Corporation (NTT) of Japan in 2014 [5]. Although with high complexity, the potential of PDSParchitectures attracts many research interests, off-line. In optical fiber communication, recent literatureshave reported systems based on PDSP architectures [6,7]. However, to our best knowledge, until now,only offline (non-real time) PDSP architectures [8] have been realized.

Electronics 2020, 9, 652; doi:10.3390/electronics9040652 www.mdpi.com/journal/electronics

Page 2: A Parallel Timing Synchronization Structure in Real-Time

Electronics 2020, 9, 652 2 of 13

Offline systems can be applied in many areas, such as High Definition (HD) movies and otherapplications. However, in many applications, real-time is required. For instance, an HD video callsystem has a large amount of data to be processed. If the baseband digital signal processing is runningoffline, the users will have to wait for a rather long time of processing before the user on the receiver sidecan get the information transmitted from the user in the transmitted side. To solve this problem, online(real-time) communication system is becoming an overwhelming prevalent technology nowadays.

To enjoy the real-time features of communication systems, high order Quadrature AmplitudeModulation (QAM) [9,10] is thought to be the key to improve the spectrum efficiency. High order QAMmodulation communication systems always require quite complicated demodulator architectures.Hardware complexity and resource limitation are thought to be the main challenges in implementingsuch a real-time tremendous architecture. Field Programmable Gate Array (FPGA)-based [2] PDSParchitecture is supposed to be the key to solve this problem, novel architectures emerge endlessly.Baseband PDSPs [11–13] are necessary in these demodulator architectures, while parallel timingsynchronization is essential in baseband PDSPs.

In this article, an improved two times oversampling parallel timing synchronization architectureaimed at real-time performance is proposed and then implemented on a Xilinx XC7VX485T FPGAchip. The key technology is PDSP on FPGA, which greatly reduces the system clock frequency andmakes it feasible to achieve real-time performance with current existing hardware devices. Specifically,a parallel First In First Out (FIFO) structure based on an index associated rearranging method anda dual feedback loop based on the Gardner algorithm are adopted in our parallel architecture.

The rest part of this article is organized as follows. Section 2 describes several shortages oftwo existing parallel structures. The improved parallel structure and FPGA implementation is carriedout in Section 3. Section 4 presents the simulation and implementation results. Finally, a conclusion ismade in Section 5.

2. Shortages of Existing Parallel Architectures

As it has been discussed in Section 1, many offline communication systems with high transmissionrate have been developed. Nonetheless, improving the communication rate of real-time systems is stilla big problem that needs to be worked out, especially the baseband digital signal processing technology.

Most parallel structures are derived from serial algorithms. Gardner [14–16] and Oerder and Meyr(O&M) [17] algorithms are the two most commonly used serial timing synchronization algorithms.The authors have developed a parallel Gardner architecture [8,18,19] in a Gardner algorithm basedtwo times oversampling coherent optical system. Nevertheless, this architecture could not achievereal-time performance, mainly because their Numerically Controlled Oscillator (NCO) structure couldnot meet FPGA timing requirements. Besides, the loop filter dismisses information provided by someof the parallel error detectors, which leads to a rather great deviation in recovered signals. To achievereal-time performance, Lin and his collaborators have successfully implemented parallel O&M onFPGA in [20]. However, its requirement of four times the oversampling costs an extremely largeamount of hardware resources and needs higher speed Analog to Digital Converters (ADCs) inparallel systems.

2.1. Non-Real Time

Zhou and her collaborators have explained their parallel structure in [8]. However, their digitalsignal processing (DSP) structure was not implemented on any existing hardware facilities, even thoughtheir structure is offline verified on the MATLAB platform. The main reason that limits their real-timeperformance is that when timing error accumulates to a certain amount, information loss will occur intheir structure.

It is also too difficult to get real-time performance on Digital Signal Processors (DSPs). Take one ofthe highest performance DSPs, Texas Instruments (TI) C66x series DSP, for example. The highest

Page 3: A Parallel Timing Synchronization Structure in Real-Time

Electronics 2020, 9, 652 3 of 13

frequency C66x DSP could work at is only 1.2 GHz, which is impossible to implement a real-timecommunication system over a 20 Gbps rate.

2.2. High Hardware Resource Consumption

From [17], it is easy to find out that a structure based on the O&M algorithm will need at leastfour times the oversampling rate, because the O&M based parallel error detector in [20] needs foursampled points to get one timing error.

3. Improved Parallel Architecture

The architecture of the serial Gardner algorithm can be found in [14,15]. The parallel architectureimprovements will be carried out in this section. A parallel First In First Out (FIFO) structure based onan index associated rearranging method and a dual feedback loop based on the Gardner’s algorithmare adopted in the proposed parallel architecture. The overall architecture is depicted in Figure 1.

Figure 1. 2m-Parallel Architecture of Proposed Algorithm.

Page 4: A Parallel Timing Synchronization Structure in Real-Time

Electronics 2020, 9, 652 4 of 13

To generate parallel digital signals, I Q analog signals are firstly sampled by two high-speedADCs. The sample frequency relation of ADC and parallel structure is shown in Equation (1).

fadc = 2m · fs, (1)

where, fadc is the sampling frequency of high speed ADC, m is the number of parallel channels, and fs

is the sampling frequency of parallel structure. The digital signals are stored by parallel FIFO and thenrearranged to ensure the stability of data flow. Afterwards, interpolated signals from interpolators willbe imported to (timing error detector) TEDs to obtain timing errors, and then filtered by loop filter.Ultimately, timing errors are compensated in interpolators by fractional interval and basepoint indexprovided by NCOs.

FPGA implementation block diagram with delays and bit width of the parallel architecture isdepicted in Figure 2.

Figure 2. Field Programmable Gate Array (FPGA) block diagram of improved paralleltiming architecture.

The stability of feedback loop is quite sensitive to timing error, so the error processing relatedmodules require bit width more than 8 bits, even though the source and output are all 8-bit signals.The FPGA sources consumed and delay caused by each module will be discussed in detail in 3.1 and 3.2.

If not specified, the following descriptions are based on m channel parallel module, and the delayadder and multiplier brought into FPGA are 1 and 3 respectively.

3.1. Parallel Preprocess

Parallel preprocessing composed of parallel FIFO, data rearrange and data select module isresponsible for the stability of data flow.

3.1.1. Parallel FIFO

The source signals need to be stored in the parallel FIFO before all the other procedures in caseof any information loss. 2m FIFOs with 8 bits of write/read depth and 512 bits of write width of I Qsignal are required on FPGA in an m parallel architecture.

3.1.2. Resource-Saving Data Rearrangement

Timing error is caused by timing frequency and phase offset. The phase offset caused error isa constant value, but timing error will continuously increase or decrease to infinite if there exists timingfrequency offset. To restrict the error, source signals need to be deleted or kept when the error has

Page 5: A Parallel Timing Synchronization Structure in Real-Time

Electronics 2020, 9, 652 5 of 13

accumulated to a certain amount. However, in a parallel structure, the parallel source signal sequencewill be disordered once the delete or keep operation occurs.

Data rearrangement is adopted to delete/keep the source signal and then adjust the disorderedsignals into a correct order. In [10], the authors have taken a parallel FIFO based delete-keep method tomake this adjustment. However, as shown in the first column of Table 1, the subscripts of indexes arevariable. This leads to Look Up Table (LUT) being consumed in every parallel channel. More seriously,the LUT consumption increases exponentially with parallel number.

To solve this problem, an index-associated method is proposed. It is easy to find out thatthe increment of subscript shares the same value with the increment of index value from 1.Thus, the variables can be translated into index(i) plus the corresponding subscripts, as exhibited inthe second column of Table 1. With the proposed associated index method, only one LUT is consumedbecause the other LUTs in [10] can be replaced by add operations. Taking an m parallel systemfor instance, the LUT are carried out only in index(i), and the other m− 1 LUTs are replaced withm− 1 adders.

Table 1. Index Relation.

In [10] Associated Index

index(i) index(i)index(i+1) index(i)+1· · · · · ·index(i+m-1) index(i)+m-1

Our work aims at achieving a 20 Gbps rate communication system, a 64 parallel architecture willbe a wise choice. Because the FPGA clock frequency will be running around 156.25 MHz, which willnot be limited by the current hardware. So the comparison of synthesised FPGA resource utilizationwith 64-parallel FIFO is exhibited in Table 2. It shows that the proposed method saves about 67% LUTscompared to the method in [10].

Table 2. Resource utilization comparison.

Parameter LUT FF

Utilization In [10] 21,248(7.00%) 3175 (0.52%)Proposed 6881 (2.27%) 2531 (0.42%)

Available 303,600 607,200

3.1.3. Data Select

A data select module is used to send rearranged source signal to the corresponding interpolator,and each interpolator needs four data. Specifically, in a parallel structure, four extra source signalsfrom the beginning of the next clock cycle need to be attached to the end of current rearranged queue.Otherwise, the last four interpolators could not have enough source data. A data select module needs5m registers on FPGA in a 2m parallel system.

3.2. Parallel Dual Feedback Loop

The improved parallel dual feedback loop is composed of a Parallel Module (PM) and a loop filter.Each PM has two interpolators, one TED and one NCO.

3.2.1. Parallel Module

Every PM needs five source signals and each interpolator has four, the three source signals in themiddle are used by both interpolators.

Page 6: A Parallel Timing Synchronization Structure in Real-Time

Electronics 2020, 9, 652 6 of 13

1 Coefficient Multiplier-Free Interpolator

As it has been summarized in [15], three multipliers/dividers will be consumed with cubicinterpolator while updating the coefficients every time, and two while updating with four-pointpiecewise-parabolic interpolator (α = 1/2) in serial systems. The coefficients are shown in Equation (2).

C−2 = 1/2 · x2 − 1/2 · x,

C−1 = −1/2 · x2 + 3/2 · x,

C0 = −1/2 · x2 − 1/2 · x + 1,

C1 = 1/2 · x2 − 1/2 · x, (2)

where x stands for the input signals. Specifically, ′3/2 · x′ in Equation (2) is separated into ′1/2 · x + x′.Then all Farrow coefficients of four-point piecewise-parabolic interpolator with α = 1/2 will be aninteger multiple of 2. This makes the multiplier/divider could be implemented with shift operationson FPGA [21]. In other words, four-point piecewise-parabolic with α = 1/2 is the best choice for FPGAthat could balance the maximum resource savings with minimal performance deterioration.

The Farrow structure on FPGA is shown in Figure 3.

Figure 3. Farrow structure for piecewise-parabolic interpolator (α = 1/2).

where, D stands for hardware delay, and the symbols are corresponding to those in Equation (2).In particular, one more extra delay is caused by the coefficient ′3/2′ which is separated into ′1 + 1/2′

described above. While each adder brings in 1 delay, and the multiplier brings in 3. So it is easy to findout that the number of delays caused by interpolator is 11. Where,

2 Error Detector

Gardner′s timing error detector (TED) [16] is shown in Equation (3).

e(n) = I(n− 1/2)[I(n)− I(n− 1)]

+ Q(n− 1/2)[Q(n)−Q(n− 1)], (3)

where e(n) is the timing error, I(n) and Q(n) are real and image parts of the interpolator’s output,n− 1, n− 1/2, n are three continuous indexes.

TED equation in a parallel system is shown in Equation (4) below

e(n, i) = I1(n, i)[I2(n, i)− I2(n, i− 1)]

+Q1(n, i)[Q2(n, i)−Q2(n, i− 1)], (4)

Page 7: A Parallel Timing Synchronization Structure in Real-Time

Electronics 2020, 9, 652 7 of 13

where e(n, i) is the timing error of the ith PM at time n. For I signal, I1(n, i) is the first interpolator′soutput of the ith PM at time n, I2(n, i) is the second interpolator′s output of the ith PM at time n,I2(n, i− 1) is the second interpolator′s output of the i− 1th PM at time n. When i=1, I2(n, i− 1) standsfor the second interpolator′s output of the last PM at time n-1. Q signal has the same explanation.

For FPGA implementation, each error detector contains two adders and two multipliers.The hardware block diagram is depicted in Figure 4.

Figure 4. Implementation structure of the error detector.

where D stands for hardware delay, and the symbols are corresponding to those in Equation (4).From Figure 4, it can be seen that the total delay is 5, in which the adders bring in 1 delay each andmultipliers bring in 3.

3 Simplified NCO

In [8], the overflow moments are obtained by comparators. Nevertheless, as we have discussed inSection 2, their comparison logic has difficulties to meet FPGA timing requirements. To get real-timeperformance, a direct-calculation based parallel NCO structure is proposed. In [14], Gardner givesEquation (5) below to calculate mk

mk = int[kTi/Ts], (5)

where int[z] means the largest integer not exceeding z, Ts is the sample period before synchronization,and Ti is the synchronized sample period. Equation (5) can be translated into Equation (6) below

mk+1 = mk + f ix[R + W(n) + µk], (6)

where f ix[z] stands for the largest integer toward z, and R is half of the oversampling ratio, W(n) isthe control word of NCO at time n. Thus, mk of each parallel module can be calculated directly andaccurately instead of the comparison logic.

When implemented on FPGA, the NCO only needs to locate the initial position of the interpolatorwith one 8-bit control signal in a serial structure. However, in a parallel structure, not only theinitial positions of each parallel module are required, another 2-bit control signal is required forrearrangement. Luckily, in our proposed direct calculation method NCO, this 2-bit signal is the firsttwo bits of the 8-bit control signal in the mth NCO, so there are no more hardware resources required,as depicted in Figure 1.

3.2.2. High Precision Loop Filter

A proportional integral filter is employed in our structure. In [8], the information carried bymost of the TEDs, except for the last one, are dismissed and only the last TED’s timing error servesas the input of proportional element, which leads to a great deviation. To guarantee the accuracy,the average value of all timing errors is employed as the input of loop filter in our work. Simulation

Page 8: A Parallel Timing Synchronization Structure in Real-Time

Electronics 2020, 9, 652 8 of 13

results confirmed that the performance is better when using the average value. The equations of theimproved loop filter is shown from Equation (7) to Equation (9).

Proportional element

Pn = k1 × (errn,1 + errn,2 + · · ·+ errn,32), (7)

Integral element

In = In−1 + k2 × (errn,1 + errn,2 + · · ·+ errn,32), (8)

Then the output of loop filter is

Wn = Pn + In, (9)

where, k1 and k2 stand for the coefficient of proportional element and integral element separately,err(n,i) is the error of the ith PM at time n, Wn is the output of loop filter at time n. The structure isdepicted in Figure 5.

Figure 5. Implementation structure of loop Filter with error smoothing.

When implemented on FPGA, there will be an error smooth module before the first adder,because the input of loop filter is modified to the average value of parallel timing errors as mentionedabove. In an m parallel system there exists m/2 TEDs, so only a log2m/2-bit shift operation on FPGAcan accomplish the smoother. Besides, the first adder is an m/2 input adder, so in an m parallel systemthe delay brought in by the smoother is log2m/2. To save the hardware resources, the multiplierswill be replaced by shift operations on FPGA as aforementioned. So k1 and k2 in the loop filter areapproximated to the nearest integer power of 1/2 to replace the multipliers by shift operations onFPGA. As the multiplier operation on FPGA is achieved by shift and add operation, so even thoughthese approximations would change the closed loop bandwidth and the damping factor achieved bythe system, the approximation is reasonable.

Therefore, the total adders consumed in the loop filter is 1 + m/2, and the corresponding delay is2 + log2m/2.

4. Simulation and Fpga Implementation

Our work aims at achieving a 20 Gbps rate wireless communication system, which has beenmentioned before. A 64-parallel architecture could make the FPGA clock frequency running at around156.25 MHz. A 32-parallel system will lead to 312.5 MHz FPGA clock frequency, which makes it verydifficult for FPGA to ensure the stability while running. Even though a 128-parallel system needs theFPGA circuit runs only at 78.125 MHz, which is quite easy for nowadays FPGA circuit to guaranteeits stability, but with the increase of the number parallel channels, the system error grows drastically.On the other hand, the number of parallel channels other than an integer multiple of two will lead toa waste of waste hardware resource on FPGA chip while routing, because FPGA is based on binarysystem. So, a 64-parallel architecture is the best choice for a 20 Gbps rate system.

The proposed algorithm is verified in a baseband communication system. The modulation type is16 QAM, bit rate is 20 Gbps, roll-off factor is 0.4, oversampling frequency fs is 2 times the symbol rateRs, ( fs = 2Rs), and the timing frequency and phase offset are 32 kHz and π respectively. The parallelsource signal is quantized to 8 bits. The BER performance of the improved parallel architecture

Page 9: A Parallel Timing Synchronization Structure in Real-Time

Electronics 2020, 9, 652 9 of 13

simulated on MATLAB reveals its high efficiency. Furthermore, the implemented parallel architectureon Xilinx XC7VX485T FPGA shows perfect consistency with simulation.

4.1. MATLAB Simulation

The constellation diagrams are shown in Figure 6, where Figure 6a,b are the constellation diagramsbefore and after timing synchronization respectively. Here, SNR is set to 20 dB. The convergedconstellation diagram proves that the timing module works correctly.

(a) Before Timing Synchronization. (b) After Timing Synchronization.

Figure 6. Constellation diagrams before and after timing synchronization.

BER performance for 100 frames (with 16,384 bits each) transmission is carried out in Figure 7.The blue curve is the theoretical BER, ′∗′ and ′∆′ represent MATLAB fix and float point simulationrespectively. The BER performance indicates that the algorithm can work efficiently with deteriorationless than 0.5 dB and 1 dB in float and fix point simulation.

Figure 7. Bit Error Rate (BER) performance of the improved architecture.

Page 10: A Parallel Timing Synchronization Structure in Real-Time

Electronics 2020, 9, 652 10 of 13

4.2. FPGA Implementation

Implementation is demonstrated on a Xilinx XC7VX485T FPGA chip. 128 parallel ROMs areemployed to store the source data as equivalent to the 10 GHz ADCs. In order to evaluate theperformance difference between simulation and FPGA implementation, periodic source is embeddedinto the ROMs. The write and read clock of parallel FIFO module are set to 156.268 MHz and159.524 MHz respectively by an Mixed Mode Clock Manager (MMCM). The device utilization ofthe whole system is summarized in Table 3.

Table 3. Resource utilization comparison

Resource Utilization Available Utilization%

LUT 35,228 303,600 11.60LUTRAM 1375 130,800 1.05FF 55,950 607,200 9.21BRAM 143 1030 13.88DSP48e 480 2800 17.14

The output of fractional interval is depicted in Figure 8a. Where, ′∗′ and ′∆′ represent MATLABfix point simulation and hardware behavior simulation respectively. It is easy to find out that thesetwo signals are totally overlapped, which indicates the high efficiency of hardware design. To furtherverify the accuracy, the difference value of the two aforementioned signals is exhibited in Figure 8b.The constant zero proves the NCO output in MATLAB is exactly the same as in hardware behaviorsimulation. Moreover, not only fractional interval, but all the signals achieved in behavior simulationis exactly the same as those in MATLAB, which confirmed the correctness and effectiveness.

(a) Output value.

(b) Difference Value.

Figure 8. Output and difference value of fractional interval with MATLAB fixed point simulation andbehavior simulation.

The constellation diagram achieved by Xilinx XC7VX485T FPGA chip is displayed in Figure 9a.Figure 9b shows the difference value (image part) of the interpolator′s output of behavior simulationand FPGA implementation.

Page 11: A Parallel Timing Synchronization Structure in Real-Time

Electronics 2020, 9, 652 11 of 13

(a) Constellation diagram on FPGA. (b) Difference value (image part).

Figure 9. Constellation diagram on FPGA and difference value (image part) of interpolator′s outputwith MATLAB fix point simulation and FPGA implementation.

The difference here can not be guaranteed to always be zero because the initial source signalsare impossible to be precisely controlled on an FPGA chip. However, the difference value is alwaysless than 2 from about the 15,000th datum. This means only quantization error less than 1.5% will bebrought in for an 8-bit signal and further confirms that the proposed algorithm can work effectivelyon FPGA.

5. Conclusions

Through this paper, we have proposed an improved parallel timing synchronization architectureto solve the urgent problem of enhancing transmission capacity in real-time wireless communicationsystems. Besides, we have demonstrated that the proposed architecture can be successfullyimplemented on FPGA. In addition, our work saves 67% LUTs resources on FPGA compared witheariler results. Meanwhile, the NCO is further improved to meet the FPGA timing requirements bydirect-calculation instead of comparator based structure in related work. The key technology is parallelsignal processing, both in theory and FPGA implementation. Parallelization of m channels couldreduce the system clock to 1/2 m of that required in serial processing, which makes it feasible to achievereal-time performance with current existing hardware devices. Accordingly, a parallel digital timingsynchronization theoretical model was established. Simulation result of 64-parallel channels, 20 Gbpsrate and 16 QAM system shows high consistency with the theoretical model. The BER deteriorationis less than 0.5 dB and 1 dB in float and fix point simulation respectively. Simultaneously, FPGAimplementation shows excellent agreement with simulation. Furthermore, the proposed algorithmis not limited to 64-parallel, higher capacity can be achieved with faster clocks or more parallelchannels. The proposed structure would be potentially optimized in future work of high capacitywireless communication.

Author Contributions: X.H.: Methodology, Software, Validation, Formal Analysis, Data Curation, Writing—Original Draft. C.L.: Conceptualization, Investigation, Writing—Review & Editing, Supervision, ProjectAdministration, Funding Acquisition. Q.W.: Resources, Writing—Review & Editing, Supervision. All authorshave read and agreed to the published version of the manuscript.

Funding: This research was funded by National Key R&D Program of China Grant (2018YFB18000500),The President Funding of China Academy of Engineering Physics with No. YZJJLX2018009.

Acknowledgments: The authors would like to thank Zhu Zheng, Zheng Feng, Lei Zhao, Yang Yu, Jianfei An andChengda Ren for helpful discussions.

Conflicts of Interest: The authors declare no conflict of interest.

Page 12: A Parallel Timing Synchronization Structure in Real-Time

Electronics 2020, 9, 652 12 of 13

References

1. Cisco. Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2017–2022; White Paper CiscoPublic: San Jose, CA, USA, 2019.

2. Parhi, K.K. VLSI Digital Signal Processing: Systems Design and Implementation; A Wiley-Interscience Publication:Hoboken, NJ, USA, 1999; pp. 255–311.

3. Blin, S.; Tohme, L.; Coquillat, D.; Horiguchi, S.; Minamikata, Y.; Hisatake, S.; Nouvel, P.; Cohen, T.;Pénarier, A.; Cano, F.; et al. Wireless Communication at 310 GHz using GaAs high-Electron-MobilityTransistors for Detection. J. Commun. Netw. 2013, 15, 559–568. [CrossRef]

4. Antes, J.; König, S.; Leuther, A.; Massler, H.; Leuthold, J.; Ambacher, O.; Kallfass, I. 220 GHz wireless datatransmission experiments up to 30 Gbit/s. In Proceedings of the 2012 IEEE/MTT-S International MicrowaveSymposium Digest, Montreal, QC, Canada, 17–22 June 2012; pp. 1–3.

5. Song, H.; Kim, J.; Ajito, K.; Kukutsu, N.; Yaita, M. 50-Gbps Direct Conversion QPSK Modulator andDemodulator MMICs for Terahertz Communications at 300 GHz. IEEE Trans. Microw. Theory Tech. 2014,62, 600–609. [CrossRef]

6. Yang, T.; Shi, C.; Chen, X.; Zhang, M.; Ji, Y.; Hua, F.; Chen, Y. Linewidth-tolerant and multi-format carrierphase estimation schemes for coherent optical m-QAM flexible transmission systems. Opt. Express 2018,26, 10599–10615. [CrossRef] [PubMed]

7. Gao, Z.; Zhou, M.; Reviriego, P.; Maestro, J.A. Efficient Fault-Tolerant Design for Parallel Matched Filters.IEEE Trans. Circuits Syst. II Express Briefs 2018, 65, 366–370. [CrossRef]

8. Zhou, X.; Chen, X. Parallel implementation of all-digital timing recovery for high-speed and real-time opticalcoherent receivers. Opt. Express 2011, 19, 9282–9295. [CrossRef] [PubMed]

9. Wu, Q.; Lin, C.; Lu, B.; Miao, L.; Hao, X.; Wang, Z.; Jiang, Y.; Lei, W.; Den, X.; Chen, H.; et al. A 21 km5 Gbps real time wireless communication system at 0.14 THz. In Proceedings of the 2017 42nd InternationalConference on Infrared, Millimeter, and Terahertz Waves (IRMMW-THz), Cancun, Mexico, 27 August–1September 2017; pp. 1–2.

10. Lin, C.; Zhang, J.; Shao, B. A Multi-Gigabit Parallel Demodulator and Its FPGA Implementation.IEICE TRANSACTIONS Fundam. Electron. Commun. Comput. Sci. 2012, 95, 1412–1415. [CrossRef]

11. Lin, C.; Shao, B.; Zhang, J. A high data rate parallel demodulator suited to FPGA implementation.In Proceedings of the 2010 International Symposium on Intelligent Signal Processing and CommunicationSystems, Chengdu, China, 6–8 December 2010; pp. 1–4.

12. Cheng, C.; Parhi, K.K. Hardware Efficient Fast Parallel FIR Filter Structures Based on Interated ShortConvolution. IEEE Trans. Circuits Syst. I 2004, 51, 1492–1500. [CrossRef]

13. Cheng, C.; Parhi, K.K. Low Cost Parallel Adaptive Filter Structures. In Proceedings of the Conference Recordof the 39th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, 30 October–2November 2005; pp. 354–358.

14. Gardner, F.M. Interpolation in digital modems. I. Fundamentals. IEEE Trans. Commun. 1993, 41, 501–507.[CrossRef]

15. Erup, L.; Gardner, F.M.; Harris, R.A. Interpolation in digital modems. II. Implementation and performance.IEEE Trans. Commun. 1993, 41, 998–1008. [CrossRef]

16. Gardner, F. A BPSK/QPSK Timing-Error Detector for Sampled Receivers. IEEE Trans. Commun. 1986,34, 423–429. [CrossRef]

17. Oerder, M.; Meyr, H. Digital filter and square timing recovery. IEEE Trans. Commun. 1988, 36, 605–612.[CrossRef]

18. Zhou, X.; Chen, X.; Zhou, W.; Fan, Y.; Zhu, H.; Li, Z. All-Digital Timing Recovery and Adaptive Equalizationfor 112Gbit/s POLMUX-NRZ-DQPSK Optical Coherent Receivers. IEEE/OSA J. Opt. Commun. Netw. 2010,2, 984–990. [CrossRef]

19. Fan, Y.; Chen, X.; Zhou, W.; Zhou, X. Parallel processing clock synchronization-dispersion equalizationcombining loop in 112Gb/s optical coherent receivers. In Proceedings of the 19th Annual Wireless andOptical Communications Conference (WOCC 2010), Shanghai, China, 14–15 May 2010; pp. 1–4.

Page 13: A Parallel Timing Synchronization Structure in Real-Time

Electronics 2020, 9, 652 13 of 13

20. Lin, C.; Zhang, J.; Shao, B. A High Speed Parallel Timing Recovery Algorithm and Its FPGA Implementation.In Proceedings of the 2011 2nd International Symposium on Intelligence Information Processing and TrustedComputing, Hubei, China, 22–23 October 2011; pp. 63–66.

21. Yan, X.; Wang, Q.; Hao, X.; Qin, K. A High-Efficiency Multiplierless Symbol Synchronization Algorithm forIEEE802.11x WLANs. Wirel. Pers. Commun. 2017, 94, 1737–1749. [CrossRef]

c© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open accessarticle distributed under the terms and conditions of the Creative Commons Attribution(CC BY) license (http://creativecommons.org/licenses/by/4.0/).