A GPU based real-time GPS software receiver

ORIGINAL ARTICLE

A GPU based real-time GPS software receiver

Thomas Hobiger Æ Tadahiro Gotoh ÆJun Amagai Æ Yasuhiro Koyama Æ Tetsuro Kondo

Received: 7 May 2009 / Accepted: 7 July 2009 / Published online: 8 August 2009

� Springer-Verlag 2009

Abstract Off-the-shelf graphics processing units provide

low-cost massive parallel computing performance, which

can be utilized for the implementation of a GPS software

receiver. In order to realize a real-time capable system the

crucial stages of the receiver should be optimized to suit

the requirements of a parallel processor. Moreover, the

receiver should be capable to provide wider correlation

functions and provide easy access to the spectral domain of

the signals. Thus, the most suitable correlation algorithm,

which forms the core part of each receivers should be

chosen and implemented on the graphics processor. Since

the sampling rate of the received signal limits the real-time

capabilities of the software radio it is necessary to deter-

mine an optimum value, considering that the precision of

the observable varies with sampling bandwidth. We are

going to discuss details and present our single frequency

multi-channel implementation, which is capable of oper-

ating in real-time mode. Our implementation differs from

other solutions by the wideness of the correlation function

and allows simple handling of data in the spectral domain.

Comparison with output from a commercial hardware

receiver, which shares the antenna with the software radio,

confirms the consistency and accuracy of our development.

Keywords GPU � Software receiver � Real-time � FFT

Introduction

Driven by the increase of CPU performance GPS/GNSS

software receivers have become more popular since they

offer a flexible and extendible platform for developing and

testing new applications (Chakravarthy et al. 2001). A

software radio cannot only mimic the functionality of its

hardware counterpart, but allows the user to carry out the

signal processing chain with unprecedented floating point

precision. Since application-specific integrated circuits

(ASICs) for GPS tracking cannot be easily adopted to new

signals nor can their hard-wired logic be replaced with new

algorithms, receivers based on field-programmable gate

arrays (FPGAs) have been developed in the recent years,

e.g. Mumford et al. (2006). Such a solution provides a good

tradeoff between the flexibility of the software radios and

the speed of the ASICs, but is still an expensive niche

product for dedicated applications. Similar to the progress

with real-time software receivers running on the CPU

(Deng et al. 2009) graphics processing units (GPUs) are

expected to be another way for implementation which

allows to realize the GPS radio on the PC.

General purpose graphics processing units

Caused by this historical separation and driven by the

requirements of the PC gaming industry, GPUs have

evolved to massive parallel processing systems which

entered the area of non-graphic related applications.

Although a single processing core on the GPU is much

slower and provides less functionality than its counterpart

on the CPU, the huge number of these small processing

entities outperforms the classical processors when the

application can be parallelized. Thus, GPUs have started to

T. Hobiger (&) � T. Gotoh � J. Amagai � Y. Koyama �T. Kondo

Space-time Standards Group, National Institute of Information

and Communications Technology, 4-2-1 Nukui-Kitamachi,

Koganei, Tokyo 184-8795, Japan

e-mail: [email protected]

URL: http://www.nict.go.jp

123

GPS Solut (2010) 14:207–216

DOI 10.1007/s10291-009-0135-2

https://www.researchgate.net/publication/226869410_Software_GPS_Receiver?el=1_x_8&enrichId=rgreq-9ef1a206c70cfbe4e84600f0d7b28665-XXX&enrichSource=Y292ZXJQYWdlOzIyNjAzNTA0ODtBUzoxMDE2MDU0ODM0ODMxMzZAMTQwMTIzNjA0MzIwNA==

https://www.researchgate.net/publication/225935797_An_enhanced_bit-wise_parallel_algorithm_for_real-time_GPS_software_receiver?el=1_x_8&enrichId=rgreq-9ef1a206c70cfbe4e84600f0d7b28665-XXX&enrichSource=Y292ZXJQYWdlOzIyNjAzNTA0ODtBUzoxMDE2MDU0ODM0ODMxMzZAMTQwMTIzNjA0MzIwNA==

attract researchers from a variety of fields, and more and

more applications are emerging which are not related to the

purpose they have been originally designed for (Nguyen

2007; Owens et al. 2008). Moreover, Harris et al. (2008)

demonstrate that a GPU can be successfully applied to

radio astronomical signal processing, solving tasks which

are similar to those of GNSS. Other than CPUs which

directly access the PC’s memory, it is necessary to transfer

the relevant data from the CPU memory to the onboard-

memory of the graphic card, before it can be accessed by a

program running on the GPU. The same holds for the CPU,

which cannot directly access a memory area on the GPU,

but needs to copy the data back to the RAM. Thus, data-

transfer between CPU and GPU can be a significant bot-

tleneck for an application and it should be checked in

advance if the gain of computation performance on the

GPU is not consumed by the overhead caused by data-

transfer. Although, recent motherboards are equipped with

PCI busses which allow data to be transferred at several

Gb/s it could be still an important factor to consider before

starting implementing an application on the GPU.

System overview

Software receivers require that the RF signals are down-

converted and digitized by hardware components before

they can be processed on the PC. Moreover, since the

system developed in this study is not only dedicated to

GNSS but also useable for time-transfer applications using

PRN-code like signals, flexible and robust hardware parts

have been deployed. Several components which have been

originally developed for Very Long Baseline Interferome-

try (VLBI) are used in addition to other off-the-shelf

components.

RF down-conversion and digitization

Figure 1 displays the hardware components which are

utilized for the down- and analog/digital (A/D) conversion.

The RF signals, which are also processed by a com-

mercial hardware receiver (JAVADTM), are received from

a standard geodetic GPS antenna (Ashtech choke-ring

antenna). Thereafter L1 and L2 signals are down-con-

verted to two intermediate frequencies using a phase

locked oscillator operating at 1,380 MHz. These inter-

mediate frequencies are fed to a video converter (Kiuchi

et al. 1997) where they are mixed with the second local

oscillator running at 193.42 MHz. After this stage the

signals are digitized via a sampler, which has been

developed for VLBI (Kondo et al. 2006). Since the digital

signals are output via an USB 2.0 interface they can be

directly handled by a PC. Although displayed in Fig. 1,

processing of L2 signals has been currently turned off and

will be implemented in the near future as discussed later.

Thus, in the following the usage of L1 C/A code is

considered only. A dual-frequency receiver can be real-

ized from the following description by adding a second

GPU which is dedicated to the processing of L2C code

signals.

CPU, GPU and programming utilities

In order to demonstrate that a GPS software receiver can be

implemented on the GPU a test PC has been set up, using

the hard- and software components listed in Table 1.

Basically, only off-the-shelf components have been

utilized. The GPU code has been compiled by the help of

NVIDIA’s CUDA environment (http://developer.download.

nvidia.com/), whereas the host code running on the CPU is

compiled by a GNU C compiler. Although the used GPU

would support double precision floating point numbers, no

use of this option was made because it slows down com-

putational efficiency slightly. Tests with single and double

precision number have revealed identical results, corrobo-

rating our approach to use single precision numbers only.

NVIDIA’s FFT library, named CUFFT is utilized for the

Fourier transforms and other functions of the CUDA toolkit

turned out to be very useful for debugging the code. Since

NVIDIA provides a profiler for programs running on the

Fig. 1 Schematics of the

hardware components for down

conversion and digitization of

GPS signals. (power divider PD,

phase looked oscillator PLO,

frequency distributor Dist., low

noise amplifier LNA

208 GPS Solut (2010) 14:207–216

123

http://developer.download.nvidia.com/

http://developer.download.nvidia.com/

https://www.researchgate.net/publication/226856540_GPU_accelerated_radio_astronomy_signal_convolution_Experimental_Astronomy_2212_129-141?el=1_x_8&enrichId=rgreq-9ef1a206c70cfbe4e84600f0d7b28665-XXX&enrichSource=Y292ZXJQYWdlOzIyNjAzNTA0ODtBUzoxMDE2MDU0ODM0ODMxMzZAMTQwMTIzNjA0MzIwNA==

GPU, this tool has been applied, too in order to detect bottle-

necks within the code.

GPS software receiver implementation on a GPU

Conventional hardware-receivers as well as several soft-

ware radios implement the ‘‘classical’’ early/prompt/late

scheme for the code tracking loop (Tsui 2000). We have

decided to follow another strategy, computing a wider and

finer correlation function, which has several advantages

concerning multi-path mitigation and tracking of weak

signals. Moreover, as discussed below, the computation of

the correlation function in time-domain is not a straight-

forward algorithm for a parallel-processor and requires a

special strategy when computing the correlation amplitude

via summation over all chips.

Basically, two different strategies exist for the realization

of a multi-lag approach, i.e. the computation of a wider or

even the complete correlation function. The first approach,

named XF performs the cross-correlation in the time-

domain and obtains the cross-spectrum via Fourier trans-

form. The second approach, called FX, transforms the

received and the replica signal into the frequency domain

first, and then obtains the cross-spectrum via multiplication.

Correlation engine: XF- versus FX-type

Although both approaches lead to the same results but

differ in their performance when being carried out on the

GPU. At the first glance the XF strategy seems to have the

advantage in that the correlation function can be computed

within a narrow search space around the expected peak

position using a limited number of lags, whereas the FX

approach will automatically compute the complete corre-

lation function. A disadvantage of the XF strategy arises

from the fact that an efficient computation of correlation

functions requires that coalescent memory access is avail-

able when the cross-product is summed up. Since this is not

provided on the GPU a work-around called parallel

reduction is necessary for the computation (Fernando

2005). Although fast shared memory on the GPU can be

utilized, it is limited in its size which is accessible from

across different threads. Even more difficulties arise when

several channels have to be correlated in parallel. Thus, in

order to evaluate the performance of the XF architecture on

the GPU a single channel test running only the X part, i.e.

the cross-correlation, for different lag and data-sizes has

been carried out (Fig. 2).

The time for the F-part, i.e. the Fourier transformation is

negligible compared to the computation time of the cor-

relation function and can be ignored. Nevertheless, it can

already be predicted that the XF strategy does not have the

potential to realize a multi-channel real-time GNSS recei-

ver. In order to obtain precise geodetic observables, signals

will be recorded with sampling rates equal or larger than 4

Msps (see ‘‘Real time requirements’’). Thus the equivalent

data size of the C/A code (1 ms) will be at least 4,000

points. One second of data requires 1,000 calls of the XF

engine, which would take about 0.33 s if a 16 lag corre-

lation function is computed for a single channel. The 32

and 64 lag XF implementations which take about 0.66 and

1.31 s for the same sampling rate make clear that it is not

feasible to utilize the this architecture for a multi-channel

receiver on the GPU. Moreover, another caveat of the XF

approach is caused by the pre-filter process of the recorded

signals. Although FIR filters do not require coalescent

memory access they are expected to contribute at least

another 20% to the overall computation time of the XF

engine per channel. Since the results displayed in Fig. 2

show only the pure time taken by the XF engine

(neglecting the FFT) the data-transfer between CPU and

GPU and other processing stages such as phase wipe-off,

peak search, and phase adjustment which can also be done

on the GPU, have to be considered too. Therefore, even an

optimistic approximation of the XF architecture restricts

Table 1 Hard- and software components used for this study

CPU GPU

Intel Core 2

Q9450

NVIDIA Geforce

GTX 280

Cores 4 240

Processor clock 2,660 MHz 1,296 MHz

Memory 4 GB 1 GB

Compiler gcc 4.3 nvcc 2.1

Misc. 1.5 TB

(SATA RAID 0)

CUDA, CUFFT

Operating system Fedora 9 (64 bit)

0.01

0.1

1

10

1024 2048 4096 8192 16384 32768

time

[ms]

data size

16 lags32 lags64 lags

Fig. 2 Performance measures of a single channel XF cross-correla-

tion implementation on the GPU using different lag and data-sizes.

Measurements represent mean values from 100,000 runs and do not

include data-transfer between CPU and GPU. The units of the

ordinate are in milliseconds

GPS Solut (2010) 14:207–216 209

123

https://www.researchgate.net/publication/227991205_Fundamentals_of_Global_Positioning_System_Receivers_A_Software_Approach?el=1_x_8&enrichId=rgreq-9ef1a206c70cfbe4e84600f0d7b28665-XXX&enrichSource=Y292ZXJQYWdlOzIyNjAzNTA0ODtBUzoxMDE2MDU0ODM0ODMxMzZAMTQwMTIzNjA0MzIwNA==

the software correlator to 1–5 channels which can be pro-

cessed in real-time on the GPU.

The second strategy, i.e. the FX approach, has the

advantage that it does not require a pre-filter stage since

real and imaginary part of the replica spectrum can be set

to zero for frequencies outside the band-pass. Thus the

cross-spectrum will automatically be filtered, each time

when signal and replica are multiplied in the frequency

domain. Moreover, the spectra for each PRN need to be

computed only once and can be re-used each time the

cross-spectra are computed. Thus, only the speed of the

FFT engines will be the dominating factor for the perfor-

mance of the FX architecture. Many GPU vendors provide

users with sophisticated FFT libraries (Moreland and Angel

2003) which are usually based on the FFTW development

(Frigo and Johnson 2005). In order to evaluate the per-

formance of the FX architecture on the GPU a test scenario

was set up, which carries out the FFT on the incoming

signal, does the multiplication with the replica spectra and

obtains the cross-correlation function via inverse FFT.

Figure 3 displays the time taken as a function of data

lengths for single and multi-channel runs.

For very short data-sizes, i.e. 1,024 FFT points or less,

the parallel performance allows the processing of 1, 2, 4, 8

or even 16 channels in the same time. When the data-size

grows, this feature becomes only available for a reduced

number of channels, since the parallel FFT algorithms

cannot allocate enough shared memory for all data-streams.

If data-size become 16 K (which would be required for

16 Msps recording) or larger it happens that the Fourier

transforms are executed serially for each channel. Never-

theless, even eight channels can be processed in real-time

with the FX engine for data-rates of 16 Msps (16 K FFT

points) and there seems to be enough overhead for a

sophisticated multi-channel implementation when using

data-rates of 8 Msps (8,092 FFT points) or less. Therefore,

the FX architecture was selected to be embedded within the

other processing stages to realize the GPS receiver.

The software receiver

Based on the above conclusion that the FX architecture has

the potential to support real-time applications, a GPS

software receiver has been designed and implemented by

the help of CUDA, which provides a convenient interface

for developing and porting programs to the GPU. Figure 4

shows the schematics of the complete multi-channel

architecture, including the delay and Doppler tracking

loop.

For high-sampling rates data can be read from a hard-

disc, but for moderate sampling rates it is possible to run

the receiver in real-time mode, reading the data-stream via

a ring-buffer. The delay- and Doppler tracking loops are

running at 4 Hz (256 ms), and each update cycle is used to

transfer the results, i.e. delays, phases and amplitudes, back

to the CPU and to copy new sampled data to GPU memory.

The value of 4 Hz appeared as a compromise between the

requirements of precise Doppler-tracking and the overhead

caused by data-transfer via the PCI bus.

The A/D sampler used for this study provides quanti-

zation levels of 1, 2, 4 and 8 bits and sampling rates up to

128 Msps. One and two bit quantizations lead to a signif-

icant decrease of precision of the obtained observables. On

the other hand, the four and eight bit representations do not

differ significantly (Van Vleck and Middleton 1966).

Therefore, four bit representation seems to be the best

trade-off between data-size and quality of the analog signal

representation. Decoding of the of the bit-stream, which is

transmitted via the USB bus, can be done efficiently on the

CPU with the help of a look-up table, and the unpacked

signed integer values can be filled into the ring-buffer

where they are waiting to be transmitted to the GPU. If data

is expected to be processed off-line, the incoming bit-

stream will be recorded to hard-disc at first, and decoded

directly before it is sent to the GPU.

Parallelization

Beside the FX engine, which appears to be well suited for a

parallel implementation, the other steps within the pro-

cessing chain need to be checked for their scalability. The

first stage, i.e. the bit-shifter, can be implemented by

changing the read-pointer of the data-stream. Since the

delays of different PRNs can be pro- or retrograde it is

necessary to provide a data-buffer overhead of a few mil-

liseconds to ensure that multi-channel tracking can be

carried out smoothly. The second stage, which realizes the

numerically controlled oscillator, can be combined with the

bit-shifter, just by utilizing adequate data-stream pointers.

Moreover, since this stage does not lead to coalescent

memory access on the GPU, it can be implemented very

sophisticatedly even for parallel channels, leading to only a

0.01

0.1

1

10

1024 2048 4096 8192 16384 32768

time

[ms]

data size

1 ch.2 ch.4 ch.8 ch.

16 ch.

Fig. 3 Parallel FFT performance for different data-sizes using the

CUFFT library. Measurements represent mean values from 100,000

runs and do not include data-transfer between CPU and GPU

210 GPS Solut (2010) 14:207–216

123

small overhead. After this stage, which provides 1 milli-

second data-blocks to the FX correlation engines as

described in the prior section, the correlation peak must be

searched. Since this step requires coalescent memory

access, a dedicated algorithm similar to the parallel

reduction scheme (Fernando 2005) has been developed.

The same holds for the normalization of the correlation

function, which can also be done only via parallel reduc-

tion. Once the peak position has been found the code

delays can be computed for each channel. After that, the

cross-spectra need to be aligned properly with the updated

delay information. This step can also be done easily in

parallel. The derivation of the phases requires again a

parallel reduction scheme, since the corrected cross-spectra

need to be summarized. After this stage the delays, carrier

phases and amplitudes are available for each channel.

Every 256 ms, i.e. 4 Hz, the Doppler frequency for the

each PRN is updated using the obtained phases and

amplitudes. A-priori delays for each channel are computed

at the same cycle, and the corresponding bit-positions

within the data-stream are parsed to the bit-shifter. Also

results (delays, phases, amplitudes and Doppler frequen-

cies) are transferred back to the CPU and the next data-

block is copied to GPU memory, considering the overlap

which is required for continuous bit-shifting. The results on

the CPU are stored in NetCDF files (Rew and Davis 1990),

and can be accessed by an independent thread for extrac-

tion of the navigation message as well as for real-time post-

processing applications. During the development of the

software receiver the CUDA’s profiler helped in finding

bottlenecks and drawing conclusions about which stage of

the receiver can be improved. Figure 5 shows the profiling

results from a test with 10 parallel channels using 16 Msps

data, which is read from hard-disc in offline mode.

The Fourier transformations take nearly half of the

computation time, although the FFT libraries are already

optimized. The second largest contributor to the total

computation time are the peak-search algorithms which do

not scale very well on the GPU and have to be imple-

mented similar to a parallel reduction scheme. The same

holds for all occurrences of summations, which also take

nearly one eighth of total time on the GPU. The remaining

contributors are already optimized for a parallel scheme,

but add to the total budget due to the use of multi-cycle

math operations (e.g. phase adjustment) on the GPU. The

overhead for data-transfer between GPU and CPU varies

between 5 and 10% depending on the sampling rate and is

not considered in Fig. 5.

Real-time requirements

In order to have an objective criteria whether the software

radio will run in real-time or not we introduce the pro-

cessing factor j which relates the processing time

(including data-transfer) to the time-span of the data taken.

Values of j smaller than one will indicate that real-time

Fig. 4 Schematics of the multi-

channel real-time software

receiver running on the GPU.

The numerically controlled

oscillator (NCO) is realized

with single precision floating

point trigonometric functions

and is updated via the Doppler-

tracking loop, ensuring a

continuous tracking of the

carrier phase. Delay tracking is

performed via proper variation

of the read-pointer which feeds

the FX engine with data

Fig. 5 Relative time in percent for the main parts of the software

receiver on the GPU. Values are obtained from a 1 min run with ten

channels in parallel using 16 Msps data

GPS Solut (2010) 14:207–216 211

123

https://www.researchgate.net/publication/3208179_NetCDF_An_interface_for_scientific_data_access?el=1_x_8&enrichId=rgreq-9ef1a206c70cfbe4e84600f0d7b28665-XXX&enrichSource=Y292ZXJQYWdlOzIyNjAzNTA0ODtBUzoxMDE2MDU0ODM0ODMxMzZAMTQwMTIzNjA0MzIwNA==

processing is possible, whereas values larger than one

reflect configurations which can only be handled in off-line

mode. The real-time capacity of the receiver described in

the prior section is mainly limited by the sampling rate of

the incoming data-stream. Thus, in order to find out how

many channels can be processed in real-time on the GPU,

tests with varying sampling rates have been carried out.

Figure 6 depicts how j depends on the number of channels

which are processed in parallel for sampling rates of 4, 8,

16 and 32 Msps.

Based on these results we can conclude that real-time

processing for a realistic number of visible satellite is

possible only for 4 and 8 Msps. Sixteen Msps would allow

the processing of up to five satellites in online mode, which

is basically enough to obtain unambiguous positioning

estimates, but would require a selection of satellites to be

tracked. Larger sampling rates than 16 Msps or more par-

allel channels would cause the receiver to lag behind the

rate at which the ring-buffers are filled when being oper-

ated in online mode. Therefore, the question remains how

the sampling rate (and the resulting effective bandwidth)

impacts the precision of the observables. In order to answer

this question, we start with the 32 Msps data-set, which

allows us to utilize a bandwidth of 16 MHz, i.e. ±8 MHz

around the center frequency. In the following discussion,

we will relate the precision of the observables to the

sampling rate (SR) in Msps, assuming that the total

available bandwidth, i.e. half the sampling rate, is utilized.

By narrowing the bandwidth using the filter of the replica

spectrum inside the FX engine we can simulate other

sampling rates, based on the same data-set. Therefore, the

root mean square (RMS) values for de-trended delay and

phase results over short time-spans (30 s) can be computed.

Based on these measures it will be possible to deduce a

simple relationship between sampling rate and the preci-

sion of the observables. Since the RMS measures are

varying between the satellites due to different elevation

angles we introduce relative measures of scatter, related to

the results obtained from the 32 Msps data-set. Thus, we

introduce the factors

as SRð Þ ¼ RMSs SRð ÞRMSsð32 MspsÞ and

a/ SRð Þ ¼ RMS/ SRð ÞRMS/ð32 MspsÞ

which reveal how much the RMS of delays (s) and phases

(/) grows/shrinks when the sampling rate is changed.

Figure 7 depicts the results from such a test.

It can be seen that delay precision primarily depends on

the sampling rate and roughly follows a 1� ffiffiffiffiffiffi

SRp

rule, i.e.

RMS doubles if sampling rate is reduced by a factor of 4. On

the other hand, carrier phase estimates are less affected by

narrowing the bandwidth, which can be explained by the fact

that cross-spectral phases show larger scattering when going

from the band-center towards the Nyquist frequency. Sam-

pling rates below 4 Msps will lead to a noticeable increase of

phase scatter since the corresponding bandwidth undergoes

the first null of the C/A code located at about ±1 MHz.

Therefore, considering that carrier phases are utilized as

them main observable for precise positioning applications it

becomes feasible to run the software receiver on the GPU in

real-time, using sampling rates of 4 or 8 Msps, favoring the

latter one in case of lower elevation cut-off angles. Higher

cut-off angles would reduce the number of visible satellites

and even enable processing of 16 Msps for which a

0.1

1

10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

κ

# of channels

4 Msps8 Msps

16 Msps32 Msps

Fig. 6 Processing factors j as a function of the number of parallel

channels and different sampling rates. The solid line represents the

limit under which real-time processing is possible

Fig. 7 RMS of de-trended delays (upper plot) and phases (lowerplot) for different sampling rates using 30 s of output. Results are

based on a 32 Msps data-set by changing the width of the band-pass

filter according to the aimed sampling rate

212 GPS Solut (2010) 14:207–216

123

processing factor of less than one can be realized (Fig. 5). As

discussed in the prior section, the FFT performance drops

significantly for larger data-blocks. Using 8 instead of 4

Msps leads to an increase of the processing factor by roughly

30%, but using 16 instead of 8 Msps already doubles the

computation time and makes real-time processing difficult

for a larger number of channels. Beside the performance

restrictions caused by hardware specifications, the knowl-

edge of a-priori values of delays and Doppler shifts together

with information about satellites above the local horizon are

other crucial points for real-time applications. Thus, without

external information (i.e. orbit information) the software

receiver requires some time to search for available satellites

and extracts the necessary almanacs by decoding the infor-

mation which is modulated on the carrier phases. In order to

avoid such a time-consuming ‘‘cold-start’’ the software has

been designed to handle external almanac information, to

allow for an immediate start of tracking using the maximum

number of visible PRNs.

Navigation message decoding

The navigation message bits are transmitted by modulation

of the carrier phase and lead to jumps of the obtained phase

values by ±180� every 20 ms if a bit differs from the prior

one. Thus, given that the signal of the concerned PRN is

strong enough, it is possible to extract the navigation mes-

sage by converting these jumps into a binary data-stream.

Since the obtained phases are stored or the written to Net-

CDF files, an independent thread that runs on the CPU can

handle the decoding. Once the preamble (IS GPS 200-D

2006) is detected, the sub-frames 1–3 are decoded and the

output is written to a text files, following the RINEX con-

ventions for navigation messages (Gurtner 1994). More-

over, the sub-frames 4 and 5 which hold information about

the complete constellation are extracted and can be used to

update/replace the almanac information which is used to run

the software receiver. Since all these steps are carried out on

the CPU, which has enough free computing capacity, the

GPU performance is not affected by the extraction of the

navigation message. Comparisons with the IGS broadcast

ephemeris (Dow et al. 2005) have revealed that the navi-

gation message is decoded correctly even for signals from

satellites at low elevations (i.e. low signal-to-noise ratios).

Post-processing and geodetic analysis

Similar to the extraction of the navigation message, it is

possible to post-process the obtained observables on the

CPU, without interfering with the GPU computations. The

raw delays, which are available with a sampling rate of

1,000 Hz need to be averaged in order to provide useful

input for analysis programs. Therefore, the delays are

down-sampled to 1 Hz and output to a RINEX file. Carrier

phases are treated in a similar way, after applying the LO

offset and connecting/unwrapping. Additionally, correla-

tion amplitudes are converted to C/N0 values in dB–Hz

under consideration of the utilized bandwidth. The gener-

ated RINEX file can be compared with the results obtained

from the hardware receiver, which tracks the same signals

(Fig. 1). Since cable length and internal delays are different

between the systems, one has to remove a constant offset

between the hard- and software results before computing

statistics.

Results

A continuous tracking test on 25 March 2009, between 6

and 18 h UT has been carried out with 8 Msps using one of

the antennas located at Koganei, Japan, close to IGS station

KGN1. Although this sampling rate would allow real-time

tracking of all visible satellites using the ring-buffer

implementation, the data was recorded to hard-disc and

processed offline taking 10 h, which equals a processing

factor of 0.83. This demonstrates the real-time processing

is feasible, and can be performed on the GPU. Figure 8

depicts the delays and carrier phases obtained and displays

the corresponding elevation angles for each visible PRN,

using a cut-off angle of 20�. Figure 9 displays the RMS

values of delays and phases when averaging the software

receiver output to 1 Hz RINEX data.

As expected, the precision of the observables increases

for higher elevation angles, yielding a formal error 3 m

(delays) and 2 mm (carrier phases) in zenith direction.

Comparison with output from a hardware receiver

The signals received from the GPS antenna are not only fed

to the software receiver, but are also directed to a JavadTM

hardware receiver which outputs RINEX observation files

at a 0.1 Hz sampling rate. Thus, results from the software

receiver can be verified with such output, after considering

delay/timing offsets caused by different cable lengths and

system specific stability characteristics. The 1,000 Hz raw-

data output from the software receiver needs to be inter-

polated to meet the identical epochs for comparison.

Additionally it has to be taken into consideration that the

software radio provides output in the UTC time-frame

since it is directly clocked from UTC(NICT), whereas the

hardware receiver outputs results in GPS time. After con-

sideration of all these effects, the delays obtained from the

software radio using the C/A code measurements can be

verified (see Fig. 10) revealing a standard deviation of

±3.57 m which is well within the formal error (Fig. 9) of

this type of observable.

GPS Solut (2010) 14:207–216 213

123

https://www.researchgate.net/publication/223268145_The_International_GPS_Service_IGS_Celebrating_the_10th_Anniversary_and_Looking_to_the_Next_Decade_Adv_Space_Res_363_320-326?el=1_x_8&enrichId=rgreq-9ef1a206c70cfbe4e84600f0d7b28665-XXX&enrichSource=Y292ZXJQYWdlOzIyNjAzNTA0ODtBUzoxMDE2MDU0ODM0ODMxMzZAMTQwMTIzNjA0MzIwNA==

https://www.researchgate.net/publication/284045238_RINEX_The_receiver-independent_exchange_format?el=1_x_8&enrichId=rgreq-9ef1a206c70cfbe4e84600f0d7b28665-XXX&enrichSource=Y292ZXJQYWdlOzIyNjAzNTA0ODtBUzoxMDE2MDU0ODM0ODMxMzZAMTQwMTIzNjA0MzIwNA==

For verification of the obtained L1 phases, some pre-

processing is necessary in order to remove periods when

either of the two receivers looses lock or when cycle slips

occur. The carrier phase differences obtained are depicted

in Fig. 11, verifying that the output from the software

radio agrees well below one cycle (i.e. approximately

0.19 cm at L1) yielding a standard deviation of about

6 mm.

Looking at time-dependent characteristic of the differ-

ences shows that a small random-walk like drift can be

seen for all PRNs. This feature explains why the histogram

is skewed and not perfectly centered at zero. Nevertheless,

given that that receiver clocks are modeled as a random-

walk process within the geodetic analysis software, these

differences are likely to be absorbed within the estimated

clock parameters.

Fig. 8 Obtained delays (upperplot) and phases (middle plot) as

well as the corresponding

elevation angles for a 12 h test

starting of 25 March 2009, 6:00

UT using a cut-off angle of 20�

Fig. 9 RMS of the delays (left)and carrier phases (right) with

respect to elevation angle for the

obtained observables displayed

in Fig. 8

214 GPS Solut (2010) 14:207–216

123

Conclusions

It has been demonstrated that a GPS/GNSS real-time

software radio can be implemented on a GPU yielding

similar results as obtained from a hardware receiver. Real-

time mode is possible for moderate sampling rates (i.e. up

to 16 Msps), whereas higher sampling rates should be

recorded to hard-disk and processed offline. Instead of the

classical early/prompt/late correlation engines, which are

usually implemented by software radios on CPUs, a wider

and finer correlation function is considered. The FX strat-

egy seems to be more suitable than its counterpart, the XF

architecture, since it takes advantage of the fast parallel

FFT implementation which is available for the GPU. Thus,

the performance of the FFT engines strongly determines

the real-time capability of the receiver and restricts the

number of channels when data-rates are exceeding 16

Msps.

The GPU seems to be a suitable candidate for the

realization of a GNSS software radio because it offers huge

parallel processing power and supersedes the CPU con-

cerning cost/performance. In order to realize a real-time

dual-frequency GNSS receiver one could equip a PC with

two GPU cards which would allow separate processing of

L1 and L2C signals. Although the signal structure of the

latter signal requires larger FFTs, the transforms are called

less often due to the increase of the PRN-code length.

Despite other GPU applications, the available bandwidth of

the PCI bus does not appear as an additional bottle neck,

even if data-rates up to 64 Msps would be sent between the

CPU and the GPU.

Like any other software receiver, the implementation

realized is very flexible and can be adapted to new signals

or applications without major modifications. New algo-

rithms can be tested or innovative applications created

within very short development time. For example, the GPU

software receiver has been successfully applied to time-

and frequency transfer experiments, which operate with

similar PRN-code signals as GNSS. Other applications like

mitigation of multi-path and ionosphere monitoring are

under development. Additionally, the software radio can

easily be modified into an open-loop receiver, which is

perfectly supported by the FX architecture and its wide

correlation function.

Looking at the impressive development of GPU com-

putation power in the recent years and assuming that

Moore’s law (Moore 1965) might hold for the next 2 or 3

years gives rise to the hope that even larger data-rate and

multi-frequency GNSS systems can be tracked by a single

off-the-shelf GPU. Improvement in development software,

compilers as well FFT libraries will help to realize a GNSS

software radio which fulfills all the requirements of real-

time applications.

We thank the anonymous reviewers and the editor in

charge (Prof. Leick) for their valuable comments, which

helped to improve the paper. The authors acknowledge the

International GNSS service as well as US Coast Guard for

providing orbital information.

References

Chakravarthy V, Tsui J, Lin D, Schamus J (2001) Software GPS

receiver. GPS Solut 5(2):63–70

Deng J, Chen R, Wang J (2009) An enhanced bit-wise parallel

algorithm for real-time GPS software receiver, GPS Solut. doi:

10.1007/s10291-009-0125-4

Dow JM, Neilan RE, Gendt G (2005) The international GPS service

(IGS): celebrating the 10th Anniversary and looking to the next

0

0.02

0.04

0.06

0.08

0.1

0.12

-30 -20 -10 0 10 20 30

rela

tive

freq

uenc

y

delay differences [m]

Fig. 10 Differences between the delays obtained from the GPU

receiver and the JavadTM hardware receiver which shares the antenna

with the software radio

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

-0.04 -0.02 0 0.02 0.04

rela

tive

freq

uenc

y

phase differences [m]

Fig. 11 Differences between the L1 carrier phases obtained from the

GPU receiver and the JavadTM hardware receiver. Epochs when either

of the receiver lost lock or when cycle slips occurred where excluded

from the comparison

GPS Solut (2010) 14:207–216 215

123

http://dx.doi.org/10.1007/s10291-009-0125-4



https://www.researchgate.net/publication/2985293_Cramming_More_Components_Onto_Integrated_Circuits?el=1_x_8&enrichId=rgreq-9ef1a206c70cfbe4e84600f0d7b28665-XXX&enrichSource=Y292ZXJQYWdlOzIyNjAzNTA0ODtBUzoxMDE2MDU0ODM0ODMxMzZAMTQwMTIzNjA0MzIwNA==




decade. Adv Space Res 36(3):320–326. doi:10.1016/j.asr.

2005.05.125

Fernando R (2005) GPU GEMS-programming techniques, tips, and

tricks for real-time graphics, Pearson Education, Inc

Frigo M, Johnson SG (2005) The design and implementation of

FFTW3. Proc IEEE 93(2):216–231

Gurtner W (1994) RINEX—the receiver-independent exchange

format. GPS World 5(7):48–52

Harris C, Haines K, Staveley-Smith L (2008) GPU accelerated radio

astronomy signal convolution. Exp Astron 22(1–2):129–141.

doi:10.1007/s10686-008-9114-9

IS-GPS-200-D (2006) Interface specification IS-GPS-200, Revision

D, Interface Revision Notice (IRN)-200D-001, 7 March 2006,

Navstar GPS Space Segment/Navigation User Interface

Kiuchi H, Amagai J, Hama S, Imae M (1997) K-4 VLBI data-

acquisition system. Publ Astron Soc Jpn 49:699–708

Kondo T, Koyama Y, Takeuchi H, Kimura M (2006) Development of

a New VLBI Sampler Unit (K5/VSSP32) equipped with a USB

2.0 Interface. In: Behrend, Dirk, Baver, Karen (eds) International

VLBI Service for Geodesy and Astrometry 2006 General

Meeting Proceedings. NASA/CP-2006-214140, pp 195–199

Moore GE (1965) Cramming more components into integrated

circuits. Electronics 38:114–117

Moreland K, Angel E (2003) The FFT on a GPU, Proceedings of the

ACM SIGGRAPH/EUROGRAPHICS conference on Graphics

hardware session: simulation and computation, pp 112–119

Mumford PJ, Parkinson K, Dempster AG (2006) The Namuru Open

GNSS Research Receiver. In: Proceedings of 19th international

technical meeting of the Satellite Division of the US, Inst. of

Navigation, Fort Worth, Texas, 26–29 September 2006, pp

2847–2855

Nguyen H (2007) GPU Gems 3: programming techniques for high-

performance graphics and general-purpose computation, Addi-

son-Wesley Professional

Owens JD, Houston M, Luebke D, Green S, Stone JE, Phillips JC

(2008) GPU computing. In: Proc IEEE 96(5):879–899. doi:

10.1109/JPROC.2008.917757

Rew RK, Davis GP (1990) NetCDF: an interface for scientific data

access. IEEE Comput Graphics Appl 10(4):76–82

Tsui JB-Y (2000) Fundamentals of global positioning system

receivers: a software approach. Wiley, New York

Van Vleck JH, Middleton D (1966) The spectrum of clipped noise.

Proc IEEE 54:2–19

Author Biographies

Thomas Hobiger received his

M.Sc. and Ph.D. degrees in

geodesy and geophysics from

the Vienna University of Tech-

nology, Austria in 2002 and

2005, respectively. From Octo-

ber 2006 until September 2008

he worked at Kashima Space

Research Center, National

Institute of Information and

Communications Technology

(NICT), Japan as a JSPS fellow.

Since October 2008, he is with

NICT, working as an expert researcher. His research interests include

troposphere and ionosphere modeling, GNSS, Very Long Baseline

Interferometry (VLBI), adjustment theory and high performance

computing.

Tadahiro Gotoh received his

degree from the University of

Electro-Communications,

Tokyo, Japan in 1988. In 1985,

he joined the Radio Research

Laboratory (now the National

Institute of Information and

Communications Technology),

Koganei, Japan.

Jun Amagai received the

degree in natural sciences from

Tsukuba University, Ibaraki,

Japan in 1981. He joined the

Radio Research Laboratory

(currently the National Institute

of Information and Communi-

cations Technology), Tokyo,

Japan in 1981.

Yasuhiro Koyama received the

Ph.D. degree in Astronomy

from the Graduate University

for Advances Studies, Japan in

2003. He has been involved in

the research and development of

VLBI since he joined the

research group of the Radio

Research Laboratory (now

NICT) in 1988.

Tetsuro Kondo received the

Ph.D. degree in geophysics

from Tohoku University, Sen-

dai, Japan in 1982. He joined

the staff of the Kashima Space

Research Center, Communica-

tions Research Laboratory

(CRL) in 1981 and worked on

the development of VLBI tech-

nology and analysis of space

geodetic techniques. Since April

2008, he is research professor at

Ajou University, Korea.

216 GPS Solut (2010) 14:207–216

123

http://dx.doi.org/10.1016/j.asr.2005.05.125

http://dx.doi.org/10.1016/j.asr.2005.05.125

http://dx.doi.org/10.1007/s10686-008-9114-9

http://dx.doi.org/10.1109/JPROC.2008.917757














Documents

A GPU based real-time GPS software receiver