Upload
independent
View
0
Download
0
Embed Size (px)
Citation preview
ORIGINAL ARTICLE
A GPU based real-time GPS software receiver
Thomas Hobiger Æ Tadahiro Gotoh ÆJun Amagai Æ Yasuhiro Koyama Æ Tetsuro Kondo
Received: 7 May 2009 / Accepted: 7 July 2009 / Published online: 8 August 2009
� Springer-Verlag 2009
Abstract Off-the-shelf graphics processing units provide
low-cost massive parallel computing performance, which
can be utilized for the implementation of a GPS software
receiver. In order to realize a real-time capable system the
crucial stages of the receiver should be optimized to suit
the requirements of a parallel processor. Moreover, the
receiver should be capable to provide wider correlation
functions and provide easy access to the spectral domain of
the signals. Thus, the most suitable correlation algorithm,
which forms the core part of each receivers should be
chosen and implemented on the graphics processor. Since
the sampling rate of the received signal limits the real-time
capabilities of the software radio it is necessary to deter-
mine an optimum value, considering that the precision of
the observable varies with sampling bandwidth. We are
going to discuss details and present our single frequency
multi-channel implementation, which is capable of oper-
ating in real-time mode. Our implementation differs from
other solutions by the wideness of the correlation function
and allows simple handling of data in the spectral domain.
Comparison with output from a commercial hardware
receiver, which shares the antenna with the software radio,
confirms the consistency and accuracy of our development.
Keywords GPU � Software receiver � Real-time � FFT
Introduction
Driven by the increase of CPU performance GPS/GNSS
software receivers have become more popular since they
offer a flexible and extendible platform for developing and
testing new applications (Chakravarthy et al. 2001). A
software radio cannot only mimic the functionality of its
hardware counterpart, but allows the user to carry out the
signal processing chain with unprecedented floating point
precision. Since application-specific integrated circuits
(ASICs) for GPS tracking cannot be easily adopted to new
signals nor can their hard-wired logic be replaced with new
algorithms, receivers based on field-programmable gate
arrays (FPGAs) have been developed in the recent years,
e.g. Mumford et al. (2006). Such a solution provides a good
tradeoff between the flexibility of the software radios and
the speed of the ASICs, but is still an expensive niche
product for dedicated applications. Similar to the progress
with real-time software receivers running on the CPU
(Deng et al. 2009) graphics processing units (GPUs) are
expected to be another way for implementation which
allows to realize the GPS radio on the PC.
General purpose graphics processing units
Caused by this historical separation and driven by the
requirements of the PC gaming industry, GPUs have
evolved to massive parallel processing systems which
entered the area of non-graphic related applications.
Although a single processing core on the GPU is much
slower and provides less functionality than its counterpart
on the CPU, the huge number of these small processing
entities outperforms the classical processors when the
application can be parallelized. Thus, GPUs have started to
T. Hobiger (&) � T. Gotoh � J. Amagai � Y. Koyama �T. Kondo
Space-time Standards Group, National Institute of Information
and Communications Technology, 4-2-1 Nukui-Kitamachi,
Koganei, Tokyo 184-8795, Japan
e-mail: [email protected]
URL: http://www.nict.go.jp
123
GPS Solut (2010) 14:207–216
DOI 10.1007/s10291-009-0135-2
attract researchers from a variety of fields, and more and
more applications are emerging which are not related to the
purpose they have been originally designed for (Nguyen
2007; Owens et al. 2008). Moreover, Harris et al. (2008)
demonstrate that a GPU can be successfully applied to
radio astronomical signal processing, solving tasks which
are similar to those of GNSS. Other than CPUs which
directly access the PC’s memory, it is necessary to transfer
the relevant data from the CPU memory to the onboard-
memory of the graphic card, before it can be accessed by a
program running on the GPU. The same holds for the CPU,
which cannot directly access a memory area on the GPU,
but needs to copy the data back to the RAM. Thus, data-
transfer between CPU and GPU can be a significant bot-
tleneck for an application and it should be checked in
advance if the gain of computation performance on the
GPU is not consumed by the overhead caused by data-
transfer. Although, recent motherboards are equipped with
PCI busses which allow data to be transferred at several
Gb/s it could be still an important factor to consider before
starting implementing an application on the GPU.
System overview
Software receivers require that the RF signals are down-
converted and digitized by hardware components before
they can be processed on the PC. Moreover, since the
system developed in this study is not only dedicated to
GNSS but also useable for time-transfer applications using
PRN-code like signals, flexible and robust hardware parts
have been deployed. Several components which have been
originally developed for Very Long Baseline Interferome-
try (VLBI) are used in addition to other off-the-shelf
components.
RF down-conversion and digitization
Figure 1 displays the hardware components which are
utilized for the down- and analog/digital (A/D) conversion.
The RF signals, which are also processed by a com-
mercial hardware receiver (JAVADTM), are received from
a standard geodetic GPS antenna (Ashtech choke-ring
antenna). Thereafter L1 and L2 signals are down-con-
verted to two intermediate frequencies using a phase
locked oscillator operating at 1,380 MHz. These inter-
mediate frequencies are fed to a video converter (Kiuchi
et al. 1997) where they are mixed with the second local
oscillator running at 193.42 MHz. After this stage the
signals are digitized via a sampler, which has been
developed for VLBI (Kondo et al. 2006). Since the digital
signals are output via an USB 2.0 interface they can be
directly handled by a PC. Although displayed in Fig. 1,
processing of L2 signals has been currently turned off and
will be implemented in the near future as discussed later.
Thus, in the following the usage of L1 C/A code is
considered only. A dual-frequency receiver can be real-
ized from the following description by adding a second
GPU which is dedicated to the processing of L2C code
signals.
CPU, GPU and programming utilities
In order to demonstrate that a GPS software receiver can be
implemented on the GPU a test PC has been set up, using
the hard- and software components listed in Table 1.
Basically, only off-the-shelf components have been
utilized. The GPU code has been compiled by the help of
NVIDIA’s CUDA environment (http://developer.download.
nvidia.com/), whereas the host code running on the CPU is
compiled by a GNU C compiler. Although the used GPU
would support double precision floating point numbers, no
use of this option was made because it slows down com-
putational efficiency slightly. Tests with single and double
precision number have revealed identical results, corrobo-
rating our approach to use single precision numbers only.
NVIDIA’s FFT library, named CUFFT is utilized for the
Fourier transforms and other functions of the CUDA toolkit
turned out to be very useful for debugging the code. Since
NVIDIA provides a profiler for programs running on the
Fig. 1 Schematics of the
hardware components for down
conversion and digitization of
GPS signals. (power divider PD,
phase looked oscillator PLO,
frequency distributor Dist., low
noise amplifier LNA
208 GPS Solut (2010) 14:207–216
123
GPU, this tool has been applied, too in order to detect bottle-
necks within the code.
GPS software receiver implementation on a GPU
Conventional hardware-receivers as well as several soft-
ware radios implement the ‘‘classical’’ early/prompt/late
scheme for the code tracking loop (Tsui 2000). We have
decided to follow another strategy, computing a wider and
finer correlation function, which has several advantages
concerning multi-path mitigation and tracking of weak
signals. Moreover, as discussed below, the computation of
the correlation function in time-domain is not a straight-
forward algorithm for a parallel-processor and requires a
special strategy when computing the correlation amplitude
via summation over all chips.
Basically, two different strategies exist for the realization
of a multi-lag approach, i.e. the computation of a wider or
even the complete correlation function. The first approach,
named XF performs the cross-correlation in the time-
domain and obtains the cross-spectrum via Fourier trans-
form. The second approach, called FX, transforms the
received and the replica signal into the frequency domain
first, and then obtains the cross-spectrum via multiplication.
Correlation engine: XF- versus FX-type
Although both approaches lead to the same results but
differ in their performance when being carried out on the
GPU. At the first glance the XF strategy seems to have the
advantage in that the correlation function can be computed
within a narrow search space around the expected peak
position using a limited number of lags, whereas the FX
approach will automatically compute the complete corre-
lation function. A disadvantage of the XF strategy arises
from the fact that an efficient computation of correlation
functions requires that coalescent memory access is avail-
able when the cross-product is summed up. Since this is not
provided on the GPU a work-around called parallel
reduction is necessary for the computation (Fernando
2005). Although fast shared memory on the GPU can be
utilized, it is limited in its size which is accessible from
across different threads. Even more difficulties arise when
several channels have to be correlated in parallel. Thus, in
order to evaluate the performance of the XF architecture on
the GPU a single channel test running only the X part, i.e.
the cross-correlation, for different lag and data-sizes has
been carried out (Fig. 2).
The time for the F-part, i.e. the Fourier transformation is
negligible compared to the computation time of the cor-
relation function and can be ignored. Nevertheless, it can
already be predicted that the XF strategy does not have the
potential to realize a multi-channel real-time GNSS recei-
ver. In order to obtain precise geodetic observables, signals
will be recorded with sampling rates equal or larger than 4
Msps (see ‘‘Real time requirements’’). Thus the equivalent
data size of the C/A code (1 ms) will be at least 4,000
points. One second of data requires 1,000 calls of the XF
engine, which would take about 0.33 s if a 16 lag corre-
lation function is computed for a single channel. The 32
and 64 lag XF implementations which take about 0.66 and
1.31 s for the same sampling rate make clear that it is not
feasible to utilize the this architecture for a multi-channel
receiver on the GPU. Moreover, another caveat of the XF
approach is caused by the pre-filter process of the recorded
signals. Although FIR filters do not require coalescent
memory access they are expected to contribute at least
another 20% to the overall computation time of the XF
engine per channel. Since the results displayed in Fig. 2
show only the pure time taken by the XF engine
(neglecting the FFT) the data-transfer between CPU and
GPU and other processing stages such as phase wipe-off,
peak search, and phase adjustment which can also be done
on the GPU, have to be considered too. Therefore, even an
optimistic approximation of the XF architecture restricts
Table 1 Hard- and software components used for this study
CPU GPU
Intel Core 2
Q9450
NVIDIA Geforce
GTX 280
Cores 4 240
Processor clock 2,660 MHz 1,296 MHz
Memory 4 GB 1 GB
Compiler gcc 4.3 nvcc 2.1
Misc. 1.5 TB
(SATA RAID 0)
CUDA, CUFFT
Operating system Fedora 9 (64 bit)
0.01
0.1
1
10
1024 2048 4096 8192 16384 32768
time
[ms]
data size
16 lags32 lags64 lags
Fig. 2 Performance measures of a single channel XF cross-correla-
tion implementation on the GPU using different lag and data-sizes.
Measurements represent mean values from 100,000 runs and do not
include data-transfer between CPU and GPU. The units of the
ordinate are in milliseconds
GPS Solut (2010) 14:207–216 209
123
the software correlator to 1–5 channels which can be pro-
cessed in real-time on the GPU.
The second strategy, i.e. the FX approach, has the
advantage that it does not require a pre-filter stage since
real and imaginary part of the replica spectrum can be set
to zero for frequencies outside the band-pass. Thus the
cross-spectrum will automatically be filtered, each time
when signal and replica are multiplied in the frequency
domain. Moreover, the spectra for each PRN need to be
computed only once and can be re-used each time the
cross-spectra are computed. Thus, only the speed of the
FFT engines will be the dominating factor for the perfor-
mance of the FX architecture. Many GPU vendors provide
users with sophisticated FFT libraries (Moreland and Angel
2003) which are usually based on the FFTW development
(Frigo and Johnson 2005). In order to evaluate the per-
formance of the FX architecture on the GPU a test scenario
was set up, which carries out the FFT on the incoming
signal, does the multiplication with the replica spectra and
obtains the cross-correlation function via inverse FFT.
Figure 3 displays the time taken as a function of data
lengths for single and multi-channel runs.
For very short data-sizes, i.e. 1,024 FFT points or less,
the parallel performance allows the processing of 1, 2, 4, 8
or even 16 channels in the same time. When the data-size
grows, this feature becomes only available for a reduced
number of channels, since the parallel FFT algorithms
cannot allocate enough shared memory for all data-streams.
If data-size become 16 K (which would be required for
16 Msps recording) or larger it happens that the Fourier
transforms are executed serially for each channel. Never-
theless, even eight channels can be processed in real-time
with the FX engine for data-rates of 16 Msps (16 K FFT
points) and there seems to be enough overhead for a
sophisticated multi-channel implementation when using
data-rates of 8 Msps (8,092 FFT points) or less. Therefore,
the FX architecture was selected to be embedded within the
other processing stages to realize the GPS receiver.
The software receiver
Based on the above conclusion that the FX architecture has
the potential to support real-time applications, a GPS
software receiver has been designed and implemented by
the help of CUDA, which provides a convenient interface
for developing and porting programs to the GPU. Figure 4
shows the schematics of the complete multi-channel
architecture, including the delay and Doppler tracking
loop.
For high-sampling rates data can be read from a hard-
disc, but for moderate sampling rates it is possible to run
the receiver in real-time mode, reading the data-stream via
a ring-buffer. The delay- and Doppler tracking loops are
running at 4 Hz (256 ms), and each update cycle is used to
transfer the results, i.e. delays, phases and amplitudes, back
to the CPU and to copy new sampled data to GPU memory.
The value of 4 Hz appeared as a compromise between the
requirements of precise Doppler-tracking and the overhead
caused by data-transfer via the PCI bus.
The A/D sampler used for this study provides quanti-
zation levels of 1, 2, 4 and 8 bits and sampling rates up to
128 Msps. One and two bit quantizations lead to a signif-
icant decrease of precision of the obtained observables. On
the other hand, the four and eight bit representations do not
differ significantly (Van Vleck and Middleton 1966).
Therefore, four bit representation seems to be the best
trade-off between data-size and quality of the analog signal
representation. Decoding of the of the bit-stream, which is
transmitted via the USB bus, can be done efficiently on the
CPU with the help of a look-up table, and the unpacked
signed integer values can be filled into the ring-buffer
where they are waiting to be transmitted to the GPU. If data
is expected to be processed off-line, the incoming bit-
stream will be recorded to hard-disc at first, and decoded
directly before it is sent to the GPU.
Parallelization
Beside the FX engine, which appears to be well suited for a
parallel implementation, the other steps within the pro-
cessing chain need to be checked for their scalability. The
first stage, i.e. the bit-shifter, can be implemented by
changing the read-pointer of the data-stream. Since the
delays of different PRNs can be pro- or retrograde it is
necessary to provide a data-buffer overhead of a few mil-
liseconds to ensure that multi-channel tracking can be
carried out smoothly. The second stage, which realizes the
numerically controlled oscillator, can be combined with the
bit-shifter, just by utilizing adequate data-stream pointers.
Moreover, since this stage does not lead to coalescent
memory access on the GPU, it can be implemented very
sophisticatedly even for parallel channels, leading to only a
0.01
0.1
1
10
1024 2048 4096 8192 16384 32768
time
[ms]
data size
1 ch.2 ch.4 ch.8 ch.
16 ch.
Fig. 3 Parallel FFT performance for different data-sizes using the
CUFFT library. Measurements represent mean values from 100,000
runs and do not include data-transfer between CPU and GPU
210 GPS Solut (2010) 14:207–216
123
small overhead. After this stage, which provides 1 milli-
second data-blocks to the FX correlation engines as
described in the prior section, the correlation peak must be
searched. Since this step requires coalescent memory
access, a dedicated algorithm similar to the parallel
reduction scheme (Fernando 2005) has been developed.
The same holds for the normalization of the correlation
function, which can also be done only via parallel reduc-
tion. Once the peak position has been found the code
delays can be computed for each channel. After that, the
cross-spectra need to be aligned properly with the updated
delay information. This step can also be done easily in
parallel. The derivation of the phases requires again a
parallel reduction scheme, since the corrected cross-spectra
need to be summarized. After this stage the delays, carrier
phases and amplitudes are available for each channel.
Every 256 ms, i.e. 4 Hz, the Doppler frequency for the
each PRN is updated using the obtained phases and
amplitudes. A-priori delays for each channel are computed
at the same cycle, and the corresponding bit-positions
within the data-stream are parsed to the bit-shifter. Also
results (delays, phases, amplitudes and Doppler frequen-
cies) are transferred back to the CPU and the next data-
block is copied to GPU memory, considering the overlap
which is required for continuous bit-shifting. The results on
the CPU are stored in NetCDF files (Rew and Davis 1990),
and can be accessed by an independent thread for extrac-
tion of the navigation message as well as for real-time post-
processing applications. During the development of the
software receiver the CUDA’s profiler helped in finding
bottlenecks and drawing conclusions about which stage of
the receiver can be improved. Figure 5 shows the profiling
results from a test with 10 parallel channels using 16 Msps
data, which is read from hard-disc in offline mode.
The Fourier transformations take nearly half of the
computation time, although the FFT libraries are already
optimized. The second largest contributor to the total
computation time are the peak-search algorithms which do
not scale very well on the GPU and have to be imple-
mented similar to a parallel reduction scheme. The same
holds for all occurrences of summations, which also take
nearly one eighth of total time on the GPU. The remaining
contributors are already optimized for a parallel scheme,
but add to the total budget due to the use of multi-cycle
math operations (e.g. phase adjustment) on the GPU. The
overhead for data-transfer between GPU and CPU varies
between 5 and 10% depending on the sampling rate and is
not considered in Fig. 5.
Real-time requirements
In order to have an objective criteria whether the software
radio will run in real-time or not we introduce the pro-
cessing factor j which relates the processing time
(including data-transfer) to the time-span of the data taken.
Values of j smaller than one will indicate that real-time
Fig. 4 Schematics of the multi-
channel real-time software
receiver running on the GPU.
The numerically controlled
oscillator (NCO) is realized
with single precision floating
point trigonometric functions
and is updated via the Doppler-
tracking loop, ensuring a
continuous tracking of the
carrier phase. Delay tracking is
performed via proper variation
of the read-pointer which feeds
the FX engine with data
Fig. 5 Relative time in percent for the main parts of the software
receiver on the GPU. Values are obtained from a 1 min run with ten
channels in parallel using 16 Msps data
GPS Solut (2010) 14:207–216 211
123
processing is possible, whereas values larger than one
reflect configurations which can only be handled in off-line
mode. The real-time capacity of the receiver described in
the prior section is mainly limited by the sampling rate of
the incoming data-stream. Thus, in order to find out how
many channels can be processed in real-time on the GPU,
tests with varying sampling rates have been carried out.
Figure 6 depicts how j depends on the number of channels
which are processed in parallel for sampling rates of 4, 8,
16 and 32 Msps.
Based on these results we can conclude that real-time
processing for a realistic number of visible satellite is
possible only for 4 and 8 Msps. Sixteen Msps would allow
the processing of up to five satellites in online mode, which
is basically enough to obtain unambiguous positioning
estimates, but would require a selection of satellites to be
tracked. Larger sampling rates than 16 Msps or more par-
allel channels would cause the receiver to lag behind the
rate at which the ring-buffers are filled when being oper-
ated in online mode. Therefore, the question remains how
the sampling rate (and the resulting effective bandwidth)
impacts the precision of the observables. In order to answer
this question, we start with the 32 Msps data-set, which
allows us to utilize a bandwidth of 16 MHz, i.e. ±8 MHz
around the center frequency. In the following discussion,
we will relate the precision of the observables to the
sampling rate (SR) in Msps, assuming that the total
available bandwidth, i.e. half the sampling rate, is utilized.
By narrowing the bandwidth using the filter of the replica
spectrum inside the FX engine we can simulate other
sampling rates, based on the same data-set. Therefore, the
root mean square (RMS) values for de-trended delay and
phase results over short time-spans (30 s) can be computed.
Based on these measures it will be possible to deduce a
simple relationship between sampling rate and the preci-
sion of the observables. Since the RMS measures are
varying between the satellites due to different elevation
angles we introduce relative measures of scatter, related to
the results obtained from the 32 Msps data-set. Thus, we
introduce the factors
as SRð Þ ¼ RMSs SRð ÞRMSsð32 MspsÞ and
a/ SRð Þ ¼ RMS/ SRð ÞRMS/ð32 MspsÞ
which reveal how much the RMS of delays (s) and phases
(/) grows/shrinks when the sampling rate is changed.
Figure 7 depicts the results from such a test.
It can be seen that delay precision primarily depends on
the sampling rate and roughly follows a 1� ffiffiffiffiffiffi
SRp
rule, i.e.
RMS doubles if sampling rate is reduced by a factor of 4. On
the other hand, carrier phase estimates are less affected by
narrowing the bandwidth, which can be explained by the fact
that cross-spectral phases show larger scattering when going
from the band-center towards the Nyquist frequency. Sam-
pling rates below 4 Msps will lead to a noticeable increase of
phase scatter since the corresponding bandwidth undergoes
the first null of the C/A code located at about ±1 MHz.
Therefore, considering that carrier phases are utilized as
them main observable for precise positioning applications it
becomes feasible to run the software receiver on the GPU in
real-time, using sampling rates of 4 or 8 Msps, favoring the
latter one in case of lower elevation cut-off angles. Higher
cut-off angles would reduce the number of visible satellites
and even enable processing of 16 Msps for which a
0.1
1
10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
κ
# of channels
4 Msps8 Msps
16 Msps32 Msps
Fig. 6 Processing factors j as a function of the number of parallel
channels and different sampling rates. The solid line represents the
limit under which real-time processing is possible
Fig. 7 RMS of de-trended delays (upper plot) and phases (lowerplot) for different sampling rates using 30 s of output. Results are
based on a 32 Msps data-set by changing the width of the band-pass
filter according to the aimed sampling rate
212 GPS Solut (2010) 14:207–216
123
processing factor of less than one can be realized (Fig. 5). As
discussed in the prior section, the FFT performance drops
significantly for larger data-blocks. Using 8 instead of 4
Msps leads to an increase of the processing factor by roughly
30%, but using 16 instead of 8 Msps already doubles the
computation time and makes real-time processing difficult
for a larger number of channels. Beside the performance
restrictions caused by hardware specifications, the knowl-
edge of a-priori values of delays and Doppler shifts together
with information about satellites above the local horizon are
other crucial points for real-time applications. Thus, without
external information (i.e. orbit information) the software
receiver requires some time to search for available satellites
and extracts the necessary almanacs by decoding the infor-
mation which is modulated on the carrier phases. In order to
avoid such a time-consuming ‘‘cold-start’’ the software has
been designed to handle external almanac information, to
allow for an immediate start of tracking using the maximum
number of visible PRNs.
Navigation message decoding
The navigation message bits are transmitted by modulation
of the carrier phase and lead to jumps of the obtained phase
values by ±180� every 20 ms if a bit differs from the prior
one. Thus, given that the signal of the concerned PRN is
strong enough, it is possible to extract the navigation mes-
sage by converting these jumps into a binary data-stream.
Since the obtained phases are stored or the written to Net-
CDF files, an independent thread that runs on the CPU can
handle the decoding. Once the preamble (IS GPS 200-D
2006) is detected, the sub-frames 1–3 are decoded and the
output is written to a text files, following the RINEX con-
ventions for navigation messages (Gurtner 1994). More-
over, the sub-frames 4 and 5 which hold information about
the complete constellation are extracted and can be used to
update/replace the almanac information which is used to run
the software receiver. Since all these steps are carried out on
the CPU, which has enough free computing capacity, the
GPU performance is not affected by the extraction of the
navigation message. Comparisons with the IGS broadcast
ephemeris (Dow et al. 2005) have revealed that the navi-
gation message is decoded correctly even for signals from
satellites at low elevations (i.e. low signal-to-noise ratios).
Post-processing and geodetic analysis
Similar to the extraction of the navigation message, it is
possible to post-process the obtained observables on the
CPU, without interfering with the GPU computations. The
raw delays, which are available with a sampling rate of
1,000 Hz need to be averaged in order to provide useful
input for analysis programs. Therefore, the delays are
down-sampled to 1 Hz and output to a RINEX file. Carrier
phases are treated in a similar way, after applying the LO
offset and connecting/unwrapping. Additionally, correla-
tion amplitudes are converted to C/N0 values in dB–Hz
under consideration of the utilized bandwidth. The gener-
ated RINEX file can be compared with the results obtained
from the hardware receiver, which tracks the same signals
(Fig. 1). Since cable length and internal delays are different
between the systems, one has to remove a constant offset
between the hard- and software results before computing
statistics.
Results
A continuous tracking test on 25 March 2009, between 6
and 18 h UT has been carried out with 8 Msps using one of
the antennas located at Koganei, Japan, close to IGS station
KGN1. Although this sampling rate would allow real-time
tracking of all visible satellites using the ring-buffer
implementation, the data was recorded to hard-disc and
processed offline taking 10 h, which equals a processing
factor of 0.83. This demonstrates the real-time processing
is feasible, and can be performed on the GPU. Figure 8
depicts the delays and carrier phases obtained and displays
the corresponding elevation angles for each visible PRN,
using a cut-off angle of 20�. Figure 9 displays the RMS
values of delays and phases when averaging the software
receiver output to 1 Hz RINEX data.
As expected, the precision of the observables increases
for higher elevation angles, yielding a formal error 3 m
(delays) and 2 mm (carrier phases) in zenith direction.
Comparison with output from a hardware receiver
The signals received from the GPS antenna are not only fed
to the software receiver, but are also directed to a JavadTM
hardware receiver which outputs RINEX observation files
at a 0.1 Hz sampling rate. Thus, results from the software
receiver can be verified with such output, after considering
delay/timing offsets caused by different cable lengths and
system specific stability characteristics. The 1,000 Hz raw-
data output from the software receiver needs to be inter-
polated to meet the identical epochs for comparison.
Additionally it has to be taken into consideration that the
software radio provides output in the UTC time-frame
since it is directly clocked from UTC(NICT), whereas the
hardware receiver outputs results in GPS time. After con-
sideration of all these effects, the delays obtained from the
software radio using the C/A code measurements can be
verified (see Fig. 10) revealing a standard deviation of
±3.57 m which is well within the formal error (Fig. 9) of
this type of observable.
GPS Solut (2010) 14:207–216 213
123
For verification of the obtained L1 phases, some pre-
processing is necessary in order to remove periods when
either of the two receivers looses lock or when cycle slips
occur. The carrier phase differences obtained are depicted
in Fig. 11, verifying that the output from the software
radio agrees well below one cycle (i.e. approximately
0.19 cm at L1) yielding a standard deviation of about
6 mm.
Looking at time-dependent characteristic of the differ-
ences shows that a small random-walk like drift can be
seen for all PRNs. This feature explains why the histogram
is skewed and not perfectly centered at zero. Nevertheless,
given that that receiver clocks are modeled as a random-
walk process within the geodetic analysis software, these
differences are likely to be absorbed within the estimated
clock parameters.
Fig. 8 Obtained delays (upperplot) and phases (middle plot) as
well as the corresponding
elevation angles for a 12 h test
starting of 25 March 2009, 6:00
UT using a cut-off angle of 20�
Fig. 9 RMS of the delays (left)and carrier phases (right) with
respect to elevation angle for the
obtained observables displayed
in Fig. 8
214 GPS Solut (2010) 14:207–216
123
Conclusions
It has been demonstrated that a GPS/GNSS real-time
software radio can be implemented on a GPU yielding
similar results as obtained from a hardware receiver. Real-
time mode is possible for moderate sampling rates (i.e. up
to 16 Msps), whereas higher sampling rates should be
recorded to hard-disk and processed offline. Instead of the
classical early/prompt/late correlation engines, which are
usually implemented by software radios on CPUs, a wider
and finer correlation function is considered. The FX strat-
egy seems to be more suitable than its counterpart, the XF
architecture, since it takes advantage of the fast parallel
FFT implementation which is available for the GPU. Thus,
the performance of the FFT engines strongly determines
the real-time capability of the receiver and restricts the
number of channels when data-rates are exceeding 16
Msps.
The GPU seems to be a suitable candidate for the
realization of a GNSS software radio because it offers huge
parallel processing power and supersedes the CPU con-
cerning cost/performance. In order to realize a real-time
dual-frequency GNSS receiver one could equip a PC with
two GPU cards which would allow separate processing of
L1 and L2C signals. Although the signal structure of the
latter signal requires larger FFTs, the transforms are called
less often due to the increase of the PRN-code length.
Despite other GPU applications, the available bandwidth of
the PCI bus does not appear as an additional bottle neck,
even if data-rates up to 64 Msps would be sent between the
CPU and the GPU.
Like any other software receiver, the implementation
realized is very flexible and can be adapted to new signals
or applications without major modifications. New algo-
rithms can be tested or innovative applications created
within very short development time. For example, the GPU
software receiver has been successfully applied to time-
and frequency transfer experiments, which operate with
similar PRN-code signals as GNSS. Other applications like
mitigation of multi-path and ionosphere monitoring are
under development. Additionally, the software radio can
easily be modified into an open-loop receiver, which is
perfectly supported by the FX architecture and its wide
correlation function.
Looking at the impressive development of GPU com-
putation power in the recent years and assuming that
Moore’s law (Moore 1965) might hold for the next 2 or 3
years gives rise to the hope that even larger data-rate and
multi-frequency GNSS systems can be tracked by a single
off-the-shelf GPU. Improvement in development software,
compilers as well FFT libraries will help to realize a GNSS
software radio which fulfills all the requirements of real-
time applications.
We thank the anonymous reviewers and the editor in
charge (Prof. Leick) for their valuable comments, which
helped to improve the paper. The authors acknowledge the
International GNSS service as well as US Coast Guard for
providing orbital information.
References
Chakravarthy V, Tsui J, Lin D, Schamus J (2001) Software GPS
receiver. GPS Solut 5(2):63–70
Deng J, Chen R, Wang J (2009) An enhanced bit-wise parallel
algorithm for real-time GPS software receiver, GPS Solut. doi:
10.1007/s10291-009-0125-4
Dow JM, Neilan RE, Gendt G (2005) The international GPS service
(IGS): celebrating the 10th Anniversary and looking to the next
0
0.02
0.04
0.06
0.08
0.1
0.12
-30 -20 -10 0 10 20 30
rela
tive
freq
uenc
y
delay differences [m]
Fig. 10 Differences between the delays obtained from the GPU
receiver and the JavadTM hardware receiver which shares the antenna
with the software radio
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
-0.04 -0.02 0 0.02 0.04
rela
tive
freq
uenc
y
phase differences [m]
Fig. 11 Differences between the L1 carrier phases obtained from the
GPU receiver and the JavadTM hardware receiver. Epochs when either
of the receiver lost lock or when cycle slips occurred where excluded
from the comparison
GPS Solut (2010) 14:207–216 215
123
decade. Adv Space Res 36(3):320–326. doi:10.1016/j.asr.
2005.05.125
Fernando R (2005) GPU GEMS-programming techniques, tips, and
tricks for real-time graphics, Pearson Education, Inc
Frigo M, Johnson SG (2005) The design and implementation of
FFTW3. Proc IEEE 93(2):216–231
Gurtner W (1994) RINEX—the receiver-independent exchange
format. GPS World 5(7):48–52
Harris C, Haines K, Staveley-Smith L (2008) GPU accelerated radio
astronomy signal convolution. Exp Astron 22(1–2):129–141.
doi:10.1007/s10686-008-9114-9
IS-GPS-200-D (2006) Interface specification IS-GPS-200, Revision
D, Interface Revision Notice (IRN)-200D-001, 7 March 2006,
Navstar GPS Space Segment/Navigation User Interface
Kiuchi H, Amagai J, Hama S, Imae M (1997) K-4 VLBI data-
acquisition system. Publ Astron Soc Jpn 49:699–708
Kondo T, Koyama Y, Takeuchi H, Kimura M (2006) Development of
a New VLBI Sampler Unit (K5/VSSP32) equipped with a USB
2.0 Interface. In: Behrend, Dirk, Baver, Karen (eds) International
VLBI Service for Geodesy and Astrometry 2006 General
Meeting Proceedings. NASA/CP-2006-214140, pp 195–199
Moore GE (1965) Cramming more components into integrated
circuits. Electronics 38:114–117
Moreland K, Angel E (2003) The FFT on a GPU, Proceedings of the
ACM SIGGRAPH/EUROGRAPHICS conference on Graphics
hardware session: simulation and computation, pp 112–119
Mumford PJ, Parkinson K, Dempster AG (2006) The Namuru Open
GNSS Research Receiver. In: Proceedings of 19th international
technical meeting of the Satellite Division of the US, Inst. of
Navigation, Fort Worth, Texas, 26–29 September 2006, pp
2847–2855
Nguyen H (2007) GPU Gems 3: programming techniques for high-
performance graphics and general-purpose computation, Addi-
son-Wesley Professional
Owens JD, Houston M, Luebke D, Green S, Stone JE, Phillips JC
(2008) GPU computing. In: Proc IEEE 96(5):879–899. doi:
10.1109/JPROC.2008.917757
Rew RK, Davis GP (1990) NetCDF: an interface for scientific data
access. IEEE Comput Graphics Appl 10(4):76–82
Tsui JB-Y (2000) Fundamentals of global positioning system
receivers: a software approach. Wiley, New York
Van Vleck JH, Middleton D (1966) The spectrum of clipped noise.
Proc IEEE 54:2–19
Author Biographies
Thomas Hobiger received his
M.Sc. and Ph.D. degrees in
geodesy and geophysics from
the Vienna University of Tech-
nology, Austria in 2002 and
2005, respectively. From Octo-
ber 2006 until September 2008
he worked at Kashima Space
Research Center, National
Institute of Information and
Communications Technology
(NICT), Japan as a JSPS fellow.
Since October 2008, he is with
NICT, working as an expert researcher. His research interests include
troposphere and ionosphere modeling, GNSS, Very Long Baseline
Interferometry (VLBI), adjustment theory and high performance
computing.
Tadahiro Gotoh received his
degree from the University of
Electro-Communications,
Tokyo, Japan in 1988. In 1985,
he joined the Radio Research
Laboratory (now the National
Institute of Information and
Communications Technology),
Koganei, Japan.
Jun Amagai received the
degree in natural sciences from
Tsukuba University, Ibaraki,
Japan in 1981. He joined the
Radio Research Laboratory
(currently the National Institute
of Information and Communi-
cations Technology), Tokyo,
Japan in 1981.
Yasuhiro Koyama received the
Ph.D. degree in Astronomy
from the Graduate University
for Advances Studies, Japan in
2003. He has been involved in
the research and development of
VLBI since he joined the
research group of the Radio
Research Laboratory (now
NICT) in 1988.
Tetsuro Kondo received the
Ph.D. degree in geophysics
from Tohoku University, Sen-
dai, Japan in 1982. He joined
the staff of the Kashima Space
Research Center, Communica-
tions Research Laboratory
(CRL) in 1981 and worked on
the development of VLBI tech-
nology and analysis of space
geodetic techniques. Since April
2008, he is research professor at
Ajou University, Korea.
216 GPS Solut (2010) 14:207–216
123