Comparison of neural networks and regression-based methods for temperature retrievals

Comparison of neural networksand regression-based methods fortemperature retrievals

Howard E. Motteler, L. Larrabee Strow, Larry McMillin, and J. Anthony Gualtieri

5390 APPLIED OPTICS @ V

Two methods for performing clear-air temperature retrievals from simulated radiances for the Atmo-spheric Infrared Sounder are investigated. Neural networks are compared with a well-known linearmethod in which regression is performed after a change of bases. With large channel sets, both methodscan rapidly perform clear-air retrievals over a variety of climactic conditions with an overall RMS error ofless than 1 K. The Jacobian of the neural network is compared with the Jacobian 1the regressioncoefficients2 of the linear method, revealing a more fine-scale variation than expected from the underlyingphysics, particularly for the neural net. Some pragmatic information concerning the application ofneural nets to retrieval problems is also included.Key words: Temperature sounding, remote sensing, neural networks.

1. Introduction

We investigate two methods for performing clear-airtemperature retrievals from simulated radiances forthe Atmospheric Infrared Sounder 1AIRS2. An ap-proach using neural networks, similar to that de-scribed by Escobar-Munoz et al.,1 is compared with awell-known linear method in which the regression isperformed after a change of bases, as described bySmith and Woolf.2 The linear method was usedoperationally at the National Oceanic and Atmo-spheric Administration until 1988, for TIROS Opera-tional Vertical Sounder@High Resolution Infrared Ra-diation Sounder3 retrievals.There are a number of similarities between the

regression and neural net approaches: both arestatistical or nonphysical retrieval methods, i.e., theydo not make explicit use of a radiative transfercomputation; both are well suited to take advantage

H. E. Motteler is with the Department of Computer Science andL. L. Strow is with the Department of Physics, University ofMaryland Baltimore County, Baltimore,Maryland 21228; L.McMil-lin is with the National Oceanic and Atmospheric Admin-istration@National Environmental Satellite, Data, and InformationService, 5200 Auth Road, Camp Springs, Maryland 20746; J. A.Gualtieri is with Hughs STX, NASA Goddard Space Flight Center,Code 902.2, Greenbelt, Maryland 20771.Received 17 November 1993; revised manuscript received 6

February 1995.0003-6935@95@245390-08$06.00@0.

r 1995 Optical Society of America.

ol. 34, No. 24 @ 20 August 1995

of the large AIRS channel set; both methods canperform retrievals very rapidly, although the initialtraining needed for the neural network is muchslower than the process of finding the eigenvaluebasis and doing the regression needed for the linear-ized method. An advantage of both methods for anoperational system is that they can be used indepen-dently of any radiative transfer model. Measuredbrightness temperatures can be co-located with radio-sonde measurements to build sets of training 1orregression2 data directly.We feel it is worthwhile to compare these methods

carefully. We are aware of the limitations of clear airtests, and tests with unit emissivity, but they doprovide a useful context for comparisons, and clear-air algorithms are still central to many retrievalsystems. One of our goals is to verify the goodresults reported with the use of neural nets for AIRStemperature retrievals by Escobar-Munoz et al.1Further, the comparison of a nonlinear functionalapproximation such as the neural net with a goodlinear method can show the degree of inherent nonlin-earity in theAIRS temperature retrieval problem.Overall, we found that the performance of the two

methods was very similar. When 728 temperature-sensitive channels are used, the overall RMS retrievalerror with added noise is below 1 K for both methods.Although the neural net is slightly more accurate, theerror plots for both methods are quite close for agiven channel set; for the 728- and 130-channel setsthey are within 0.1 K overall and within 0.3 K in the

troposphere. The Jacobians of the two methods doexhibit some differences, which suggests that on afine scale the regression solution mimics the underly-ing physics more closely than does the neural net.In addition to analyzing neural net performance

and comparing neural nets with regression, we haveincluded an appendix containing some pragmaticinformation concerning the application of neural netsto retrieval problems. This includes informationabout algorithms, and working ranges for networktraining parameters that have been determined bytrial and error. This may be of use to anyone wishingto apply backpropagation neural networks to similarretrieval problems.

A. Atmospheric Infrared Sounder Retrieval Problem

Instruments on Earth-viewing satellites, such asHIRS3 on TIROS-N and the proposed AIRS4 on theEarth Observing System 1EOS2, observe upwellingradiances in the infrared spectrum at a set of discretefrequencies. The retrieval or inversion problem is totake these measured radiances and obtain informa-tion on the atmospheric state, including temperature,humidity, composition, and surface emissivity andtemperature.The inversion problem is difficult and has been

studied for over 30 years.5,6 The problem is ill posedin that a given radiance vector may correspond tomany different temperature profiles. Constraintsmust be added to guarantee a unique inverse. Theproblem is also ill conditioned in that standard nu-meric methods are unstable. Present retrieval sys-tems are based on linear regression or on nonlineariterative methods; recently neural networks have alsobeen applied to this problem.1,7TheAIRS, currently under development for NASA’s

EOS satellite series, will provide higher accuracy andbetter vertical resolution than the present opera-tional sounders. The AIRS instrument will have,2500 channels at a much higher spectral resolution1n@Dn < 12002 than the currently operational 20-channel HIRS instrument. The data available areincreased by 2 orders of magnitude, and existingtechniques may not be able to make effective use ofthis information. Nonlinear inversion techniquesthat require several calculations of the channel radi-ances are computationally intensive, and they mayprove difficult to apply in real time in an operationalsounding system.

B. Data Sets

The data sets we use for training 1or regression2 andfor most of our tests are generated from the set of1761 TOVS Initial Gross Retrieval 1TIGR2 tempera-ture, water, and ozone profiles.8 These are clear-airprofiles, with a surface emissivity assumed to be 1,and they are derived from radiosonde measurementstaken at all latitudes. Themean and standard devia-tion for the TIGR profiles are shown in Fig. 1. TheTIGR profiles are interpolated from the original 40levels to the 64 TOVS pressure levels used in the

fast-transmittance algorithm. The retrieved quanti-ties are air temperature at 64 pressure levels, andskin temperature.Corresponding brightness temperatures for se-

lected subsets of the AIRS channel set are computedby the use of the fast-transmittance algorithm of J.Susskind,9 which includes both radiative transfer andinstrument broadening of the observed spectra.Three channel sets were used: one with 59, one with130, and one with 728 channels. These channelswere selected by hand for temperature sensitivity,and the smaller sets are subsets of the larger ones.For testing retrieval algorithms, we partition the

TIGR set into a training 1or regression2 subset and atesting subset; typically this was chosen to be even-numbered profiles for training 1or regression2 andodd-numbered profiles for testing. For testing pur-poses, we also use the set of 100 write_test profiles,with unit emissivity, and two tracks of the flat_test1one with unit and one with varying emissivity2, asprovided by NASA’s Jet Propulsion Laboratory 1JPL2for the AIRS science team. 1Those data are availablefor anonymous file transfer program access from JPL,at airs1.jpl.nasa.gov.2Noise is added to brightness temperatures for both

training and testing purposes. At the time of therelease of the write_test data 119922, the nominalAIRS hardware specification for the standard devia-tion for noise at wave number n and temperature Twas

NEDT 5 NEDT0

exp12cn@T02T2

exp12cn@T2T02,

where c 5 6.63 3 10234 3 2.9979 3 [email protected] 3 10223,T0 5 250 K, and NEDT0 5 0.1 K. This value as-sumes the averaging of nine fields of view and hasbeen refined since the write_test release to reflect theinstrument specifications more closely.

Fig. 1. Mean and mean plus or minus the standard deviation forthe 1761 TIGR temperature profiles.

20 August 1995 @ Vol. 34, No. 24 @ APPLIED OPTICS 5391

2. Neural Networks

The neural networks we use to perform retrievals arethree-layer feed-forward networks, trained with amodified form of backpropagation.10,11 Such net-works can be represented as a vector-valued function,

F1x2 : 5 F31W3F21W2F11W1x 1 b12 1 b22 1 b32,

where F1, F2, and F3 map vectors to vectors byapplying a transfer function to each vector component.Vector x is the input to the network, Wi are weightmatrices, and bi are bias vectors. The mapping Fi isoften referred to as a layer, with the weight matricesrepresenting connections between layers. We usethe hyperbolic tangent as a transfer function in thefirst two layers, and as a linear function in the third.Backpropagation training is a variation of gradient

descent, in which weight and bias vectors are ad-justed incrementally in an attempt to match networkoutput with a training set, a sequence of input–outputvector pairs. A single presentation of all the trainingdata, with a corresponding weight adjustment, iscalled an epoch. Training consists of a sequence ofepochs and typically continues until the differencebetween training data and net output is acceptable, oralternatively until the net’s performance on a sepa-rate set of testing data is acceptable. Training is acomputationally intensive process for nontrivial net-works. In contrast, applying a trained net is quitefast, with the run time being dominated by the timefor the three matrix–vector multiplies.The networks we use for temperature retrievals

have one input component for each selected instru-ment channel, and one output component for eachpressure level; inputs are normalized brightness tem-peratures, and the output is a normalized tempera-ture profile. The networks we use here have one ofthe following three structures.

inputs hidden hidden outputs728 108 72 65130 90 60 6559 60 60 65

In each case the hidden functions 1F1 and F22 arehyperbolic tangents, and the outputs 1F32 are linearcombinations of the preceding layer. A typical train-ing set uses 880 input–output vector pairs 1i.e., halfthe TIGR data set2.Training is stopped when the rms error on testing

data stops showing significant improvement; thistypically happens after 20,000–100,000 epochs, de-pending on network size and amount of noise added tothe training set. Once network parameters 1adap-tive learning parameters, sizes of hidden layers, andinitial distributions2 are fixed in a useful range,different sets of random initial weights typically havea small effect on the final RMS error. When the fullset of TIGR profiles is divided into training andextrapolation sets of approximately equal size 1withrepresentatives from all latitudes in both sets2, then

5392 APPLIED OPTICS @ Vol. 34, No. 24 @ 20 August 1995

the exchange of training and extrapolation subsetsdoes not have a significant effect.As with more traditional methods of interpolation,

neural networks can both underfit and overfit. Ahigh training error or the inability to converge on thetraining set is a sign of underfitting, whereas poorperformance on new data is a sign of overfitting. Therelative smoothness of retrieved profiles and the closecorrespondence between training and extrapolationerrors suggest that the size of our hidden layers is nottoo large, and that we are not overfitting. 1Also, inpreliminary tests with smaller networks, the RMSerror was considerably greater.2 For example, for the728-channel network, the RMS average error over all880 training profiles and 64 pressure levels plus skintemperature is 0.85 K, the corresponding error overall 881 test profiles is 0.94 K, and the correspondingerror over all test profiles with added noise is 1.04 K.Table 1 and Figs. 2, 3, and 4 give a summary of the

RMS errors for both neural nets and regression, forseveral channel sets. Training 1or regression2 isperformed on even-numbered TIGR profiles. TheTIGR error is error retrieving odd-numbered TIGRprofiles. Thewrite_test error is for the set of 100 JPLwrite_test profiles, and the flat_test error is for two ofthe JPL flat_test tracks, as supplied for the AIRSscience team. Surface emissivity is fixed at 1 intrack F4D and varies in track F5D. 1The same728-channel neural net and regression transformwere used for tracks F4D and F5D as were used forthe TIGR and write_test tests; because training andregression were done in all cases with fixed emissiv-ity, it is not surprising that we see somewhat greatererrors in the retrieval of skin temperature in trackF5D.2 Input noise as determined by instrumentspecifications is added to brightness temperatures for

Table 1. Summary of RMS Retrieval Errors Adegrees Kelvin B for BothNeural Net and Regression Methods, for Three Channel Sets a

Method

TIGR Test Set JPL write_test

Overall1000–100mbars Overall

1000–100mbars

728-channelneural net 0.88 0.91 1.25 1.01regression 0.95 1.11 1.35 1.28



JPL flat_testTrack F4D

1fixed emissivity2

JPL flat_testTrack F5D

1varying emissivity2728-channelregression 0.83 0.98 0.90 1.12neural net 0.88 0.95 0.91 1.00

aAll retrievals are with AIRS instrument noise added, and theTIGR error is for odd-numbered profiles.

the retrieval tests. The RMS error is computed byinterpolation of the 64 retrieved levels to 30 slabs 1ofapproximately 1-km thickness, in the troposphere2and then computation of the RMS error for each slab.

3. Linear Retrievals

Here we compare the neural net retrievals with alinear regression method involving a change of basisfor both temperature profiles and brightness tempera-ture spectra, as proposed by Smith and Woolf.2 Thechange of basis gives a large compression of radiancedata and makes the regression much less sensitive tonoise.The regression and neural net approaches have a

number of features in common. These include theability to make efficient use of large channel sets, thefast computation of retrievals, a similar error perfor-

Fig. 2. RMS error retrieving odd-numbered TIGR profiles andwrite_test profiles for a neural net and for regression, using 728temperature channels. mb, mbar.

Fig. 3. RMS error retrieving odd-numbered TIGR profiles andwrite_test profiles for a neural net and for regression, using 130temperature channels.

mance, and some similarities in the Jacobians. Aninteresting difference between the methods is thatthe change of bases described below, which is sohelpful for regression, does not seem to work withneural nets. This is somewhat disappointing, as ouroriginal motivation for investigating the change-of-basis approach was to improve neural net retrievals.It initially seemed at least plausible that if a changeof basis could improve regression, it might also im-prove the performance of a neural network.The essential idea in the regression method is to

find distinct minimal orthogonal bases for both arepresentative set of temperature profiles and for thecorresponding brightness temperature spectra, for aselected set of channels. Suppose T is an n 3 k arrayof k temperature profiles at n pressure levels, oneprofile per column, with k . n. Let V be the n 3 nmatrix whose columns are eigenvectors of TTT, or-dered by the eigenvalues. 1Matrix V can be obtainedeither by a conventional eigenvector algorithm or bythe performance of a singular value decomposition ofT.2 Although V is an n-dimensional basis for T, itmay be the case that T has an effective basis with asmaller dimension.The basis B we want is the first m # n columns of

V, where the value of m is estimated from theeigenvalues and can be checked by a test of theaccuracy of mappings into and back out of the m-di-mensional space. If B is orthonormal, then we canuse BT as a linear transform from profiles into them-dimensional basis, and B as a transform from them-dimensional basis back to profiles. For example,if T is the entire set of 1761 TIGR profiles, then T canbe represented quite accurately with m 5 35; in thiscase we have rms1T 2 BBTT2 , 1023 K. Becauseupper levels of the 40-level TIGR set are extrapola-tions, it is not surprising that there is such anaccurate 35-dimensional representation. 1Note thatthe mapping in and out of the new basis is not aninterpolation, and that components of profiles in the

Fig. 4. Two RMS error retrieving two tracks of the flat_test, for aneural net and for regression, using 728 temperature channels.


35-dimensional space do not correspond to pressurelevels.2The same change of basis can be performed on the

brightness temperature spectra, Tb, corresponding toT, for various channel sets. The set of 728 tempera-ture-sensitive channels can be represented with a35-element basis B1, such that RMS1Tb 2 B1B1

TTb2 ,0.02 K.Figure 5 shows the first three basis vectors. The

first eigenvector in temperature basis set B is veryclose to the mean TIGR temperature profile, up to aconstant factor, and the first brightness temperaturebasis vector is very close to the brightness tempera-tures for the mean TIGR profile, again up to aconstant factor. The physical interpretation for theremaining basis vectors is not as clear, but thesuccessive temperature basis vectors have an increas-ingly fine structure.Suppose B1 is a basis for Tb and B2 is a basis for T.

Let Tb8 5 B1TTb and T8 5 B2

TT, and let C be theleast-squares solution to CTb8 5 T8. Then D 5B2CB1

T is the desired linear transform for retrievals.Aslight improvement performing retrievals with noisyinput 1,0.1 K2 can be obtained by the addition of noiseafter bases B1 and B2 have been obtained, and beforethe regression. If Tb is Tbwith added noise, than letTb8 5 B1

TTb, let C be the least-squares solution to CTˆ b8 5 T8, and letD 5 B2CB1

T.Table 1 and Figs. 2, 3, and 4 summarize RMS

testing error for the regression method and compareregression results with neural nets. As with theneural nets, the eigenvalue bases are determinedfrom and the regression is performed on even-numbered TIGR profiles, whereas the error shown isfor retrievals of the odd-numbered TIGR profiles.Input noise as determined by instrument specifica-tions 1with NEDT250 5 0.1 K2 is added to brightnesstemperatures for the retrieval tests.As noted, our original intent in investigating a

Fig. 5. First three basis vectors for the 64-level even-numberedTIGR profiles 1lower row2 and for the corresponding the 728-channel set 1upper row2.


change of basis was to improve neural net retrievals.However, so far we have been unable to get a neuralnet to work as well on the transformed space as thesimple least-squares fit. This appears to be due tomany deep local minima in the search space 1the set ofpossible weight and bias values2. In every test wedid, the net converges within a few hundred epochs toan overall rms error of ,2 K and then shows nofurther improvement.In a related test, the weight distribution of a neural

net was initialized to act as the regression transform,C. In training, this net gradually diverged to anoverall error of ,2 K and stayed there. Furthertests showed that a network initialized in this waywas quite sensitive to any perturbation of weightscoefficients; adding noise to the weights with a stan-dard deviation of as little as 1024 caused the overallretrieval error to double. Weights are typically ini-tialized with uniform noise over the interval 321, 14.

4. Jacobians

The Jacobians of retrieval functions are interestingfor a number of reasons. Among other things, theyshow to what degree fine-scale retrieval behaviorcorresponds to the physical phenomena being mod-eled, and they can indicate which channels are mostuseful. The Jacobian ≠T@≠Tb of a trained network isan array whose rows are pressure levels and whosecolumns are channels, where element 1i, j2 is ≠Ti@≠Tbj.We use differences to compute the Jacobians of thefast-transmittance algorithm ≠Tb@≠T 1not shown here2,and we use analytic differentiation to compute theJacobian of trained neural networks ≠T@≠Tb. Theregression Jacobian is simply matrixD.Figure 6 shows the Jacobian of a trained neural

network 1upper plot2 evaluated at the mean TIGRprofile and the Jacobian of regression 1lower plot2 forthe 130-channel set. Both Jacobians are plotted to

Fig. 6. Jacobians for a trained neural net 1upper plot2 andregression 1lower plot2 for the 130-channel set. Channels areordered by weighting-function peaks. Dark areas are low values,and light areas are high values.

the same gray scale. At the start of training, theneural net’s Jacobian has no discernible structure; astraining progresses it begins to resemble the regres-sion Jacobian.Both Jacobians show significant structure, includ-

ing multiple peaks and more variation in individualchannels than was expected physically; this is moremarked with the neural net. This is in contrast tothe Jacobian of the fast-transmittance code 1and theunderlying physics2, which is much smoother andgenerally has only one peak per channel. The oscilla-tions in the retrieval Jacobians raises the issue of howclosely the fine-scale retrieval behavior correspondsto the physical phenomena being modeled, especiallyfor the neural net. It is possible that the largeroscillations in the neural net’s Jacobian are due tooverfitting, or training too long, as the oscillations 1aswell as any other structure2 are less marked earlier in

Fig. 7. Jacobian 1regression matrix C2 for the 35-dimensionaltransform.

Fig. 8. RMS error in degrees Kelvin in the retrieval of TIGRtemperature profiles, and RMS percentage error in the retrieval ofTIGR water profiles, for a neural net and for regression, using 20AMSU channels.

the training process. However, as noted above, train-ing is carried out only as long as the extrapolationerror is decreasing, indicating that there may be sometrade-off between Jacobian smoothness and retrievalaccuracy.One use of the Jacobians is in a form of sensitivity

analysis, in which sensitivity refers to a channel’scontribution to the overall retrieval process. Jacobi-ans are computed for a representative set of profiles,the individual elements of these Jacobians aresquared, and the RMS average of the columns ofsquares is taken 1thus averaging over pressure levels2,giving a single sensitivity value for each channel.Channels can then be ranked on the basis of thissensitivity, and these rankings correspond reasonablywell with a by-hand ordering of channel sensitivity.In the 35-dimensional representation, the Jacobian

1Fig. 72, which is simply regression matrix C, showscorrespondences between temperature profile andbrightness spectra basis vectors, with lighter shadesindicating a positive and darker shades indicating anegative correspondence. This plot also indicatesthat only 20 or so basis vectors were actually neededto represent the temperature profiles.

5. Conclusions

Overall, we found that for temperature retrievals, theperformances of neural nets and regression weresimilar. In all the tests except the flat_test, theneural net is slightly more accurate than regression,in particular in the troposphere 11000–100 mbars2,where the difference is ,0.2 K for each of the threechannel sets. Regression performed slightly betterthan the neural net on the somewhat easier flat_test.In comparing themethods, onemust weigh the neuralnet’s overall modest advantage in accuracy againstthe greater 1one-time2 difficulty of the training ascompared with doing the regression, and against thecomforting nature of linear extrapolation. The morecomplicated fine-scale structure of the neural net’sJacobian may also be cause for concern. It is impor-tant to note again that these results are for clear-airretrievals with unit emissivity 1except for the secondflat_test2, and that these results depend to someextent on the assumption that noise is independent,which may not be the case under cloudy conditions.The accuracy of the neural nets we have used for

retrievals is similar to that reported by Escobar-Munoz et al.1 A more precise comparison is compli-cated by differences in pressure levels, and becausewe are reporting errors up to 1 mbar. The errorvalues we see for neural networks are remarkablypersistent across a wide range of training runs,interpolations, initial conditions, training with vary-ing amounts of noise, and particular training subsets.The AIRS temperature retrieval problem appears

close to linear, except possibly in the troposphere, andthe question naturally arises as to how a neural netwould compare with regression for water or otherretrievals. We have done some initial tests of water


and temperature retrievals using the 20AMSU 1AIRSmicrowave sounder2 channels; we have merged theAMSU-A and AMSU-B channels into a single set forthese tests, and we are using all 20 channels for bothtemperature andwater retrievals. Themuch smallerchannel set permits correspondingly smaller net-works, typically with 20 inputs, 20 transfer functionsin each of two hidden layers, and 30–64 outputs; thesesmaller nets have much shorter training times.Emissivity was fixed at 1 for the temperature testsand 0.5 for the water tests, and noise with a standarddeviation of 0.5 K was added to the simulated bright-ness temperatures. Figure 8 shows RMS error re-trieving 880 TIGR temperature profiles at 64 TOVSpressure levels, and RMS percentage error retrieving880 TIGR water profiles at the lowest 30 TOVS pres-sure levels. The overall retrieval errors were asfollows.

method temperature water20-channel neural net 2.16 K 25.1%20-channel regression 2.22 K 34.2%

In this preliminary test, as with the AIRS retrievals,the performances of the neural net and regression arevery similar for temperature retrievals, with theneural net doing slightly better. For the water re-trievals, in contrast, the net has a significant advan-tage. We emphasize again that this is only a roughcomparison 1we are retrieving 30 water levels and 64temperature levels, and different emissivities wereused in the two tests2, but it does suggest that neuralnetworks may be more appropriate than regressionfor water or other less linear retrieval problems.A significant limitation of both regression and

neural networks 1and one shared by any statisticalmethod2 is that both methods depend very heavily onhaving a good set of training 1or regression2 profiles.For example, consider the decreased accuracy inretrieving the write_test profiles at high altitudes; thelarge errors above 20 mbars may be due to the twodata sets using different methods for the extrapola-tion of radiosonde measurements. The accuracy ofboth neural networks and regression is subject tosimilar fundamental limitations. In general, for anygiven channel set, several distinct profiles may giverise to the same or very similar radiances. In thiscase both neural networks and regression will tend toreturn an average of these profiles.

Appendix A. Implementation and Training Details

Neural networks are probably not yet as familiar asregression-based methods, so we present some fur-ther details of the network training methods andparameters. Data manipulation, analysis, and dis-play are all done in MATLAB, and we used the MATLABNEURAL NETWORK TOOLBOX for our initial networktests. Unfortunately, the TOOLBOX was not fastenough, even on a fast workstation, for the networksand training sets of the size we are dealing with.Because of this, key TOOLBOX routines were ported to


the 16,384-processor MasPar MP-1 at the GoddardSpace Flight Center 1GSFC2, making extensive use ofthe MasPar linear algebra library. This code runs ata rate of approximately 100 million weight updatesper second 1number of weights 3 number ofepochs 3 length of training set@CPU time2 and450 Mflops, on typical datasets. This has reducedtraining time to a few hours, even for the 728-inputnetwork.One normalizes the data presented to the net by

subtracting the mean and dividing by the standarddeviation of the training set. For most tests, weightand bias distributions were uniform, except for aNguyen–Widrow distribution used in the first layer1i.e., for W12. A number of experiments were done inwhich initial and final weight distributions werecompared. By choosing initial weight distributionssimilar to the final distributions of a successfullytrained net, we somewhat speeded up the subsequenttraining, with no significant change in final accuracy.In general, we obtained the best results by choosinginitial values for weights that kept transfer functionsin, or not too far from, their linear region 1approxi-mately 60.52 on typical inputs. After training, wechecked the number of transfer functions remainingin the linear region, for the mean TIGR profile. Forthe 728-channel network described here, 66% of thetransfer functions were in the linear region in the firstlayer, and 90% of the transfer functions were in thelinear region in the second layer. At most one or twotransfer functions were saturated 1beyond 622 in anysingle layer. We observed that if training was al-lowed to continue beyond any significant improve-ment in extrapolative behavior, the number of satu-rated transfer functions began to increase.The large number of transfer functions remaining

in the linear region might suggest that a smaller netwould be sufficient, and we experimented with netsize and number of layers. Networks that weresignificantly smaller or that had only one hiddenlayer gave consistently worse results, with overallRMS errors over 2 K. Nets that were close in size tothe net used for the results reported here for the728-channel set, with 108 nodes in the first hiddenlayer and 72 in the second, were also tested. Asmaller network was tried, with 80 nodes in the firsthidden layer and 60 nodes in the second; this had anoverall RMS error that was ,0.2 K greater than the108 3 72 net. A larger network was also tried, with120 nodes in the first hidden layer and 90 in thesecond; this trained more slowly and gave essentiallythe same RMS error results as the 108 3 72 net. Ofcourse, this does not prove that a smaller net couldnot 1in theory2 work well; it merely means that thebackpropagation algorithm we used could not find theweight values for such a net.There are a number of parameters associated with

backpropagation training: initial distributions ofweights, net size, type of transfer functions, learningrate, and momentum. In addition, the adaptivelearning rate variation of backpropagation that we

used has several parameters: learning rate incre-ment, decrement, and error threshold. Parametersfor the adaptive learning algorithm, as used to trainthe 728-input net 1run 4102 described in Fig. 2 andTable 1, are as follows.

parameter run 410 useful rangemomentum 0.98 0.95–0.99initial learning rate 5.0e-4learning rate increments 1.03 1.01–1.04learning rate decrements 0.7 0.65–0.8error threshold 1.03 1.02–1.04

The useful ranges are approximate and were deter-mined by a number of tests with short training runs of5000 epochs, in which each parameter was variedindividually.

We thank the National Research Council for sup-portingH.Motteler’s researchwith a Research Associ-ateship at NASA@GSFC, and we thank M. Halem forserving as the National Research Council ResearchAdvisor, for many helpful discussions, and for provid-ing access to NASA@GSFC facilities. We also thankA. Chedin for sharing data and for being among thefirst to apply neural networks to retrieval problems,and J. Susskind for helpful discussions and for provid-ing the temperature-sensitive 59- and 130-channelsets.

References1. J. Escobar-Munoz, A. Chedin, F. Cheruy, and N. Scott, ‘‘Re-

seaux de neurones multicouches pour la restitution de vari-ables thermodynamiques atmospheriques a l’aide de son-duers verticaux satellitaires,’’ Technical Report 1Laboratoirede Meteorologie Dynamique du Centre National de La Recher-

che Scientifique, Ecole Polytechnique, 91128 Palaiseau Cedex,France, 19932.

2. W. L. Smith and H. M. Woolf, ‘‘The use of eigenvectors ofstatistical covariancematrices for interpreting satellite sound-ing radiometer observations,’’ J. Atmos. Sci. 33, 1127–1140119762.

3. J. Susskind, J. Rosenfield, D. Reuter, and M. T. Chahine,‘‘Remote sensing of weather and climate parameters fromHIRS2@MSU on TIROS-N,’’ J. Geophys. Res. 89, 4677–4697119842.

4. ‘‘Atmospheric Infrared Sounder: science and measurementrequirements,’’ Tech. Rep. D6665 Rev. 1 1Jet Propulsion Labo-ratory, Pasadena, Calif., 19912.

5. C. Rodgers, ‘‘Retrieval of atmospheric temperature and compo-sition from remote measurements of thermal radiation,’’ Rev.Geophys. Space Phys. 14, 609–624 119762.

6. A. Deepak, H. E. Fleming, and J. S. Theon, RSRM ’87:Advances in Remote Sensing Retrieval Methods, 1Deepak,Hampton, Va., 19882.

7. B. Kamgar-Parsi and J. A. Gualtieri, ‘‘Solving inversion prob-lems with neural networks,’’ in Proceedings of the Interna-tional Joint Conference on Neural Networks 1InternationalNeural Net Society and Institute of Electrical and ElectronicsEngineers, Piscataway, N.J., 19902, Vol. III, pp. 955–960.

8. R. Hecht-Nielsen,Neurocomputing 1Addison-Wesley, NewYork,19902.

9. P. K. Simpson, Artificial Neural Systems 1Pergamon, Elmsford,NewYork, 19902.

10. A. Chedin, N. A. Scott, C. Wahiche, and P. Moulinier, ‘‘Theimproved initialization inversion method: a high resolutionphysical method for temperature retrievals from satellites ofthe TIROS-N series,’’ J. Climate Appl. Meteorol. 24, 128–143119852.

11. J. Susskind, J. Rosenfield, and D. Reuter, ‘‘An accurate radia-tive transfer model for use in the direct physical inversion ofHIRS2 and MSU temperature sounding data,’’ J. Geophys.Res. 88, 8550–8586 119832.


Documents

Comparison of neural networks and regression-based methods for temperature retrievals