Slide 1/29 Informed Prefetching in ROOT Leandro Franco 23 June 2006 ROOT Team Meeting CERN

Slide 1/29

Informed Prefetching in ROOT

Leandro Franco

23 June 2006

ROOT Team Meeting

CERN

Slide 2/29

Roadmap● Description of the problem.● Definition of a possible solution.● Limitations of that solution.● Implementation.● Tests and comments.● Future work.● Conclusion.

Slide 3/29

Problem● While processing a large (remote) file the

data must be transfered in small chunks.● The way to work with this file can be seen as:

● while ( NOT EOF)● Read Buffer● Process Data● Go to Next Position

● The time waiting for the data will be a considerable part of the total time.

Slide 4/29

Problem DescriptionClient Server

Latency

LatencyResponse Time

Round Trip Time ( RTT ) =

2*Latency + Response Time

Runt Trip Time ( RTT )

Client Process Time ( CPT )



Total Time = 3 * [Client Process Time ( CPT )] + 3*[Round Trip Time ( RTT )]Total Time = 3* ( CPT ) + 3 * ( Response time ) + 3 * ( 2 * Latency )

Slide 5/29

Problem Description● Depending on the conditions of the transmission,

the latency could be greater than the time needed to process the date (normal case).

● The time for a given job will be directly proportional to the latency and if latency >> CPT or latency >> response time the total time will be mostly composed of un-used time.

● In which case the best idea would be be to eliminate the latency time altogether.

Slide 6/29

Evolution of the problem

● Total time is proportionally direct to the latency.

● The real time needed by the job is very small in comparison ( obtained when latency = 0 ).

● The number of reads gives us the exact way in which they are related (slope of the line).

Slide 7/29

Slide 8/29

Idea ( diagram )● Perform a big request instead of many small

requests (only possible if the future reads are known !! ) Client Server

Latency

Latency

Response Time


Total Time = 3* ( CPT ) + 3 * ( Response time ) + ( 2 * Latency )

Slide 9/29

Idea ( performance gain )● Such method would allow

us to (almost) eliminate the dependence from the latency and add it only as an additional constant.

● Imperceptible if compared to the original one (but the latency is still there).

Slide 10/29

Idea ( limitations - xrootd )● Saying that we will transfer

all the data in a single request is not realistic. The best we can do is to transfer blocks big enough to cause an improvement in performance.

● Let's say our small blocks are usually 2.5KB, if we have a buffer of 256KB we will be able to perform 100 requests in a single transfer.

Slide 11/29

Idea ( limitations - TCP )● But that's still unrealistic,

the transfer size depends ultimately on the network ( and operating system ). If it has not been changed, the default will probably be very small.

● A typical value will be 64KB... which will reduce by 4 the performance of the last graph.

Slide 12/29

How can we get there?● We need a class that could take many small

requests, put them in a list, order them and try to get them all at once.

● To do that, we have the class TFilePrefetch created by Rene.– Prefetch(Long64_t pos, Int_t len) : Puts a request in the list.– ReadBuffer(char *buf, Long64_t pos, Int_t len) : Reads a buffer

(is it's the first time, it will sort the list and will try to get it from the underlying mechanism).

Slide 13/29

What underlying mechanism?

● If we don't implement it we have to read all the requests (one by one and return it to TFilePrefetch)– TFile::ReadBuffers(char *buf, Long64_t *pos, Int_t *len, Int_t nbuf) - It will read every element of the list and will put it in the buffer. - Note that only with this we gain in performance since we avoid

random seeks.

● If we want to provide the service, every descendant of TFile has to overload ReadBuffers() to provide a specialized version. For the moment changes have been made to support http ( Fons ), rootd and xrootd.

Slide 14/29

How do we know what requests we must pass to

TFilePrefetch ?● Fortunately, that is possible when processing

root trees.● This is done with a specialization of

TFilePrefetch, called TTreeFilePrefetch.– - At the beginning it will enter a “learning phase”,

which will add to a list the branches where we can find the events requested.

– - After a given number of requests ( say 100 for example ) it will stop registering branches and will prefetch only the ones that had already been specified.

Slide 15/29

Does it work?

Slide 16/29

Example ( h2fast ) - Simulated latency

( xrootd )

Slide 17/29

Example ( h2fast ) - Simulated latency ( xrootd )

Slide 18/29

The same test on rootd instead of xrootd

Slide 19/29

Details about the test● 4802 calls without prefetch.● 57 calls with a big buffer (although it's

limited by a 256KB limit on the server size).● 97 calls with a buffer of 64KB that should be

similar to the TCP window size (probably the most realistic case).

● The average size per call is around 1.3KB● The latency is simulated with a system sleep,

which is not accurate below 10ms.

Slide 20/29

Comments about the test● After the implementation we see the improvement is as

big as predicted.● But as we saw, there are restrictions on the block size.

– Client: Will be limited to send 4095 requests in one call ( if every request is 1KB this should be around 4MB ).

– Server: The response will be sent in 256KB chunks.– Network: TCP window size limitation ( 64KB should

be a conservative assumption). ● Therefore, we will be limited by the smaller one.

Slide 21/29

Comments● In addition to avoiding network latency

TFilePrefetch can be a big improvement on the server side since the calls are ordered.

● This is very useful if there are many clients, specially if we can guarantee the atomicity of the vectored read.

● i.e. reducing disk latency when switching contexts... average latency around 5ms?

Slide 22/29

What about a 'real' test? ( using http )

0 10000 100000 1'000000

LAN - 0.4 ms Real Time = 25.013 s

CPUtime = 5.420 s

Real Time = 4.766 s

CPUtime = 3.420 s

Real Time = 4.266 s

CPUtime = 3.130 s

Real Time = 4.452 s

CPUtime = 3.430 s

Orsay - 10 ms Real Time = 124.672 s

CPUtime = 6.640 s

Real Time = 12.433 s

CPUtime = 3.810 s

Real Time = 9.002 s

CPUtime = 3.340 s

Real Time = 9.059 s

CPUtime = 3.710 s

Nikhef - 20 ms Real Time = 230.904 s

CPUtime = 5.860 s


CPUtime = 1.890 s


CPUtime = 1.790 s

Real Time = 8.045 s

CPUtime = 1.920 s

ADSL - 70 ms Real Time = 743.667 s

CPUtime = 5.530 s


CPUtime = 3.200 s


CPUtime = 3.040 s


CPUtime = 2.970 s

TFilePrefetch buffer sizeNo TFilePrefetch

done with cp in 3 seconds... could we get there?

Slide 23/29

Future work ( client side )● we can try a parallel transfer ( multiple threads

asking for different chunks of the same buffer ) to avoid latency ( protocol specific ). If we remember the first graphs we would be dividing the slope by the number of threads.

● We can implement a client-side ReadAhead mechanism ( also multithreaded ) to ask the server for future chunks ( parallel if possible but could be seen as another thread transferring data while the main thread does something else ).

Slide 24/29

Future work ( server side )● We could use the pre-read mechanism

specified in the xrootd protocol for example (to avoid the disk latency), but this doesn't help much with the network latency.– Although this is implemented in the server,

modification in the client must be made ( we have to tell the server the buffers we want to pre-read ).

Slide 25/29

Future work ( different issue )

● After having the buffer with all the requests, create a thread to decompress the chunks that will be used. Avoiding the latency of the decompression and reducing the footprint since right now it's copied twice before being unzipped.

● This is not really related related to the other subject but could be interesting ;) .

Slide 26/29

Conclusion● TFilePrefetch

– State: Implemented– Potential Improvement: Critical in high latency networks ( can go to 2 orders

of magnitude ).● Pre-reads on the xrootd Server

– State: Already implemented on the server. Modifications in the client side are easy.

– Potential Improvement: Reduce disk latency.● Parallel Reading

– State: Working on it, beginning with one additional thread and passing to a pool.

– Potential Improvement: Avoid the limitation of the block size in the xrootd server ( new latency = old latency / number of threads ).

Slide 27/29

Conclusion● Read Ahead in the client side

– State: Implemented independently of TFilePrefetch (integration pending).

– Potential Improvement: Use all the CPU time to transfer data at the same time ( in a different thread ).

● Unzipping Ahead?– State: Idea– Potential Improvement: The application won't need to

wait since the data has been unzipped in advance ( by another thread ). This could result in gain by a factor of 2.

Slide 28/29

Questions ??

or comments ?

Slide 29/29

Thank you !

Documents

Slide 1/29 Informed Prefetching in ROOT Leandro Franco 23 June 2006 ROOT Team Meeting CERN