21
Reliability of Parallel Build Systems Derrick Coetzee, George Necula UC Berkeley Creative Commons Zero Waiver: To the extent possible under law, the author, Derrick Coetzee, waives all copyright and related or neighboring rights to this work.

Reliability of Parallel Build Systems Derrick Coetzee, George Necula UC Berkeley Creative Commons Zero Waiver: To the extent possible under law, the author,

Embed Size (px)

Citation preview

Page 1: Reliability of Parallel Build Systems Derrick Coetzee, George Necula UC Berkeley Creative Commons Zero Waiver: To the extent possible under law, the author,

Reliability ofParallel Build Systems

Derrick Coetzee, George NeculaUC Berkeley

Creative Commons Zero Waiver: To the extent possible under law, the author, Derrick Coetzee, waives all copyright and related or neighboring rights to this work.

Page 2: Reliability of Parallel Build Systems Derrick Coetzee, George Necula UC Berkeley Creative Commons Zero Waiver: To the extent possible under law, the author,

Why parallelize builds?

• Developer cycle time– Faster builds = Developers get more work done,

higher morale• Continuous integration– Faster builds = tests run more often

• Check-in verification systems– Faster builds = more throughput on check-in

queue

Page 3: Reliability of Parallel Build Systems Derrick Coetzee, George Necula UC Berkeley Creative Commons Zero Waiver: To the extent possible under law, the author,

Parallel build systems today

• Job scheduling• Typical example: make -j <n>– Find n build steps that have no unbuilt

dependencies and run them– Whenever one exits, start the next one

• Depends on the dependency graph being correct and complete

• Coarse-grained task parallelism

Page 4: Reliability of Parallel Build Systems Derrick Coetzee, George Necula UC Berkeley Creative Commons Zero Waiver: To the extent possible under law, the author,

What could go wrong?

• Incomplete dependency information– Serial builds → leads to incorrect incremental

builds– Parallel builds → leads to nondeterministic builds,

build breaks, incorrect builds– Developer changes can introduce or remove

dependencies at any time• #include "yy.lex.h"

Page 5: Reliability of Parallel Build Systems Derrick Coetzee, George Necula UC Berkeley Creative Commons Zero Waiver: To the extent possible under law, the author,

Example of missing dependencies

• gcc test.c -o test– What files does it read/write/test existence of?

Page 6: Reliability of Parallel Build Systems Derrick Coetzee, George Necula UC Berkeley Creative Commons Zero Waiver: To the extent possible under law, the author,

Example of missing dependencies

• gcc test.c -o test– What files does it read/write/test existence of?

• Actual: 5 processes, 119 files/directories/usr/bin/gcc /etc/ld.so.hwcap /tmp

/usr/lib/gcc/…/cc1 /lib/libc.so.6 /tmp/ccdCCHK0.s

/usr/bin/as /proc/meminfo /tmp/ccKs1ykU.c

/usr/bin/ld test.c.gch /tmp/cc0YtTuE.o

/usr/bin/nm /usr/lib/crt1.o /tmp/ccGGL3Eo.ld

/usr/bin/strip /usr/…/lib/specs /tmp/ccG4c608.le

… … …

Page 7: Reliability of Parallel Build Systems Derrick Coetzee, George Necula UC Berkeley Creative Commons Zero Waiver: To the extent possible under law, the author,

Parallel builds are error-prone

• Missing dependencies cause errors• Nondeterministic builds make errors difficult

to reproduce• Unnecessary dependencies limit scalability• An alternative:– Developer specifies serial build (easier!)– Serial build is automatically parallelized– Nondeterminism is eliminated

Page 8: Reliability of Parallel Build Systems Derrick Coetzee, George Necula UC Berkeley Creative Commons Zero Waiver: To the extent possible under law, the author,

Build transactions

• Each build step’s file operations are monitored using system call interception

• A transaction manager inserts locks before accessing each file (may suspend processes)

• Ensure that parallel build behaves in same way as the serial build– Use concurrency control techniques from databases– Schedule is conflict-equivalent to the user’s serial

schedule

Page 9: Reliability of Parallel Build Systems Derrick Coetzee, George Necula UC Berkeley Creative Commons Zero Waiver: To the extent possible under law, the author,

Build transactions example

• (1) Compile test.c to test.o, then (2) link:tid Lock/unlock Lock type Path Result

1 LOCK READ /etc/ld.so.cache OK

2 LOCK READ /etc/ld.so.cache OK

… … … … …

1 LOCK CREATE test.o OK

… … … … …

2 LOCK TEST test.o BLOCKED

… … … … …

1 UNLOCK CREATE test.o OK

2 LOCK TEST test.o OK

Page 10: Reliability of Parallel Build Systems Derrick Coetzee, George Necula UC Berkeley Creative Commons Zero Waiver: To the extent possible under law, the author,

Build transactions example

• What if transaction 2 takes the lock first?tid Lock/unlock Lock type Path Result

1 LOCK READ /etc/ld.so.cache OK

2 LOCK READ /etc/ld.so.cache OK

… … … … …

2 LOCK TEST test.o OK

… … … … …

1 LOCK CREATE test.o ROLLBACK 2

… … … … …

2 LOCK TEST test.o BLOCKED

… … … … …

1 UNLOCK CREATE test.o OK

2 LOCK TEST test.o OK

Page 11: Reliability of Parallel Build Systems Derrick Coetzee, George Necula UC Berkeley Creative Commons Zero Waiver: To the extent possible under law, the author,

Avoiding cascading rollback

• To ensure conflict-equivalence to the serial schedule, transactions must commit in order– Strict two-phase locking is too strict

• Instead, take advantage of the fact that the dependency graph – and lock set – changes very little from build to build

• Predicted locks– Derived from set of possible conflicts during previous run– Never block– Give no privilege to access data– Block conflicting lock attempts by transactions with larger

timestamps

Page 12: Reliability of Parallel Build Systems Derrick Coetzee, George Necula UC Berkeley Creative Commons Zero Waiver: To the extent possible under law, the author,

Build transactions example

• Compile step followed by a link step:tid Lock/unlock Lock type Path Result

1 PREDICTED LOCK

CREATE test.o OK

1 LOCK READ /etc/ld.so.cache OK

2 LOCK READ /etc/ld.so.cache OK

… … … … …

2 LOCK TEST test.o BLOCKED

… … … … …

1 LOCK CREATE test.o OK

… … … … …

1 UNLOCK CREATE test.o OK

2 LOCK TEST test.o OK

Page 13: Reliability of Parallel Build Systems Derrick Coetzee, George Necula UC Berkeley Creative Commons Zero Waiver: To the extent possible under law, the author,

Preliminary results - Linux kernel build

0 5 10 15 20 25 30 35 401

1.5

2

2.5

3

3.5

Speedup (apmake)Speedup (make -j)

Number of concurrent processes

Page 14: Reliability of Parallel Build Systems Derrick Coetzee, George Necula UC Berkeley Creative Commons Zero Waiver: To the extent possible under law, the author,

Preliminary results - Linux kernel build• Statistics:– Number of transactions/build steps: 2,949– Parallel build time: 3m9s– Total lock requests: 1,859,172– Lock requests blocked due to conflict: 1,697

0.000.18

0.370.55

0.740.92

1.111.29

1.481.66

1.842.03

2.212.40

020406080

100120140160180

Time waiting on lock (sec)

Freq

uenc

y

Page 15: Reliability of Parallel Build Systems Derrick Coetzee, George Necula UC Berkeley Creative Commons Zero Waiver: To the extent possible under law, the author,

Future work:Unimplemented stuff

• Haven’t yet implemented rollback– Needed for “unexpected dependencies”

• Fast cross-platform system call interception– ptrace, binary translation, custom filesystem?

• Multiversion timestamping– Useful for builds that read/write the same file

multiple times• Append-only files– Log files, standard out

Page 16: Reliability of Parallel Build Systems Derrick Coetzee, George Necula UC Berkeley Creative Commons Zero Waiver: To the extent possible under law, the author,

Future work:Diagnosing make build bugs

• If two build steps experience a conflict, but neither depends on the other directly or indirectly…– This proves the make build is nondeterministic– Isolates most important missing dependencies

• Filter dependency graph by “files in my source repository”– Finds other interesting dependencies (e.g. headers)

• Easy bug-finding tool for existing projects

Page 17: Reliability of Parallel Build Systems Derrick Coetzee, George Necula UC Berkeley Creative Commons Zero Waiver: To the extent possible under law, the author,

Future work:Process hierarchies

• Long-running process spawning many short-lived processes (e.g. make)

• Rolling back make would be very bad• Solution is virtualization:– Lie to make (your children have completed)– Predict outputs of children based on previous

build – block make if it tries to access these– Rolling back make (if necessary) isn’t so bad now

Page 18: Reliability of Parallel Build Systems Derrick Coetzee, George Necula UC Berkeley Creative Commons Zero Waiver: To the extent possible under law, the author,

Future work:Intra-build step parallelism

• Efficient parallel parsing for compilation– Ref Par Lab Browser’s work (Seth Fowler)

• Efficient parallel optimization– Unexplored?

• Efficient parallel linking– Ref Google’s gold linker

Page 19: Reliability of Parallel Build Systems Derrick Coetzee, George Necula UC Berkeley Creative Commons Zero Waiver: To the extent possible under law, the author,

Questions?

Page 20: Reliability of Parallel Build Systems Derrick Coetzee, George Necula UC Berkeley Creative Commons Zero Waiver: To the extent possible under law, the author,

Future work:Validated incremental builds

• Observation: most build steps produce same output files as in previous build

• Go ahead and use the old versions – if they’re wrong, we’ll find out when that file is rebuilt

• Eliminates blocking for a faster parallel build, at the cost of more rollbacks

Page 21: Reliability of Parallel Build Systems Derrick Coetzee, George Necula UC Berkeley Creative Commons Zero Waiver: To the extent possible under law, the author,

Future work:Distributed parallel builds

• How to automatically partition builds between machines based on dependency graph?

• How to efficiently handle unexpected dependencies