69
Copyright© 2016 NTT Corp. All Rights Reserved. Flaky Tests and Bugs in Apache Software (e.g. Hadoop) Akihiro Suda <[email protected]> NTT Software Innovation Center ApacheCon Core North America (May 12, 2016, at Vancouver)

Flaky tests and bugs in Apache software (e.g. Hadoop)

Embed Size (px)

Citation preview

Page 1: Flaky tests and bugs in Apache software (e.g. Hadoop)

Copyright© 2016 NTT Corp. All Rights Reserved.

Flaky Tests and Bugs in

Apache Software (e.g. Hadoop)

Akihiro Suda <[email protected]>

NTT Software Innovation Center

ApacheCon Core North America (May 12, 2016, at Vancouver)

Page 2: Flaky tests and bugs in Apache software (e.g. Hadoop)

2Copyright© 2016 NTT Corp. All Rights Reserved.

• Software Engineer at NTT Corporation

• NTT: the largest telecom in Japan

• Engaged in improvement on reliability of

distributed systems

• Some contributions to ZooKeeper / Hadoop

including critical bug fixes (non-committer)

• github: https://github.com/AkihiroSuda

Who am I

Page 3: Flaky tests and bugs in Apache software (e.g. Hadoop)

3Copyright© 2016 NTT Corp. All Rights Reserved.

• Current "flakiness" in Apache software

• Why flaky test matters?

• What causes a flaky test?

• How can we find, reproduce, and fix a flaky test?

• Existing work at Apache communities

• Our work: Namazu(鯰, catfish)

https://github.com/osrg/namazu

Agenda

Page 4: Flaky tests and bugs in Apache software (e.g. Hadoop)

4Copyright© 2016 NTT Corp. All Rights Reserved.

Agenda

• Current "flakiness" in Apache software

• Why flaky test matters?

• What causes a flaky test?

• How can we find, reproduce, and fix a flaky test?

• Existing work at Apache communities

• Our work: Namazu(鯰, catfish)

https://github.com/osrg/namazu

Page 5: Flaky tests and bugs in Apache software (e.g. Hadoop)

5Copyright© 2016 NTT Corp. All Rights Reserved.

Good News: Apache software are well tested!

Software Production code (LOC) Test code (LOC)

MapReduce 95K 87K

YARN 178K 121K

HDFS 152K 150K

ZooKeeper 33K 27K

HBase 571K 222K

Spark 167K 128K

Flume 46K 34K

Cassandra 168K 78K

Data are measured at 14/01/2016, using CLOC

Prod Test

Page 6: Flaky tests and bugs in Apache software (e.g. Hadoop)

6Copyright© 2016 NTT Corp. All Rights Reserved.

Bad News: https://builds.apache.org/job/%s-trunk/

MapReduce YARN HDFS

ZooKeeper

Data are captured at 14/01/2016

HBaseBuild

Build Time

Blue = Success

Red = Failure

I've never seen fully successful Hadoop build,

even on my local machine...

Page 7: Flaky tests and bugs in Apache software (e.g. Hadoop)

7Copyright© 2016 NTT Corp. All Rights Reserved.

Bad News: JIRA QL: project = ? AND text ~ "test fail*"

Software #Matched #All

Issues

MapReduce 2,441 (38%) 6,373

YARN 2,290 (63%) 4,756

HDFS 5,141 (53%) 9,672

ZooKeeper 828 (35%) 2,384

HBase 6,595 (42%) 15,542

Spark 794 ( 6%) 14,047

Flume 342 (12%) 2,882

Cassandra 1,656 (15%) 11,430

Data are captured at 4/4/2016

Roughly speaking,

the half of

Hadoop development

is dedicated to

debugging test failures.

Interestingly,

its flakiness seems

not uniform

across software..

(discussed later)

just for approximation

Page 8: Flaky tests and bugs in Apache software (e.g. Hadoop)

8Copyright© 2016 NTT Corp. All Rights Reserved.

Agenda

• Current "flakiness" in Apache software

• Why flaky test matters?

• What causes a flaky test?

• How can we find, reproduce, and fix a flaky test?

• Existing work at Apache communities

• Our work: Namazu(鯰, catfish)

https://github.com/osrg/namazu

Page 9: Flaky tests and bugs in Apache software (e.g. Hadoop)

9Copyright© 2016 NTT Corp. All Rights Reserved.

97% unit test failures in Apache software are said to be

harmless for production ("false-alarm")

• Information source:

"An Empirical Study of Bugs in Test Code" (A.Vahabzadeh et al., ICSME'15)

Not all test failures are critical for production..

Page 10: Flaky tests and bugs in Apache software (e.g. Hadoop)

10Copyright© 2016 NTT Corp. All Rights Reserved.

It still matters!

For developers..

It's a barrier to promotion of CI

• If many tests are flaky, developers tend to ignore CI

failure overlook real bugs

It's also a psychological barrier to contribution

• A developer may be blamed due to a test failure

For users..

It's a barrier to risk assessment for production

• No one can tell flaky tests from real bugs

So flaky test doesn't matter, as it doesn't affect production?

Page 11: Flaky tests and bugs in Apache software (e.g. Hadoop)

11Copyright© 2016 NTT Corp. All Rights Reserved.

SemaphoreCI suggests "No broken windows" strategy

for flaky tests

https://semaphoreci.com/community/tutorials/how-to-deal-with-and-eliminate-flaky-tests

So flaky test doesn't matter, as it doesn't affect production?

image: http://guides.lib.jjay.cuny.edu/nypd/brokenwindows

Page 12: Flaky tests and bugs in Apache software (e.g. Hadoop)

12Copyright© 2016 NTT Corp. All Rights Reserved.

Agenda

• Current "flakiness" in Apache software

• Why flaky test matters?

• What causes a flaky test?

• How can we find, reproduce, and fix a flaky test?

• Existing work at Apache communities

• Our work: Namazu(鯰, catfish)

https://github.com/osrg/namazu

Page 13: Flaky tests and bugs in Apache software (e.g. Hadoop)

13Copyright© 2016 NTT Corp. All Rights Reserved.

• Typical flaky test is caused by a malformed async

operation like this

(A.Vahabzadeh et al., ICSME'15 / Q.Luo et al., ACM FSE'14 / YARN-4478)

• Basically it can be fixed by increasing timeout&retries

• But it's not easy to find a reasonable timeout value

(e.g. YARN-{4804, 4807, 4929...})

• Long timeout is expensive

Basic cause: async operation

invokeAsyncOperation();// some tests lack even this sleepsleep(certainHardcodedTimeout);assertTrue(checkSomethingGoodHasHappened());

Page 14: Flaky tests and bugs in Apache software (e.g. Hadoop)

14Copyright© 2016 NTT Corp. All Rights Reserved.

• Host configuration

• Host performance

• Docker is great! But it still has some

issues

Testbed (e.g. CI) can cause test failures as well

Page 15: Flaky tests and bugs in Apache software (e.g. Hadoop)

15Copyright© 2016 NTT Corp. All Rights Reserved.

• HADOOP-12687

• Many YARN test fails when /etc/hosts has multiple loopback

entries

• ZOOKEEPER-2252

• Test: nslookup("a") should fail

• It does not fail when there is actually the host named "a“

• INFRA-11811

• JDK was not set up properly in a Jenkins slave

• Such a test can fail when the job is assigned to a

specific buildbot and it looks like a flaky test

CI host configuration can cause test failures

Page 16: Flaky tests and bugs in Apache software (e.g. Hadoop)

16Copyright© 2016 NTT Corp. All Rights Reserved.

CI host performance: they're not made equal

• Hadoop's buildbot https://builds.apache.org/computer/

Data are captured at 25/04/2016

Page 17: Flaky tests and bugs in Apache software (e.g. Hadoop)

17Copyright© 2016 NTT Corp. All Rights Reserved.

CI host performance: they're not made equal

• Spark's buildbot https://amplab.cs.berkeley.edu/jenkins/computer/

Page 18: Flaky tests and bugs in Apache software (e.g. Hadoop)

18Copyright© 2016 NTT Corp. All Rights Reserved.

CI host performance: they're not made equal

• Significant difference in the response time!

• Maybe related to the fact that Spark has only a

small number of test-related issues

(e.g. YARN 63% vs Spark 6% (slide 7))

Target Average Max Min

Hadoop 1163ms 1482ms 30ms

Spark 3ms 6ms 0ms

Page 19: Flaky tests and bugs in Apache software (e.g. Hadoop)

19Copyright© 2016 NTT Corp. All Rights Reserved.

Docker is great for testing!

• Some Apache software are using Docker on their

CI (via Apache Yetus)

• Apache BigTop also utilizes Docker for

provisioning Hadoop

• People also loves Docker for setting up test beds

on their workstations and laptops

• Of course me too

Docker issues

Page 20: Flaky tests and bugs in Apache software (e.g. Hadoop)

20Copyright© 2016 NTT Corp. All Rights Reserved.

• Mentioned in several Apache-related issue tickets:

• jupyter/docker-stacks#75: Spark hanging

• docker-library/cassandra#43, #46

• docker-solr/docker-solr#4

• ALLURA-8039

• AMBARI-14706

• IGNITE-2377

• YETUS-229 …

• Fortunately Apache Buildbot (Yetus) didn't hit the bug,

but made people's local testbeds flaky in a weird way.

• Fixed in recent kernels (so, accurately, it's not a Docker's issue)

Docker #18180: Java VM unkillable zombie

Page 21: Flaky tests and bugs in Apache software (e.g. Hadoop)

21Copyright© 2016 NTT Corp. All Rights Reserved.

AUFS: fcntl(F_SETFL, O_APPEND) was not supported

(#20199)

• Can cause data corruption (Dovecot is known to be affected)

• Fixed in recent AUFS

Overlay: You should not open O_RDWR and

O_RDONLY simultaneously (#10180)

• Can cause data corruption (RPM is known to be affected)

• Expected behavior, won't get fixed

More information: https://github.com/AkihiroSuda/docker-issues

Other potential Docker-related issues

Page 22: Flaky tests and bugs in Apache software (e.g. Hadoop)

22Copyright© 2016 NTT Corp. All Rights Reserved.

• Some issues can occur only in a

deployed environment rather than in a

CI

• e.g. TCP packet corruption

• Very flaky and critical

Flaky test is not limited to xUnit in CI..

TCP

Page 23: Flaky tests and bugs in Apache software (e.g. Hadoop)

23Copyright© 2016 NTT Corp. All Rights Reserved.

https://www.pagerduty.com/blog/the-discovery-of-apache-

zookeepers-poison-packet/

• TCP checksum was ignored in some IPsec

configuration

• ZooKeeper became weird intermittently due to corrupted TCP

packet

https://tech.vijayp.ca/linux-kernel-bug-delivers-corrupt-tcp-ip-

data-to-mesos-kubernetes-docker-containers-

4986f88f7a19#.gq8chzply

• TCP checksum was ignored in some veth

configuration

• Mesos and Kubernetes are affected

TCP packet corruption

TCP

Page 24: Flaky tests and bugs in Apache software (e.g. Hadoop)

24Copyright© 2016 NTT Corp. All Rights Reserved.

• It's very hard to notice (and reproduce) flaky TCP

packet corruption...

• Should distributed systems be TCP-corruption

tolerant...?

• the probability is very low in regular environments,

but it is not zero

(32-bit Ethernet CRC + 16-bit TCP checksum)

• JIRA issues: ZOOKEEPER-2175, HDFS-8161…

TCP packet corruption

TCP

Page 25: Flaky tests and bugs in Apache software (e.g. Hadoop)

25Copyright© 2016 NTT Corp. All Rights Reserved.

Agenda

• Current "flakiness" in Apache software

• Why flaky test matters?

• What causes a flaky test?

• How can we find, reproduce, and fix a flaky test?

• Existing work at Apache communities

• Our work: Namazu(鯰, catfish)

https://github.com/osrg/namazu

Page 26: Flaky tests and bugs in Apache software (e.g. Hadoop)

26Copyright© 2016 NTT Corp. All Rights Reserved.

• determine-flaky-tests-hadoop.py

• Apache Kudu‘s CI (dist_test)

• Google's TAP

• Our work: Namazu

https://github.com/osrg/Namazu

• and similar great tools

Efforts to find/reproduce a flaky test

Page 27: Flaky tests and bugs in Apache software (e.g. Hadoop)

27Copyright© 2016 NTT Corp. All Rights Reserved.

• Picks up failed tests using Jenkins API

• Included in hadoop.git/dev-support (HADOOP-

11045)

determine-flaky-tests-hadoop.py

$ determine-flaky-tests-hadoop.py --job Hadoop-YARN-trunk****Recently FAILED builds in url: https://builds.apache.org/job/Hadoop-YARN-trunk...Among 15 runs examined, all failed tests <#failedRuns: testName>:

7: TestContainerManagerRecovery.testApplicationRecovery...

Page 28: Flaky tests and bugs in Apache software (e.g. Hadoop)

28Copyright© 2016 NTT Corp. All Rights Reserved.

• Great tool, but it doesn't support running a

specific test repeatedly

• Also there is a maven dependency issue (YARN-

4478)

• B depends on A

• TestB is never executed if TestA fails

if TestA is flaky, we can't evaluate the flakiness of

TestB!

determine-flaky-tests-hadoop.py

Page 29: Flaky tests and bugs in Apache software (e.g. Hadoop)

29Copyright© 2016 NTT Corp. All Rights Reserved.

Kudu's CI: flaky test dashboard

http://dist-test.cloudera.org:8080/ (Apr 25)

Recently open-sourced and introduced at Apache: Big Data (Monday)

https://github.com/cloudera/dist_test

Page 30: Flaky tests and bugs in Apache software (e.g. Hadoop)

30Copyright© 2016 NTT Corp. All Rights Reserved.

Kudu's CI: flaky test dashboard

• Tests are run repeatedly on CI to find flaky tests

• KUDU_FLAKY_TEST_ATTEMPTS

• KUDU_FLAKY_TEST_LIST

(From https://github.com/apache/incubator-kudu/commit/1a24338a)

Fix flakiness of client_failover-itest

The reason this test was flaky is that there is a race between....

Looped 100x and they all passed:

http://dist-test.cloudera.org/job?job_id=mpercy.1454486819.10566

Author Mike Percy Jan 29, 2016 8:01 AMCommitter Todd Lipcon Feb 4, 2016 2:14 PMCommit 1a24338ad60a8842d1ae5e227f8f03e58faea8c0

Page 31: Flaky tests and bugs in Apache software (e.g. Hadoop)

31Copyright© 2016 NTT Corp. All Rights Reserved.

• Google's internal CI

• 1.6M test failures per day

• 73K (4.5%) are flaky

• Repeat a failing test 10 times for labeling

flaky tests

• Information source: An Empirical Analysis

of Flaky Tests (Q.Luo et al. ACM FSE'14)

Google's TAP

Page 32: Flaky tests and bugs in Apache software (e.g. Hadoop)

32Copyright© 2016 NTT Corp. All Rights Reserved.

• Modern CIs run jobs repeatedly to find /

reproduce flaky tests

• But they don't control non-determinism

• Overlook a flaky test

• Can not reproduce a failure

Cannot analyze the failure

• Our suggestion: increase non-determinism

for finding and reproducing flaky tests

Challenge: poor non-determinism

Page 33: Flaky tests and bugs in Apache software (e.g. Hadoop)

33Copyright© 2016 NTT Corp. All Rights Reserved.

NAMAZU: PROGRAMMABLE FUZZY SCHEDULER

https://github.com/osrg/namazu

NOTE: Namazu was formerly named "Earthquake"

Page 34: Flaky tests and bugs in Apache software (e.g. Hadoop)

34Copyright© 2016 NTT Corp. All Rights Reserved.

Namazu: programmable fuzzy scheduler

https://github.com/osrg/namazu

EventFuzzed (Randomized)

Schedule

Increases non-determinismfor finding and

reproducing flaky tests

Filesystem Packet Go[planned] Linux threadsJava

鯰 (namazu) means

a catfish in Japanese

Page 35: Flaky tests and bugs in Apache software (e.g. Hadoop)

35Copyright© 2016 NTT Corp. All Rights Reserved.

FUSE

Netfilter

Openflow

Byteman

AspectJ

Filesystem Packet Go[planned] Linux threadsJava

AspectGo

[wip]

sched_

setattr(2)

Namazu uses non-invasive techniques

• can be easily applied to any environment

• can avoid false-positives

Namazu: programmable fuzzy scheduler

https://github.com/osrg/namazu

https://github.com/AkihiroSuda/golang-exp-aspectgo

Page 36: Flaky tests and bugs in Apache software (e.g. Hadoop)

36Copyright© 2016 NTT Corp. All Rights Reserved.

• xUnit tests

• 😃 Easy to get started; just run `mvn`

• 😃 Can reproduce test failures observed in CI

• 😞 Limited testable scope

• Integration tests on a distributed cluster

• 😃 Can test everything

• 😞 Need to write a script to set up the cluster

• But Docker helps us a lot!

Namazu targets

Page 37: Flaky tests and bugs in Apache software (e.g. Hadoop)

37Copyright© 2016 NTT Corp. All Rights Reserved.

We support the both scenarios

Namazu targets

Single-node mode

(for xUnit tests)

Distributed mode

(for integration tests)

$ mvn test

Orchestrator

RPC

Page 38: Flaky tests and bugs in Apache software (e.g. Hadoop)

38Copyright© 2016 NTT Corp. All Rights Reserved.

NAMAZU + XUNIT TESTS

$ mvn test

Page 39: Flaky tests and bugs in Apache software (e.g. Hadoop)

39Copyright© 2016 NTT Corp. All Rights Reserved.

• Namazu is a comprehensive framework...

• Quick start: “renice” threads for xUnit tests

• POSIX.1 requires that threads share the single nice(priority)

value, but the actual Linux implementation (NPTL) not.

• Not always effective, but it’s generic and easy to get started

Namazu + xUnit tests

Filesystem Packet Go[planned] Linux threadsJava

Page 40: Flaky tests and bugs in Apache software (e.g. Hadoop)

40Copyright© 2016 NTT Corp. All Rights Reserved.

Namazu + xUnit tests

$ PID=$(docker inspect $(docker ps -q -f ancestor=hadoop-build-ubuntu) | jq .[0].State.Pid)$ sudo nmz inspectors proc -pid $PID

$ cd hadoop; ./start-build-env.sh[container]$ mvn test –Dtest=TestFoo#testBar

Namazu periodically sets random nice values for all the child

processes and the threads under $PID

Plus utilizes non-default kernel schedulers (e.g. SCHED_BATCH)

Page 41: Flaky tests and bugs in Apache software (e.g. Hadoop)

41Copyright© 2016 NTT Corp. All Rights Reserved.

Namazu + xUnit tests: Reproducibility

Testcase Traditional Namazu

YARN-4548

RM/TestCapacityScheduler11% 82%

YARN-4556

RM/TestFifoScheduler2% 44%

ZOOKEEPER-2137

ReconfigTest2% 16%

YARN-4168

NM/TestLogAggregationService1% 8%

YARN-1978

NM/TestLogAggregationService0% 4%

YARN-4543

NM/TestNodeStatusUpdater0% 1%

• More information: osrg/namazu#125

Page 42: Flaky tests and bugs in Apache software (e.g. Hadoop)

42Copyright© 2016 NTT Corp. All Rights Reserved.

Namazu + xUnit tests: Reproducibility

Testcase Traditional Namazu

ZOOKEEPER-2080

ReconfigRecoveryTest

14.0% 61.9%

• "Renicing" is not always effective...

• But even when renicing is ineffective,

sometimes you can also reproduce the flaky test

by injecting delays or reordering packets

$ sudo iptables ... -j NFQUEUE --queue-num 42$ sudo nmz inspectors ethernet -nfq-number 42

Page 43: Flaky tests and bugs in Apache software (e.g. Hadoop)

43Copyright© 2016 NTT Corp. All Rights Reserved.

NAMAZU + INTEGRATION TESTS

Page 44: Flaky tests and bugs in Apache software (e.g. Hadoop)

44Copyright© 2016 NTT Corp. All Rights Reserved.

• ZooKeeper: distributed coordination service

• used in Hadoop, Spark, Mesos, Kafka..

• ZooKeeper 3.5 (alpha) introduced the dynamic

configuration

• We performed an integration test so as to evaluate

the reliability of the reconfiguration

• We found a flaky bug!

Namazu + Integration tests

Page 45: Flaky tests and bugs in Apache software (e.g. Hadoop)

45Copyright© 2016 NTT Corp. All Rights Reserved.

• We permuted some specific Ethernet packets in random

order using Namazu

• TCP retransmissions are eliminated for reducing possible state

space

Namazu + Integration tests

ZooKeeper cluster

Open vSwitch + Ryu SDN Framework

+ Namazu

Page 46: Flaky tests and bugs in Apache software (e.g. Hadoop)

46Copyright© 2016 NTT Corp. All Rights Reserved.

• Bug: New node cannot participate to ZK cluster properly

New node cannot become a leader of ZK cluster itself

(More technically, it keeps being an "observer“)

• Cause: distributed race (ZAB packet vs FLE packet)

• ZAB.. atomic broadcast protocol for data

• FLE.. leader election protocol for ZK cluster itself

Found ZOOKEEPER-2212

Leader of ZK cluster New ZK node

ZAB [2888/tcp]

FLE [3888/tcp]

Uses different TCP connection

Non-deterministic packet order

Page 47: Flaky tests and bugs in Apache software (e.g. Hadoop)

47Copyright© 2016 NTT Corp. All Rights Reserved.

Data are captured at 22/01/2016

Found ZOOKEEPER-2212

Page 48: Flaky tests and bugs in Apache software (e.g. Hadoop)

48Copyright© 2016 NTT Corp. All Rights Reserved.

• Expected: ZK cluster works even when 𝑵/𝟐 nodes

crashed

• Real: single node failure can terminate the 3-node

ensemble

Found ZOOKEEPER-2212

Not participating properly

(keeps being an "observer")

Page 49: Flaky tests and bugs in Apache software (e.g. Hadoop)

49Copyright© 2016 NTT Corp. All Rights Reserved.

• Reproducibility: 0.0% 21.8%

(tested 1,000 times)

• We could not reproduce the bug even after

5,000 times traditional testing (60 hours!)

• Even reproducible by “renicing” threads, but the

reproducibility is just 0.7%

How hard is it to reproduce?

Page 50: Flaky tests and bugs in Apache software (e.g. Hadoop)

50Copyright© 2016 NTT Corp. All Rights Reserved.

We define the distributed execution pattern based on code coverage:

𝑷 =

𝒑𝟏,𝟏 ⋯ 𝒑𝟏,𝑵

⋮ ⋱ ⋮𝒑𝑳,𝟏 ⋯ 𝒑𝑳,𝑵

• 𝐿: LOC

• 𝑁: Number of nodes (==3 in this case)

• 𝑝𝑖 ,𝑗 : 1 if the node 𝑗 covers the branch in line 𝑖 , otherwise 0

• We used JaCoCo: Java Code Coverage Library (patch: ZOOKEEPER-2266)

Why we can hit the bug?

Namazu achieves faster pattern growth.

That's why we can hit the bug.

Page 51: Flaky tests and bugs in Apache software (e.g. Hadoop)

51Copyright© 2016 NTT Corp. All Rights Reserved.

HOW TO USE NAMAZU?

Page 52: Flaky tests and bugs in Apache software (e.g. Hadoop)

52Copyright© 2016 NTT Corp. All Rights Reserved.

Easy to install

Easy to get started

• Provides Docker-like CLI

• No code instrumentation needed

• No configuration needed (default: just renice threads)

How to use Namazu?

$ sudo apt-get install lib{netfilter-queue,zmq3}-dev$ go get github.com/osrg/namazu/nmz

$ sudo nmz container run –it –v /foo:/foo ubuntu[container]$ cd /foo && mvn test

Page 53: Flaky tests and bugs in Apache software (e.g. Hadoop)

53Copyright© 2016 NTT Corp. All Rights Reserved.

For threads ("renicing")

$ sudo nmz inspectors proc -pid $TARGET_PID

$ sudo nmz inspectors fs -mount-point /nmzfs

$ sudo iptables ... -j NFQUEUE --queue-num 42$ sudo nmz inspectors ethernet -nfq-number 42

Need distributed mode? (for integration testing)

Just add `--orchestrator-url http://foobar:10080/api/v3` to the CLI.

For filesystem

For network packets

How to use Namazu?

Page 54: Flaky tests and bugs in Apache software (e.g. Hadoop)

54Copyright© 2016 NTT Corp. All Rights Reserved.

Namazu API (Go)

type ExplorePolicy interface {QueueEvent(Event)ActionChan() chan Action

}

func (p *MyPolicy) QueueEvent(event Event) {action := event.DefaultAction()p.timeBoundedQ.Enqueue(action,

10 * Millisecond, 30 * Millisecond)}

func (p *MyPolicy) ActionChan() chan Action {return p.timeBoundedQ.DequeueChan

}

Action is randomly fired in [10ms, 30ms]

You can also inject fault actions here

Namazu defines REST API,

so you can also use other languages

An event can contain

Ethernet packet bytes

Page 55: Flaky tests and bugs in Apache software (e.g. Hadoop)

55Copyright© 2016 NTT Corp. All Rights Reserved.

• We found a bug: YARN cannot detect disk failure cases

where mkdir()/rmdir() blocks

• We noticed that the bug can occur theoretically

when we are reading the code, and actually produced the

bug using Namazu

• When we should inject the fault is pre-known;

so we manually wrote a concrete scenario using Namazu API

• Much more realistic than JUnit + mocking

API use case: found YARN-4301

mkdir

EIO

mkdir

...

A case where mkdir() returns EIO explicitly A case where mkdir() blocks

Page 56: Flaky tests and bugs in Apache software (e.g. Hadoop)

56Copyright© 2016 NTT Corp. All Rights Reserved.

func (p *MyPolicy) signalHandler() {signal.Notify(sigChan, syscall.SIGUSR1)for {

<-sigChanp.sleep = 10 * time.Minute

}}go p.signalHandler()func (p *MyPolicy) QueueEvent(event Event) {..}func (p *MyPolicy) ActionChan() chan Action {..}

$ go run mypolicy.go inspectors fs -mount-point /nmzfs

Set "yarn.nodemanager.local-dirs" to "/nmzfs/nm-local-dir",

Send SIGUSR1 to Namazu when you (and YARN) are ready

Interactive test is often easier than writing a JUnit testcase

We use SIGUSR1 here,

but it is also interesting to

implement human-friendly

CLI or GUI for

interactive testing

fault: blocks for 10 minutes

API use case: found YARN-4301

Page 57: Flaky tests and bugs in Apache software (e.g. Hadoop)

57Copyright© 2016 NTT Corp. All Rights Reserved.

API use case: found YARN-4301

Page 58: Flaky tests and bugs in Apache software (e.g. Hadoop)

58Copyright© 2016 NTT Corp. All Rights Reserved.

• If you have knowledge on the protocol, you can make

a hash for a packet

• Note that you have to eliminate time-dependent and random

bytes when you hash the packet

• Using the hash and Namazu API, you can "semi"-

deterministically replay the scenario

• Not fully deterministic; it just does its best effort

• Record-less! You just need to remember the "seed" for

replaying

• PoC: ZOOKEEPER-2212: up to 65% reproducibility

• More information: osrg/namazu#137

• See also (for Go): https://github.com/AkihiroSuda/go-replay

Another API use case: "semi"-deterministic replay

Page 59: Flaky tests and bugs in Apache software (e.g. Hadoop)

59Copyright© 2016 NTT Corp. All Rights Reserved.

SIMILAR GREAT TOOLS

Page 60: Flaky tests and bugs in Apache software (e.g. Hadoop)

60Copyright© 2016 NTT Corp. All Rights Reserved.

• Network partitioner + Linearizability tester

• Famous for "Call Me Maybe" blog: http://jepsen.io/

• “Call Me Maybe” by Carly Rae Jepsen (vevo):

https://www.youtube.com/watch?v=fWNaR-rxAic

• Randomly injects network partition using iptables

• "Linearizability" ∈ "Strong consistency"

• Integration test on a flaky network rather than a

flaky xUnit test

Similar great tool: Jepsen

Page 61: Flaky tests and bugs in Apache software (e.g. Hadoop)

61Copyright© 2016 NTT Corp. All Rights Reserved.

• Has been used to test several Apache software

• Cassandra: 9851,10001,10068,10231,10413,10674

• http://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen

• HBase

• Kafka

• Solr: 6530, 6583, 6610

• http:///lucidworks.com/blog/2014/12/10/call-maybe-solrcloud-jepsen-

flaky-networks

• ZooKeeper

Similar great tool: Jepsen

Page 62: Flaky tests and bugs in Apache software (e.g. Hadoop)

62Copyright© 2016 NTT Corp. All Rights Reserved.

• Namazu is much more generalized

• The bugs we found/reproduced are basically beyond the

scope of Jepsen (Threads, Disks..)

• Namazu can be also combined with Jepsen! It will be

our next work..

Namazu + Jepsen?

• causes network partition

• tests linearizablity

• increases non-determinism

• injects filesystem faults

Jepsen Namazu ...

Page 63: Flaky tests and bugs in Apache software (e.g. Hadoop)

63Copyright© 2016 NTT Corp. All Rights Reserved.

• Make the filesystem flaky using FUSE

• Used in testing ScyllaDB (Apache Cassandra's clone)

• https://github.com/scylladb/charybdefs

• Similar to Namazu FS

• Both supports API

• Also similar to PetardFS (not active since 2007)

• CharybdeFS can be also combined with Namazu as

well

• CharybdeFS is specialized in FS; Namazu is much more

comprehensive.

Similar great tool: CharybdeFS

Page 64: Flaky tests and bugs in Apache software (e.g. Hadoop)

64Copyright© 2016 NTT Corp. All Rights Reserved.

https://github.com/NetSys/demi

• Found some akka-raft bugs and reproduced a few Spark bugs

• challenge in reducing false-positives related to instrumentation

• DEMi and Namazu are complementary each other

• DEMi is powerful, but has some limitations

• Namazu is comprehensive and made easy to get started

Similar great tool: DEMi (appeared in NSDI'16)

Namazu DEMi

Target Generic

(Network,Filesystem,Thread..)

Akka

Getting Started Easy Need to write

AspectJ codes

Deterministic Replay? No Yes

Bug Cause Minimization? No Yes

Page 65: Flaky tests and bugs in Apache software (e.g. Hadoop)

65Copyright© 2016 NTT Corp. All Rights Reserved.

SO... HOW CAN WE FIX FLAKY TESTS?

Page 66: Flaky tests and bugs in Apache software (e.g. Hadoop)

66Copyright© 2016 NTT Corp. All Rights Reserved.

• Namazu finds/reproduces flaky tests, but it

doesn't automatically fix them😞

• Basic approach for async-related flakiness:

Adjust the values for sleep() and retries in the

test code

How can we fix flaky tests?

invokeAsyncOperation();// some tests lack even this sleepsleep(certainHardcodedTimeout);assertTrue(checkSomethingGoodHasHappened());

Page 67: Flaky tests and bugs in Apache software (e.g. Hadoop)

67Copyright© 2016 NTT Corp. All Rights Reserved.

How can we fix flaky tests?

invokeAsyncOperation();// some tests lack even this sleepsleep(certainHardcodedTimeout);assertTrue(checkSomethingGoodHasHappened());

• Suggestion: the timeout(&retries) should be a configurable

parameter rather than a hard-coded value

Timeout value Cost

(time)

Risk (timeout) Appropriate for

Long High Low • Slow machine (e.g.CI)

• Conservative person

Short Low High • Fast machine

• Risk-appetite person

Page 68: Flaky tests and bugs in Apache software (e.g. Hadoop)

68Copyright© 2016 NTT Corp. All Rights Reserved.

CONCLUSION

Page 69: Flaky tests and bugs in Apache software (e.g. Hadoop)

69Copyright© 2016 NTT Corp. All Rights Reserved.

• Apache software are well tested

• But they are flaky

• Let’s improve them

• Improve asynchronous code

• Repeat tests

• Our tool can control non-determinism

so as to reproduce flaky tests

https://github.com/osrg/namazu

Conclusion