From Symbolic Server Models to Interactive Testsyishuai/itree-testing.pdfFrom this symbolic model implementation we can automatically derive (i) a reference server that can interact

From Symbolic Server Models to Interactive TestsSpecifying and Testing Networked Servers with Internal Nondeterminism

Yishuai Li Li-yao Xia Benjamin C. Pierce Steve Zdancewic

Abstract

We present a rigorous framework for automatically testing networked servers with internal nondetermin-ism. The framework uses executable server models defined in Gallina, the language of the Coq interactivetheorem prover, supporting a smooth path to formal verification using the same models as specifications.The key innovation is a domain-specific embedded language for writing symbolic models that representand manipulate the results of nondeterministic choices as symbolic constants.

Given such a symbolic model, we show how to automatically derive a test harness, comprising (1) atest client that generates message sequences designed to probe the server for unexpected behaviors, and(2) a compliance checker that validates the server’s interactions with the test client. Both client andchecker correctly account for potential nondeterminism in the symbolic model.

We demonstrate the effectiveness of this approach by using the framework to specify and test a simpleHTTP-based key-value server.

1 Introduction

Cybersecurity demands flawless network servers, guaranteed by rigorous testing against formal specifications.Model-based testing [8] is a natural technique for addressing this challenge. In this approach, one firstconstructs a server model that acts as a specification defining the acceptable behaviors of the system andthen tests interactions with the server implementation to look for behaviors that don’t comply with themodel.

A critical challenge in model-based testing is dealing with servers’ internal nondeterminism. Some serversare allowed to generate data nondeterministically (e.g. HTTP entity tags [13]), or use different parametersin configuration (e.g. TCP retransmission policies). These internal choices might affect the server’s observedbehavior, and the tester needs to know about the choices to determine whether the server complies with theprotocol. When it’s costly, if not impossible, for the tester to access the server other than over the network,the tester needs to reason on the server’s internal choices based on external behavior observed.

Server models for testing purposes are mainly written in two styles: (a) operational semantics [2] and(b) reference implementations [19]. These specifications have been applied successfully in testing nondeter-ministic network servers, for low-level protocols like TCP and for high-level tools like Dropbox. However,current tools in either style have some limitations: an operation-semantics-style specification cannot directlybe executed as a networked server, nor have we seen it automatically derived into a tester client that caninteract with servers. While the reference-implementation approach addresses these problems in prototypingand test generation, current tools require nondeterministic values to be chosen from a small finite set, whichis not feasible for HTTP ETags that can be arbitrary strings.

We propose a specification style that can be used for prototyping, testing, and verifying network pro-tocols with generic nondeterminism, where internal values can be chosen from an unbounded range. Ourspecification is a high level reference implementation, which we call a model implementation. It inherits theadvantages of existing reference-implementation-style specifications in test generation and generating exe-cutable programs, and it addresses the problem of testing nondeterministic protocols by symbolic evaluation.

In this style, nondeterministic choices in the model are represented as symbolic variables. To check aserver’s validity, the tester extracts a trace of observable behavior, and determines whether some choices

1

can lead the model to produce the observed behavior. The choices lead to different behavior in two ways,depending on where the symbolic variables appear in the model: either (1) in the responses sent by theserver, which we call data flow nondeterminism, or (2) in the condition of if-else branches, which we callcontrol flow nondeterminism. We handle them by (1) symbolic evaluation and (2) backtrack searching.

From this symbolic model implementation we can automatically derive (i) a reference server that caninteract with networked clients, in which any valid server’s behavior should be included, and (ii) a testingclient that can interact with networked servers and check its conformance with the aforementioned checkingmethods. In a previous work, we have used the same specification language to verify a similar web serverwritten in C [16], but that server didn’t involve nondeterminism.

To validate our method, we specified a simple (single-client) key-value server speaking a simple HTTP-like protocol, supporting GET, PUT, and Compare-And-Swap operations. We derived a testing client tocheck networked servers with or without bugs. Experiments show that this testing framework accepts a validserver while quickly rejecting a number of “mutants” with manually inserted bugs.

In summary, our main contributions are:

• We propose a methodology by which the same specification can be used for prototyping, testing,and verifying networked servers with internal nondeterminism. The key innovation is to representnondeterminism with symbolic expressions, and interpret a high-level “symbolic model implementation”in multiple ways, so that an executable prototype server and a validity checker for sample traces fromother (possibly buggy) servers can be derived automatically.

• We demonstrate the effectiveness of this methodology by specifying and testing an HTTP-like proto-col; our automatically-derived tester is able to detect a variety of intentionally inserted bugs in theimplementation. A similar server has previously been formally verified against a specification in thesame style [16].

The paper is organized as follows. We describe the challenges of testing nondeterministic protocols inSection 2. Our specification language is described in Section 3, and we show how to derive a tester from aminimal model in Section 4. We then expand the model to handle control flow nondetermism in Section 5.The complete derivation framework is evaluated in Section 6. We then compare related works with ours inSection 7.

2 Challenge: testing nondeterministic network protocol

To illustrate the challenge introduced by nondeterminism, we first demonstrate a toy query-response protocol(let’s call it “Q”). Here we show that even minimal nondeterminism in Q requires nontrivial effort to writea tester for its functional correctness.

The Q protocol is informally described as follows: “The server has a state of some constant value. Clientsmay query the value. The server should respond 1 if the request matches the stored value and respond 0otherwise.” An implementation of a Q server is shown in Figure 1.

We can translate this server-side specification into clients’ view: Responses observed from a valid Q servershould be well-formed and reflect that the server has processed the request correctly. More specifically, if theserver responds 1 to some query x, then it must respond 1 to all queries of the same x, and for all queriesthat are not of x, the server must respond 0; if the server responds 0 to some query of y, then never shouldit respond 1 to any query of the same y.

A hand-implementation of the client-side logic for testing Q is shown in Figure 2. The tester keeps trackof the server’s response to different queries. A query that receives a 1 response is stored in a variable is

(initially unknown), indicating that we know data is equal to the query. Queries that receive response 0 areadded to a set is_not (initially empty), indicating that we know data is_not equal to the query.

The testing program in Figure 2 first extracts a trace by executing a client that interacts with the server,and then checks whether the trace is valid or not. The trace is a linked list constructed by [] (nil) and ::

(cons). The checker first pattern matches on the trace, expecting a Send event that contains a query, followed

2

int main() {

init();

const int data = rand();

while (true) {

int request = recv_request (& connection);

char response = (request == data) ? ’1’ : ’0’;

send_response(connection , response);

}

}

Figure 1: Q server: The variable data is initialized to some random value. Upon receiving each request,we compare it with data, send back response, and loop back.

is := ∅;is_not := ∅;let check(_trace) =

match _trace with

| Send query :: Recv ’1’ :: remainder ⇒if (query ∈ is_not ∨ (∃ data , is == {data} ∧ data != query))

then reject ("Wrong Answer ")

else is := {query }; check(remainder)

| Send query :: Recv ’0’ :: remainder ⇒if (is == {query})

then reject ("Wrong Answer ")

else is_not := {query} ∪ is_not; check(remainder)

| Send _ :: Recv other :: _⇒ reject ("Bad Response ")

| Recv _ :: _ ⇒ reject (" Unexpected Response ")

| Send _ :: []⇒ accept ("All responses were valid ")

| [] ⇒ accept ("All interactions checked ")

end in

trace := execute(client);

check(trace)

Figure 2: Tester for the Q protocol (in pseudocode)

3

let loop(_data : int) =

(connection , request) := recv();

send(connection , request == _data);

loop(_data)

in

data := choose ();

loop(data)

Figure 3: Concrete model implementation (CMI) for Q server. The main loop receives a request on aconnection, sends back the response, and loops. The value data is initialized nondeterministically.

by a Recv event containing a response which should be 0 or 1; and rejects the trace if it doesn’t follow thispattern. For every pair of query and response, the checker determines whether they are consistent with itsexisting knowledge, and rejects the trace upon violation. If the transation is satisfying, the checker updatesits knowledge accordingly and keeps checking the remainder of the trace recursively, until reaching the endwhen it accepts the whole trace.

Even for a very simple protocol like Q, the tester is already non-trivial. To scale up to real-world networkprotocols, where nondeterministm affects the server’s behavior in various ways, the handwritten tester needsto carefully translate observed behavior into constraints on internal values. In a previous experiment, wedid develop a tester for HTTP that handles a subset of conditional requests, but it cost us more effort todebug the tester itself than to find bugs in HTTP servers with the tester. This experience suggests that wetry, instead, to derive the tester automatically from the protocol specification, with a generic way to handlenondeterminism in any protocol specified in our language.

3 Specification language: symbolic model implementation

Our specification is written as a model implementation, which is a program in higher-level language thatdefines the expected behavior of valid implementations. An implementation satisfies the protocol if and onlyif its behavior could have been performed by the model implementation.

Figure 3 shows a model implementation for Q protocol. The code is similar to the C implementationin Figure 1, except that the model implementation is written as an “interaction tree” [22], a data structurein the Coq proof assistant that can represent interactive programs for prototyping, testing, and verificationpurposes [16]. This model involves three kinds of events: It can send and receive (recv) messages over thenetwork, and choose an arbitrary value for data.

The nondeterministic nature of the model implementation poses a notable challenge to testing: How toevaluate the model after a choose event, without knowing the value it returns?

Instead of enumerating all possible values returned by choose, we introduce an abstraction over the chosenvalue: Instead of assigning data to a concrete value, we represent it as a symbolic expression. When testingthe server, the symbolic expressions are unified against concrete behavior observed on client side. Figure 5shows the abstracted model: We substitute choose with a symbolic new event that returns an expressiondatax. Since the server’s response value depends on this datax, we change send into a symbolic sendx whoseargument is an expression. We call this abstraction a “symbolic model implementation” (SMI), as opposedto the “concrete model implementation” (CMI) in Figure 3 that generates concrete values.

To specify a networked program with symbolic model implementation, a specification developer needsto: (1) Represent the server’s internal nondeterminism as impure events that produce symbolic expressions;(2) define the server’s external interactions, where arguments that depend on nondeterministic values shouldbe represented as expressions composed of the values’ representing symbols; and (3) describe a prototypeserver’s behavior as a model program that performs the aforementioned internal and external events.

A significant challenge for this method is that a server’s branch condition might depend on nondetermin-istic values it previously generated. We will return to this kind of branching nondeterminism in Section 5.

4

e ::= expressions| Exp Var x variable| Exp Int i integer| Exp Eqb (i , e) equivalence

Figure 4: Symbolic expressions for specifying Q protocol. To specify protocols that involve more arithmeticsthan equivalence, the expression language should be extended accordingly, within the capability of SMTsolvers.

let loop(_datax : expression) =

(connection , request) := recv();

sendx(connection , Exp_Eqb (request , _datax));

loop(_datax)

in

datax := new();

loop(datax)

Figure 5: Symbolic model implementation (SMI) for Q protocol. This program looks similar to the CMI inFigure 3, except that new generates a symbolic expression datax, and sendx sends an expression composed ofthe generated datax. Throughout this paper, we use the name foox for the symbolic variant of foo: datax isa symbolic expression that represents some data, and sendx is sending a symbolic message.

4 Derivation: from model implementation to testing program

A sketch of our testing framework appears in Figure 6. Starting from the top left, we transform the SMI intwo ways. On the left-hand side of the figure, we extract a testing client from the SMI; this client generatesmessages to send to some real server implementation, producing a trace of client–server interactions. On theright-hand side of the figure, we transform the server into a dynamic tester (DT), which determines whethertraces produced by running the client together with a server under test could also have been produced bythe specification, i.e., the SMI.

The main innovation is how we determine whether an observed behavior could have been produced bythe model implementation. Instead of searching for concrete values of the internally generated data, wecheck whether such values exist by solving constraints them. The constraints are constructed by unifyingthe symbolic expressions sent by the server model against the actual traces observed from the server undertest. In this paper we use a simple solver to handle equivalence, while the method is extendable to usecommodity SMT solvers.

4.1 Interpreting model implementations

The fundamental method in our testing framework is “interpretation.” To interpret a model is to map itsevents into specific computations, preserving the original control flow structure. A model can be interpretedin multiple ways, by assigning different semantics to the model’s events. The interpretation process was firstintroduced by interaction trees [22].

Using the semantics rule defined in Figure 7, we can interpret the SMI in Figure 5 into a CMI that isequivalent to Figure 3, so that the framework doesn’t require writing both SMI and CMI. By interpretingthe SMI and CMI in various ways as shown in Figure 6, we achieve a complete testing program that interactswith real-world servers.

5

Figure 6: Testing framework

If SMI wants to... (i.e. e) then CMI should... (i.e. f(e))Create a new expression datax choose an integer data nondeterministicallySend a boolean expression bx Evaluate the expression and send its value (eval bx)

Figure 7: Interpretation from SMI to CMI. On the left is the events the SMI wants to perform, on the rightis the translated events in the CMI.

If SMI wants to... then ST should...Receive a request Observe a request from the client

Send a response responsexObserve a response from the server, and guard that the response

is unifiable with responsex

Create a new expression Create a fresh variable x to construct the expression (Exp_Var x)

Figure 8: Interpretation from SMI to symbolic tester (ST)

6

let loop(_datax : expression) =

(connection , request) := send();

response := recv(connection);

guard(response , Exp_Eqb (request , _datax));

loop(_datax)

in

x := fresh();

let datax = Exp_Var x in

loop(datax)

Figure 9: Symbolic tester for Q protocol

If ST wants to... then DT should...Create a fresh variable Map the fresh variable to is_not []

guard that a value matches an expressionIf the expression is unifiable to the value, then updatestate by unification; otherwise report a violation

Figure 10: Interpretation from ST to DT. The unification method is the same as Figure 2, except generalizedfor multiple variables.

4.2 From symbolic server to symbolic tester

The SMI defines the server’s behavior. The tester we want to build stands on the other side of the network,and expects to observe the behavior defined in the SMI. As shown in Figure 8, when the SMI sends orreceives some message, the tester should receive or send the message respectively. Since the SMI specifiesthe response contents sent by the server as a symbolic expression, the tester introduces a guard event thatasserts that the response it observed matches responsex. Here we leverage the symbolic nature of responsex,so the assertion can be used to refine our knowledge of the server state in future steps.

As for the new events that represent choosing an arbitrary value, the tester creates a fresh variable toindicate that it initially knows nothing about the chosen value, until future guard events that add constraintsto the variables.

Using the interpretation rules in Figure 8, the SMI in Figure 5 is “dualized” into a symbolic tester (ST)in Figure 9.

4.3 From static tester to dynamic tester

We represent the internal state of the server symbolically to account for actions that are not visible tothe tester. The dualized ST preserves the symbolic expressions in SMI. To unify the symbolic events withconcrete traces, we construct a stateful checker that maintains knowledge about the server’s internal state.

The client-side knowledge is a global mapping from variables to constraints: each variable either holds(“is”) some concrete value or else holds is_not plus a list of wrong values (in the same way as Figure 2).

When the ST creates a fresh variable, we use is_not [] to represent zero knowledge about the corre-sponding value. When guarding an equivalence expression to be true or false, the tester gains knowledgewhether the variable and value being compared are equal or not. Here we show a simple constraint solverfor our minimal example. To handle more complex expressions, we would need to expand our framework sothat commodity SMT solvers can plug in.

The assertion guard(true, Exp_Eqb (i, x)) expects the existing constraint on x to be either is (i) orsome is_not (l) where l doesn’t contain i, and for the latter case, we refine the constraint to is (i);guard(false, Exp_Eqb (i, x)) expects the existing constraints on x to be either some is_not (l) or someis (j) where j != i, and for the former case, we refine the constraint by adding i to the list l.

Interpreting the ST’s events with rules in Figure 10 yields a “dynamic tester” (DT) that reflects Figure 2.

7

Trace execute(size_t fuel) {

Trace trace = empty;

while (fuel -- > 0) {

Request request = rand();

send_request (& connection , request);

append_trace(trace , Send (request));

Response response = recv_response(connection);

append_trace(trace , Recv (response));

}

return trace;

}

Figure 11: C equivalent to the extracted Q client.

4.4 From tester model to trace checker

The derived DT makes it possible to check whether a given trace could have been produced by the originalserver model (SMI). Suppose we have extracted a trace by executing some client against the server undertest; DT consumes the trace one event at a time and rejects it if (a) the DT expects some kind of event(e.g. sending a request to the server) but observes another kind (e.g. receiving a response from the server);(b) a response is malformed; or (c) a response is inconsistent with the tester’s knowledge of the server state.In the explicitly written tester in Figure 2, these conditions correspond to (a) "Unexpected Response", (b)"Bad Response", and (c) "Wrong Answer". The DT accepts a trace if it consumes all events in the trace withoutdetecting any violation ("All interactions checked"). A trace that contains a partial transaction, where thefinal request was not responded, is also acceptable ("All responses were valid"), since the client might havebeen halted before receiving the response, which does not indicate that the server is wrong.

4.5 From CMI to client program

Having a property checker constructed, the last thing we need is a tester client that feeds the checker withtraces extracted from the server. By dualizing the CMI in a similar way as Figure 8, minus handling symbolicexpressions, we get a model client. The client is then extracted to an executable program that interacts withthe server over the network. Figure 11 shows an equivalent C program to the client derived from the CMI.

5 Testing nondeterministic branches

Nondeterminism affects the server’s behavior in two ways, depending on where the symbolic variables appearin the server model: either (1) in the responses sent by the server, which we call data flow nondeterminism,or (2) in the condition of if-else branches, which we call control flow nondeterminism. The Q protocol wehave shown only exibits data flow nondeterminism, but to test more realistic protocols like HTTP, we needto handle control flow nondeterminism as well.

5.1 Example: protocol KVS

To illustrate control flow nondeterminism, we introduce a “Key-Value-Swap” (KVS) protocol, which issimplified from HTTP.

A “Key-Value-Swap” (KVS) server maintains a mapping from addresses (keys) to their values, with initialstate undetermined. It accepts three methods:

• Request (GET k) queries the value stored at address k.

8

if (request.c == map[request.k]) map[request.k] = request.v;

else respond (412); /* Precondition Failed */

Figure 12: KVS server handling CAS requests

b := decide(Exp_Eqb (request.c, mapx[request.k]));

match b with

| true ⇒ process(request)

| false⇒ respond (412)

end

Figure 13: SMI for CAS requests

– Response (200:OK v) indicates that v == map[k].

• Request (PUT k v) modifies the value at address k.

– Response 204:NoContent indicates that map[k] has been updated to v.

• Request (CAS k c v) does a compare-and-swap operation at address k.

– Response 204:NoContent indicates that map[k] was equal to c before this request, and has beenupdated to v;

– Response 412:PreconditionFailed indicates that map[k] is not equal to c, and is not changedby this request.

The KVS protocol covers a subset of HTTP, featuring “If-Match” conditional requests [13]. Here we use adifferent message encoding, as the difficulty in parsing HTTP/1.1 messages (prone to request smuggling) isout of this paper’s interest.

As shown in Figure 12, upon receiving a CAS request, a KVS server should process the request ifthe request’s c field matches the target’s value stored in the server. Otherwise, it should respond with“412 Precondition Failed” without performing other operations. Without knowing the target’s initial value(before performing any GET or PUT request to the target), the client cannot determine whether the serverhas processed or rejected the CAS request, which leads to different server states.

More generally, when an if-else branching (e.g. whether execute the CAS request or not) depends onnondeterministic values (e.g. the server’s initial state), the tester must somehow decide which branch wasactually taken by the server under test, but this might not be revealed until after observing more transactionsin the future. As a result, the tester should keep track of all possible branches taken, and discard those itfinds to be impossible.

Figure 13 shows the part of SMI that specifies how KVS servers should handle CAS requests. Thecondition is represented as a symbolic expression that depends on the server-side key-expression mapping.Since the model needs a concrete value to determine which branch to take, we introduce a new event todecide the branch condition’s boolean value.

5.2 Backtracking: explore all possible branches

When deriving a tester from the SMI, the newly introduced decide events events remain unchanged in thedualization process from SMI to ST. To interpret the ST into DT, we handle decide events in two phases:

Phase 1: From static tester to nondeterministic tester. When the ST wants to decide the valueof a boolean expression bx, we consider both cases where the expression evaluates to true or false, eithermight be correct. As shown in Figure 14, the tester flips a coin, taking both sides, and attempts to unify

9

nt_of_st (b := decide(exp); M’(b)) =

b := flip(); unify(b, exp); M’(b)

Figure 14: Interpreting decide events in ST

let simulate ((r := e; M’(r)), others) =

match e with

| flip⇒ simulate(M’(true), M’(false):: others)

| send | recv⇒r := e; simulate(M’(r), match_event(e, r, others))

| throw error⇒ (* Current branch rejects; *)

match others with (* try other branches , if any. *)

| [] ⇒ throw error

| other::others ’⇒ simulate(other , others ’)

end

end in

simulate(nt, [])

Figure 15: Simulating NT

the expression with each chosen boolean value. This is similar to guarding the value of an expression, exceptthat the target value is not observed from wire, but picked with a flip. Notice that flipping in this tester isdifferent from generating a random value, as the tester takes both sides of the coin to see if any is satisfying.

This “from decide to flip” phase is performed along with other interpretations defined in Figure 10, andproduces a semi-dynamic “nondeterministic tester” (NT) which is conceptually a nondeterministic Turingmachine, to be interpreted into a dynamic, deterministic tester in the next phase.

Phase 2: From nondeterministic tester to deterministic tester. To simulate the NT, when acoin is flipped, the deterministic tester chooses one side (e.g. true) to unify with the expression, and storesthe other side (i.e. false) into a set of “other possiblilities.” When the tester observes some event, itmatches all possible branches with the observation, and discards those branches whose event mismatch. Ifa violation is reported on the current chosen branch, discard the model tester on this branch and look intoother possibilities, as shown in Figure 15. The tester accepts a trace if some branch consumes all the traces;and rejects if all possible branches report some violation.

6 Evaluation

To evaluate whether these derived testers are effective at finding bugs, we ran the KVS tester against multiplebuggy variants of a KVS server implementation.

6.1 Experiment setup

As shown in Figure 16, we derived a tester (checker+client) from the SMI of KVS protocol, using the methodsdescribed in Section 4 and Section 5. We then ran the checker against a KVS server written in C. 1 The keysand values of the requests generated by the testing client were uniformly distributed in the range of [0, 9].For each server, we ran the test 29 times and analysed the performance in quartiles.

1CPU: Intel i7-8700, 3.20GHz; OS: openSUSE Leap 15.1; Network: Local Loopback; Round-trip delay time: 0.06 ± 0.04ms(at 95% confidence).

10

Figure 16: Evaluation framework

6.2 Results

6.2.1 Mutation testing

The tester has found one bug in the C server we wrote, and accepted the server after we fixed the bug.This acceptance indicates that either (1) the server is correct, or (2) the tester was incapable of findingthe remaining bugs. To rule out possibility (2), we manually inserted 11 more bugs to the accepted server.The bugs were located in various parts of the code: message encoding and decoding, key-value mapping,evaluating compare-and-swap conditions, and the semantics of request methods.

Our tester rejected all buggy (intentionally or not) instances of the server. The fact that the tester canquickly detect a dozen different flaws gives us increased confidence that the server it accepted was correct.

6.2.2 Performance

All buggy instances were rejected within half second. More specifically, the client extracted 852 ± 7 requestand response messages per run, taking 0.41 ± 0.03 seconds (at 95% confidence). The reason why our clienthas extracted different length of traces between executions is that the client was an infinite model programderived from the SMI, and has nondeterministic branching. To eventually reach a result, we ran the clientfor a finite number of steps, defined as fuel. Some branches consumed more fuel than others, resulting in thenetwork interactions triggered fewer times.

All buggy traces were rejected by the checker within 4ms. As shown in Figure 17, the bugs are different indepth: Some bugs, on average, were revealed by a longer prefix in the trace than others, and took the checkermore time to detect. The deepest bugs (6 and 9) are related to CAS method: Bug 6 updated the value uponsuccesful comparison, but sends back “412 Precondition Failed” indicating that the comarison had failed.Bug 9 sends back “204 No Content” upon succesful comparison, but did not update the corresponding value.These bugs are deep because the checker cannot tell the violation immediately, as the server did send acorrect response. These bugs can only be detected by future observation to reveal that the server didn’tupdate its internal state accordingly.

Some of these bugs come from real world practices: Bug 1 was unintentionally introduced when we wrotethe C server, and Bug 6 imitates a violation in Apache against RFC 7232, which we previously found with

11

1 2 3 4 5 6 7 8 9 10 11 120.01

0.02

0.05

0.1

0.2

0.5

1

2

4

(a) Time to failure (microseconds).

1 2 3 4 5 6 7 8 9 10 11 121

2

5

10

20

50

100

200

400

600

(b) Length of failing trace prefix.

Figure 17: Time spent and trace consumed by checker to detect each bug, as boxplot in logarithmetic scale.

a handwritten tester.

7 Related Work

Testing of transition systems has been well studied over the past decades [1, 6, 7, 8, 11, 14, 15]. Here wecompare our method against the most closely related existing techniques we are aware of.

Note to readers of this preliminary draft: The literature on testing is huge, and we mayhave missed things that are relevant. We would greatly appreciate if readers would letus know about other threads that we should look at.

We describe the literature in three aspects: (1) languages used for protocol specifications, (2) theoriesin testing against specifications, and (3) tools developed for real-world testing practices. Among the workswe mention, the Network Semantics (NETSEM) project [2] is closely related, and will be discussed in eachaspect.

7.1 Specification Languages

The TCP/IP Protocol Stack is specified by the Internet Engineering Task Force (IETF) with Request forComments (RFC) documents. The informal nature of these specifications makes them inappropriate formechanical conformance testing.

The International Organization for Standardization (ISO) has developed two formal description tech-niques for open distributed systems: Language of Temporal Ordering Specification (LOTOS) [5] based onprocess algebra, and Estelle [9], where tasks are described as extended finite state machine (EFSM). Theformer has been adopted as international standard.

Distributed concurrent systems in LOTOS are described as processes that interact via channels. Fromthe structure of a process definition, one can systematically derive labelled transitions the process mightperform. These transitions form an “action tree” which is different from the “interaction tree” we used: Anaction tree branches when the process is allowed to interact with the environment via multiple channels,whereas our interaction tree treats interactions as nodes, and branches at different return values of aninteraction. Despite the structural differences, our language style of reference implementation is similar toLOTOS. Whether these two structures can simulate each other is to be explored in future work.

Beyond ISO, labelled transition system (LTS) has been used for specifying and testing networked systemslike TCP [2, 17], and is capable of formal verification [3, 4]. In a previous experiment, we’ve handwritten

12

an interactive tester program as a functional correctness specification for HTTP. These specification stylesfocus on validating existing implementations without providing an executable example for reference.

7.2 Test Algorithms

In property-based testing, high-level specifications are written as executable decision procedures, run withmany generated inputs (chosen randomly, enumeratively, search-based...). Claessen and Hughes [12] haveapplied property-based testing to stateful interfaces in various ways. In particular, they demonstrate howto test the conformance of a library with respect to a deterministic model. In contrast, our frameworkaims to provide a rich specification language of abstract servers to express fine-grained communicationprotocols, where requests strongly depend on previous responses, making it cumbersome to model the stateexplicitly. Rather than generating individual inputs such as sequences of actions, our framework transformsspecifications into executable clients which can react dynamically to the server’s responses.

For specifications not directly describing what traces are valid, we need to develop a function that analyzeswhether a test result conforms with the protocol. The general idea is to check whether an observed behavioris included in the model’s transitions.

TETRA [20] is a trace analysis tool that checks observed behavior against LOTOS specifications. It canhandle control flow nondeterminism by enumerating and backtracking, but cannot handle data value chosenfrom an unbounded range.

To address the challenge of testing protocols with internally chosen data, Wu and von Bochmann haveproposed an execution model [21] which randomly searches for a suitable process that corresponds with thegenerated value. This execution model claims statistical completeness with probability 1, but we haven’tseen implementations for it and doubt its feasibility, as the chanse of guessing a value within unboundedrange is negligeable.

Instead of searching for a concrete value for generic choices, Bishop et al. [2] applied symbolic evaluationto the testing process, and determines traces’ conformance by solving constraints on the symbolic expressionsthat represent nondeterministic values. The same idea of constraint solving is used in our framework.

Testing for nondeterministic reactive systems is well studied under the theory of conformance test-ing [17, 18]. It discovers formal relationships between implementations and specifications modeled as labelledtransition systems, in particular by comparing their interactions with a common “observer.” A shared goalwith our work is to derive efficient observers, which we call “clients,” to distinguish conforming implemen-tations. It would be interesting to reframe our work in those terms. Since our testing framework is writtenin Coq, it is particularly amenable to formal verification, and formalizing I/O-conformance concepts couldrepresent a significant step towards that goal.

7.3 Protocol Testing Pragmatics

Using a formal language strongly inspired by LOTOS, TorXakis [19] implements a test generation tool forsymbolic transition systems. The testing framework unfolds the model definition into a behavior tree thatdefines the primitives for generating test cases.

Unlike our tester that extracts an entire trace before sending it to the checker, TorXakis executes thegenerated test steps and checks the response immediately before generating the next step. Although thison-the-fly process is technicality achievable in our framework, we chose to derive the client and checkerseparately for (1) potential flexibility of tuning the client’s effectiveness (generating test cases that capturebugs more quickly) without interfering with the checker, and (2) allowing the checker to work on its ownagainst a trace extracted by any arbitrary client.

Although TorXakis has been used for testing real-world systems like Dropbox [19], it doesn’t supportgeneralized choice over infinite sets of values, the same limitation as TETRA [20]. In order to test theQ protocol, the TorXakis model needs to take one among three detours: (1) Restrict randomness of thegenerated value to a finite range; (2) Specify clients’ knowledge of the generated value; or (3) Expose thegenerated value to some channel. The limitations are: (1) tests the server by enumerating all possible valuesthe server might generate, each corresponds to a process to be checked. If the model generates a n-bit

13

random number, the tester needs to search through at most 2n processes for the one that corresponds tothe server. (2) makes the model as complex as the tester in Figure 2, whose drawbacks are discussed inSection 2. (3) requires the server to send the generated value explicitly, so the model doesn’t accept a Qserver in Figure 1 that keeps the value hidden from clients.

As for specifications written as transition rules: Aichernig et al. have derived a test case generator for aweb-service application, based on business-rule models that translates into an EFSM. The NETSEM projecthas tested network protocols like TCP, UDP, and Sockets based on rigorous specifications written as anoperational semantics for LTS [2].

Our work is similar to NETSEM in testing mechanism, that handles internal nondeterminism by main-taining constraints during symbolic evaluation. Our work and NETSEM use different specification languagesfor particular interests in utilizing the specification: NETSEM provides full-scale specifications for real-worldprotocols like TCP, and the tests are generated from a separately-developed program. Our project startsfrom a simplified protocol, but aims for a specification language with more usages than shown in NETSEM.In addition to validating existing implementations, our specification can (1) generate an interactive testingclient to extract observable behavior from servers, (2) derive a server prototype as reference implementation,and (3) verify the functional correctness of C implementations using Verified Software Toolchain [10].

In a previous work [16], we have specified a simpler, deterministic protocol, in a simpler version of theframework, and used the specification for both formal verification and random testing against a networkedserver. However, the client was a separate artifact written manually. We show here that even the clientcan be derived, and that this technique is extensible for testing nondeterministic protocols. The extendedspecification should also be amenable to formal verification, as we discuss in the next section.

8 Future Work

8.1 Specification

We have presented a method for specifying and testing servers that serve one client at a time. In real worldpractices, servers might maintain multiple connections simultaneously. Requests sent by different clientsmight arrive at the server in any order due to network delays. In a previous experiment, we have handledsuch network-intrinsic nondeterminism by network refinement [16], a variant of observation refinement. Withthe help of network refinement, we believe the specification and testing methods in this paper can be appliedto multi-connection servers as well.

8.2 Test case generation

The tester we’ve shown generates uniformly distributed requests, but there exist some patterns of test casesthat are more likely to trigger bugs. For example, when the client has requested an operation on some target,and we’ve gained some experience (by testing or debugging servers) that the operation handler is likely tohave bugs, then we’d be interested in a followup check of the target to see whether the requested operationwas performed properly. If more bugs are related to memory safety, we’d emphasize checking non-targetsto see if they’re not interfered. To reflect such interest in generating certain patterns of test cases, we planto expand the language for model implementation, allowing specification writers to embed domain-specificknowledges about what kind of requests are more likely to trigger violations. We’ll then derive the client toadjust its distribution of request generator dynamically, based on previous interactions and hints embeddedin the specification. By tuning the client to emphasize some patterns of test cases, we’d expect to triggerbugs more quickly.

8.3 Shrinking

When the tester rejects a server, it reports the first failing trace as counterexample, which possibly containsirrelevant transactions. To locate the bug efficiently, we hope to shrink the counterexample to a minimal trace

14

that clearly shows why the server is wrong. The existing shrinking technique for pure functional programsneeds to be adjusted to our scenario, where the response of an impure server differs from one execution toanother. Shrinking test cases for interactive, nondeterministic programs deserves more exploration.

8.4 Verification

Our specification language is designed for both testing and verification purposes. We’d expect to verify a Cserver against this specification, which provides an even more rigorous correctness guarantee than testing.

Although a limited number of test cases doesn’t give us full guarantee of the server’s compliance uponarbitrary inputs, we are still interested in proving that the tester is “exhaustive” i.e. for all servers thatcontains some bug, the tester can eventually generate some counterexample to reveal that bug. Having anexhaustive tester, as the number of tests increase, our confidence of the server’s correctness converges to100%.

9 Conclusion

We have presented how to (1) specify servers with model implementations, (2) handle internal nondetermin-ism with symbolic evaluation, and (3) derive an interactive tester from the symbolic model implementation.Our derived tester can tell valid servers from buggy mutants, and the specification is capable of many otherusages like prototyping and formal verification.

References

[1] Saswat Anand, Edmund K. Burke, Tsong Yueh Chen, John Clark, Myra B. Cohen, Wolfgang Grieskamp,Mark Harman, Mary Jean Harrold, Phil McMinn, Antonia Bertolino, J. Jenny Li, and Hong Zhu. Anorchestrated survey of methodologies for automated software test case generation. Journal of Systemsand Software, 86(8):1978 – 2001, 2013.

[2] Steve Bishop, Matthew Fairbairn, Hannes Mehnert, Michael Norrish, Tom Ridge, Peter Sewell, MichaelSmith, and Keith Wansbrough. Engineering with logic: Rigorous test-oracle specification and validationfor tcp/ip and the sockets api. J. ACM, 66(1), December 2018.

[3] Aaron Bohannon and Benjamin C. Pierce. Featherweight Firefox: Formalizing the core of a web browser.In Usenix Conference on Web Application Development (WebApps), June 2010.

[4] Aaron Bohannon, Benjamin C. Pierce, Vilhelm Sjoberg, Stephanie Weirich, and Steve Zdancewic. Re-active noninterference. In CCS ’09: Proceedings of the 16th ACM conference on Computer and com-munications security, pages 79–90, New York, NY, USA, 2009. ACM.

[5] Tommaso Bolognesi and Ed Brinksma. Introduction to the iso specification language lotos. ComputerNetworks and ISDN Systems, 14(1):25 – 59, 1987.

[6] Ed Brinksma, Lex Heerink, and Jan Tretmans. Developments in testing transition systems, pages 143–166. Springer US, Boston, MA, 1997.

[7] Ed Brinksma and Jan Tretmans. Testing Transition Systems: An Annotated Bibliography, pages 187–195. Springer Berlin Heidelberg, Berlin, Heidelberg, 2001.

[8] Manfred Broy, Bengt Jonsson, J-P Katoen, Martin Leucker, and Alexander Pretschner. Model-basedtesting of reactive systems. In Volume 3472 of Springer LNCS. Springer, 2005.

[9] S. Budkowski and P. Dembinski. An introduction to estelle: A specification language for distributedsystems. Computer Networks and ISDN Systems, 14(1):3 – 23, 1987.

15

[10] Qinxiang Cao, Lennart Beringer, Samuel Gruetter, Josiah Dodds, and Andrew W Appel. VST-Floyd:A Separation Logic Tool to Verify Correctness of C Programs. Journal of Automated Reasoning, 61(1-4):367–422, jun 2018.

[11] Stephen Castro. The relationship between conformance testing of and interoperability between osisystems. Computer Standards & Interfaces, 12(1):3 – 11, 1991.

[12] Koen Claessen and John Hughes. Testing monadic code with quickcheck. In Proceedings of the 2002ACM SIGPLAN Workshop on Haskell, Haskell ’02, page 65–77, New York, NY, USA, 2002. Associationfor Computing Machinery.

[13] Roy T. Fielding and Julian Reschke. Hypertext Transfer Protocol (HTTP/1.1): Conditional Requests.RFC 7232, June 2014.

[14] Alan Hartman, Andrei Kirshin, Kenneth Nagin, Sergey Olvovsky, and Aviad Zlotnick. Model basedtest generation for validation of parallel and concurrent software, August 8 2006. US Patent 7,089,534.

[15] Ramon Janssen, Frits Vaandrager, and Jan Tretmans. Relating alternating relations for conformanceand refinement. In Wolfgang Ahrendt and Silvia Lizeth Tapia Tarifa, editors, Integrated Formal Methods,pages 246–264, Cham, 2019. Springer International Publishing.

[16] Nicolas Koh, Yao Li, Yishuai Li, Li-yao Xia, Lennart Beringer, Wolf Honore, William Mansky, Ben-jamin C. Pierce, and Steve Zdancewic. From c to interaction trees: Specifying, verifying, and testinga networked server. In Proceedings of the 8th ACM SIGPLAN International Conference on CertifiedPrograms and Proofs, CPP 2019, pages 234–248, New York, NY, USA, 2019. ACM.

[17] Jan Tretmans. Conformance testing with labelled transition systems: Implementation relations and testgeneration. Computer Networks and ISDN Systems, 29(1):49 – 79, 1996. Protocol Testing.

[18] Jan Tretmans. Test generation with inputs, outputs, and quiescence. In Proceedings of the SecondInternational Workshop on Tools and Algorithms for Construction and Analysis of Systems, TACAs’96, page 127–146, Berlin, Heidelberg, 1996. Springer-Verlag.

[19] Jan Tretmans and Pierre van de Laar. Model-based testing with torxakis: The mysteries of dropboxrevisited. In Strahonja, V.(ed.), CECIIS: 30th Central European Conference on Information and Intel-ligent Systems, October 2-4, 2019, Varazdin, Croatia. Proceedings, pages 247–258. Zagreb: Faculty ofOrganization and Informatics, University of Zagreb, 2019.

[20] Gregor von Bochmann, D Desbiens, M Duboc, Devin Ouimet, and F Saba. Test result analysis andvalidation of test verdicts. In 3rd Int. Workshop on Protocol Test Systems, 1991.

[21] Cheng Wu and Gregor von Bochmann. An execution model for lotos specifications. In [Proceedings]GLOBECOM ’90: IEEE Global Telecommunications Conference and Exhibition, pages 1890–1895 vol.3,Dec 1990.

[22] Li-yao Xia, Yannick Zakowski, Paul He, Chung-Kil Hur, Gregory Malecha, Benjamin C. Pierce, andSteve Zdancewic. Interaction trees: Representing recursive and impure programs in coq. Proc. ACMProgram. Lang., 4(POPL):51:1–51:32, December 2019.

16

Documents

From Symbolic Server Models to Interactive Testsyishuai/itree-testing.pdfFrom this symbolic model implementation we can automatically derive (i) a reference server that can interact